Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement first method transfer manager #4

Merged
merged 11 commits into from
Mar 18, 2016
Merged

Conversation

kyleknap
Copy link
Contributor

@kyleknap kyleknap commented Feb 3, 2016

Has the ability to upload files to s3 by providing the filename.

Sorry that it is really long. I am hoping that with the way in which I designed the interface of the internals, it is going to take a significant less amount of effort and code to add the other methods/functionality. It just did not make sense to me break the internals up into separate PR's without any knowledge/reference of how I was going to use them. I would be more than happy to explain how everything fits together in person to help with the review.

cc @jamesls @mtdowling @rayluo @JordonPhillips

Has the ability to upload files to s3 by providing the filename.
@kyleknap
Copy link
Contributor Author

kyleknap commented Feb 3, 2016

Forgot that this botocore PR for stubbing:boto/botocore#784 needs to be merged first before tests can pass...

@kyleknap
Copy link
Contributor Author

I did some initial testing between the transfer manager and the CLI cp command. Here it was I got for the speed of uploading a single 10GB file:

s3 cp
1m55.227s
1m33.227s
1m32.077s
1m33.663s

s3transfer
1m31.954s
1m32.433s
1m33.514s
1m38.684s

So the times looks similar which is good. The amount of memory being used and CPU being used were roughly the same using top. I am going to next look at a directory with a bunch of files and see how that looks.

@kyleknap
Copy link
Contributor Author

Hmmm. Uploading many small files is pretty slow compared to the CLI. Uploading 10,000 files of 10kb size is taking about 1m10s with the cp commands and the s3transfer is currently like 5 to 7 times slower...

The good news is that a pure futures implementation is not the reason. This code gets me less than 1m10s:

def upload_file(client, filename, bucket, key):
    with open(filename, 'rb') as f:
        client.put_object(Body=f, Bucket=bucket, Key=key)

def upload_many_files_with_client(client):
    bucket = 'mybucketfoo'
    file_list = glob.glob('many-files/*/*/*')
    with ThreadPoolExecutor(max_workers=10) as executor:
        for name in file_list:
            future = executor.submit(
                upload_file, client, filename=name, bucket=bucket, key=name)

So it is probably something to do with the implementation. Have yet to figure that out.

@kyleknap
Copy link
Contributor Author

It is the use of ReadFileChunk class. Wrapping the body in that class makes it much slower than just using the normal open() function with a context handler. When I remove all of the ReadFileChunk logic, the speed of s3transfer is about 1m10s, the same as the CLI.

@jamesls
Copy link
Member

jamesls commented Feb 16, 2016

Do you have a profile of what in the ReadFileChunk is slow? Are there subscribers to the read callbacks that cause this slow down?

@kyleknap
Copy link
Contributor Author

Possibly. That is what I am trying to figure out next. In my perf script, I have no progress callbacks registered. I need to profile it though.

@kyleknap
Copy link
Contributor Author

Specifically, it looks like the slowness is caused by the registering and unregistering of the enabling of callbacks. Having to register and unregister an unique event to a single event emitter really bogs down the speed of the individual threads. When I remove the logic where I register and unregister the handlers I achieve a speed a bit faster than the CLI.

I am tempted to remove the dependency on ReadFileChunk as it has already been problematic in boto3 such as with the sigv4 workaround and try and find another way to determine progress that revolves around socket or HTTP connection reading.

Another thing that I found that opening files in the main thread through ReadFileChunk causes IOErrors saying too many open files are present. Moving the file opening to each individual thread fixes this issue.

@jamesls
Copy link
Member

jamesls commented Feb 23, 2016

@kyleknap Is this related to the naive cache invalidation we have when unregistering handlers? Right now in botocore, any modifications to the handler registrations unconditionally clears the cache.

@kyleknap
Copy link
Contributor Author

It maybe the reason. I have not looked too deep into it. But it would make sense if there were thousands of handlers being registered with no cache so you would have to search through them from scratch.

Honestly, I want to figure out a way to get progress without registering handlers. So this logic should hopefully be temporary as I backport boto3/CLI functionality to s3transfer.

There was not really a need to register and unregister the handlers in the
upload method. It also slowed uploading many files by quite a bit because
the event emitter wipe out its cache on every register and unregister
@JordonPhillips
Copy link
Contributor

Looks good to me, pending the updates mentioned above.

# If there are too many futures running, wait till some
# complete and save the remaining running futures as the
# next set to wait for.
self._currently_running_futures = futures.wait(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this is creating a copy of the incomplete waiters each time?

I'm wondering about the case where the input rate is greater than the output rate which will mean we're essentially calling .wait() every time we try to submit. If the max size is large enough, and we are creating a copy of incomplete futures every wait() call, this seems like this could create a lot of overhead.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the implementation. The future is not copied when it is added to the set of unfinished futures. Is that what you were asking? If it is, I do not think it would be too much overhead as it is just reusing the provided futures to create a new set of unfinished futures.

Each time wait is called, a new waiter gets added to the unfinished future's list of current waiters, but that seems necessary to the implementation.

@jamesls
Copy link
Member

jamesls commented Mar 7, 2016

I've gone through everything except the tests, and so far I think this looks good. It's pretty straightforward to follow, and seems like a big improvement over what we have currently.

I'll go through the tests shortly.

If the file got open in the main thread it caused os to bug out about to
many open file handles.
@kyleknap
Copy link
Contributor Author

I updated the PR with the following:

  • Improved performance on uploading many files. It is now at the speed of the CLI.
  • Made a BaseSubscriber class. It is a much better approach from a functionality and documentation perspective than before.

Should be good to look at again along with the tests.

@kyleknap
Copy link
Contributor Author

@jamesls so I just implemented to upload streams to s3: #11. If you have not gotten too far in this PR (and I have a decent understanding of the internals), I would suggest that you take a look at that PR because I did a fair amount of refactoring in both the code and the tests. In the end, I feel that because of the refactoring I did, the implementation is better abstracted making it easier to read, understand, and test.

@jamesls
Copy link
Member

jamesls commented Mar 18, 2016

:shipit: Looks good. I'll take a look at the streaming upload.

kyleknap added a commit that referenced this pull request Mar 18, 2016
Implement first method transfer manager
@kyleknap kyleknap merged commit 692e62d into boto:develop Mar 18, 2016
@kyleknap kyleknap deleted the upload branch March 18, 2016 23:24
@iamahuman
Copy link

iamahuman commented Sep 21, 2021

Another thing that I found that opening files in the main thread through ReadFileChunk causes IOErrors saying too many open files are present. Moving the file opening to each individual thread fixes this issue.

I suspect the commit (supposedly) fixing this issue, 2f3d12c, actually introduced the regression described in #80.

It should be noted that the fixing commit introduced an additional behavior of closing the file argument. I wonder if the thread from which the files were opened were not the issue by itself, but simply interfered with the chances of the OSError: Too many open files (EMFILE) issue cropping up due to concurrency of threads.

@kyleknap: Would it be possible that the main issue was, in fact, due to the (stress) test code not assuming the responsibility of closing the files itself after put_object (and multipart counterparts, etc) finishes, resulting in file descriptor leakage manifesting as error EMFILE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants