Implement first method transfer manager #4

kyleknap · 2016-02-03T20:34:24Z

Has the ability to upload files to s3 by providing the filename.

Sorry that it is really long. I am hoping that with the way in which I designed the interface of the internals, it is going to take a significant less amount of effort and code to add the other methods/functionality. It just did not make sense to me break the internals up into separate PR's without any knowledge/reference of how I was going to use them. I would be more than happy to explain how everything fits together in person to help with the review.

cc @jamesls @mtdowling @rayluo @JordonPhillips

Has the ability to upload files to s3 by providing the filename.

kyleknap · 2016-02-03T20:36:58Z

Forgot that this botocore PR for stubbing:boto/botocore#784 needs to be merged first before tests can pass...

kyleknap · 2016-02-16T21:37:20Z

I did some initial testing between the transfer manager and the CLI cp command. Here it was I got for the speed of uploading a single 10GB file:

s3 cp
1m55.227s
1m33.227s
1m32.077s
1m33.663s

s3transfer
1m31.954s
1m32.433s
1m33.514s
1m38.684s

So the times looks similar which is good. The amount of memory being used and CPU being used were roughly the same using top. I am going to next look at a directory with a bunch of files and see how that looks.

kyleknap · 2016-02-16T23:35:46Z

Hmmm. Uploading many small files is pretty slow compared to the CLI. Uploading 10,000 files of 10kb size is taking about 1m10s with the cp commands and the s3transfer is currently like 5 to 7 times slower...

The good news is that a pure futures implementation is not the reason. This code gets me less than 1m10s:

def upload_file(client, filename, bucket, key):
    with open(filename, 'rb') as f:
        client.put_object(Body=f, Bucket=bucket, Key=key)

def upload_many_files_with_client(client):
    bucket = 'mybucketfoo'
    file_list = glob.glob('many-files/*/*/*')
    with ThreadPoolExecutor(max_workers=10) as executor:
        for name in file_list:
            future = executor.submit(
                upload_file, client, filename=name, bucket=bucket, key=name)

So it is probably something to do with the implementation. Have yet to figure that out.

kyleknap · 2016-02-16T23:47:43Z

It is the use of ReadFileChunk class. Wrapping the body in that class makes it much slower than just using the normal open() function with a context handler. When I remove all of the ReadFileChunk logic, the speed of s3transfer is about 1m10s, the same as the CLI.

jamesls · 2016-02-16T23:56:46Z

Do you have a profile of what in the ReadFileChunk is slow? Are there subscribers to the read callbacks that cause this slow down?

kyleknap · 2016-02-16T23:59:16Z

Possibly. That is what I am trying to figure out next. In my perf script, I have no progress callbacks registered. I need to profile it though.

kyleknap · 2016-02-17T18:12:36Z

Specifically, it looks like the slowness is caused by the registering and unregistering of the enabling of callbacks. Having to register and unregister an unique event to a single event emitter really bogs down the speed of the individual threads. When I remove the logic where I register and unregister the handlers I achieve a speed a bit faster than the CLI.

I am tempted to remove the dependency on ReadFileChunk as it has already been problematic in boto3 such as with the sigv4 workaround and try and find another way to determine progress that revolves around socket or HTTP connection reading.

Another thing that I found that opening files in the main thread through ReadFileChunk causes IOErrors saying too many open files are present. Moving the file opening to each individual thread fixes this issue.

jamesls · 2016-02-23T18:28:47Z

@kyleknap Is this related to the naive cache invalidation we have when unregistering handlers? Right now in botocore, any modifications to the handler registrations unconditionally clears the cache.

kyleknap · 2016-02-23T21:43:44Z

It maybe the reason. I have not looked too deep into it. But it would make sense if there were thousands of handlers being registered with no cache so you would have to search through them from scratch.

Honestly, I want to figure out a way to get progress without registering handlers. So this logic should hopefully be temporary as I backport boto3/CLI functionality to s3transfer.

There was not really a need to register and unregister the handlers in the upload method. It also slowed uploading many files by quite a bit because the event emitter wipe out its cache on every register and unregister

JordonPhillips · 2016-02-29T22:28:24Z

Looks good to me, pending the updates mentioned above.

jamesls · 2016-03-07T18:22:19Z

s3transfer/futures.py

+                # If there are too many futures running, wait till some
+                # complete and save the remaining running futures as the
+                # next set to wait for.
+                self._currently_running_futures = futures.wait(


Do you know if this is creating a copy of the incomplete waiters each time?

I'm wondering about the case where the input rate is greater than the output rate which will mean we're essentially calling .wait() every time we try to submit. If the max size is large enough, and we are creating a copy of incomplete futures every wait() call, this seems like this could create a lot of overhead.

Thoughts?

I looked at the implementation. The future is not copied when it is added to the set of unfinished futures. Is that what you were asking? If it is, I do not think it would be too much overhead as it is just reusing the provided futures to create a new set of unfinished futures.

Each time wait is called, a new waiter gets added to the unfinished future's list of current waiters, but that seems necessary to the implementation.

jamesls · 2016-03-07T18:46:49Z

I've gone through everything except the tests, and so far I think this looks good. It's pretty straightforward to follow, and seems like a big improvement over what we have currently.

I'll go through the tests shortly.

If the file got open in the main thread it caused os to bug out about to many open file handles.

kyleknap · 2016-03-10T22:30:24Z

I updated the PR with the following:

Improved performance on uploading many files. It is now at the speed of the CLI.
Made a BaseSubscriber class. It is a much better approach from a functionality and documentation perspective than before.

Should be good to look at again along with the tests.

kyleknap · 2016-03-17T00:30:47Z

@jamesls so I just implemented to upload streams to s3: #11. If you have not gotten too far in this PR (and I have a decent understanding of the internals), I would suggest that you take a look at that PR because I did a fair amount of refactoring in both the code and the tests. In the end, I feel that because of the refactoring I did, the implementation is better abstracted making it easier to read, understand, and test.

jamesls · 2016-03-18T23:17:41Z

Looks good. I'll take a look at the streaming upload.

Implement first method transfer manager

iamahuman · 2021-09-21T11:24:20Z

Another thing that I found that opening files in the main thread through ReadFileChunk causes IOErrors saying too many open files are present. Moving the file opening to each individual thread fixes this issue.

I suspect the commit (supposedly) fixing this issue, 2f3d12c, actually introduced the regression described in #80.

It should be noted that the fixing commit introduced an additional behavior of closing the file argument. I wonder if the thread from which the files were opened were not the issue by itself, but simply interfered with the chances of the OSError: Too many open files (EMFILE) issue cropping up due to concurrency of threads.

@kyleknap: Would it be possible that the main issue was, in fact, due to the (stress) test code not assuming the responsibility of closing the files itself after put_object (and multipart counterparts, etc) finishes, resulting in file descriptor leakage manifesting as error EMFILE?

Implement first method transfer manager

6660e77

Has the ability to upload files to s3 by providing the filename.

kyleknap added 5 commits February 9, 2016 13:39

Fix test with no configured credentials

2c33bc2

Make main() and submit() methods internal

c831b92

Update UploadRequest to UploadTask for Submitter

8e0aa99

Switch TransferContext to TransferCoordinator

0ae164a

Update a docstring about the bounded executor

2ea4ba7

Fix an import issue

d1fbd2d

Make handlers be registered only once

b0510a3

There was not really a need to register and unregister the handlers in the upload method. It also slowed uploading many files by quite a bit because the event emitter wipe out its cache on every register and unregister

jamesls reviewed Mar 7, 2016
View reviewed changes

kyleknap added 3 commits March 8, 2016 13:43

Open files in thread

2f3d12c

If the file got open in the main thread it caused os to bug out about to many open file handles.

Create a base class for subscribers

fc08bcc

Small update to docstrings

2b4a2b9

kyleknap mentioned this pull request Mar 17, 2016

Add support for streaming uploads #11

Closed

kyleknap added a commit that referenced this pull request Mar 18, 2016

Merge pull request #4 from kyleknap/upload

692e62d

Implement first method transfer manager

kyleknap merged commit 692e62d into boto:develop Mar 18, 2016

kyleknap deleted the upload branch March 18, 2016 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement first method transfer manager #4

Implement first method transfer manager #4

kyleknap commented Feb 3, 2016

kyleknap commented Feb 3, 2016

kyleknap commented Feb 16, 2016

kyleknap commented Feb 16, 2016

kyleknap commented Feb 16, 2016

jamesls commented Feb 16, 2016

kyleknap commented Feb 16, 2016

kyleknap commented Feb 17, 2016

jamesls commented Feb 23, 2016

kyleknap commented Feb 23, 2016

JordonPhillips commented Feb 29, 2016

jamesls Mar 7, 2016

kyleknap Mar 10, 2016

jamesls commented Mar 7, 2016

kyleknap commented Mar 10, 2016

kyleknap commented Mar 17, 2016

jamesls commented Mar 18, 2016

iamahuman commented Sep 21, 2021 •

edited

Loading

Implement first method transfer manager #4

Implement first method transfer manager #4

Conversation

kyleknap commented Feb 3, 2016

kyleknap commented Feb 3, 2016

kyleknap commented Feb 16, 2016

kyleknap commented Feb 16, 2016

kyleknap commented Feb 16, 2016

jamesls commented Feb 16, 2016

kyleknap commented Feb 16, 2016

kyleknap commented Feb 17, 2016

jamesls commented Feb 23, 2016

kyleknap commented Feb 23, 2016

JordonPhillips commented Feb 29, 2016

jamesls Mar 7, 2016

Choose a reason for hiding this comment

kyleknap Mar 10, 2016

Choose a reason for hiding this comment

jamesls commented Mar 7, 2016

kyleknap commented Mar 10, 2016

kyleknap commented Mar 17, 2016

jamesls commented Mar 18, 2016

iamahuman commented Sep 21, 2021 • edited Loading

iamahuman commented Sep 21, 2021 •

edited

Loading