-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload big io.Buffer to S3 #380
Comments
Thanks for reporting this. Could you simplify the example a bit? It's too long and requires external data (wind_2012_test_parts.zip). From your description of the problem, it sounds like the following should reproduce your problem: import io
import smart_open
with open('some_large_file.bin', 'rb') as fin:
with smart_open.open('s3://bucket/key.bin', 'wb') as fout:
buf = fin.read(10e9) # read 10GiB into memory, oof
fout.write(buf) Can you confirm whether the above reproduces your problem? If not, let's look into reducing your original example, it's a bit too much for me to look at. |
I can confirm that I'm encountering this error when trying to upload a file over 5GB via This stack overflow article appears to explain the cause: https://stackoverflow.com/questions/26319815/entitytoolarge-error-when-uploading-a-5g-file-to-amazon-s3 |
@davidparks21 Thank you for confirming the problem. I think we can resolve the issue by ensuring that a single write call never puts more than 5GB. If there is more data, then subsequent write calls should handle it. Are you able to make a PR? |
Oh, so just raise an exception when one That would be an easy solution to deal with for me. I think I could do a PR for that. There was one other small thing I wanted to do a PR for too, so this would probably get me off my butt to do both. |
smart_open's promise is to handle large uploads (and downloads) transparently. So instead of raising an exception, isn't it better to split the chunk into multipart pieces, each smaller than 5GB? IIRC smart_open is already handling multipart uploads transparently under the hood, so this should be no different. |
I have a similar issue with trying to stream/write large files to S3 via Is this still a 'needs-info' or is the problem understood? |
I think we understand the problem, now we "just" need to fix it. |
@JamalRahman @pythric @ivanhigueram could you help with a fix, prepare the PR? |
@mpenkov is it still an open issue? I found this when considering smart_open to upload 5TB file to S3. |
I started working on a solution, but it wasn't a very trivial change the way it's currently coded. I ran out of time and abandoned the effort back when I posted. I'm not sure about the current status, but my solution was to simply chunk the calls to |
I've been chugging away at this, and finally hit upon a solution where it wouldn't need to make any extra copies to the buffer of the data - which would be a significant improvement when dealing with the size of files we're talking about, but unfortunately boto/boto3#3423 stopped me from such a perfect solution. I'll be opening a PR soon with a compromise solution though, but if my PR to botocore is accepted and it's released it'll open up not needing to buffer the data at all before sending (unless smaller than min-size-upload writes are involved). |
I think I'm running into a very similar problem:
I seem to be running out of memory on a small core machine with plenty of /tmp space. So, thinking I need to buffer the write/read? I though this was handled with the tp? |
Solved this. I was missing the write iterator. |
Problem description
I am requesting a set of files, zipping them, and then upload the zipped data to S3 using
smart_open
and aio.BytesIO()
object. The size of the compressed data exceeds the 5 Gb S3 limit, and I know that in that case a multi-parts approach should be use (just like inboto3
). I am usingsmart_open.s3.open()
for doing this, but I do not completely understand how to configure the multi-part upload to avoid theEntityTooLarge
error. I keep getting the error when using my code. Should I divide my file before hand or specify the number of parts? Checking the source code I don't see anum_parts
option.My function is the following:
You can test the function by running:
Versions
The text was updated successfully, but these errors were encountered: