You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
first HEAD request is here during the S3File constructor - because opening a file doesn't necessarily mean you're going to read the file, and so a HEAD request is required to ensure the file actually exists
second HEAD request is actually caused by boto3 because we use download_fileobj here. That method is a high-level managed transfer. It does the HEAD request to find out the file size so it can potentially do a multipart download of large files using multiple threads.
When called in a tight loop these extra requests can slow things down a fair bit, especially for large numbers of small files.
I propose:
Eliminate the request in the constructor by just hitting self.file (thus triggering the download_fileobj right away.). Probably most callers will be calling .read() immediately anyway. Add a config option (EAGER_DOWNLOAD?) to opt out if you really don't want to, but I don't see any common reason you wouldn't - If you don't want to read the file but just want object size or something, you don't need to call S3Storage.open() at all, you can use S3Storage.size()
Eliminate the request in the download_fileobj by using get() instead of download_fileobj. This will probably be context-dependent (for larger files, download_fileobj may perform better), so it probably needs to be opt-in via a setting - what about USE_MULTIPART_DOWNLOAD?
Thanks for your consideration :)
The text was updated successfully, but these errors were encountered:
Thanks for opening this, people also pay for these requests so best to minimize.
I strongly want to avoid adding settings where possible.
For option 1, would we still get an exception if you try to read a file that doesn't exist? As long as we maintain that invariant I think that is certainly the best way.
We've noticed that using
S3Storage.open("file.x").read()
does a lot of HEAD requests in addition to the GET:These are caused by:
When called in a tight loop these extra requests can slow things down a fair bit, especially for large numbers of small files.
I propose:
self.file
(thus triggering thedownload_fileobj
right away.). Probably most callers will be calling.read()
immediately anyway. Add a config option (EAGER_DOWNLOAD
?) to opt out if you really don't want to, but I don't see any common reason you wouldn't - If you don't want to read the file but just want object size or something, you don't need to callS3Storage.open()
at all, you can useS3Storage.size()
download_fileobj
by usingget()
instead ofdownload_fileobj
. This will probably be context-dependent (for larger files,download_fileobj
may perform better), so it probably needs to be opt-in via a setting - what aboutUSE_MULTIPART_DOWNLOAD
?Thanks for your consideration :)
The text was updated successfully, but these errors were encountered: