-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-535791: PUT operations read full input data into memory #536
Comments
thank you for submitting this issue with us! read through the linked PR 538 and it has another reference to another PR , #527 which is merged, so I guess this should be resolved for now. |
I can confirm that this is still an issue. There are at least instances where the stream is read fully into memory still here and here (for the s3 uploader). The concept of having the entire contents of the stream available seems pretty baked-in, for example here where the size of the stream is needed. I can imagine it will take some effort to fully refactor this but it would be be useful to handle streaming PUTs in a memory-efficient way. |
thank you for reviewing and commenting and especially for providing useful details! reopening this Issue and marking it as an enhancement for the product team to consider |
We use the streaming PUT feature of gosnowflake to upload data to snowflake internal stages. Recently, we started throwing GB sized files at it and saw our memory usage explode. Through some experimentation with an isolated production workload, we saw that for a 1GB input file, the driver was using approximately 8GB of memory when the PUT operation was configured as a stream, and 2-3GB when it was configured as a file on disk.
Looking at the code, the problem seems to be that alleged "streams" are often passed around as
bytes.Buffer
s that read the entire input data into memory multiple times over, the exact amount depending on the command and options (e.g. here, here, here, everywhere this is invoked, etc). Some crude experimentation with a fork I've created suggest there are likely more places.Beyond the obvious issue that the documentation on streaming puts is misleading, I would go so far as to say the driver should never have to read the entire input contents into memory. The major operations it's responsible for (compression, encryption, calculating digests, etc) are all possible with modest buffers that don't need to scale with the input size.
I'd be happy to contribute here, but wanted to start a discussion since the required changes seem to be nontrivial.
The text was updated successfully, but these errors were encountered: