-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove S3FS dependency #332
Conversation
with s3.open(object_path, "rb") as f: | ||
source_version = f.version_id | ||
with s3.open_input_stream( | ||
"{}/{}".format(input_bucket, input_key), buffer_size=FIVE_MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer size is same as s3fs: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L246
protection against automated library upgrade PRs that may silently introduce | ||
issues. | ||
""" | ||
assert s3transfer.__version__ == "0.6.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @HieronymusLex - may seem a bit over cautious but I think it's worth it given Dependabot may make very easy to silently upgrade and leave dangerous things unnoticed.
Description of changes:
This PR removes s3fs to simplify handling communications with S3.
It uses pyarrow's S3FileSystem to read, and boto3 to write back.
The reason I am using pyarrow to read is that it allows us to get rid of s3fs and I haven't noticed differences in terms of performance.
I tried to use pyarrow to write back to s3 too, but unfortunately I don't have an API that gives me back the VersionId of the write. If I would read the latest version using boto3, I would lose confidence that the version I'm reading is the one I just wrote, and therefore I wouldn't be able to perform consistency checks.
In order to remove s3fs, I investigated boto3 and found a (new?) convenient method called upload_fileobj that is smart enough to figure out if doing multipart or not, and I noticed an increase in performance compared to s3fs. The issue is that this method doesn't return the versionID as well, but I managed to find a way to monkey patch the code while we wait for this issue to be resolved. I found some Pull Requests too so perhaps we can see this soon landing in boto.
I ran the acceptance tests as well as some manual tests with very large files (8GB snappy compressed parquet), for which I didn't see any difference in reads, and some performance improvement (3x) in writing.
PR Checklist:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.