Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove S3FS dependency #332

Merged
merged 5 commits into from
Aug 24, 2022
Merged

Remove S3FS dependency #332

merged 5 commits into from
Aug 24, 2022

Conversation

matteofigus
Copy link
Member

@matteofigus matteofigus commented Jul 28, 2022

Description of changes:

This PR removes s3fs to simplify handling communications with S3.

It uses pyarrow's S3FileSystem to read, and boto3 to write back.

The reason I am using pyarrow to read is that it allows us to get rid of s3fs and I haven't noticed differences in terms of performance.

I tried to use pyarrow to write back to s3 too, but unfortunately I don't have an API that gives me back the VersionId of the write. If I would read the latest version using boto3, I would lose confidence that the version I'm reading is the one I just wrote, and therefore I wouldn't be able to perform consistency checks.

In order to remove s3fs, I investigated boto3 and found a (new?) convenient method called upload_fileobj that is smart enough to figure out if doing multipart or not, and I noticed an increase in performance compared to s3fs. The issue is that this method doesn't return the versionID as well, but I managed to find a way to monkey patch the code while we wait for this issue to be resolved. I found some Pull Requests too so perhaps we can see this soon landing in boto.

I ran the acceptance tests as well as some manual tests with very large files (8GB snappy compressed parquet), for which I didn't see any difference in reads, and some performance improvement (3x) in writing.

PR Checklist:

  • Changelog updated
  • Unit tests (and integration tests if applicable) provided
  • All tests pass
  • Pre-commit checks pass
  • Debugging code removed
  • If releasing a new version, have you bumped the version in the main CFN template?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

with s3.open(object_path, "rb") as f:
source_version = f.version_id
with s3.open_input_stream(
"{}/{}".format(input_bucket, input_key), buffer_size=FIVE_MB
Copy link
Member Author

@matteofigus matteofigus Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protection against automated library upgrade PRs that may silently introduce
issues.
"""
assert s3transfer.__version__ == "0.6.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @HieronymusLex - may seem a bit over cautious but I think it's worth it given Dependabot may make very easy to silently upgrade and leave dangerous things unnoticed.

@matteofigus matteofigus changed the title S3fs 2 Remove S3FS dependency Jul 28, 2022
HieronymusLex
HieronymusLex previously approved these changes Jul 28, 2022
@matteofigus matteofigus merged commit c03e66f into master Aug 24, 2022
@matteofigus matteofigus deleted the s3fs-2 branch August 24, 2022 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants