-
Notifications
You must be signed in to change notification settings - Fork 316
[WIP] feat: add S3 download and upload mappers for distributed processing #839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @kyo-tom, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data processing capabilities by integrating robust S3 download and upload functionalities. It introduces two specialized mappers designed for distributed environments, allowing users to efficiently manage data transfer between S3 and local processing nodes. This enables complex workflows, such as processing large media files, directly from and to cloud storage, streamlining distributed data pipelines. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two new mappers, S3DownloadFileMapper and S3UploadFileMapper, to facilitate file transfers with S3 in distributed environments. The overall implementation is solid, featuring concurrent operations and lazy client initialization for Ray compatibility. My review focuses on a critical bug fix in the download mapper, improving type hints for correctness, removing unused code, and enhancing security practices in the provided example configuration. With these adjustments, the new mappers will be more robust and reliable.
demos/process_video_on_ray/configs/s3_video_processing_config.yaml
Outdated
Show resolved
Hide resolved
92312f4 to
49958c3
Compare
49958c3 to
9ccf2bf
Compare
|
Formatted code |
cyruszhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall agree with the problem we are trying to tackle here: s3 url in individual field. couple of comments to clean it up. Can you also share a successful run with the code?
|
Thanks to @cyruszhang for reviewing the code. I have submitted a new commit to fix the issue you mentioned. This commit refactors both s3_download_file_mapper and s3_upload_file_mapper Changes to data_juicer/ops/mapper/s3_download_file_mapper.py:
Changes to data_juicer/ops/mapper/s3_upload_file_mapper.py:
Once the PR is approved, I will squash all commits. |
|
LGTM |
|
ready to merge; please resolve comments, merge master |
49ff42b to
3cf4c86
Compare
|
It is ready to merge. @cyruszhang |
Dludora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve
Dludora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve and run test
|
Hi @kyo-tom Thanks for your contribution! There are two issues from latest PRs or updates of third-party dependency and I have fixed them in this PR to the branch in your fork here. It's supposed to solve the test issues now in this PR. Please check and merge it before this PR is finished. Thanks again! |
Hi, these modifications have been merged into the main branch in PR #876 . Please pull the latest code in the main branch and merge it into this PR. |
3cf4c86 to
f06c3bb
Compare
|
Hi @HYLcool Thanks for your help! I've already pulled the latest code from the main branch (including the fixes from #876) and force-pushed to this PR branch. The commit Please let me know if there's anything else I need to address. Thanks again for your assistance! 🙏 |
|
@HYLcool It looks like the unit tests ran correctly and finished, but the exit code is 1. |
Recently the network issues are more often and it leads to failure of some API-based test cases. Besides, it seems some fixed bugs still exist in this PR. But it's fine anyway. After the unit test, if there are no errors about the contents in this PR, we can merge it still. We will check the whole system later in new PRs. |
|
The issues occured in the unit test of this PR are resolved in other PRs (e.g. #824 ). Here we just ignore them and merge this PR. |


feat: add S3 download and upload mappers for distributed processing
Add two new mappers to support S3-based workflows in distributed Ray environments:
Key features:
This addresses the distributed file management challenge where processed
files are scattered across Ray worker nodes without centralized tracking.
Related Issues
#838