Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add data rank split #167

Merged
merged 3 commits into from
Dec 4, 2024
Merged

add data rank split #167

merged 3 commits into from
Dec 4, 2024

Conversation

samsja
Copy link
Collaborator

@samsja samsja commented Dec 3, 2024

What this pr does:

  • allow to split locally by data rank for handling local experiments after the new datasets PR
  • add a easy to use install script

this graph show the old and the new behavior where the data is not duplicated locally anymore
Screenshot from 2024-12-03 18-52-24

@samsja samsja force-pushed the add-back-data-rank-split branch 6 times, most recently from fec3ba0 to e2a22dc Compare December 3, 2024 23:28
@samsja samsja force-pushed the add-back-data-rank-split branch from 3144e13 to 894b281 Compare December 3, 2024 23:47
@samsja samsja force-pushed the add-back-data-rank-split branch from 894b281 to 42b2ab0 Compare December 4, 2024 01:22
Copy link
Member

@Jackmin801 Jackmin801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. should we make the code download the data instead of the script in a future PR? I think the HF dataset way of using it is nice where you can just specify dataset repo / path

@samsja
Copy link
Collaborator Author

samsja commented Dec 4, 2024

lgtm. should we make the code download the data instead of the script in a future PR? I think the HF dataset way of using it is nice where you can just specify dataset repo / path

yeah I am planning on refactoring the dataset part to do streaming as well from hf repo. So I might as well to the downloading option at the same time

@samsja samsja merged commit 63efaf0 into main Dec 4, 2024
2 checks passed
@samsja samsja deleted the add-back-data-rank-split branch December 4, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants