Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have the nodes ping out their dataloader state before the all-reduce. #98

Open
Jackmin801 opened this issue Oct 11, 2024 · 0 comments
Open

Comments

@Jackmin801
Copy link
Member

Jackmin801 commented Oct 11, 2024

  • If a node leaves by crashing, we cannot exactly recover its dataloader state.
  • This forces us to manually skip shards to avoid duplicates
  • The ideal state is that they can resume automatically from a remote dataloader state
  • The dataloader state is not that big and this should not cost too much overhead
  • We could interleave it with the all-reduce, completing the all-reduce validates the dataloader state as latest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant