Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] fail when diloco is used with global pg size of 1 #21

Closed
wants to merge 1 commit into from

Conversation

Jackmin801
Copy link
Member

Without this, running the wrong simulate_multi_node shell script doesnt lead to errors but instead creates diloco workers that dont communicate.

This PR causes the training to error in this case as you would never want to do this.

@Jackmin801 Jackmin801 requested a review from samsja September 30, 2024 03:58
Copy link
Collaborator

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this pr is needed. I might want to do diloco with only one worker, especially for ablation

@Jackmin801
Copy link
Member Author

hrmm makes sense. what if we set the default global size to 0 and error in diloco for that? the diloco launch script will set 1 and pass but the non-diloco launch script sets nothing and gets 0 and will thus fail if it tries to create the diloco class.

@Jackmin801 Jackmin801 requested a review from samsja September 30, 2024 04:25
@samsja
Copy link
Collaborator

samsja commented Sep 30, 2024

hrmm makes sense. what if we set the default global size to 0 and error in diloco for that? the diloco launch script will set 1 and pass but the non-diloco launch script sets nothing and gets 0 and will thus fail if it tries to create the diloco class.

This feel hacky IMO. And I still want to be able to start diloco one worker purely using torchrun

@Jackmin801
Copy link
Member Author

hrmm makes sense. I guess we just have to be careful with the scripts

@Jackmin801 Jackmin801 closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants