Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dpu #178

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

dpu #178

wants to merge 3 commits into from

Conversation

samsja
Copy link
Collaborator

@samsja samsja commented Dec 15, 2024

No description provided.

@samsja samsja changed the title dou dpu Dec 15, 2024
@samsja samsja requested a review from Jackmin801 December 15, 2024 01:04
Comment on lines +63 to +67
:Args:
model: The model to be trained
elastic_device_mesh: The elastic device mesh to be used
dpu: Whether to use delayed parameter updates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args have changed

Example usage:

```
global_ddp = GlobalDDP(model, elastic_device_mesh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example changed

Comment on lines +74 to +81
for micro_bs in range(num_micro_bs):
optimizer.zero_grad()
loss = model(batch)
loss.backward()

global_ddp.all_reduce()
optimizer.step()
diloco.step(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizer zero grad is in the wrong place? This would make only the last micro batch have grad

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol good catch


self.model = model

self._staling_grad_work: list[AllReduceGradWork] | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staling -> stalling

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staline gradient ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stalingrad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants