Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use rolling window for mfu calculation #10

Merged
merged 2 commits into from
Sep 26, 2024
Merged

fix: use rolling window for mfu calculation #10

merged 2 commits into from
Sep 26, 2024

Conversation

samsja
Copy link
Collaborator

@samsja samsja commented Sep 25, 2024

this pr fix the weird unstable mfu. It use a rolling average of tokens_per_second. The old implementation that was measuring the time per steps was unstable, lead to weird mfu (>100 with grad_acc=1).

[h100 sxm]
1b with 4 gpus : 48.70
1b with 8 gpus: 47.68

@samsja samsja requested a review from Jackmin801 September 25, 2024 20:12
@samsja samsja marked this pull request as draft September 25, 2024 21:59
@samsja samsja changed the title feat: use monotmic time fix: use rolling window for mfu calculaction Sep 25, 2024
@samsja samsja changed the title fix: use rolling window for mfu calculaction fix: use rolling window for mfu calculation Sep 25, 2024
@samsja samsja marked this pull request as ready for review September 26, 2024 02:50
@samsja
Copy link
Collaborator Author

samsja commented Sep 26, 2024

mfu seems correct, it start high tho and then reduce, I don't know why: https://wandb.ai/primeintellect/debug_1B_zero_band/runs/c403bo2n?nw=nwusersamsja_pi
Screenshot from 2024-09-25 19-50-59

I also checked the reported tokens_per_sec is 62k which exactly matched the total_tokens = 5502402560 divided by the time of training 147.75 * 60 seconds.

So I would say this pr is ready to merge very confident in the reported mfu

@samsja samsja requested a review from JohannesHa September 26, 2024 02:54
Copy link
Member

@Jackmin801 Jackmin801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@JohannesHa JohannesHa merged commit b624c43 into main Sep 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants