Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] custom ring all-reduce #23

Merged
merged 26 commits into from
Oct 4, 2024
Merged

[Feat] custom ring all-reduce #23

merged 26 commits into from
Oct 4, 2024

Conversation

Jackmin801
Copy link
Member

@Jackmin801 Jackmin801 commented Sep 30, 2024

torchrun --nproc-per-node 2 scripts/all_reduce_test.py --backend custom --transfer_dtype uint8 on saturn

Backend Dtype Time Taken (s) Bandwidth (GB/s)
gloo float32 3.12 1.28
custom float32 6.90 0.58
custom bfloat16 4.73 0.85
custom uint8 8.99 0.45

However under the 500mbps test, it beats the gloo one

image

1B all reduce takes 102s under 500mbps conditions (I think the 4 procs share the limit tho so maybe we can divide the time taken by 4 if the links are all individually 500mbps)
image

Jackmin801 and others added 9 commits September 30, 2024 15:35
* add all reduce test

* introduce different all reduce backend and add testing

* add custom backend to all reduce test

* fix group passing gloo al reduce

* use cap because jack ask for it even tho its ugly
@Jackmin801 Jackmin801 requested a review from samsja October 3, 2024 09:45
Copy link
Collaborator

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm added some small non blocking comments

@Jackmin801 Jackmin801 requested a review from samsja October 4, 2024 21:45
@Jackmin801 Jackmin801 requested a review from samsja October 4, 2024 22:43
Copy link
Collaborator

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lfgtm

@samsja samsja merged commit 753fb78 into main Oct 4, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants