-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Example] Multi node training on XPU device #9490
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
@DamianSzwichtenberg I plan to add doc for it, where should I put the doc, in the current folder? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Is there any reason we cannot merge this with multi_gpu/distributed_sampling.py
?
@rusty1s Good question. Actually, I had the same confusion as you. I think it is possible to merge it with However, I want to clarify the reason why I didn't merge them. As a new user, I believe it’s essential to follow the existing conventions of the repo. I noticed that for NVDA GPUs, there are two files provided: Additionally, if you closely examine the code under multi_gpu, you'll find several redundant pieces of code. Some variations are due to different datasets or training configurations. I counted a total of 12 files for NVDA GPUs. Ideally, by introducing some conditional branches in the code, we could consolidate many of these files, potentially reducing them to 3-4 files. What are your thoughts on this approach? |
Co-authored-by: Matthias Fey <[email protected]>
for more information, see https://pre-commit.ci
@rusty1s Update with readme and guide. Please have a check. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Do you have an idea which part of the example script is XPU-specific? We could consider merging this file with distributed_sampling_multinode.py
if the only change is in the launch command and the torch.device
.
This is a follow-up PR for issue #9464 .
I refer to distributed_sampling_multinode.py for a XPU version of multi node training.
I use two nodes for testing, each nodes with 1 XPU card.
See the following screenshot, it has passed the test in internal cluster.