[Example] Multi node training on XPU device #9490

zhouyu5 · 2024-07-05T08:45:04Z

This is a follow-up PR for issue #9464 .

I refer to distributed_sampling_multinode.py for a XPU version of multi node training.

I use two nodes for testing, each nodes with 1 XPU card.

See the following screenshot, it has passed the test in internal cluster.

for more information, see https://pre-commit.ci

zhouyu5 · 2024-07-05T08:52:22Z

@DamianSzwichtenberg I plan to add doc for it, where should I put the doc, in the current folder?

rusty1s

Thanks. Is there any reason we cannot merge this with multi_gpu/distributed_sampling.py?

CHANGELOG.md

zhouyu5 · 2024-07-09T01:40:46Z

Thanks. Is there any reason we cannot merge this with multi_gpu/distributed_sampling.py?

@rusty1s Good question. Actually, I had the same confusion as you. I think it is possible to merge it with multi_gpu/distributed_sampling.py.

However, I want to clarify the reason why I didn't merge them. As a new user, I believe it’s essential to follow the existing conventions of the repo. I noticed that for NVDA GPUs, there are two files provided: distributed_sampling.py and distributed_sampling_multinode.py. Following this format, I created separate files for XPU, which is distributed_sampling_xpu.py and distributed_sampling_multinode_xpu.py (note the xpu suffix).

Additionally, if you closely examine the code under multi_gpu, you'll find several redundant pieces of code. Some variations are due to different datasets or training configurations. I counted a total of 12 files for NVDA GPUs. Ideally, by introducing some conditional branches in the code, we could consolidate many of these files, potentially reducing them to 3-4 files. What are your thoughts on this approach?

Co-authored-by: Matthias Fey <[email protected]>

for more information, see https://pre-commit.ci

zhouyu5 · 2024-07-12T08:12:12Z

@rusty1s Update with readme and guide. Please have a check. Thanks.

akihironitta

This is great! Do you have an idea which part of the example script is XPU-specific? We could consider merging this file with distributed_sampling_multinode.py if the only change is in the launch command and the torch.device.

zhouyu5 added 6 commits July 5, 2024 15:52

test

10dba88

fix device

b6d09d4

test

cea9458

test

39d2bf8

test

4fd83e6

bug fix

4c561e4

zhouyu5 requested a review from wsad1 as a code owner July 5, 2024 08:45

zhouyu5 mentioned this pull request Jul 5, 2024

Add multi node training guide for XPU device #9464

Open

pre-commit-ci bot and others added 2 commits July 5, 2024 08:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

6996a18

for more information, see https://pre-commit.ci

doc

70366eb

rusty1s reviewed Jul 7, 2024

View reviewed changes

rusty1s assigned zhouyu5 Jul 7, 2024

rusty1s added feature 1 - Priority P1 example labels Jul 7, 2024

rusty1s changed the title ~~[ISSUE 9464] Enable multi node training on XPU device~~ [Example] Multi node training on XPU device Jul 7, 2024

rusty1s reviewed Jul 7, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

zhouyu5 and others added 5 commits July 9, 2024 09:45

Merge branch 'master' into xpu-multi-node

7598c02

Update CHANGELOG.md

49e562d

Co-authored-by: Matthias Fey <[email protected]>

update readme

f7655c7

add comment of launch training

5153657

[pre-commit.ci] auto fixes from pre-commit.com hooks

e8f4ccd

for more information, see https://pre-commit.ci

akihironitta added the distributed label Jul 19, 2024

akihironitta reviewed Jul 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] Multi node training on XPU device #9490

[Example] Multi node training on XPU device #9490

zhouyu5 commented Jul 5, 2024 •

edited

Loading

zhouyu5 commented Jul 5, 2024 •

edited

Loading

rusty1s left a comment

zhouyu5 commented Jul 9, 2024 •

edited

Loading

zhouyu5 commented Jul 12, 2024

akihironitta left a comment

[Example] Multi node training on XPU device #9490

Are you sure you want to change the base?

[Example] Multi node training on XPU device #9490

Conversation

zhouyu5 commented Jul 5, 2024 • edited Loading

zhouyu5 commented Jul 5, 2024 • edited Loading

rusty1s left a comment

Choose a reason for hiding this comment

zhouyu5 commented Jul 9, 2024 • edited Loading

zhouyu5 commented Jul 12, 2024

akihironitta left a comment

Choose a reason for hiding this comment

zhouyu5 commented Jul 5, 2024 •

edited

Loading

zhouyu5 commented Jul 5, 2024 •

edited

Loading

zhouyu5 commented Jul 9, 2024 •

edited

Loading