Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out initializing process group in store based barrier on rank: 2 #48

Open
yotaroshimose opened this issue Jul 4, 2022 · 2 comments

Comments

@yotaroshimose
Copy link

Hi, Thank you for sharing your great work!

I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.

Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).

Any advise on successfully running your training code?

Thank you for your cooperation.

@WjzZwd
Copy link

WjzZwd commented Sep 23, 2022

@yotaroshimose
bro,my english is not very good ,so maybe i wrongly understand your question,the following is my advice
you can add a statement -> CUDA_VISIBLE_DEVICES=0,1
(i have 4 GPU but only use the first and the second)
before the statement -> python ./scripts/validate.py \
in the file which name is validate.sh or train.sh
wish help you

@yotaroshimose
Copy link
Author

Thank you for your kind reply.
I will try the way of your device control. thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants