Timed out initializing process group in store based barrier on rank: 2 #48

yotaroshimose · 2022-07-04T10:23:49Z

Hi, Thank you for sharing your great work!

I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.

Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).

Any advise on successfully running your training code?

Thank you for your cooperation.

WjzZwd · 2022-09-23T12:21:13Z

@yotaroshimose
bro，my english is not very good ,so maybe i wrongly understand your question,the following is my advice
you can add a statement -> CUDA_VISIBLE_DEVICES=0,1
(i have 4 GPU but only use the first and the second)
before the statement -> python ./scripts/validate.py \
in the file which name is validate.sh or train.sh
wish help you

yotaroshimose · 2022-09-26T01:35:28Z

Thank you for your kind reply.
I will try the way of your device control. thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out initializing process group in store based barrier on rank: 2 #48

Timed out initializing process group in store based barrier on rank: 2 #48

yotaroshimose commented Jul 4, 2022

WjzZwd commented Sep 23, 2022

yotaroshimose commented Sep 26, 2022

Timed out initializing process group in store based barrier on rank: 2 #48

Timed out initializing process group in store based barrier on rank: 2 #48

Comments

yotaroshimose commented Jul 4, 2022

WjzZwd commented Sep 23, 2022

yotaroshimose commented Sep 26, 2022