You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.
Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).
Any advise on successfully running your training code?
Thank you for your cooperation.
The text was updated successfully, but these errors were encountered:
@yotaroshimose
bro,my english is not very good ,so maybe i wrongly understand your question,the following is my advice
you can add a statement -> CUDA_VISIBLE_DEVICES=0,1
(i have 4 GPU but only use the first and the second)
before the statement -> python ./scripts/validate.py \
in the file which name is validate.sh or train.sh
wish help you
Hi, Thank you for sharing your great work!
I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.
Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).
Any advise on successfully running your training code?
Thank you for your cooperation.
The text was updated successfully, but these errors were encountered: