-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276
Comments
Thanks for the detailed report! I have not encountered this before so don't have a ready answer, but here are some thoughts/ideas:
|
So memory is being allocated on both GPUs but only one is being utilized? Does it stay consistently at 0% throughout the training or does it vary? |
The behavior occurs only during dataset initialization; during training, both GPUs are utilized as expected. To clarify, the issue is observed during the dataset build phase, before the training begins. By observing the logs, the initialization of processes does not happen in parallel but sequentially. When the first process, which uses GPU:0, initializes the dataset, GPU:1 shows a consistent utilization of 100%, while GPU:0 shows 0% utilization. Conversely, during the initialization of the second process, GPU:0 shows 100% utilization, and GPU:1 shows 0% utilization. This behavior was observed using |
Ah, I see. Raster Vision does build the datasets sequentially: first for the master DDP process (the rank 0 process) and then for the rest. This is to avoid downloading files (if using remote files) multiple times; the master process downloads the files and the others just use the already-downloaded files. This can be seen in code here: raster-vision/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py Lines 1145 to 1157 in 0cac03c
At each step, the idle processes wait for the others to finish via |
🐛 Bug
When training a model using the Raster Vision pipeline with a "larger" datasets on a dual GPU setup, the process crashes after reaching the "Building datasets..." stage. The issue appears to be caused by a timeout during NCCL operations, specifically an ALLREDUCE operation. Reducing the number of scenes in the dataset allows the training to proceed. All while the GPU utilization is at 100% for a single GPU while the second one is idle.
Attempted Workarounds:
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
environment variable did not resolve the issue as the new value was not applied and the timeout remains at 600000ms.Disabling monitoring by setting TORCH_NCCL_ENABLE_MONITORING=0 avoids the timeout but disables essential monitoring, which is not ideal.Edit: False positive.TORCH_NCCL_ENABLE_MONITORING=0
also did not solve this issue.Logs
Logs captured with
NCCL_DEBUG=INFO
building-panoptic:875384:875432 [1] NCCL INFO ncclCommInitRank comm 0x564b81859bd0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 110 commId 0x1a221728bd4a5a4a - Init COMPLETE 2024-11-18 14:43:22:rastervision.pytorch_learner.learner: INFO - Building datasets ... [rank1]:[E1118 14:53:22.694659270 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
The text was updated successfully, but these errors were encountered: