NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

lumoe · 2024-11-18T15:28:21Z

🐛 Bug

When training a model using the Raster Vision pipeline with a "larger" datasets on a dual GPU setup, the process crashes after reaching the "Building datasets..." stage. The issue appears to be caused by a timeout during NCCL operations, specifically an ALLREDUCE operation. Reducing the number of scenes in the dataset allows the training to proceed. All while the GPU utilization is at 100% for a single GPU while the second one is idle.

Attempted Workarounds:

Increasing the TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC environment variable did not resolve the issue as the new value was not applied and the timeout remains at 600000ms.
~~Disabling monitoring by setting TORCH_NCCL_ENABLE_MONITORING=0 avoids the timeout but disables essential monitoring, which is not ideal.~~ Edit: False positive.
Reducing the number of scenes (10 instead of 3500) in the dataset allows training to proceed
Disabling heartbeat monitoring with TORCH_NCCL_ENABLE_MONITORING=0 also did not solve this issue.

Logs

Logs captured with NCCL_DEBUG=INFO

building-panoptic:875384:875432 [1] NCCL INFO ncclCommInitRank comm 0x564b81859bd0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 110 commId 0x1a221728bd4a5a4a - Init COMPLETE     
2024-11-18 14:43:22:rastervision.pytorch_learner.learner: INFO - Building datasets ...                                                                                                  
[rank1]:[E1118 14:53:22.694659270 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.

The text was updated successfully, but these errors were encountered:

AdeelH · 2024-11-18T17:41:13Z

Thanks for the detailed report! I have not encountered this before so don't have a ready answer, but here are some thoughts/ideas:

Disabling monitoring by setting TORCH_NCCL_ENABLE_MONITORING=0 avoids the timeout but disables essential monitoring, which is not ideal.
- Does it take longer than 600000ms to build the datasets in this case?
- Does the training proceed correctly?
- Are both GPUs utilized during training?
All while the GPU utilization is at 100% for a single GPU while the second one is idle.
- Raster Vision logs the number of GPUs detected at the start. Does it detect both GPUs?
- Unlikely, but could there possibly be something wrong with one of the GPUs? Can you successfully do single-GPU training on both? E.g. by setting CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1 and then training.

lumoe · 2024-11-18T20:05:53Z

I updated my initial description. Disabling heartbeat monitoring with TORCH_NCCL_ENABLE_MONITORING=0 did not solve this issue - although I initially thought that it worked.

Does it take longer than 600000ms to build the datasets in this case?

Yes.

Does the training proceed correctly?

Yes.

Are both GPUs utilized during training?

Yes.

Raster Vision logs the number of GPUs detected at the start. Does it detect both GPUs?

Yes.

Update:
I now got it up and running by not throwing a single large GEOJSON file at it but splitting it together with the TIF files. This got the data loading time just under 10 minutes - 9:30 or so 😅

One more observation: I am running it using the rastervision cli to launch the training and run it according to the docs with torchrun --standalone --nnodes=1 --nproc-per-node=2 --no-python. Meaning that pytorch uses spawn for new processes, and therefore rastervision initializes the dataset for each process.
While it initializes the process for GPU:0, nvidia-smi shows 100% GPU utilization for GPU:1 and vice versa.

AdeelH · 2024-11-20T16:13:59Z

So memory is being allocated on both GPUs but only one is being utilized? Does it stay consistently at 0% throughout the training or does it vary?

lumoe · 2024-11-20T16:27:02Z

The behavior occurs only during dataset initialization; during training, both GPUs are utilized as expected.

To clarify, the issue is observed during the dataset build phase, before the training begins. By observing the logs, the initialization of processes does not happen in parallel but sequentially.

When the first process, which uses GPU:0, initializes the dataset, GPU:1 shows a consistent utilization of 100%, while GPU:0 shows 0% utilization.

Conversely, during the initialization of the second process, GPU:0 shows 100% utilization, and GPU:1 shows 0% utilization.

This behavior was observed using watch -n 1 nvidia-smi.

AdeelH · 2024-11-20T16:56:47Z

Ah, I see. Raster Vision does build the datasets sequentially: first for the master DDP process (the rank 0 process) and then for the rest. This is to avoid downloading files (if using remote files) multiple times; the master process downloads the files and the others just use the already-downloaded files.

This can be seen in code here:

raster-vision/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py

Lines 1145 to 1157 in 0cac03c

    
           if distributed:  # pragma: no cover 
        
               if self.is_ddp_local_master: 
        
                   train_ds, valid_ds, test_ds = self.build_datasets() 
        
                   log.debug(f'{self.ddp_rank=} Done.') 
        
               else: 
        
                   log.debug(f'{self.ddp_rank=} Waiting.') 
        
               dist.barrier() 
        
               if not self.is_ddp_local_master: 
        
                   train_ds, valid_ds, test_ds = self.build_datasets() 
        
                   log.debug(f'{self.ddp_rank=} Done.') 
        
               else: 
        
                   log.debug(f'{self.ddp_rank=} Waiting.') 
        
               dist.barrier()

At each step, the idle processes wait for the others to finish via dist.barrier(). It seems like waiting at dist.barrier() is manifesting as 100% utilization in nvidia-smi. I am not sure why that might be.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

lumoe commented Nov 18, 2024 •

edited

Loading

AdeelH commented Nov 18, 2024

lumoe commented Nov 18, 2024 •

edited

Loading

AdeelH commented Nov 20, 2024

lumoe commented Nov 20, 2024

AdeelH commented Nov 20, 2024

NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

Comments

lumoe commented Nov 18, 2024 • edited Loading

🐛 Bug

Logs

AdeelH commented Nov 18, 2024

lumoe commented Nov 18, 2024 • edited Loading

AdeelH commented Nov 20, 2024

lumoe commented Nov 20, 2024

AdeelH commented Nov 20, 2024

lumoe commented Nov 18, 2024 •

edited

Loading

lumoe commented Nov 18, 2024 •

edited

Loading