Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

Open
lumoe opened this issue Nov 18, 2024 · 5 comments
Open

NCCL Timeout Bug During Dataset Building with (Large) Datasets #2276

lumoe opened this issue Nov 18, 2024 · 5 comments

Comments

@lumoe
Copy link

lumoe commented Nov 18, 2024

🐛 Bug

When training a model using the Raster Vision pipeline with a "larger" datasets on a dual GPU setup, the process crashes after reaching the "Building datasets..." stage. The issue appears to be caused by a timeout during NCCL operations, specifically an ALLREDUCE operation. Reducing the number of scenes in the dataset allows the training to proceed. All while the GPU utilization is at 100% for a single GPU while the second one is idle.

Attempted Workarounds:

  1. Increasing the TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC environment variable did not resolve the issue as the new value was not applied and the timeout remains at 600000ms.
  2. Disabling monitoring by setting TORCH_NCCL_ENABLE_MONITORING=0 avoids the timeout but disables essential monitoring, which is not ideal. Edit: False positive.
  3. Reducing the number of scenes (10 instead of 3500) in the dataset allows training to proceed
  4. Disabling heartbeat monitoring with TORCH_NCCL_ENABLE_MONITORING=0 also did not solve this issue.

Logs

Logs captured with NCCL_DEBUG=INFO

building-panoptic:875384:875432 [1] NCCL INFO ncclCommInitRank comm 0x564b81859bd0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 110 commId 0x1a221728bd4a5a4a - Init COMPLETE     
2024-11-18 14:43:22:rastervision.pytorch_learner.learner: INFO - Building datasets ...                                                                                                  
[rank1]:[E1118 14:53:22.694659270 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
@AdeelH
Copy link
Collaborator

AdeelH commented Nov 18, 2024

Thanks for the detailed report! I have not encountered this before so don't have a ready answer, but here are some thoughts/ideas:

  • Disabling monitoring by setting TORCH_NCCL_ENABLE_MONITORING=0 avoids the timeout but disables essential monitoring, which is not ideal.

    • Does it take longer than 600000ms to build the datasets in this case?
    • Does the training proceed correctly?
    • Are both GPUs utilized during training?
  • All while the GPU utilization is at 100% for a single GPU while the second one is idle.

    • Raster Vision logs the number of GPUs detected at the start. Does it detect both GPUs?
    • Unlikely, but could there possibly be something wrong with one of the GPUs? Can you successfully do single-GPU training on both? E.g. by setting CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1 and then training.

@lumoe
Copy link
Author

lumoe commented Nov 18, 2024

I updated my initial description. Disabling heartbeat monitoring with TORCH_NCCL_ENABLE_MONITORING=0 did not solve this issue - although I initially thought that it worked.

Does it take longer than 600000ms to build the datasets in this case?

Yes.

Does the training proceed correctly?

Yes.

Are both GPUs utilized during training?

Yes.

Raster Vision logs the number of GPUs detected at the start. Does it detect both GPUs?

Yes.

Update:
I now got it up and running by not throwing a single large GEOJSON file at it but splitting it together with the TIF files. This got the data loading time just under 10 minutes - 9:30 or so 😅

One more observation: I am running it using the rastervision cli to launch the training and run it according to the docs with torchrun --standalone --nnodes=1 --nproc-per-node=2 --no-python. Meaning that pytorch uses spawn for new processes, and therefore rastervision initializes the dataset for each process.
While it initializes the process for GPU:0, nvidia-smi shows 100% GPU utilization for GPU:1 and vice versa.

image

@AdeelH
Copy link
Collaborator

AdeelH commented Nov 20, 2024

So memory is being allocated on both GPUs but only one is being utilized? Does it stay consistently at 0% throughout the training or does it vary?

@lumoe
Copy link
Author

lumoe commented Nov 20, 2024

The behavior occurs only during dataset initialization; during training, both GPUs are utilized as expected.

To clarify, the issue is observed during the dataset build phase, before the training begins. By observing the logs, the initialization of processes does not happen in parallel but sequentially.

When the first process, which uses GPU:0, initializes the dataset, GPU:1 shows a consistent utilization of 100%, while GPU:0 shows 0% utilization.

Conversely, during the initialization of the second process, GPU:0 shows 100% utilization, and GPU:1 shows 0% utilization.

This behavior was observed using watch -n 1 nvidia-smi.

@AdeelH
Copy link
Collaborator

AdeelH commented Nov 20, 2024

Ah, I see. Raster Vision does build the datasets sequentially: first for the master DDP process (the rank 0 process) and then for the rest. This is to avoid downloading files (if using remote files) multiple times; the master process downloads the files and the others just use the already-downloaded files.

This can be seen in code here:

if distributed: # pragma: no cover
if self.is_ddp_local_master:
train_ds, valid_ds, test_ds = self.build_datasets()
log.debug(f'{self.ddp_rank=} Done.')
else:
log.debug(f'{self.ddp_rank=} Waiting.')
dist.barrier()
if not self.is_ddp_local_master:
train_ds, valid_ds, test_ds = self.build_datasets()
log.debug(f'{self.ddp_rank=} Done.')
else:
log.debug(f'{self.ddp_rank=} Waiting.')
dist.barrier()

At each step, the idle processes wait for the others to finish via dist.barrier(). It seems like waiting at dist.barrier() is manifesting as 100% utilization in nvidia-smi. I am not sure why that might be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants