some question about CUDA_VISIBLE_DEVICES #11

wangfakang · 2024-05-06T08:45:42Z

Why is the value of CUDA_VISIBLE_DEVICES not configured in ascending order? For example, CUDA_VISIBLE-DEVICES=0,1,2,3,4,5,6,7 better suited for PXN?

training_results_v3.1/Azure+NVIDIA/benchmarks/gpt3/implementations/pytorch/config_common.sh

Line 1 in 5b62935

export CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7

The text was updated successfully, but these errors were encountered:

wangfakang · 2024-05-06T08:51:57Z

friendly ping @nathanw-mlc @nv-rborkar

erhoo82 · 2024-05-06T18:21:58Z

We used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7 to work around a bug in NCCL that causes NIC port usage conflict at a specific tensor-parallel size. We fixed the bug and using the ascending order mapping should yield the same performance.

wangfakang · 2024-05-07T04:12:00Z

We used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7 to work around a bug in NCCL that causes NIC port usage conflict at a specific tensor-parallel size. We fixed the bug and using the ascending order mapping should yield the same performance.

@erhoo82 Thank you for your reply. Can you explain why used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7 can work around it？ or is there a PR related to NCCL repair? Thank you.

wangfakang · 2024-05-10T06:32:42Z

@pgmpablo157321 @hiwotadese @nv-rborkar @erhoo82 have any updates? and another question that why we need to disable NVLS and CUMEM feature ?

training_results_v3.1/Azure+NVIDIA/benchmarks/gpt3/implementations/pytorch/config_common.sh

Lines 18 to 19 in 5b62935

    
           export NCCL_CUMEM_ENABLE=0 
        
           export NCCL_NVLS_ENABLE=0

nathanw-mlc assigned pgmpablo157321, hiwotadese and nv-rborkar May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some question about CUDA_VISIBLE_DEVICES #11

some question about CUDA_VISIBLE_DEVICES #11

wangfakang commented May 6, 2024

wangfakang commented May 6, 2024

erhoo82 commented May 6, 2024

wangfakang commented May 7, 2024

wangfakang commented May 10, 2024 •

edited

Loading

some question about CUDA_VISIBLE_DEVICES #11

some question about CUDA_VISIBLE_DEVICES #11

Comments

wangfakang commented May 6, 2024

wangfakang commented May 6, 2024

erhoo82 commented May 6, 2024

wangfakang commented May 7, 2024

wangfakang commented May 10, 2024 • edited Loading

wangfakang commented May 10, 2024 •

edited

Loading