Better device handling #1301

EIFY · 2024-12-01T06:33:53Z

I can't reproduce the issue of every process allocating memory of GPU 0 (#969), so maybe the underlying issue has been fixed. Regardless, usage of torch.cuda.set_device is now discouraged in favor of just setting CUDA_VISIBLE_DEVICES. Furthermore, previously validate() and AverageMeter are trying to determine what device to use on their own. We really should just pass in the agreed-upon device.

Tested on a 8-GPU instance. I made sure that both commands below still work:

python main.py -a resnet50 --gpu 1 --evaluate --batch-size 1024 /data/ImageNet/

python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:60000' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --batch-size 1024 /data/ImageNet/

# After detaching the screen, `nvidia-smi` shows that the 2nd command results in expected, distributed GPU usage:

$ nvidia-smi
Sun Dec  1 01:01:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   56C    P0             94W /  400W |   14949MiB /  40960MiB |     62%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   49C    P0             99W /  400W |   15093MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   52C    P0            105W /  400W |   15097MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   55C    P0            103W /  400W |   15095MiB /  40960MiB |     92%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  |   00000000:0B:00.0 Off |                    0 |
| N/A   54C    P0            102W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  |   00000000:0C:00.0 Off |                    0 |
| N/A   50C    P0            103W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   47C    P0             97W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  |   00000000:0E:00.0 Off |                    0 |
| N/A   49C    P0             85W /  400W |   14953MiB /  40960MiB |     73%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2800326      C   /usr/bin/python                             14940MiB |
|    1   N/A  N/A   2800327      C   /usr/bin/python                             15084MiB |
|    2   N/A  N/A   2800328      C   /usr/bin/python                             15088MiB |
|    3   N/A  N/A   2800329      C   /usr/bin/python                             15084MiB |
|    4   N/A  N/A   2800330      C   /usr/bin/python                             15084MiB |
|    5   N/A  N/A   2800331      C   /usr/bin/python                             15084MiB |
|    6   N/A  N/A   2800332      C   /usr/bin/python                             15084MiB |
|    7   N/A  N/A   2800333      C   /usr/bin/python                             14944MiB |
+-----------------------------------------------------------------------------------------+

See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

netlify · 2024-12-01T06:34:12Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`d10a469`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-examples-preview/deploys/674c035432f73f0008c38141

EIFY · 2024-12-01T06:39:39Z

BTW there are quite a few unattended PRs for this example, some of them years old...
#1291
#1118
#1001
#825
#551

fix device handling

d10a469

See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

facebook-github-bot added the cla signed label Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better device handling #1301

Better device handling #1301

EIFY commented Dec 1, 2024

netlify bot commented Dec 1, 2024 •

edited

Loading

EIFY commented Dec 1, 2024

Better device handling #1301

Are you sure you want to change the base?

Better device handling #1301

Conversation

EIFY commented Dec 1, 2024

netlify bot commented Dec 1, 2024 • edited Loading

✅ Deploy Preview for pytorch-examples-preview canceled.

EIFY commented Dec 1, 2024

netlify bot commented Dec 1, 2024 •

edited

Loading