Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better device handling #1301

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Better device handling #1301

wants to merge 1 commit into from

Conversation

EIFY
Copy link

@EIFY EIFY commented Dec 1, 2024

I can't reproduce the issue of every process allocating memory of GPU 0 (#969), so maybe the underlying issue has been fixed. Regardless, usage of torch.cuda.set_device is now discouraged in favor of just setting CUDA_VISIBLE_DEVICES. Furthermore, previously validate() and AverageMeter are trying to determine what device to use on their own. We really should just pass in the agreed-upon device.

Tested on a 8-GPU instance. I made sure that both commands below still work:

python main.py -a resnet50 --gpu 1 --evaluate --batch-size 1024 /data/ImageNet/

python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:60000' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --batch-size 1024 /data/ImageNet/

# After detaching the screen, `nvidia-smi` shows that the 2nd command results in expected, distributed GPU usage:

$ nvidia-smi
Sun Dec  1 01:01:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   56C    P0             94W /  400W |   14949MiB /  40960MiB |     62%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   49C    P0             99W /  400W |   15093MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   52C    P0            105W /  400W |   15097MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   55C    P0            103W /  400W |   15095MiB /  40960MiB |     92%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  |   00000000:0B:00.0 Off |                    0 |
| N/A   54C    P0            102W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  |   00000000:0C:00.0 Off |                    0 |
| N/A   50C    P0            103W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   47C    P0             97W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  |   00000000:0E:00.0 Off |                    0 |
| N/A   49C    P0             85W /  400W |   14953MiB /  40960MiB |     73%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2800326      C   /usr/bin/python                             14940MiB |
|    1   N/A  N/A   2800327      C   /usr/bin/python                             15084MiB |
|    2   N/A  N/A   2800328      C   /usr/bin/python                             15088MiB |
|    3   N/A  N/A   2800329      C   /usr/bin/python                             15084MiB |
|    4   N/A  N/A   2800330      C   /usr/bin/python                             15084MiB |
|    5   N/A  N/A   2800331      C   /usr/bin/python                             15084MiB |
|    6   N/A  N/A   2800332      C   /usr/bin/python                             15084MiB |
|    7   N/A  N/A   2800333      C   /usr/bin/python                             14944MiB |
+-----------------------------------------------------------------------------------------+

See pytorch#969
Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP
device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
Copy link

netlify bot commented Dec 1, 2024

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit d10a469
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/674c035432f73f0008c38141

@EIFY
Copy link
Author

EIFY commented Dec 1, 2024

BTW there are quite a few unattended PRs for this example, some of them years old...
#1291
#1118
#1001
#825
#551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants