Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

两块A100运行时一直不动 #111

Closed
luokehan opened this issue Dec 11, 2024 · 6 comments
Closed

两块A100运行时一直不动 #111

luokehan opened this issue Dec 11, 2024 · 6 comments

Comments

@luokehan
Copy link

按照文档中“🚀 xDiT 在多个 GPU 上进行并行推理部分”进行安装,运行的命令是torchrun --nproc_per_node=2 sample_video.py \ --video-size 960 960 \ --video-length 129 \ --infer-steps 20 \ --prompt "A cat walks on the grass, realistic style." \ --flow-reverse \ --seed -1 \ --ulysses-degree 1 \ --ring-degree 2 \ --save-path ./results 前面一直正常(没报错)但一直卡在 0%| | 0/20 [00:00<?, ?it/s 这步
看了下显卡情况是+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:AD:00.0 Off | 0 |
| N/A 43C P0 101W / 300W | 47373MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:AF:00.0 Off | 0 |
| N/A 42C P0 100W / 300W | 46875MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
不知道怎么解决

@feifeibear
Copy link
Contributor

具体卡在哪?

--ulysses-degree 2 \ --ring-degree 1

能否将ulysses degree设置成2试试

@luokehan
Copy link
Author

应该是我内存不够,内存只有90GB🤡抱歉了

@feifeibear
Copy link
Contributor

单卡内存只有40GB么?

@luokehan
Copy link
Author

不是,单卡80GB(显存) 两张A100 但是这个是在集群里跑的 我申请的机器内存就90GB 不知道够不够😳

@luokehan
Copy link
Author

具体卡在哪?

--ulysses-degree 2 \ --ring-degree 1

能否将ulysses degree设置成2试试
这个我设置过,文档上面写的有🤒但是不知道它为什么推理的时候不动 😂卡在0步(步数设置的20步) 一张卡A100的话也就15分钟一个视频 我想试试两张效果怎么样 我也是小白,看上面文档操作的 不过现在不搞了,我猜原因可能是机器内存不够吧或者其他问题

@feifeibear
Copy link
Contributor

不是,单卡80GB(显存) 两张A100 但是这个是在集群里跑的 我申请的机器内存就90GB 不知道够不够😳

和CPU内存无关,要看你的单张GPU的内存是多少。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants