Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练过程中卡死,进程处于睡眠状态,GPU利用率为0 #3290

Open
2043078895 opened this issue Feb 26, 2025 · 2 comments
Open

Comments

@2043078895
Copy link

现象如题。
现象出现在我不知道一顿骚操作安装了多少库之后,目的是更新版本。
我kill掉卡死的进程后,报错信息如下:
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/work/.local/swift/cli/main.py", line 76, in
cli_main()
File "/home/work/.local/swift/cli/main.py", line 70, in cli_main
result = subprocess.run(args)
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 550, in run
stdout, stderr = process.communicate(input, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 1201, in communicate
self.wait()
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 1277, in wait
self._wait(timeout=sigint_timeout)
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 2047, in _wait
time.sleep(delay)
另一个可能有关的现象是,重复多次实验必然会卡死在相同的step,单机多卡中某一张卡的显存也几乎接近吃满,另外几张卡空余显存明显。如果是爆显存了,一般会直接报错而不是卡死,另外在我重新安装库之前,相同配置的模型是可以正常训练的,不会出现任何问题。
deepspeed卸载了问题依旧。
一些可能相关的库的版本我放下面了:
transformers 4.47.0
transformers-stream-generator 0.0.5
triton 2.1.0
trl 0.14.0
scikit-learn 1.6.1
scipy 1.15.2
sentence-transformers 3.2.1
sentencepiece 0.2.0
seqeval 1.2.2
ms-swift 3.0.3
ms-vlmeval 0.0.13
torch 2.4.0
torchvision 0.19.0

@tastelikefeet
Copy link
Collaborator

能否用一个新环境重新试试:

conda create -n new_env python==3.11
conda install swift[llm]
or
conda install .[llm]

@2043078895
Copy link
Author

能否用一个新环境重新试试:

conda create -n new_env python==3.11
conda install swift[llm]
or
conda install .[llm]

重装了环境这个问题确实消失了,但是还是想了解下背后的原因,有办法定位吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants