We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
现象如题。 现象出现在我不知道一顿骚操作安装了多少库之后,目的是更新版本。 我kill掉卡死的进程后,报错信息如下: File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/work/.local/swift/cli/main.py", line 76, in cli_main() File "/home/work/.local/swift/cli/main.py", line 70, in cli_main result = subprocess.run(args) ^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 550, in run stdout, stderr = process.communicate(input, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 1201, in communicate self.wait() File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 1277, in wait self._wait(timeout=sigint_timeout) File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 2047, in _wait time.sleep(delay) 另一个可能有关的现象是,重复多次实验必然会卡死在相同的step,单机多卡中某一张卡的显存也几乎接近吃满,另外几张卡空余显存明显。如果是爆显存了,一般会直接报错而不是卡死,另外在我重新安装库之前,相同配置的模型是可以正常训练的,不会出现任何问题。 deepspeed卸载了问题依旧。 一些可能相关的库的版本我放下面了: transformers 4.47.0 transformers-stream-generator 0.0.5 triton 2.1.0 trl 0.14.0 scikit-learn 1.6.1 scipy 1.15.2 sentence-transformers 3.2.1 sentencepiece 0.2.0 seqeval 1.2.2 ms-swift 3.0.3 ms-vlmeval 0.0.13 torch 2.4.0 torchvision 0.19.0
The text was updated successfully, but these errors were encountered:
能否用一个新环境重新试试:
conda create -n new_env python==3.11 conda install swift[llm] or conda install .[llm]
Sorry, something went wrong.
能否用一个新环境重新试试: conda create -n new_env python==3.11 conda install swift[llm] or conda install .[llm]
重装了环境这个问题确实消失了,但是还是想了解下背后的原因,有办法定位吗?
No branches or pull requests
现象如题。
现象出现在我不知道一顿骚操作安装了多少库之后,目的是更新版本。
我kill掉卡死的进程后,报错信息如下:
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/work/.local/swift/cli/main.py", line 76, in
cli_main()
File "/home/work/.local/swift/cli/main.py", line 70, in cli_main
result = subprocess.run(args)
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 550, in run
stdout, stderr = process.communicate(input, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 1201, in communicate
self.wait()
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 1277, in wait
self._wait(timeout=sigint_timeout)
File "/opt/conda/envs/llm/lib/python3.11/subprocess.py", line 2047, in _wait
time.sleep(delay)
另一个可能有关的现象是,重复多次实验必然会卡死在相同的step,单机多卡中某一张卡的显存也几乎接近吃满,另外几张卡空余显存明显。如果是爆显存了,一般会直接报错而不是卡死,另外在我重新安装库之前,相同配置的模型是可以正常训练的,不会出现任何问题。
deepspeed卸载了问题依旧。
一些可能相关的库的版本我放下面了:
transformers 4.47.0
transformers-stream-generator 0.0.5
triton 2.1.0
trl 0.14.0
scikit-learn 1.6.1
scipy 1.15.2
sentence-transformers 3.2.1
sentencepiece 0.2.0
seqeval 1.2.2
ms-swift 3.0.3
ms-vlmeval 0.0.13
torch 2.4.0
torchvision 0.19.0
The text was updated successfully, but these errors were encountered: