You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING]
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING] *****************************************
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable foreach process to be 1in default, to avoid your
system being overloaded, please further tune the variable foroptimal performancein your application as needed.
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING] *****************************************
[2024-08-22 06:28:32,833] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-22 06:28:32,837] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-22 06:28:32,838] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a fu
ture version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a fu
ture version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a fu
ture version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-08-22 06:28:34,447] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-22 06:28:34,447] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-22 06:28:34,448] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-22 06:28:34,448] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=Tru$
to "AutoModelForCausalLM.from_pretrained".
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=Tru$
to "AutoModelForCausalLM.from_pretrained".
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=Tru$
to "AutoModelForCausalLM.from_pretrained".
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:08<00:00, 6.86s/it$
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:08<00:00, 6.84s/it$
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:08<00:00, 6.85s/it$
Loading data...
Formatting inputs...Skip in lazy mode
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel
to the minimum version or higher.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs`incase you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing`in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
[2024-08-22 06:30:06,085] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 348 closing signal SIGTERM
[2024-08-22 06:30:06,450] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 347) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in<module>sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
主要错误似乎在如下部分:
etected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel
to the minimum version or higher.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs`incase you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing`in your model.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
使用单卡4090配合
finetune_lora_single_ds.sh
可以正常运行,但是在双4090使用finetune_lora_ds.sh
就会一直报错,相同的问题也出现在Qlora上.这次更换了4x3090微调,但是仍然出错,错误信息如下:主要错误似乎在如下部分:
这两个报错中,第一个是说Linux kernel老旧,但是应该不影响训练才对否则单卡也不应该训练成功.
至于第二个问题,似乎是说使用的预训练权重版本较老,但是我查了一下确实是最新的Qwen-VL-Chat模型
求问各位大佬有解决办法么?或者交流一下多卡微调的一些信息,这几天换好多卡都一直无法成功.
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
1.python 3.8, CUDA 12.1, torch 2.12, transformer 4.36
2.sh finetune/finetune_lora_ds.sh
运行环境 | Environment
备注 | Anything else?
No response
The text was updated successfully, but these errors were encountered: