[BUG] <Failed to Finetune for multi GPUs/多卡微调一直失败> #454

mokby · 2024-08-22T06:48:09Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用单卡4090配合finetune_lora_single_ds.sh可以正常运行,但是在双4090使用finetune_lora_ds.sh就会一直报错,相同的问题也出现在Qlora上.这次更换了4x3090微调,但是仍然出错,错误信息如下:

[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING]                                                                                                 
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING] *****************************************                                                       
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your 
system being overloaded, please further tune the variable for optimal performance in your application as needed.                                           
[2024-08-22 06:28:30,941] torch.distributed.run: [WARNING] *****************************************                                                       
[2024-08-22 06:28:32,833] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                    
[2024-08-22 06:28:32,837] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                    
[2024-08-22 06:28:32,838] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                    
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a fu
ture version. Please import deepspeed modules directly from transformers.integrations                                                                      
  warnings.warn(                                                                                                                                           
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a fu
ture version. Please import deepspeed modules directly from transformers.integrations                                                                      
  warnings.warn(                                                                                                                                           
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a fu
ture version. Please import deepspeed modules directly from transformers.integrations                                                                      
  warnings.warn(                                                                                                                                           
[2024-08-22 06:28:34,447] [INFO] [comm.py:637:init_distributed] cdb=None                                                                                   
[2024-08-22 06:28:34,447] [INFO] [comm.py:637:init_distributed] cdb=None                                                                                   
[2024-08-22 06:28:34,448] [INFO] [comm.py:637:init_distributed] cdb=None                                                                                   
[2024-08-22 06:28:34,448] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl                                   
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=Tru$
 to "AutoModelForCausalLM.from_pretrained".                                                                                                                
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=Tru$
 to "AutoModelForCausalLM.from_pretrained".                                                                                                                
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=Tru$
 to "AutoModelForCausalLM.from_pretrained".                                                                                                                
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:08<00:00,  6.86s/it$
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:08<00:00,  6.84s/it$
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:08<00:00,  6.85s/it$
Loading data...                                                                                                                                            
Formatting inputs...Skip in lazy mode
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel
 to the minimum version or higher.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradie
nt_checkpointing` in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradie
nt_checkpointing` in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradie
nt_checkpointing` in your model.
[2024-08-22 06:30:06,085] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 348 closing signal SIGTERM
[2024-08-22 06:30:06,450] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 347) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

主要错误似乎在如下部分:

etected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel
 to the minimum version or higher.

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passe
d it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradie
nt_checkpointing` in your model.

这两个报错中,第一个是说Linux kernel老旧,但是应该不影响训练才对否则单卡也不应该训练成功.
至于第二个问题,似乎是说使用的预训练权重版本较老,但是我查了一下确实是最新的Qwen-VL-Chat模型
求问各位大佬有解决办法么?或者交流一下多卡微调的一些信息,这几天换好多卡都一直无法成功.

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

1.python 3.8, CUDA 12.1, torch 2.12, transformer 4.36
2.sh finetune/finetune_lora_ds.sh

运行环境 | Environment

- OS:Ubuntu 18.04
- Python:3.8
- Transformers:4.36.0
- PyTorch:2.12
- CUDA:12.1

备注 | Anything else?

No response

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] <Failed to Finetune for multi GPUs/多卡微调一直失败> #454

[BUG] <Failed to Finetune for multi GPUs/多卡微调一直失败> #454

mokby commented Aug 22, 2024

[BUG] <Failed to Finetune for multi GPUs/多卡微调一直失败> #454

[BUG] <Failed to Finetune for multi GPUs/多卡微调一直失败> #454

Comments

mokby commented Aug 22, 2024

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?