关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

zhangbin1997 · 2024-06-07T03:41:54Z

Reminder

I have read the README and searched the existing issues.

System Info

Reproduction

如题，
请问全量微调qwen2-72B，训练maxlen为32768，128张A800，80G显存，zero3+offload，这个配置是本来就会爆显存吗，还是说有办法可以跑起来呢。
谢谢～

Expected behavior

No response

Others

No response

hiyouga · 2024-06-07T03:49:06Z

batchsize 是 1 吗，把 cutoff_len 降低一两倍看看能不能跑起来

zhangbin1997 · 2024-06-07T03:58:10Z

batchsize 是 1 吗，把 cutoff_len 降低一两倍看看能不能跑起来

是的，per_device_train_batch_size=1，但是我就是需要packing到Qwen支持的最大长度32768呢，是显卡不够吗，贫穷的原因吗，还是说有什么更不占显存的配置呢～

这是我的zero3+offload的配置
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"zero_allow_untested_optimizer": true
}

hiyouga · 2024-06-07T04:08:31Z

目前可能不太支持这么长序列的微调，后续会增加方法。建议先以 8k 长度训练

zuxin666 · 2024-06-08T00:19:37Z

@hiyouga 后续不知道是否可以参考一下这个repo来改进下长序列微调？https://github.com/jzhang38/EasyContext

silvercherry · 2024-06-14T05:06:00Z

请问大佬我在全量参数sft qwen2-72b的时候，如果只是1000条数据，可以跑起来；但是当用2M+1.1M之后，相同配置就会爆显存，请问大佬有遇到这个问题吗

Huarong · 2024-06-27T03:21:29Z

请问大佬我在全量参数sft qwen2-72b的时候，如果只是1000条数据，可以跑起来；但是当用2M+1.1M之后，相同配置就会爆显存，请问大佬有遇到这个问题吗

@silvercherry 应该是2M的数据里混入了很长的训练数据。可以分别统计下这1000条和 2M的token长度。把长数据过滤调。

hiyouga added the pending This problem is yet to be addressed label Jun 7, 2024

Repository owner locked and limited conversation to collaborators Jun 28, 2024

hiyouga converted this issue into discussion #4615 Jun 28, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

zhangbin1997 commented Jun 7, 2024

hiyouga commented Jun 7, 2024

zhangbin1997 commented Jun 7, 2024

hiyouga commented Jun 7, 2024 •

edited

Loading

zuxin666 commented Jun 8, 2024

silvercherry commented Jun 14, 2024

Huarong commented Jun 27, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

Comments

zhangbin1997 commented Jun 7, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jun 7, 2024

zhangbin1997 commented Jun 7, 2024

hiyouga commented Jun 7, 2024 • edited Loading

zuxin666 commented Jun 8, 2024

silvercherry commented Jun 14, 2024

Huarong commented Jun 27, 2024

This issue was moved to a discussion.

hiyouga commented Jun 7, 2024 •

edited

Loading