Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

Closed
1 task done
zhangbin1997 opened this issue Jun 7, 2024 · 6 comments
Closed
1 task done

关于Qwen2-72B 全量参数微调所需的显卡下限 #4141

zhangbin1997 opened this issue Jun 7, 2024 · 6 comments
Labels
solved This problem has been already solved

Comments

@zhangbin1997
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

Reproduction

如题,
请问全量微调qwen2-72B,训练maxlen为32768,128张A800,80G显存,zero3+offload,这个配置是本来就会爆显存吗,还是说有办法可以跑起来呢。
谢谢~

Expected behavior

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Jun 7, 2024

batchsize 是 1 吗,把 cutoff_len 降低一两倍看看能不能跑起来

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jun 7, 2024
@zhangbin1997
Copy link
Author

batchsize 是 1 吗,把 cutoff_len 降低一两倍看看能不能跑起来

是的,per_device_train_batch_size=1,但是我就是需要packing到Qwen支持的最大长度32768呢,是显卡不够吗,贫穷的原因吗,还是说有什么更不占显存的配置呢~

这是我的zero3+offload的配置
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"zero_allow_untested_optimizer": true
}

@hiyouga
Copy link
Owner

hiyouga commented Jun 7, 2024

目前可能不太支持这么长序列的微调,后续会增加方法。建议先以 8k 长度训练

@zuxin666
Copy link

zuxin666 commented Jun 8, 2024

@hiyouga 后续不知道是否可以参考一下这个repo来改进下长序列微调?https://github.com/jzhang38/EasyContext

@silvercherry
Copy link

请问大佬我在全量参数sft qwen2-72b的时候,如果只是1000条数据,可以跑起来;但是当用2M+1.1M之后,相同配置就会爆显存,请问大佬有遇到这个问题吗

@Huarong
Copy link

Huarong commented Jun 27, 2024

请问大佬我在全量参数sft qwen2-72b的时候,如果只是1000条数据,可以跑起来;但是当用2M+1.1M之后,相同配置就会爆显存,请问大佬有遇到这个问题吗

@silvercherry 应该是2M的数据里混入了很长的训练数据。可以分别统计下这1000条和 2M的token长度。把长数据过滤调。

Repository owner locked and limited conversation to collaborators Jun 28, 2024
@hiyouga hiyouga converted this issue into discussion #4615 Jun 28, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 28, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

5 participants