Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于二阶段训练的问题 #247

Open
jianhai0527 opened this issue Jun 2, 2024 · 4 comments
Open

关于二阶段训练的问题 #247

jianhai0527 opened this issue Jun 2, 2024 · 4 comments

Comments

@jianhai0527
Copy link

  1. 使用qwen1.5 moe模型
  2. 第一阶段训练完后,保存的checkpoint,不能作为第二阶段的pretrain checkpoint使用,仅load weight
  3. 有两个问题,一个是缺失必要的配置文件,二是补充配置文件后报以下错误
    py", line 757, in get_parameter_state_dp_zero
    state_dict = optimizer.get_parameter_state_dp_zero()
    File "/nas-wulanchabu/tanfan.zjh/Pai-Megatron-Patch/Megatron-LM-240405/megatron/core/optimizer/distrib_optimizer.py", line 757, in get_parameter_state_dp_zero
    tensors[key].detach().cpu()
    tensors[key].detach().cpu()
    KeyError: 'exp_avg'
    KeyError: 'exp_avg'tensors[key].detach().cpu() tensors[key].detach().cpu()

tensors[key].detach().cpu()

tensors[key].detach().cpu()
KeyErrorKeyErrorKeyErrorKeyError: : : : tensors[key].detach().cpu()'exp_avg' 'exp_avg''exp_avg''exp_avg'tensors[key].detach().cpu()

@jerryli1981
Copy link
Collaborator

您好,这个问题我遇到过,貌似就是第二阶段加载的时候不加载优化器参数就可以了

@jianhai0527
Copy link
Author

您好,这个问题我遇到过,貌似就是第二阶段加载的时候不加载优化器参数就可以了

多谢~~我已经加了no-load-optim参数,不起作用。。应该咋操作呢

@divisionblur
Copy link

您好,这个问题我遇到过,貌似就是第二阶段加载的时候不加载优化器参数就可以了

断点需要不需要优化器状态吗?

@divisionblur
Copy link

您好,这个问题我遇到过,貌似就是第二阶段加载的时候不加载优化器参数就可以了

多谢~~我已经加了no-load-optim参数,不起作用。。应该咋操作呢

请问最后怎么解决的呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants