Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3 #366

Open
lanfengmo opened this issue Oct 21, 2024 · 3 comments

Comments

@lanfengmo
Copy link

AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3

error1
4机llama3-70B训练,几个迭代打印后报错,训练脚本如下:

cat pretrain_llama3_70B_tp4_pp8.sh
cd /workspace/Pai-Megatron-Patch/examples/llama3
sh run_pretrain_llama_70b.sh
dsw
70B
1
1024
1e-7
1e-8
128
128
bf16
4
8
70B
sel
true
false
false
true
100000
/mnt/llama3-datasets/wudao_llama3bpe_content_document
/mnt/llama3-ckpts/Meta-Llama-3-70B-tp4-pp8
10000000
1
/mnt/output_megatron_llama3_70B

@jerryli1981
Copy link
Collaborator

您好,我在qwen2.5的72b上执行四机继续预训练很长时间也没有出现这个问题,您或者先试试llama3.1或者qwen2.5呢?

@jerryli1981
Copy link
Collaborator

另外方便进群加我们详细聊下吗

@lanfengmo
Copy link
Author

您好,我在qwen2.5的72b上执行四机继续预训练很长时间也没有出现这个问题,您或者先试试llama3.1或者qwen2.5呢?

好的,我切换模型试试。梯度爆炸这个问题试过降低学习率,限制梯度--clip-grad 1.0等方法,问题还是存在

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants