We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3
4机llama3-70B训练,几个迭代打印后报错,训练脚本如下:
cat pretrain_llama3_70B_tp4_pp8.sh cd /workspace/Pai-Megatron-Patch/examples/llama3 sh run_pretrain_llama_70b.sh dsw 70B 1 1024 1e-7 1e-8 128 128 bf16 4 8 70B sel true false false true 100000 /mnt/llama3-datasets/wudao_llama3bpe_content_document /mnt/llama3-ckpts/Meta-Llama-3-70B-tp4-pp8 10000000 1 /mnt/output_megatron_llama3_70B
The text was updated successfully, but these errors were encountered:
您好,我在qwen2.5的72b上执行四机继续预训练很长时间也没有出现这个问题,您或者先试试llama3.1或者qwen2.5呢?
Sorry, something went wrong.
另外方便进群加我们详细聊下吗
好的,我切换模型试试。梯度爆炸这个问题试过降低学习率,限制梯度--clip-grad 1.0等方法,问题还是存在
No branches or pull requests
AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3
4机llama3-70B训练,几个迭代打印后报错,训练脚本如下:
cat pretrain_llama3_70B_tp4_pp8.sh
cd /workspace/Pai-Megatron-Patch/examples/llama3
sh run_pretrain_llama_70b.sh
dsw
70B
1
1024
1e-7
1e-8
128
128
bf16
4
8
70B
sel
true
false
false
true
100000
/mnt/llama3-datasets/wudao_llama3bpe_content_document
/mnt/llama3-ckpts/Meta-Llama-3-70B-tp4-pp8
10000000
1
/mnt/output_megatron_llama3_70B
The text was updated successfully, but these errors were encountered: