-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qwen2-7b problem when tp=2, pp=1 #285
Comments
之前看代码的时候看到貌似有规定tp必须要能被world_size整除,你这是node=1,tp=2应该是整除不了所以出问题了吧 |
world_size我理解的应该是机器的数量,即多机多卡的情况,你可以看一些.sh文件里有写NNODES=${WORLD_SIZE} |
你是对的,world_size应该是卡数乘以机器数,我刚刚print了一下,但我试了一下我这边tt=2的情况下是可以训练的,不知道你那边是什么情况。你转换模型的时候也是tp=2的吗 |
应该是我docker环境问题, 换了24.03之后正常训练 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I tried fine-tuning qwen2-7B on a machine with 8xA100-80G and made the following attempts:
What could be the reason?
The text was updated successfully, but these errors were encountered: