Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qwen2-7b problem when tp=2, pp=1 #285

Closed
MrWaterZhou opened this issue Jul 13, 2024 · 6 comments
Closed

qwen2-7b problem when tp=2, pp=1 #285

MrWaterZhou opened this issue Jul 13, 2024 · 6 comments

Comments

@MrWaterZhou
Copy link

Hi,
I tried fine-tuning qwen2-7B on a machine with 8xA100-80G and made the following attempts:

1.	tp=1, pp=1: Training was normal.
2.	tp=1, pp=2: Training was normal.
3.	tp=2, pp=1: After reading the dataset, it got stuck and couldn’t proceed with normal training.

What could be the reason?

@tu2022
Copy link

tu2022 commented Jul 16, 2024

之前看代码的时候看到貌似有规定tp必须要能被world_size整除,你这是node=1,tp=2应该是整除不了所以出问题了吧

@MrWaterZhou
Copy link
Author

之前看代码的时候看到貌似有规定tp必须要能被world_size整除,你这是node=1,tp=2应该是整除不了所以出问题了吧

我理解world_size应该指的是卡的数量? 这个是单机4卡训练的情况
image

@tu2022
Copy link

tu2022 commented Jul 18, 2024

world_size我理解的应该是机器的数量,即多机多卡的情况,你可以看一些.sh文件里有写NNODES=${WORLD_SIZE}

@MrWaterZhou
Copy link
Author

world_size我理解的应该是机器的数量,即多机多卡的情况,你可以看一些.sh文件里有写NNODES=${WORLD_SIZE}
image
没有太细看...但是这个项目的脚本的定义好像和megatron-lm的定义不太一致
image

@tu2022
Copy link

tu2022 commented Jul 18, 2024

你是对的,world_size应该是卡数乘以机器数,我刚刚print了一下,但我试了一下我这边tt=2的情况下是可以训练的,不知道你那边是什么情况。你转换模型的时候也是tp=2的吗

@MrWaterZhou
Copy link
Author

应该是我docker环境问题, 换了24.03之后正常训练
目前在8L40s/8A100的机器上tp=2都训练正常

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants