qwen2-7b problem when tp=2, pp=1 #285

MrWaterZhou · 2024-07-13T14:16:13Z

Hi,
I tried fine-tuning qwen2-7B on a machine with 8xA100-80G and made the following attempts:

1.	tp=1, pp=1: Training was normal.
2.	tp=1, pp=2: Training was normal.
3.	tp=2, pp=1: After reading the dataset, it got stuck and couldn’t proceed with normal training.

What could be the reason?

The text was updated successfully, but these errors were encountered:

tu2022 · 2024-07-16T06:20:12Z

之前看代码的时候看到貌似有规定tp必须要能被world_size整除，你这是node=1，tp=2应该是整除不了所以出问题了吧

MrWaterZhou · 2024-07-18T02:29:01Z

之前看代码的时候看到貌似有规定tp必须要能被world_size整除，你这是node=1，tp=2应该是整除不了所以出问题了吧

我理解world_size应该指的是卡的数量? 这个是单机4卡训练的情况

tu2022 · 2024-07-18T02:33:39Z

world_size我理解的应该是机器的数量，即多机多卡的情况，你可以看一些.sh文件里有写NNODES=${WORLD_SIZE}

MrWaterZhou · 2024-07-18T02:51:19Z

world_size我理解的应该是机器的数量，即多机多卡的情况，你可以看一些.sh文件里有写NNODES=${WORLD_SIZE}

没有太细看...但是这个项目的脚本的定义好像和megatron-lm的定义不太一致

tu2022 · 2024-07-18T03:03:51Z

你是对的，world_size应该是卡数乘以机器数，我刚刚print了一下，但我试了一下我这边tt=2的情况下是可以训练的，不知道你那边是什么情况。你转换模型的时候也是tp=2的吗

MrWaterZhou · 2024-07-24T12:18:32Z

应该是我docker环境问题, 换了24.03之后正常训练
目前在8L40s/8A100的机器上tp=2都训练正常

MrWaterZhou closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen2-7b problem when tp=2, pp=1 #285

qwen2-7b problem when tp=2, pp=1 #285

MrWaterZhou commented Jul 13, 2024

tu2022 commented Jul 16, 2024

MrWaterZhou commented Jul 18, 2024

tu2022 commented Jul 18, 2024

MrWaterZhou commented Jul 18, 2024

tu2022 commented Jul 18, 2024 •

edited

Loading

MrWaterZhou commented Jul 24, 2024

qwen2-7b problem when tp=2, pp=1 #285

qwen2-7b problem when tp=2, pp=1 #285

Comments

MrWaterZhou commented Jul 13, 2024

tu2022 commented Jul 16, 2024

MrWaterZhou commented Jul 18, 2024

tu2022 commented Jul 18, 2024

MrWaterZhou commented Jul 18, 2024

tu2022 commented Jul 18, 2024 • edited Loading

MrWaterZhou commented Jul 24, 2024

tu2022 commented Jul 18, 2024 •

edited

Loading