With Zero-Optimizer Stage 3, is it recommended not to use intra-layer model parallelism (Megatron-ML)? #948
SantoshGuptaML
started this conversation in
General
Replies: 1 comment
-
Hi @SantoshGuptaML! Could you please direct me to an example where ZeRO is used with Megatron-LM (model parallelism)? Does the Megatron-Deepspeed repository use ZeRO? I didn't see it using the ZeRO stage in the examples in the Megatron-Deepspeed repo. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Since the Zero-Optimizer at Stage 3 also partitions the model parameters, my intuition is that intra-layer model parallelism would not not increase memory efficiency, and maybe even interfere with in the Zero-optimizer's efficiency, since it seems now that every GPU has a very specific set of operations, and these operations are now evenly distributed (memory wise) among the GPUs.
Beta Was this translation helpful? Give feedback.
All reactions