Skip to content

hku-systems/pipemesh

Repository files navigation

PipeMesh

This repository contains the official implementation for the paper:

PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models

Base Implementation

This implementation is based on Megatron-LM (commit 20574f7553e66dbb3e8de72ca6c26c9faa2e1b18).

Usage

PipeMesh introduces two new arguments to control the elastic pipeline schedule:

--forward-groups <group_1> <group_2> ...
--backward-groups <group_1> <group_2> ...

Each argument defines the enqueue (forward) and dequeue (backward) sequences used for micro-batch scheduling within each pipeline group.

For example, for 12 micro-batches per pipeline group, you can use:

--forward-groups 8 4 \
--backward-groups 6 6

This configuration enqueues 8 micro-batches first (for the forward pass) and dequeues 6 micro-batches at a time during the backward pass, achieving fine-grained overlap between computation and communication.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages