BurstEngine: An Efficient Distributed Framework for Training Transformers on Extremely Long Sequences

BurstEngine is a highly efficient distributed framework specifically designed for training transformer-based large language models (LLMs) and large multimodal models (LMMs) on extremely long sequences of over 1M tokens. It addresses the critical challenges of quadratic time and space complexities in attention mechanisms while achieving superior performance compared to existing state-of-the-art methods.

🚀 Key Features

Core Optimizations

BurstAttention: A highly optimized distributed attention implementation with:
- Backward Communication Optimization: Reduces ~25% communication overhead compared to RingAttention
- Topology-aware Ring Communication: Maximizes network bandwidth utilization across intra-node and inter-node connections
- Fine-grained Communication-Computation Overlap: Minimizes overall communication overhead through specialized double buffer design
Sequence-Level Selective Checkpointing: Optimizes the trade-off between memory overhead and computation overhead at the sequence level
Sequence-Level Fusion of LM Head and Loss Function: Reduces memory overhead of storing intermediate states in language modeling head
Sparse Attention Integration: Supports various sparse attention patterns including causal masking, sliding-window masking, and block-wise sparse patterns

Performance Achievements

1.2× speedup over state-of-the-art baselines on extremely long sequences (>1M tokens)
26.4% memory reduction compared to most memory-efficient baselines
Linear scaling with the number of devices along the sequence dimension
Support for sequences up to 4M tokens on 64×A800 GPUs

🛠️ Installation

Prerequisites

Docker
CUDA-compatible GPUs
InfiniBand network (for multi-node training)
Shared storage accessible by all nodes

Docker-based Installation

Clone the repository:

git clone https://github.com/your-org/BurstEngine.git
cd BurstEngine
git submodule update --init --recursive

Set up environment variables: Create and configure env.sh with your network settings:

export CUDA_DEVICE_MAX_CONNECTIONS=1
export UCX_NET_DEVICES=bond0
export GLOO_SOCKET_IFNAME=bond0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_HCA="mlx5_2,mlx5_3,mlx5_5,mlx5_6"

Important: Replace bond0 with your actual network interface name. Use ifconfig to check available interfaces:

ifconfig

Example network configuration:

# Node 0
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
      inet 10.0.2.12  netmask 255.255.255.0  broadcast 10.0.2.255

# Node 1  
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
      inet 10.0.2.13  netmask 255.255.255.0  broadcast 10.0.2.255

Note: Contact your hardware provider to ensure correct NCCL_IB_HCA values, as some InfiniBand HCAs are used for shared storage rather than program communication.

Build the Docker image:

docker build -t burst_engine:latest .

Distribute the image to all nodes: On the build machine:

docker save burst_engine:latest > /shared/burst_engine.tar.gz

On each target node:

docker load -i /shared/burst_engine.tar.gz

🚀 Quick Start

Check Node Availability

Before running experiments, check which GPU nodes are available:

# Modify check.sh to specify your node prefix and range
node_prefix="bjdx"  # Replace with your node prefix
# Check nodes bjdx1, bjdx2, bjdx3, bjdx4
bash check.sh

Running End-to-End Training

Configure your experiment: Edit code/BurstEngine/apps/llama/multi_exp.sh:

#!/bin/bash

sizes=( 1048576 2097152 )  # Sequence lengths to test
methods=(
  "burst"  # BurstEngine method
)
export NODES="bjdx1 bjdx2 bjdx3 bjdx4"  # Your node list
export MODEL="7b"  # Model size
export CP_SIZE=32  # Context parallel size

Set environment variables:

export PROJECT_DIR=/shared/sc_workspace/BE/

Launch multi-node training:

cd code/BurstEngine/apps/llama
bash multi_exp.sh

Running Baselines

Compare BurstEngine with other methods:

# Megatron-LM
cd evaluation/baselines/Megatron-LM && bash multi_exp.sh

# Megatron-DeepSpeed  
cd evaluation/baselines/Megatron-DeepSpeed/script && bash multi_exp.sh

# Intern-Evo
cd evaluation/baselines/InternEvo && bash multi_exp.sh

Running Attention Benchmarks

cd evaluation/kernel_bench/
bash bench_all.sh

If you want detail guidance for reproduction, please refer to this.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
BMTrain		BMTrain
Burst-Attention @ f25c3e9		Burst-Attention @ f25c3e9
BurstEngine		BurstEngine
data		data
evaluation		evaluation
flash-attention @ d437be5		flash-attention @ d437be5
llama-7b		llama-7b
ml-cross-entropy @ 1a63846		ml-cross-entropy @ 1a63846
plot_code		plot_code
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
HELP.md		HELP.md
README.md		README.md
check.sh		check.sh
env.sh		env.sh
submit.sh		submit.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BurstEngine: An Efficient Distributed Framework for Training Transformers on Extremely Long Sequences

🚀 Key Features

Core Optimizations

Performance Achievements

🛠️ Installation

Prerequisites

Docker-based Installation

🚀 Quick Start

Check Node Availability

Running End-to-End Training

Running Baselines

Running Attention Benchmarks

About

Uh oh!

Releases

Packages

Languages

thunlp/BurstEngine

Folders and files

Latest commit

History

Repository files navigation

BurstEngine: An Efficient Distributed Framework for Training Transformers on Extremely Long Sequences

🚀 Key Features

Core Optimizations

Performance Achievements

🛠️ Installation

Prerequisites

Docker-based Installation

🚀 Quick Start

Check Node Availability

Running End-to-End Training

Running Baselines

Running Attention Benchmarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages