Skip to content

Conversation

@h-guo18
Copy link
Contributor

@h-guo18 h-guo18 commented Jan 8, 2026

What does this PR do?

Type of change: New Feature

Overview:

  • Supported Context Parallel by patching torch ring attention;
  • Require following libirary version for stable cp:
    • torch2.8.0
    • transformers5.0.0
    • accelrate1.12.0
  • Move to FSDP2
  • Removed unused arguments in training script (--multi_gpu, fsdp_wrap_layer)

Usage

./launch_train.sh --model $MODEL \
            --output_dir $OUTPUT_DIR \  
            --data $DATA \
            --num_epochs 0.1 \
            --train_bs 1 \
            --eagle_config eagle_config.json \
            --training_seq_len 1024 \
            --cp_size 2   #newly added

Testing

  • SDPA level correctness: tested TTT attention with/without CP, diff < 1%
=== Compare context-parallel (CP) outputs and grads with non-CP ===
Forward output comparison (CP vs Non-CP):
  Absolute diff (adiff) cp_out vs out: 0.001953125
  Relative diff (rdiff) cp_out vs out: 0.00182342529296875
WQ (query proj) grad comparison (CP vs Non-CP):
  Absolute diff (adiff) cp_wq_grad vs wq_grad: 0.0078125
  Relative diff (rdiff) cp_wq_grad vs wq_grad: 0.00347900390625
WK (key proj) grad comparison (CP vs Non-CP):
  Absolute diff (adiff) cp_wk_grad vs wk_grad: 0.0078125
  Relative diff (rdiff) cp_wk_grad vs wk_grad: 0.002471923828125
WV (value proj) grad comparison (CP vs Non-CP):
  Absolute diff (adiff) cp_wv_grad vs wv_grad: 0.25
  Relative diff (rdiff) cp_wv_grad vs wv_grad: 0.0069580078125
==============================================================
  • E2E Training Acc
    (Llama3.1-8B, Unsynthesized magpie)
image
  • Peak Mem Reserved
    (llama3.1-8B, 8xH100, train_length=4k)

    cp_size max_memory_allocated(MB) max_memory_reserved (MB)
    1 65040.20 79018.00
    2 50409.17 73098.00
    4 45120.92 72052.00
    8 38882.12 66484.00
  • Max Training Length test
    (llama3.1-8B, H100)

    cp_size 6k 12k 24k 48k
    1 OOM OOM OOM
    2 OOM OOM
    4 OOM
    8

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Signed-off-by: h-guo18 <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
@h-guo18 h-guo18 self-assigned this Jan 8, 2026
Signed-off-by: h-guo18 <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@h-guo18 h-guo18 marked this pull request as ready for review January 9, 2026 23:42
@h-guo18 h-guo18 requested a review from a team as a code owner January 9, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants