Skip to content

Conversation

GeYuhong
Copy link

@GeYuhong GeYuhong commented Sep 2, 2025

Description

This pr is used to adapt for offload activation (a new feature in Megatron-LM, NVIDIA/Megatron-LM#1752).

Offload activation select inputs of specific modules (such as core_attn, qkv_linear, router_fc1), offloading them to CPU in forward pass and reloading them to GPU in backward pass.

When offloading the modules that include weights (nn.Parameter), attributes of these weights (such as main_grad, grad_added_to_main_grad) are ripped off by torch. Therefore, this feature needs to modify the basic modules in TE (such as grouped_linear.py, 'layernorm_linear.py') to preserve these necessary attributes.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Based on whether the input tensor contains the offloading_activation attribute, add support for retrieving the offload_activation flag in grouped_linear.py, linear.py, and layernorm_linear.py.
  • Save the grad_added_to_main_grad attribute in forward pass and get it in backward pass in grouped_linear.py, linear.py, and layernorm_linear.py.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

lhb8125 and others added 5 commits September 18, 2025 07:00
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
…tion

Hongbinl/adapt for offload activation
@nvMelissa nvMelissa added megatron Support for Megatron community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. waiting-for-feedback Waiting for PR owner to answer question from repo maintainer labels Oct 9, 2025
@nvMelissa
Copy link
Collaborator

nvMelissa commented Oct 15, 2025

Hi @GeYuhong, thanks for your contribution!
Our repository requires all commits to be signed off to comply with the Developer Certificate of Origin (DCO).
One or more of your commits are missing the Signed-off-by line. Please correct this issue so we can proceed with checks. For more details, refer to our CONTRIBUTING file: https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst
Thank you! cc: @timmoon10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. megatron Support for Megatron waiting-for-feedback Waiting for PR owner to answer question from repo maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants