[main][feature][under updating]zero-overhead activation offload #1752

GeYuhong · 2025-08-18T18:11:00Z

This feature supports that activations of models are offloaded in the forward pass and prefetched in the backward pass.

Note: Must use TransformerEngine2.5 with the feature pr(NVIDIA/TransformerEngine#2145).

Currently, this feature can be used in a few modules, such as core_attention and router_fc1, we will support more modules(including qkv_linear, router_fc2 and shared_experts) as soon as possible.

We rewrite the _indices_to_multihot() in the token_dispatcher to remove all implicit synchronization without using fused ops, ensuring consistency in bitwise.

The following is the experimental results(dp4tp1cp1ep4pp2vpp2), including end-to-end performance and peak memory.
end2end perf:

	elapsed time per iteration (ms)
baseline	1262
baseline-new_indices_to_multihot	1249.7
offload_qkv	1253.8

peak memory ($R$ is the ratio of the actual decrease in peak memory to the theoretical value, where the theoretical values of stage0 and stage1 are 1440M and 800M respectively):

rank_id	base/B	base-new_indices_to_multihot/B	error between bases/M	offload_qkv/B	error offload vs base/M	$R$	error offload vs base-new/M	$R$
0	43687144448	43689495552	-2.24	42179546624	1437.76	99.84%	1440	100%
1	43687562240	43689913344	-2.24	42179859968	1437.86	99.85%	1440.1	100%
2	43687014912	43689366016	-2.24	42179417088	1437.76	99.84%	1440	100%
3	43686620672	43688971776	-2.24	42179022848	1437.76	99.84%	1440	100%
4	44975166976	44977182208	-1.92	44138519040	797.89	99.74%	799.81	99.98%
5	44975987712	44977182208	-1.14	44138321920	798.86	99.86%	800	100%
6	44975987712	44977182208	-1.14	44138716160	798.48	99.81%	799.62	99.95%
7	44973536256	44975551488	-1.92	44136691200	798.08	99.76%	800	100%

copy-pr-bot · 2025-08-18T18:11:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hxbai · 2025-08-19T06:10:30Z

@GeYuhong Is your known bug related to this? https://github.com/NVIDIA/TransformerEngine/blob/734bcedd9d86e4be30ce44f1ef67af5f69f3670d/transformer_engine/pytorch/module/linear.py#L402-L406

lhb8125 · 2025-08-25T01:28:21Z

megatron/core/model_parallel_config.py

       rank 1 |   0 1 2 0 1 2 3 4 3 4
    """

+    offload_mlp_input: bool = False


Do we still need this flag?

fixed, see in 1555e6d

lhb8125 · 2025-08-25T01:30:59Z

megatron/core/pipeline_parallel/cpu_offload.py

+
+
+
+class ChunkOffloadHandler:


We should try reusing the code of cpu_offload.py in TE as much as possible. IIUC, the class should derive from TE’s AsyncDoubleBufferGroupOffloadHandler().

fixed, (although the PipelineOffload class was applied, it achieves very limited reuse). See in e845344

lhb8125 · 2025-08-25T01:33:30Z

megatron/core/pipeline_parallel/cpu_offload.py

+                        tensor_on_device.record_stream(self.d2h_stream)
+                        self._tensor_tag_to_state[tensor_tag] = state
+        self._offloaded_group_count = group_to_offload + 1
+        self._f_event.record(self.d2h_stream)


Maybe we can use stream synchronization instead to reduce the descrepancy from TE’s AsyncDoubleBufferGroupOffloadHandler(). Event synchronization is light-weight but I think it doesn't impact the perf a lot here.

fixed, see in e845344.

lhb8125 · 2025-08-25T01:34:55Z

megatron/core/pipeline_parallel/cpu_offload.py

+    return GroupStartFunction.apply(tensor, cur_forward_chunk)
+
+
+def offloading_checker(tensor):


Do we still need the checker?

fixed, see in 7168ccd

lhb8125 · 2025-08-25T01:56:50Z

megatron/core/pipeline_parallel/cpu_offload.py

+        return len(self._queue)
+
+    def reset_chunk_handler(self, num_layer, offload_mlp_input=True):
+        cur_vpp_rank = parallel_state.get_virtual_pipeline_model_parallel_rank()


get_virtual_pipeline_model_parallel_rank() is deprecated now. The vpp_size(named as vp_stage now) is passed at runtime. The MR is here.

fixed, see in c9f00c7.

lhb8125 · 2025-08-25T01:57:38Z

megatron/core/pipeline_parallel/schedules.py

            MoEAuxLossAutoScaler.set_loss_scale(loss_scale)
        else:
+            if config.offload_activation:
+                MoEPositiveAuxLossAutoScaler.set_loss_scale(loss_scale / num_microbatches)


Why do we need an extra loss scaler?

fixed, see in 4b0d3f1.

lhb8125 · 2025-08-25T02:11:49Z

megatron/core/transformer/attention.py


        return hidden_states

+    def _offload_qkv_linear_forward(


Can we make it as a factory function to simplify the calling logic of registering and offloading tensors?

fixed, see in b00acbc.

GeYuhong · 2025-09-08T02:14:59Z

@GeYuhong Is your known bug related to this? https://github.com/NVIDIA/TransformerEngine/blob/734bcedd9d86e4be30ce44f1ef67af5f69f3670d/transformer_engine/pytorch/module/linear.py#L402-L406

yes, this is the bug we encountered and we haved fixed it. Thank you!

yspMing · 2025-09-08T07:15:54Z

Is this PR ready for using? Or there exists some limitations for applying this patch

GeYuhong · 2025-09-09T02:22:17Z

Is this PR ready for using? Or there exists some limitations for applying this patch
This feature is ready for core_attn offload and router-fc1 offload. We will support other modules in a few days, incluing router-fc2, linear_qkv etc.

Signed-off-by: Hongbin Liu <[email protected]>

Hongbinl/activation offloading add arguments.py and minor fix, OOTB runable now

Signed-off-by: Hongbin Liu <[email protected]>

support activation offloading at PP=1&PP&VPP

hxbai · 2025-09-18T03:18:54Z

megatron/core/transformer/attention.py

+            core_attn_out.offloading_activation = True
+            with PipelineOffloadManager.get_instance():
+                output, bias = self.linear_proj(core_attn_out)
+            output, bias = group_prefetch_offload_commit(output, bias, release_tensors=[core_attn_out])


core_attn_out is also saved by the fused attn. It may not work as expected.

hxbai · 2025-09-18T03:40:56Z

I think the offloading for the input of the activation function (swiglu) should also be added.

hxbai · 2025-09-18T04:12:17Z

megatron/core/transformer/cpu_offload.py

+            self.bulk_offload_group(self._layer_index)
+            if len(release_tensors) > 0:
+                cur_stream = torch.cuda.current_stream()
+                for release_tensor in release_tensors:


This cannot work with FP8. Can the saved FP8 tensors be released in time?

hxbai · 2025-09-18T06:54:25Z

megatron/training/arguments.py

            )
+
+    if args.offload_activation:
+        assert not args.overlap_grad_reduce, "overlap_grad_reduce is not supported with offload_activation"


Why can't they enable together?

Signed-off-by: Hongbin Liu <[email protected]>

Signed-off-by: root <[email protected]>

Signed-off-by: Hongbin Liu <[email protected]>

Hongbinl/activation offloading

Signed-off-by: Hongbin Liu <[email protected]>

support mixed dense&moe layer and a2a overlap

Signed-off-by: Hongbin Liu <[email protected]>

Hongbinl/activation offloading

Signed-off-by: Hongbin Liu <[email protected]>

… into the last stage Signed-off-by: Hongbin Liu <[email protected]>

Hongbinl/activation offloading

yanring requested review from yanring and hxbai August 19, 2025 05:37

lhb8125 reviewed Aug 25, 2025

View reviewed changes

GeYuhong mentioned this pull request Sep 2, 2025

[main][feature][under updating]adapt for offload activation NVIDIA/TransformerEngine#2145

Open

13 tasks

sbhavani added the enhancement New feature or request label Sep 3, 2025

zero-overhead activation offload

af236f1

GeYuhong force-pushed the activation_offloading branch from 64d156d to af236f1 Compare September 3, 2025 06:02

GeYuhong added 2 commits September 3, 2025 15:11

bugfix main_grad info and bitwise

7168ccd

remove offload_mlp_input arg

1555e6d

GeYuhong force-pushed the activation_offloading branch from a410128 to 1555e6d Compare September 3, 2025 15:56

GeYuhong added 5 commits September 8, 2025 03:23

replace get_virtual_pipeline_model_parallel_rank with vp_stage

c9f00c7

remove all MoEPositiveAuxLossAutoScaler

4b0d3f1

reduce modular PipeOffloadManager functions

b00acbc

remove call_back function

b2c99f7

polish all event sync

e845344

Hongbin Liu and others added 4 commits September 8, 2025 22:18

add arguments.py and minor fix, OOTB runable now.

81f44c7

Signed-off-by: Hongbin Liu <[email protected]>

Merge pull request #1 from lhb8125/hongbinl/activation_offloading

e1a3ad6

Hongbinl/activation offloading add arguments.py and minor fix, OOTB runable now

support activation offloading at PP=1&PP&VPP

7a52582

Signed-off-by: Hongbin Liu <[email protected]>

Merge pull request #2 from lhb8125/hongbinl/activation_offloading

31ab477

support activation offloading at PP=1&PP&VPP

hxbai reviewed Sep 18, 2025

View reviewed changes

Hongbin Liu and others added 26 commits September 17, 2025 23:54

support offloading core_attn/attn_proj and code refactoring

83ab849

Signed-off-by: Hongbin Liu <[email protected]>

add new cpu_offload.py

bee1060

Signed-off-by: root <[email protected]>

minor fix

2b574c2

Signed-off-by: Hongbin Liu <[email protected]>

code clean

1f03ceb

Signed-off-by: Hongbin Liu <[email protected]>

add interfaces to TE modules

2ff9f6e

Signed-off-by: Hongbin Liu <[email protected]>

renaming

a293701

Signed-off-by: Hongbin Liu <[email protected]>

minor fix

aa628c0

Signed-off-by: Hongbin Liu <[email protected]>

add README

ecfbc87

Signed-off-by: Hongbin Liu <[email protected]>

Update README.md

0f99ca6

remove forward sync per layer

e780b94

Signed-off-by: Hongbin Liu <[email protected]>

support FP8&MTP

20c4029

Signed-off-by: Hongbin Liu <[email protected]>

minor fix

ae494b5

Signed-off-by: Hongbin Liu <[email protected]>

code refactor and bug fix

ba17d78

Signed-off-by: Hongbin Liu <[email protected]>

update README

dfaa620

Signed-off-by: Hongbin Liu <[email protected]>

avoid multiple d2h copies for expert_fc1 and update README

7d867ab

Signed-off-by: Hongbin Liu <[email protected]>

Update README.md

7a7af1c

Merge pull request #3 from lhb8125/hongbinl/activation_offloading

b9f0a3f

Hongbinl/activation offloading

support mixed dense&moe layer and a2a overlap

5b28cb2

Signed-off-by: Hongbin Liu <[email protected]>

minor fix

4c3b2c5

Signed-off-by: Hongbin Liu <[email protected]>

Merge pull request #4 from lhb8125/hongbinl/activation_offloading

5cc4b69

support mixed dense&moe layer and a2a overlap

Merge branch 'main' into hongbinl/activation_offloading

2bfc5cc

bug fix

a5d194c

Signed-off-by: Hongbin Liu <[email protected]>

Merge pull request #5 from lhb8125/hongbinl/activation_offloading

1853279

Hongbinl/activation offloading

temp fix to enable --overlap-grad-reduce

c26eb8a

Signed-off-by: Hongbin Liu <[email protected]>

fix to enable --overlap-grad-reduce and allow placing loss layer only…

0d845de

… into the last stage Signed-off-by: Hongbin Liu <[email protected]>

Merge pull request #6 from lhb8125/hongbinl/activation_offloading

ff3a852

Hongbinl/activation offloading

		return GroupStartFunction.apply(tensor, cur_forward_chunk)


		def offloading_checker(tensor):

[main][feature][under updating]zero-overhead activation offload #1752

Are you sure you want to change the base?

[main][feature][under updating]zero-overhead activation offload #1752

Uh oh!

Conversation

GeYuhong commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Aug 18, 2025

Uh oh!

hxbai commented Aug 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeYuhong Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeYuhong commented Sep 8, 2025

Uh oh!

yspMing commented Sep 8, 2025

Uh oh!

GeYuhong commented Sep 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hxbai commented Sep 18, 2025

Uh oh!

hxbai Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GeYuhong commented Aug 18, 2025 •

edited

Loading

GeYuhong Sep 8, 2025 •

edited

Loading

hxbai Sep 18, 2025 •

edited

Loading