-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[main][feature][under updating]zero-overhead activation offload #1752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
rank 1 | 0 1 2 0 1 2 3 4 3 4 | ||
""" | ||
|
||
offload_mlp_input: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see in 1555e6d
|
||
|
||
|
||
class ChunkOffloadHandler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try reusing the code of cpu_offload.py
in TE as much as possible. IIUC, the class should derive from TE’s AsyncDoubleBufferGroupOffloadHandler()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, (although the PipelineOffload class was applied, it achieves very limited reuse). See in e845344
tensor_on_device.record_stream(self.d2h_stream) | ||
self._tensor_tag_to_state[tensor_tag] = state | ||
self._offloaded_group_count = group_to_offload + 1 | ||
self._f_event.record(self.d2h_stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use stream synchronization instead to reduce the descrepancy from TE’s AsyncDoubleBufferGroupOffloadHandler(). Event synchronization is light-weight but I think it doesn't impact the perf a lot here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see in e845344.
return GroupStartFunction.apply(tensor, cur_forward_chunk) | ||
|
||
|
||
def offloading_checker(tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the checker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see in 7168ccd
return len(self._queue) | ||
|
||
def reset_chunk_handler(self, num_layer, offload_mlp_input=True): | ||
cur_vpp_rank = parallel_state.get_virtual_pipeline_model_parallel_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_virtual_pipeline_model_parallel_rank()
is deprecated now. The vpp_size
(named as vp_stage
now) is passed at runtime. The MR is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see in c9f00c7.
MoEAuxLossAutoScaler.set_loss_scale(loss_scale) | ||
else: | ||
if config.offload_activation: | ||
MoEPositiveAuxLossAutoScaler.set_loss_scale(loss_scale / num_microbatches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an extra loss scaler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see in 4b0d3f1.
|
||
return hidden_states | ||
|
||
def _offload_qkv_linear_forward( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it as a factory function to simplify the calling logic of registering and offloading tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see in b00acbc.
64d156d
to
af236f1
Compare
a410128
to
1555e6d
Compare
yes, this is the bug we encountered and we haved fixed it. Thank you! |
Is this PR ready for using? Or there exists some limitations for applying this patch |
|
Signed-off-by: Hongbin Liu <[email protected]>
Hongbinl/activation offloading add arguments.py and minor fix, OOTB runable now
Signed-off-by: Hongbin Liu <[email protected]>
support activation offloading at PP=1&PP&VPP
core_attn_out.offloading_activation = True | ||
with PipelineOffloadManager.get_instance(): | ||
output, bias = self.linear_proj(core_attn_out) | ||
output, bias = group_prefetch_offload_commit(output, bias, release_tensors=[core_attn_out]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core_attn_out
is also saved by the fused attn. It may not work as expected.
I think the offloading for the input of the activation function (swiglu) should also be added. |
self.bulk_offload_group(self._layer_index) | ||
if len(release_tensors) > 0: | ||
cur_stream = torch.cuda.current_stream() | ||
for release_tensor in release_tensors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot work with FP8. Can the saved FP8 tensors be released in time?
megatron/training/arguments.py
Outdated
) | ||
|
||
if args.offload_activation: | ||
assert not args.overlap_grad_reduce, "overlap_grad_reduce is not supported with offload_activation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't they enable together?
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: root <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Hongbinl/activation offloading
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
support mixed dense&moe layer and a2a overlap
Signed-off-by: Hongbin Liu <[email protected]>
Hongbinl/activation offloading
Signed-off-by: Hongbin Liu <[email protected]>
… into the last stage Signed-off-by: Hongbin Liu <[email protected]>
Hongbinl/activation offloading
This feature supports that activations of models are offloaded in the forward pass and prefetched in the backward pass.
Note: Must use TransformerEngine2.5 with the feature pr(NVIDIA/TransformerEngine#2145).
Currently, this feature can be used in a few modules, such as core_attention and router_fc1, we will support more modules(including qkv_linear, router_fc2 and shared_experts) as soon as possible.
We rewrite the _indices_to_multihot() in the token_dispatcher to remove all implicit synchronization without using fused ops, ensuring consistency in bitwise.
The following is the experimental results(dp4tp1cp1ep4pp2vpp2), including end-to-end performance and peak memory.
end2end perf:
peak memory ($R$ is the ratio of the actual decrease in peak memory to the theoretical value, where the theoretical values of stage0 and stage1 are 1440M and 800M respectively):