Z3: optimizations for grad norm calculation and gradient clipping #5504

nelyahu · 2024-05-07T07:20:08Z

This PR add the below functionality:

complete_grad_norm_calculation_for_cpu_offload: move total_norm to CPU, as expected device in such case is CPU..
repalce get_global_norm() with torch.linalg.norm for better performance.
unscale_and_clip_grads: replace clipping based on if statement to use torch.clamp for better performance.

change (3) is taken from #5547 (which was closed)

deepspeed/runtime/zero/stage3.py

jomayeri · 2024-05-23T17:47:32Z

Changing this line has been associated with several bugs #5422, #5538

loadams · 2024-06-26T17:18:12Z

Changing this line has been associated with several bugs #5422, #5538

@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?

nelyahu · 2024-06-26T20:05:12Z

Changing this line has been associated with several bugs #5422, #5538

@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?

@loadams, Yes - this optimization was already pushed and reverted due to ds-chat (failures in cpu-offload configurations).
i did offline debugging of those failure and improved the code change so it will pass. Since then ds-chat tests where added to DeepSpeed repo CI and it is now passing.
Are there any other tests (full model training for example), that does not exists in the CI, which can be manually ran?

tjruwase · 2024-06-26T23:27:36Z

i did offline debugging of those failure and improved the code change so it will pass

@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?

nelyahu requested review from tjruwase and mrwyattii as code owners May 7, 2024 07:20

tjruwase reviewed May 7, 2024

View reviewed changes

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved

z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm

43792bb

nelyahu force-pushed the zero3_scaled_global branch from 299e3b6 to 43792bb Compare May 9, 2024 06:50

Merge branch 'master' into zero3_scaled_global

cac04d9

tjruwase approved these changes May 20, 2024

View reviewed changes

Merge branch 'master' into zero3_scaled_global

cbb6b6a

tjruwase and others added 2 commits May 23, 2024 14:13

Merge branch 'master' into zero3_scaled_global

37b4cb7

fix grad norm calc in cpu offload and use torch.clip for grad clipping

5dd50c3

nelyahu changed the title ~~z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm~~ Z3: optimizations for grad norm calculation and gradient clipping May 27, 2024

nelyahu mentioned this pull request May 27, 2024

unscale_and_clip_grads: use clamp on device instead of on host #5547

Closed

lekurile and others added 4 commits May 30, 2024 14:20

Merge branch 'master' into zero3_scaled_global

f918054

Merge branch 'master' into zero3_scaled_global

ba9fd42

Merge branch 'master' into zero3_scaled_global

238ab34

Merge branch 'master' into zero3_scaled_global

6b6a834

Merge branch 'master' into zero3_scaled_global

b14f920

Merge branch 'master' into zero3_scaled_global

45e62d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Z3: optimizations for grad norm calculation and gradient clipping #5504

Z3: optimizations for grad norm calculation and gradient clipping #5504

nelyahu commented May 7, 2024 •

edited

Loading

jomayeri commented May 23, 2024

loadams commented Jun 26, 2024

nelyahu commented Jun 26, 2024

tjruwase commented Jun 26, 2024

Z3: optimizations for grad norm calculation and gradient clipping #5504

Are you sure you want to change the base?

Z3: optimizations for grad norm calculation and gradient clipping #5504

Conversation

nelyahu commented May 7, 2024 • edited Loading

jomayeri commented May 23, 2024

loadams commented Jun 26, 2024

nelyahu commented Jun 26, 2024

tjruwase commented Jun 26, 2024

nelyahu commented May 7, 2024 •

edited

Loading