add muon code #168

samsja · 2024-12-04T23:13:21Z

No description provided.

Jackmin801

lgtm! just some small nits

src/zeroband/train.py

Jackmin801 · 2024-12-05T07:53:50Z

src/zeroband/optimizers/muon.py

+                    if isinstance(g, DTensor):
+                        g, meta = to_local(g, keep_sharded=False)


This will result in every rank doing the same orthogonalization computation?

awgu · 2024-12-05T16:32:24Z

src/zeroband/optimizers/muon.py

+                    meta = None
+                    if isinstance(g, DTensor):
+                        g, meta = to_local(g, keep_sharded=False)
+                    # gives NaNs when done with Dtensor, instead of throwing a typical op not supported error, quite sneaky


that's not good 🥲

The code is mostly copy paste from this repo btw : https://github.com/ethansmith2000/fsdp_optimizers

How do you think it should be handled?

Personally, if the optimizer code is meant to run locally per rank without any collectives and you own the optimizer implementation, then converting DTensors to local torch.Tensors for all of the computation seems fine (and will have slightly lower eager-mode overhead due to avoiding DTensor.__torch_dispatch__).

The main value add of DTensor in that case is providing the sharding info/metadata on the tensor, which could be useful in the state dict for example. For that, you would still want the optimizer states to be saved as DTensors.

I see makes sense. Thanks !

samsja requested review from Jackmin801 and JohannesHa December 5, 2024 06:30

Jackmin801 approved these changes Dec 5, 2024

View reviewed changes

src/zeroband/train.py Outdated Show resolved Hide resolved

src/zeroband/train.py Outdated Show resolved Hide resolved

Jackmin801 reviewed Dec 5, 2024

View reviewed changes

awgu reviewed Dec 5, 2024

View reviewed changes

samsja force-pushed the feat-muon branch from 1492720 to 30a0f23 Compare December 6, 2024 19:43

samsja added 4 commits December 10, 2024 19:03

add muon code

973dbff

add muon tests

c59e598

fix preqconditioner

63f7f88

fix preqconditioner

fb3a721

samsja force-pushed the feat-muon branch from 48e1d35 to fb3a721 Compare December 10, 2024 19:08

samsja added 4 commits December 10, 2024 19:21

update 1b confoigs

e996e85

update 1b confoigs

8646e14

disavle wandb for non 9 rank

d34265b

disavle wandb for non 9 rank

6ddfcd0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add muon code #168

add muon code #168

samsja commented Dec 4, 2024

Jackmin801 left a comment

Jackmin801 Dec 5, 2024

awgu Dec 5, 2024

samsja Dec 5, 2024 •

edited

Loading

awgu Dec 5, 2024

samsja Dec 5, 2024

		if isinstance(g, DTensor):
		g, meta = to_local(g, keep_sharded=False)

add muon code #168

Are you sure you want to change the base?

add muon code #168

Conversation

samsja commented Dec 4, 2024

Jackmin801 left a comment

Choose a reason for hiding this comment

Jackmin801 Dec 5, 2024

Choose a reason for hiding this comment

awgu Dec 5, 2024

Choose a reason for hiding this comment

samsja Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

awgu Dec 5, 2024

Choose a reason for hiding this comment

samsja Dec 5, 2024

Choose a reason for hiding this comment

samsja Dec 5, 2024 •

edited

Loading