Test hypothesis: scaling batch size widens the gap between Muon & AdamW #1558

leloykun · 2025-09-05T15:57:45Z

Description

Recent work has shown that Muon has a larger critical batch size than AdamW. Intuitively, this means that Muon can, theoretically, squeeze more juice from larger batch sizes (or more accurately, tokens per step) than AdamW. See: Convergence Bound and Critical Batch Size of Muon Optimizer on ArXiv.

I've somewhat verified this to be true in the modded-nanogpt speedruns, but it's best to double check just to be sure.

Notes:

Code here was largely copy-pasted from Muon & AdamW scaling speedruns.
Optimizer configs were searched & provided by Kaiyue Wen in https://wandb.ai/marin-community/marin/reports/Fantastic-Optimizers-and-Where-to-Find-Them--VmlldzoxMjgzMzQ2NQ
Optimal hyperparameters may be batch-size dependent [tho this claim needs verification].

leloykun · 2025-10-04T15:27:58Z

Hi @WhenWen ! I'm curious what my next steps here should be. Also, would it be possible to request access to the raw result files?

WhenWen · 2025-10-05T19:29:36Z

Hi @leloykun , I have runned some experiments on more model sizes and chinchilla ratios and will post some raw results here this afternoon. I think after which, we should merge these experiments into the main branch

github-actions · 2025-11-05T01:00:44Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

dlwh · 2025-11-05T05:29:36Z

@WhenWen did we ever run this

leloykun · 2025-11-08T15:12:19Z

Hi @WhenWen @dlwh! Are there other things I need to do on my end for the runs to get started?

WhenWen · 2025-11-08T16:54:21Z

So sorry for the delay! I have been under the weather these few days. The runs have finished and are here. https://wandb.ai/marin-community/BatchSize

Copilot

Pull request overview

This PR introduces an experimental script to test the hypothesis that Muon optimizer benefits more from larger batch sizes compared to AdamW, based on recent theoretical work showing Muon has a larger critical batch size. The script performs a batch size sweep (64, 128, 256, 512, 1024) for both Muon and AdamW optimizers across two model sizes (130M and 300M parameters), using Chinchilla-optimal training steps and learning rate scaling according to the square root of batch size.

Key Changes

Implements batch size sweep comparing Muon vs AdamW optimizers
Applies learning rate scaling based on lr ~ sqrt(batch_size) relationship
Uses pre-tuned optimizer configurations from prior hyperparameter searches

Copilot · 2025-11-23T02:16:27Z

experiments/speedrun/muon_adamw_llama_batch_size_scaling/muon_adamw_batch_size_sweep.py

+
+    # Taken from Simo Ryu's observation that lr ~ sqrt(BS) also holds for Shampoo & Muon: https://x.com/cloneofsimo/status/1907731069878825400
+    baseline_batch_size = 128
+    learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5


For MuonConfig, both learning_rate and adam_lr should be scaled with batch size according to the sqrt(BS) scaling rule mentioned in the comment on line 222. Currently, only learning_rate is being scaled.

The fix should scale both learning rates:

learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5 if optimizer_name == "muon": adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5 optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr) else: optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)

Suggested change

learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5

learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5

if optimizer_name == "muon":

adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5

optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr)

else:

optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)

Copilot · 2025-11-23T02:16:28Z

experiments/speedrun/muon_adamw_llama_batch_size_scaling/muon_adamw_batch_size_sweep.py

+    learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
+    train = SimpleTrainConfig(
+        resource_config,
+        train_batch_size=batch_size,
+        num_train_steps=num_train_steps,
+        learning_rate=learning_rate,
+        optimizer_config=optimizer_config,


The learning rate scaling is not being applied correctly. The scaled learning_rate (line 224) is passed to SimpleTrainConfig, but since optimizer_config is also provided (line 230), the training code will use optimizer_config.learning_rate directly and ignore the scaled value passed to SimpleTrainConfig.

To fix this, you need to update the optimizer_config itself with the scaled learning rate using dataclasses.replace:

learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5 optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate) train = SimpleTrainConfig( resource_config, train_batch_size=batch_size, num_train_steps=num_train_steps, learning_rate=learning_rate, # This can stay for consistency but won't be used optimizer_config=optimizer_config, )

WhenWen · 2025-11-23T02:41:05Z

The results are now documented at #1565

github-actions · 2025-12-17T00:58:53Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

dlwh · 2025-12-17T04:47:30Z

i think we should merge this into a branch, tag it, and then reference it (and the blog post!) on our website. WDYT? We'd like to move away from storing every single thing in main forever (where we have to maintain it)

dlwh · 2025-12-17T04:47:36Z

great result btw!

github-actions · 2026-01-10T01:03:22Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

github-actions · 2026-01-18T01:10:31Z

This pull request has been automatically closed due to inactivity.
If you would like to continue working on this, please reopen it or create a new PR.

leloykun added 3 commits September 5, 2025 23:45

add code for batch size scaling

6c9a727

add 64 bz & fix script description

dc7fc0a

improve run descriptions

48de2a7

WhenWen self-requested a review September 5, 2025 16:36

impl lr ~ sqrt(BS) law

fe25d7b

WhenWen mentioned this pull request Sep 7, 2025

Verify scaling batch size widens the gap between Muon & AdamW #1565

Closed

github-actions bot added the stale label Nov 5, 2025

github-actions bot removed the stale label Nov 7, 2025

Merge branch 'main' into fc--bz-scaling

93ca2d3

Copilot AI review requested due to automatic review settings November 23, 2025 02:13

Copilot started reviewing on behalf of WhenWen November 23, 2025 02:13 View session

Copilot finished reviewing on behalf of WhenWen November 23, 2025 02:16

Copilot AI reviewed Nov 23, 2025

View reviewed changes

WhenWen approved these changes Nov 23, 2025

View reviewed changes

github-actions bot added the stale label Dec 17, 2025

github-actions bot removed the stale label Dec 18, 2025

github-actions bot added the stale label Jan 10, 2026

github-actions bot closed this Jan 18, 2026

-    learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
+    learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
+    if optimizer_name == "muon":
+        adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5
+        optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr)
+    else:
+        optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)

Test hypothesis: scaling batch size widens the gap between Muon & AdamW #1558

Test hypothesis: scaling batch size widens the gap between Muon & AdamW #1558

Uh oh!

Conversation

leloykun commented Sep 5, 2025

Description

Uh oh!

leloykun commented Oct 4, 2025

Uh oh!

WhenWen commented Oct 5, 2025

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

dlwh commented Nov 5, 2025

Uh oh!

leloykun commented Nov 8, 2025

Uh oh!

WhenWen commented Nov 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

WhenWen commented Nov 23, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

dlwh commented Dec 17, 2025

Uh oh!

dlwh commented Dec 17, 2025

Uh oh!

github-actions bot commented Jan 10, 2026

Uh oh!

github-actions bot commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants