-
Notifications
You must be signed in to change notification settings - Fork 71
Test hypothesis: scaling batch size widens the gap between Muon & AdamW #1558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @WhenWen ! I'm curious what my next steps here should be. Also, would it be possible to request access to the raw result files? |
|
Hi @leloykun , I have runned some experiments on more model sizes and chinchilla ratios and will post some raw results here this afternoon. I think after which, we should merge these experiments into the main branch |
|
This pull request has been inactive for 23 days and is marked as stale. |
|
@WhenWen did we ever run this |
|
So sorry for the delay! I have been under the weather these few days. The runs have finished and are here. https://wandb.ai/marin-community/BatchSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces an experimental script to test the hypothesis that Muon optimizer benefits more from larger batch sizes compared to AdamW, based on recent theoretical work showing Muon has a larger critical batch size. The script performs a batch size sweep (64, 128, 256, 512, 1024) for both Muon and AdamW optimizers across two model sizes (130M and 300M parameters), using Chinchilla-optimal training steps and learning rate scaling according to the square root of batch size.
Key Changes
- Implements batch size sweep comparing Muon vs AdamW optimizers
- Applies learning rate scaling based on lr ~ sqrt(batch_size) relationship
- Uses pre-tuned optimizer configurations from prior hyperparameter searches
|
|
||
| # Taken from Simo Ryu's observation that lr ~ sqrt(BS) also holds for Shampoo & Muon: https://x.com/cloneofsimo/status/1907731069878825400 | ||
| baseline_batch_size = 128 | ||
| learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5 |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MuonConfig, both learning_rate and adam_lr should be scaled with batch size according to the sqrt(BS) scaling rule mentioned in the comment on line 222. Currently, only learning_rate is being scaled.
The fix should scale both learning rates:
learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
if optimizer_name == "muon":
adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5
optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr)
else:
optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)| learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5 | |
| learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5 | |
| if optimizer_name == "muon": | |
| adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5 | |
| optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr) | |
| else: | |
| optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate) |
| learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5 | ||
| train = SimpleTrainConfig( | ||
| resource_config, | ||
| train_batch_size=batch_size, | ||
| num_train_steps=num_train_steps, | ||
| learning_rate=learning_rate, | ||
| optimizer_config=optimizer_config, |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The learning rate scaling is not being applied correctly. The scaled learning_rate (line 224) is passed to SimpleTrainConfig, but since optimizer_config is also provided (line 230), the training code will use optimizer_config.learning_rate directly and ignore the scaled value passed to SimpleTrainConfig.
To fix this, you need to update the optimizer_config itself with the scaled learning rate using dataclasses.replace:
learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)
train = SimpleTrainConfig(
resource_config,
train_batch_size=batch_size,
num_train_steps=num_train_steps,
learning_rate=learning_rate, # This can stay for consistency but won't be used
optimizer_config=optimizer_config,
)|
The results are now documented at #1565 |
|
This pull request has been inactive for 23 days and is marked as stale. |
|
i think we should merge this into a branch, tag it, and then reference it (and the blog post!) on our website. WDYT? We'd like to move away from storing every single thing in main forever (where we have to maintain it) |
|
great result btw! |
|
This pull request has been inactive for 23 days and is marked as stale. |
|
This pull request has been automatically closed due to inactivity. |
Description
Recent work has shown that Muon has a larger critical batch size than AdamW. Intuitively, this means that Muon can, theoretically, squeeze more juice from larger batch sizes (or more accurately, tokens per step) than AdamW. See: Convergence Bound and Critical Batch Size of Muon Optimizer on ArXiv.
I've somewhat verified this to be true in the modded-nanogpt speedruns, but it's best to double check just to be sure.
Notes: