Skip to content

Conversation

@leloykun
Copy link
Contributor

@leloykun leloykun commented Sep 5, 2025

Description

Recent work has shown that Muon has a larger critical batch size than AdamW. Intuitively, this means that Muon can, theoretically, squeeze more juice from larger batch sizes (or more accurately, tokens per step) than AdamW. See: Convergence Bound and Critical Batch Size of Muon Optimizer on ArXiv.

I've somewhat verified this to be true in the modded-nanogpt speedruns, but it's best to double check just to be sure.


Notes:

@WhenWen WhenWen self-requested a review September 5, 2025 16:36
@leloykun
Copy link
Contributor Author

leloykun commented Oct 4, 2025

Hi @WhenWen ! I'm curious what my next steps here should be. Also, would it be possible to request access to the raw result files?

@WhenWen
Copy link
Contributor

WhenWen commented Oct 5, 2025

Hi @leloykun , I have runned some experiments on more model sizes and chinchilla ratios and will post some raw results here this afternoon. I think after which, we should merge these experiments into the main branch

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions bot added the stale label Nov 5, 2025
@dlwh
Copy link
Member

dlwh commented Nov 5, 2025

@WhenWen did we ever run this

@github-actions github-actions bot removed the stale label Nov 7, 2025
@leloykun
Copy link
Contributor Author

leloykun commented Nov 8, 2025

Hi @WhenWen @dlwh! Are there other things I need to do on my end for the runs to get started?

@WhenWen
Copy link
Contributor

WhenWen commented Nov 8, 2025

So sorry for the delay! I have been under the weather these few days. The runs have finished and are here. https://wandb.ai/marin-community/BatchSize

Copilot AI review requested due to automatic review settings November 23, 2025 02:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental script to test the hypothesis that Muon optimizer benefits more from larger batch sizes compared to AdamW, based on recent theoretical work showing Muon has a larger critical batch size. The script performs a batch size sweep (64, 128, 256, 512, 1024) for both Muon and AdamW optimizers across two model sizes (130M and 300M parameters), using Chinchilla-optimal training steps and learning rate scaling according to the square root of batch size.

Key Changes

  • Implements batch size sweep comparing Muon vs AdamW optimizers
  • Applies learning rate scaling based on lr ~ sqrt(batch_size) relationship
  • Uses pre-tuned optimizer configurations from prior hyperparameter searches


# Taken from Simo Ryu's observation that lr ~ sqrt(BS) also holds for Shampoo & Muon: https://x.com/cloneofsimo/status/1907731069878825400
baseline_batch_size = 128
learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MuonConfig, both learning_rate and adam_lr should be scaled with batch size according to the sqrt(BS) scaling rule mentioned in the comment on line 222. Currently, only learning_rate is being scaled.

The fix should scale both learning rates:

learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
if optimizer_name == "muon":
    adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5
    optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr)
else:
    optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)
Suggested change
learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
if optimizer_name == "muon":
adam_lr = optimizer_config.adam_lr * (batch_size / baseline_batch_size)**0.5
optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate, adam_lr=adam_lr)
else:
optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +230
learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
train = SimpleTrainConfig(
resource_config,
train_batch_size=batch_size,
num_train_steps=num_train_steps,
learning_rate=learning_rate,
optimizer_config=optimizer_config,
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The learning rate scaling is not being applied correctly. The scaled learning_rate (line 224) is passed to SimpleTrainConfig, but since optimizer_config is also provided (line 230), the training code will use optimizer_config.learning_rate directly and ignore the scaled value passed to SimpleTrainConfig.

To fix this, you need to update the optimizer_config itself with the scaled learning rate using dataclasses.replace:

learning_rate = optimizer_config.learning_rate * (batch_size / baseline_batch_size)**0.5
optimizer_config = dataclasses.replace(optimizer_config, learning_rate=learning_rate)
train = SimpleTrainConfig(
    resource_config,
    train_batch_size=batch_size,
    num_train_steps=num_train_steps,
    learning_rate=learning_rate,  # This can stay for consistency but won't be used
    optimizer_config=optimizer_config,
)

Copilot uses AI. Check for mistakes.
@WhenWen
Copy link
Contributor

WhenWen commented Nov 23, 2025

The results are now documented at #1565

@github-actions
Copy link
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions bot added the stale label Dec 17, 2025
@dlwh
Copy link
Member

dlwh commented Dec 17, 2025

i think we should merge this into a branch, tag it, and then reference it (and the blog post!) on our website. WDYT? We'd like to move away from storing every single thing in main forever (where we have to maintain it)

@dlwh
Copy link
Member

dlwh commented Dec 17, 2025

great result btw!

@github-actions github-actions bot removed the stale label Dec 18, 2025
@github-actions
Copy link
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions bot added the stale label Jan 10, 2026
@github-actions
Copy link
Contributor

This pull request has been automatically closed due to inactivity.
If you would like to continue working on this, please reopen it or create a new PR.

@github-actions github-actions bot closed this Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants