-
Notifications
You must be signed in to change notification settings - Fork 71
speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a baseline experiment validating the Muon optimizer on a 50M parameter Llama model at 1× Chinchilla-optimal data scale (1B tokens). The experiment serves as a preliminary test before investing in larger 4× data runs, achieving 1.3989 BPB on the Paloma benchmark.
Key changes:
- New training script configuring Muon optimizer with specific hyperparameters (lr=0.020, adam_lr=0.004, momentum=0.95)
- Training configuration for 7,629 steps on H200 GPU (~20 minutes runtime)
- Results JSON documenting metrics and configuration for reproducibility
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| experiments/speedrun/llama_50m_muon_1x/train.py | Training script with Muon optimizer configuration and speedrun setup |
| experiments/speedrun/llama_50m_muon_1x/speedrun_results.json | Experiment results including 1.3989 BPB Paloma score and training metrics |
| experiments/speedrun/llama_50m_muon_1x/README.md | Documentation describing experiment rationale, hyperparameters, and expected results |
| from experiments.llama import llama_50m | ||
| from experiments.simple_train_config import SimpleTrainConfig | ||
| from marin.execution.executor import executor_main | ||
| from marin.resources import GpuConfig |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module marin.resources does not exist. GpuConfig should be imported from fray.cluster instead. Change to: from fray.cluster import ResourceConfig
| description="Phase 1: 50M Llama with Muon at 1× Chinchilla (1B tokens). Baseline validation.", | ||
| model_config=llama_50m, | ||
| train_config=SimpleTrainConfig( | ||
| GpuConfig(gpu_count=1, accelerator_type="H200"), |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GpuConfig constructor takes parameters type and count, not gpu_count and accelerator_type. Additionally, following the pattern in other speedrun experiments (e.g., llama_50m_h200.py), you should use ResourceConfig.with_gpu("H200", count=1) instead of directly instantiating GpuConfig.
|
The name I used here is my github username. Please use Zihao Yang in the leaderboard. |
|
@dlwh Could you please review this? Thanks! |
|
I'm gonna pawn this off on @Calvin-Xu or @Helw150 who have more context than me! |
|
Overall looks good to me, compared to https://github.com/marin-community/marin/tree/main/experiments/speedrun/llama_50m_h200. I think Will's input will be good here as he ran the optimizer scaling before we had hackable transformer (would we want to switch to the hackable template here?) |
|
I think this is good to merge as is, but we'll want @redagavin to update their author information to their preferred name/email first since we can't push to his fork and get lint passing (with |
|
@Helw150 Thanks! Could you please tell me what I should do specifically? Do you want me to change the name in speedrun_results.json and train.py and then commit & push? Where do I put my email? |
|
@Helw150 Just a reminder |
Summary
Baseline validation of Muon optimizer on 50M parameter Llama model at 1× Chinchilla-optimal data scale.
Approach
Results
print_run_info() Output
Click to expand full print_run_info() output
Rationale
Tests whether Muon optimizer beats Adam at standard Chinchilla-optimal scale before investing in 4× data runs. Result of 1.3989 BPB validates proceeding with larger-scale experiments.
W&B Run
https://wandb.ai/marin-speedrun/marin-speedrun/runs/llama_50m_muon_1x-bd5fc4
Files
experiments/speedrun/llama_50m_muon_1x/train.py- Training scriptexperiments/speedrun/llama_50m_muon_1x/speedrun_results.json- Resultsexperiments/speedrun/llama_50m_muon_1x/README.md- Approach description