speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185

redagavin · 2025-12-08T00:07:11Z

Summary

Baseline validation of Muon optimizer on 50M parameter Llama model at 1× Chinchilla-optimal data scale.

Approach

Model: llama_50m (50.87M parameters, 4 layers, hidden_dim=192)
Optimizer: Muon with learning_rate=0.020, adam_lr=0.004, momentum=0.95
Training: 7,629 steps, ~1B tokens (1× Chinchilla: 20 tokens per parameter)
Hardware: 1× H200 GPU

Results

Metric	Value
Paloma BPB	1.3989
Training Time	1,179 seconds (~19.65 min)
Total Tokens	999,948,288
Model FLOPs	1.67 × 10^17

print_run_info() Output

Click to expand full print_run_info() output

2025-11-23 01:47:29,964	INFO speedrun.py:126 -- Speedrun Configuration:
2025-11-23 01:47:29,964	INFO speedrun.py:127 -- {
    "author": {
        "name": "redagavin",
        "affiliation": "Northeastern University",
        "url": "https://redagavin.github.io/"
    },
    "description": "Phase 1: 50M Llama with Muon at 1× Chinchilla (1B tokens). Baseline validation.",
    "model_config": {
        "cross_entropy_block_size": null,
        "seq_len": 1024,
        "hidden_dim": 192,
        "intermediate_dim": 448,
        "num_layers": 4,
        "num_heads": 2,
        "head_dim": null,
        "num_kv_heads": 2,
        "activation_function": "silu",
        "initializer_range": 0.02,
        "layer_norm_epsilon": 1e-05,
        "tie_word_embeddings": false,
        "hybrid_norm": false,
        "use_qk_norm": false,
        "input_embedding_norm": false,
        "upcast_attn": false,
        "attn_backend": null,
        "flash_attention_block_size": null,
        "gradient_checkpointing": true,
        "scan_layers": true,
        "use_bias": false,
        "use_layer_norm_weight": true,
        "rope": {
            "theta": 10000,
            "factor": 1.0
        },
        "reference_checkpoint": "NousResearch/Llama-2-7b-hf",
        "tokenizer": null
    },
    "train_config": {
        "train_batch_size": 128,
        "num_train_steps": 7629,
        "learning_rate": 0.02,
        "data_seed": null,
        "weight_decay": null,
        "beta1": null,
        "beta2": null,
        "epsilon": null,
        "max_grad_norm": null,
        "warmup": null,
        "decay": null,
        "rewarmup": null,
        "lr_schedule": null,
        "min_lr_ratio": null,
        "cycle_length": null,
        "z_loss_weight": null,
        "ema_beta": null,
        "skip_bad_steps": false,
        "steps_per_eval": 500,
        "steps_per_export": 10000,
        "steps_per_task_eval": null,
        "steps_per_hf_export": null,
        "per_device_eval_parallelism": null,
        "max_eval_batches": null,
        "initialize_from_checkpoint_path": null,
        "initialize_from_hf": null,
        "reset_data_loader_on_init": true,
        "allow_partial_checkpoint": false,
        "int8": false,
        "optimizer_config": {
            "learning_rate": 0.02,
            "weight_decay": 0.0,
            "min_lr_ratio": 0,
            "warmup": 0,
            "decay": 0.8,
            "rewarmup": 0.0,
            "cooldown": null,
            "cycle_length": null,
            "cycles": null,
            "lr_schedule": "linear",
            "haps": null,
            "weight_decay_modules": null,
            "default_weight_decay_mask": null,
            "lr": 0.02,
            "adam_lr": 0.004,
            "momentum": 0.95,
            "nesterov": true,
            "backend_steps": 5,
            "adam_weight_decay": null,
            "beta1": 0.8,
            "beta2": 0.98,
            "epsilon": 1e-15,
            "muon_epsilon": 1e-05,
            "max_grad_norm": 1,
            "use_kimi_scaling": false
        },
        "watch": {
            "watch_targets": [
                "grads",
                "params"
            ],
            "include_norms": true,
            "include_per_parameter_norms": true,
            "include_histograms": false,
            "split_scan_layers": true,
            "interval": 10
        }
    },
    "tokenized_dataset": "ExecutorStep(name='tokenized/subcache/fineweb-edu-10B', ...)",
    "resources": {
        "gpu_count": 1,
        "accelerator_type": "H200",
        "device_flops_override": null
    }
}
2025-11-23 01:47:29,964	INFO speedrun.py:163 -- The rough estimated compute (calculated as (total model FLOPs / Assumed MFU)) for your run is probably between:
      * 3.34e+17 FLOPs assuming an MFU of 0.5, and
      * 8.35e+17 FLOPs assuming an MFU of 0.2.

This is calculated based on assumed MFU values and can be used as a rough estimate to guide your config/training setup.
2025-11-23 01:47:29,964	INFO speedrun.py:131 -- Hardware and Model FLOPS Information:
2025-11-23 01:47:29,964	INFO speedrun.py:132 -- Number of devices: 1
2025-11-23 01:47:29,965	INFO speedrun.py:133 -- Number of chips: 1
2025-11-23 01:47:29,965	INFO speedrun.py:134 -- Device FLOPs: 9.90e+14 FLOP/s
2025-11-23 01:47:29,965	INFO speedrun.py:135 -- Total peak hardware FLOPs: 9.90e+14 FLOP/s
2025-11-23 01:47:29,965	INFO speedrun.py:136 -- Model FLOPs: 1.67e+17 FLOP
2025-11-23 01:47:29,965	INFO speedrun.py:140 -- Model size: 50.87 million parameters
2025-11-23 01:47:29,965	INFO speedrun.py:338 -- Running speedrun llama_50m_muon_1x

Rationale

Tests whether Muon optimizer beats Adam at standard Chinchilla-optimal scale before investing in 4× data runs. Result of 1.3989 BPB validates proceeding with larger-scale experiments.

W&B Run

https://wandb.ai/marin-speedrun/marin-speedrun/runs/llama_50m_muon_1x-bd5fc4

Files

experiments/speedrun/llama_50m_muon_1x/train.py - Training script
experiments/speedrun/llama_50m_muon_1x/speedrun_results.json - Results
experiments/speedrun/llama_50m_muon_1x/README.md - Approach description

Copilot

Pull request overview

This PR adds a baseline experiment validating the Muon optimizer on a 50M parameter Llama model at 1× Chinchilla-optimal data scale (1B tokens). The experiment serves as a preliminary test before investing in larger 4× data runs, achieving 1.3989 BPB on the Paloma benchmark.

Key changes:

New training script configuring Muon optimizer with specific hyperparameters (lr=0.020, adam_lr=0.004, momentum=0.95)
Training configuration for 7,629 steps on H200 GPU (~20 minutes runtime)
Results JSON documenting metrics and configuration for reproducibility

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
experiments/speedrun/llama_50m_muon_1x/train.py	Training script with Muon optimizer configuration and speedrun setup
experiments/speedrun/llama_50m_muon_1x/speedrun_results.json	Experiment results including 1.3989 BPB Paloma score and training metrics
experiments/speedrun/llama_50m_muon_1x/README.md	Documentation describing experiment rationale, hyperparameters, and expected results

Copilot · 2025-12-08T00:10:48Z

experiments/speedrun/llama_50m_muon_1x/train.py

+from experiments.llama import llama_50m
+from experiments.simple_train_config import SimpleTrainConfig
+from marin.execution.executor import executor_main
+from marin.resources import GpuConfig


The module marin.resources does not exist. GpuConfig should be imported from fray.cluster instead. Change to: from fray.cluster import ResourceConfig

Copilot · 2025-12-08T00:10:49Z

experiments/speedrun/llama_50m_muon_1x/train.py

+    description="Phase 1: 50M Llama with Muon at 1× Chinchilla (1B tokens). Baseline validation.",
+    model_config=llama_50m,
+    train_config=SimpleTrainConfig(
+        GpuConfig(gpu_count=1, accelerator_type="H200"),


GpuConfig constructor takes parameters type and count, not gpu_count and accelerator_type. Additionally, following the pattern in other speedrun experiments (e.g., llama_50m_h200.py), you should use ResourceConfig.with_gpu("H200", count=1) instead of directly instantiating GpuConfig.

redagavin · 2025-12-08T00:23:15Z

The name I used here is my github username. Please use Zihao Yang in the leaderboard.

redagavin · 2025-12-10T22:32:29Z

@dlwh Could you please review this? Thanks!

dlwh · 2025-12-11T00:51:53Z

I'm gonna pawn this off on @Calvin-Xu or @Helw150 who have more context than me!

Calvin-Xu · 2025-12-11T01:49:59Z

Overall looks good to me, compared to https://github.com/marin-community/marin/tree/main/experiments/speedrun/llama_50m_h200.

I think Will's input will be good here as he ran the optimizer scaling before we had hackable transformer (would we want to switch to the hackable template here?)

Helw150 · 2025-12-27T14:53:59Z

I think this is good to merge as is, but we'll want @redagavin to update their author information to their preferred name/email first since we can't push to his fork and get lint passing (with uv run infra/pre-commit.py)

redagavin · 2026-01-03T00:48:14Z

@Helw150 Thanks! Could you please tell me what I should do specifically? Do you want me to change the name in speedrun_results.json and train.py and then commit & push? Where do I put my email?

redagavin · 2026-01-16T01:03:36Z

@Helw150 Just a reminder

speedrun: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale

636565e

Copilot AI review requested due to automatic review settings December 8, 2025 00:07

Copilot started reviewing on behalf of redagavin December 8, 2025 00:07 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

redagavin changed the title ~~speedrun: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale~~ speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale Dec 8, 2025

dlwh requested review from Calvin-Xu and Helw150 December 11, 2025 00:52

Helw150 added 2 commits December 10, 2025 21:54

Merge branch 'main' into speedrun/llama_50m_muon_1x

78d46ae

Merge branch 'main' into speedrun/llama_50m_muon_1x

1a94c86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185

speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185

redagavin commented Dec 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 8, 2025

Uh oh!

Copilot AI Dec 8, 2025

Uh oh!

redagavin commented Dec 8, 2025

Uh oh!

redagavin commented Dec 10, 2025

Uh oh!

dlwh commented Dec 11, 2025

Uh oh!

Calvin-Xu commented Dec 11, 2025

Uh oh!

Helw150 commented Dec 27, 2025 •

edited

Loading

Uh oh!

redagavin commented Jan 3, 2026

Uh oh!

redagavin commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185

Are you sure you want to change the base?

speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale #2185

Conversation

redagavin commented Dec 8, 2025

Summary

Approach

Results

print_run_info() Output

Rationale

W&B Run

Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

redagavin commented Dec 8, 2025

Uh oh!

redagavin commented Dec 10, 2025

Uh oh!

dlwh commented Dec 11, 2025

Uh oh!

Calvin-Xu commented Dec 11, 2025

Uh oh!

Helw150 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

redagavin commented Jan 3, 2026

Uh oh!

redagavin commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Helw150 commented Dec 27, 2025 •

edited

Loading