Skip to content

Conversation

@redagavin
Copy link

Summary

Baseline validation of Muon optimizer on 50M parameter Llama model at 1× Chinchilla-optimal data scale.

Approach

  • Model: llama_50m (50.87M parameters, 4 layers, hidden_dim=192)
  • Optimizer: Muon with learning_rate=0.020, adam_lr=0.004, momentum=0.95
  • Training: 7,629 steps, ~1B tokens (1× Chinchilla: 20 tokens per parameter)
  • Hardware: 1× H200 GPU

Results

Metric Value
Paloma BPB 1.3989
Training Time 1,179 seconds (~19.65 min)
Total Tokens 999,948,288
Model FLOPs 1.67 × 10^17

print_run_info() Output

Click to expand full print_run_info() output
2025-11-23 01:47:29,964	INFO speedrun.py:126 -- Speedrun Configuration:
2025-11-23 01:47:29,964	INFO speedrun.py:127 -- {
    "author": {
        "name": "redagavin",
        "affiliation": "Northeastern University",
        "url": "https://redagavin.github.io/"
    },
    "description": "Phase 1: 50M Llama with Muon at 1× Chinchilla (1B tokens). Baseline validation.",
    "model_config": {
        "cross_entropy_block_size": null,
        "seq_len": 1024,
        "hidden_dim": 192,
        "intermediate_dim": 448,
        "num_layers": 4,
        "num_heads": 2,
        "head_dim": null,
        "num_kv_heads": 2,
        "activation_function": "silu",
        "initializer_range": 0.02,
        "layer_norm_epsilon": 1e-05,
        "tie_word_embeddings": false,
        "hybrid_norm": false,
        "use_qk_norm": false,
        "input_embedding_norm": false,
        "upcast_attn": false,
        "attn_backend": null,
        "flash_attention_block_size": null,
        "gradient_checkpointing": true,
        "scan_layers": true,
        "use_bias": false,
        "use_layer_norm_weight": true,
        "rope": {
            "theta": 10000,
            "factor": 1.0
        },
        "reference_checkpoint": "NousResearch/Llama-2-7b-hf",
        "tokenizer": null
    },
    "train_config": {
        "train_batch_size": 128,
        "num_train_steps": 7629,
        "learning_rate": 0.02,
        "data_seed": null,
        "weight_decay": null,
        "beta1": null,
        "beta2": null,
        "epsilon": null,
        "max_grad_norm": null,
        "warmup": null,
        "decay": null,
        "rewarmup": null,
        "lr_schedule": null,
        "min_lr_ratio": null,
        "cycle_length": null,
        "z_loss_weight": null,
        "ema_beta": null,
        "skip_bad_steps": false,
        "steps_per_eval": 500,
        "steps_per_export": 10000,
        "steps_per_task_eval": null,
        "steps_per_hf_export": null,
        "per_device_eval_parallelism": null,
        "max_eval_batches": null,
        "initialize_from_checkpoint_path": null,
        "initialize_from_hf": null,
        "reset_data_loader_on_init": true,
        "allow_partial_checkpoint": false,
        "int8": false,
        "optimizer_config": {
            "learning_rate": 0.02,
            "weight_decay": 0.0,
            "min_lr_ratio": 0,
            "warmup": 0,
            "decay": 0.8,
            "rewarmup": 0.0,
            "cooldown": null,
            "cycle_length": null,
            "cycles": null,
            "lr_schedule": "linear",
            "haps": null,
            "weight_decay_modules": null,
            "default_weight_decay_mask": null,
            "lr": 0.02,
            "adam_lr": 0.004,
            "momentum": 0.95,
            "nesterov": true,
            "backend_steps": 5,
            "adam_weight_decay": null,
            "beta1": 0.8,
            "beta2": 0.98,
            "epsilon": 1e-15,
            "muon_epsilon": 1e-05,
            "max_grad_norm": 1,
            "use_kimi_scaling": false
        },
        "watch": {
            "watch_targets": [
                "grads",
                "params"
            ],
            "include_norms": true,
            "include_per_parameter_norms": true,
            "include_histograms": false,
            "split_scan_layers": true,
            "interval": 10
        }
    },
    "tokenized_dataset": "ExecutorStep(name='tokenized/subcache/fineweb-edu-10B', ...)",
    "resources": {
        "gpu_count": 1,
        "accelerator_type": "H200",
        "device_flops_override": null
    }
}
2025-11-23 01:47:29,964	INFO speedrun.py:163 -- The rough estimated compute (calculated as (total model FLOPs / Assumed MFU)) for your run is probably between:
      * 3.34e+17 FLOPs assuming an MFU of 0.5, and
      * 8.35e+17 FLOPs assuming an MFU of 0.2.

This is calculated based on assumed MFU values and can be used as a rough estimate to guide your config/training setup.
2025-11-23 01:47:29,964	INFO speedrun.py:131 -- Hardware and Model FLOPS Information:
2025-11-23 01:47:29,964	INFO speedrun.py:132 -- Number of devices: 1
2025-11-23 01:47:29,965	INFO speedrun.py:133 -- Number of chips: 1
2025-11-23 01:47:29,965	INFO speedrun.py:134 -- Device FLOPs: 9.90e+14 FLOP/s
2025-11-23 01:47:29,965	INFO speedrun.py:135 -- Total peak hardware FLOPs: 9.90e+14 FLOP/s
2025-11-23 01:47:29,965	INFO speedrun.py:136 -- Model FLOPs: 1.67e+17 FLOP
2025-11-23 01:47:29,965	INFO speedrun.py:140 -- Model size: 50.87 million parameters
2025-11-23 01:47:29,965	INFO speedrun.py:338 -- Running speedrun llama_50m_muon_1x

Rationale

Tests whether Muon optimizer beats Adam at standard Chinchilla-optimal scale before investing in 4× data runs. Result of 1.3989 BPB validates proceeding with larger-scale experiments.

W&B Run

https://wandb.ai/marin-speedrun/marin-speedrun/runs/llama_50m_muon_1x-bd5fc4

Files

  • experiments/speedrun/llama_50m_muon_1x/train.py - Training script
  • experiments/speedrun/llama_50m_muon_1x/speedrun_results.json - Results
  • experiments/speedrun/llama_50m_muon_1x/README.md - Approach description

Copilot AI review requested due to automatic review settings December 8, 2025 00:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a baseline experiment validating the Muon optimizer on a 50M parameter Llama model at 1× Chinchilla-optimal data scale (1B tokens). The experiment serves as a preliminary test before investing in larger 4× data runs, achieving 1.3989 BPB on the Paloma benchmark.

Key changes:

  • New training script configuring Muon optimizer with specific hyperparameters (lr=0.020, adam_lr=0.004, momentum=0.95)
  • Training configuration for 7,629 steps on H200 GPU (~20 minutes runtime)
  • Results JSON documenting metrics and configuration for reproducibility

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
experiments/speedrun/llama_50m_muon_1x/train.py Training script with Muon optimizer configuration and speedrun setup
experiments/speedrun/llama_50m_muon_1x/speedrun_results.json Experiment results including 1.3989 BPB Paloma score and training metrics
experiments/speedrun/llama_50m_muon_1x/README.md Documentation describing experiment rationale, hyperparameters, and expected results

from experiments.llama import llama_50m
from experiments.simple_train_config import SimpleTrainConfig
from marin.execution.executor import executor_main
from marin.resources import GpuConfig
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module marin.resources does not exist. GpuConfig should be imported from fray.cluster instead. Change to: from fray.cluster import ResourceConfig

Copilot uses AI. Check for mistakes.
description="Phase 1: 50M Llama with Muon at 1× Chinchilla (1B tokens). Baseline validation.",
model_config=llama_50m,
train_config=SimpleTrainConfig(
GpuConfig(gpu_count=1, accelerator_type="H200"),
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuConfig constructor takes parameters type and count, not gpu_count and accelerator_type. Additionally, following the pattern in other speedrun experiments (e.g., llama_50m_h200.py), you should use ResourceConfig.with_gpu("H200", count=1) instead of directly instantiating GpuConfig.

Copilot uses AI. Check for mistakes.
@redagavin redagavin changed the title speedrun: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale Dec 8, 2025
@redagavin
Copy link
Author

The name I used here is my github username. Please use Zihao Yang in the leaderboard.

@redagavin
Copy link
Author

@dlwh Could you please review this? Thanks!

@dlwh
Copy link
Member

dlwh commented Dec 11, 2025

I'm gonna pawn this off on @Calvin-Xu or @Helw150 who have more context than me!

@dlwh dlwh requested review from Calvin-Xu and Helw150 December 11, 2025 00:52
@Calvin-Xu
Copy link
Member

Overall looks good to me, compared to https://github.com/marin-community/marin/tree/main/experiments/speedrun/llama_50m_h200.

I think Will's input will be good here as he ran the optimizer scaling before we had hackable transformer (would we want to switch to the hackable template here?)

@Helw150
Copy link
Member

Helw150 commented Dec 27, 2025

I think this is good to merge as is, but we'll want @redagavin to update their author information to their preferred name/email first since we can't push to his fork and get lint passing (with uv run infra/pre-commit.py)

@redagavin
Copy link
Author

@Helw150 Thanks! Could you please tell me what I should do specifically? Do you want me to change the name in speedrun_results.json and train.py and then commit & push? Where do I put my email?

@redagavin
Copy link
Author

@Helw150 Just a reminder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants