marin-community · dlwh · Nov 10, 2025 · Nov 10, 2025 · Nov 10, 2025 · Nov 10, 2025
diff --git a/.agents/projects/grugformer.md b/.agents/projects/grugformer.md
diff --git a/.github/workflows/levanter-tests.yaml b/.github/workflows/levanter-tests.yaml
@@ -191,4 +191,4 @@ jobs:
             -v /tmp/uv-cache:/tmp/uv-cache:rw \
             -w /workspace \
             $DOCKER_IMAGE \
-            bash -c "cp -a /workspace-src/. /workspace/ && cd /workspace && timeout --kill-after=5 --signal=TERM 890 uv run --package levanter --frozen --with 'jax[tpu]==$JAX_VERSION' pytest lib/levanter/tests -m 'not entry and not ray and not slow' -v --tb=short --log-cli-level=WARNING --durations=20"
+            bash -c "cp -a /workspace-src/. /workspace/ && cd /workspace && timeout --kill-after=5 --signal=TERM 890 uv run --package levanter --frozen --with 'jax[tpu]==$JAX_VERSION' pytest lib/levanter/tests -m 'not entry and not ray and not slow' -v --tb=short --log-cli-level=WARNING --durations=20"
diff --git a/AGENTS.md b/AGENTS.md
@@ -14,6 +14,7 @@
 - Begin with the agent-friendly recipes in `docs/recipes/`.
 - The first step for dataset addition is schema inspection. See the [add_dataset.md](docs/recipes/add_dataset.md) recipe for details.
 - You can help organize experiments using the [organize_experiments.md](docs/recipes/organize_experiments.md) recipe.
+- When making significant changes to Grug/Grugformer, follow [change_grug.md](docs/recipes/change_grug.md).
 - Follow the rules and examples in each recipe to ensure compatibility and automation-friendliness.
 
 ## Shared Coding Practices

diff --git a/docs/recipes/change_grug.md b/docs/recipes/change_grug.md
@@ -0,0 +1,112 @@
+# Recipe: Changing Grug (Experiment → Canonical)
+
+Grug is meant to be “grug-simple” and easy to hack, but we still want a single, trustworthy “best guess” implementation in `levanter.grug`.
+
+This recipe describes the workflow for:
+
+1) trying changes safely in a speedrun experiment, and
+2) upstreaming successful ideas into the canonical core (and cleaning up old experiments).
+
+## Source Of Truth vs Experiments
+
+- **Source of truth:** `lib/levanter/src/levanter/grug/`
+  - This is the “best guess” model. It should stay small, readable, and testable.
+- **Evolving experiment:** `experiments/speedrun/grugformer_starter/grugformer_speedrun.py`
+  - This is the *living* entrypoint that is expected to evolve as we learn.
+- **One-off experiments:** under `experiments/speedrun/…`
+  - These are snapshots / specialized edit surfaces (e.g. attention sinks).
+
+We try not to let one-off scripts become the canonical implementation.
+
+## When You Want To Try Something
+
+### 1) Decide what you’re changing
+
+Most changes fall into one bucket:
+
+- **Attention** (masking semantics, kernels, sinks/aux, layout/sharding)
+- **Block** (residual wiring, normalization order, pre/post-norm)
+- **MLP** (activation, GLU variants, gating, dimension choices)
+- **Loss** (large-vocab CE, z-loss, label smoothing, logit soft-cap)
+- **Optimizer** (Adam, Muon, etc.)
+
+Try to change **one bucket at a time**. Optimizer isn't really (currently) addressed by Grug, but we'll get there.
+
+### 2) Create an experiment entrypoint
+
+Start from:
+
+- `experiments/speedrun/grugformer_starter/grugformer_speedrun.py`
+
+Recommended workflow:
+
+1. Copy the file to a new experiment (or branch the starter if the change is small):
+   - Example: `experiments/speedrun/grugformer_<idea>/grugformer_<idea>.py`
+2. Keep the edit surface explicit:
+   - If you’re changing attention, keep the change in one local `attention()` or `attn_fn` block.
+   - If you’re changing the MLP, keep it local to an `mlp()` helper.
+3. Avoid introducing new abstractions (this is a speedrun file; copy/paste is fine).
+
+### 3) Register the experiment in the archive
+
+Add an entry to:
+
+- `docs/reports/grug-archive.md`
+
+Record:
+- the experiment path,
+- the commit SHA (once known),
+- what you changed and why,
+- the intended “status” (`active`, `superseded`, `deleted`).
+
+## When You Want To Adopt Something As Canonical
+
+### 1) Port to `levanter.grug`
+
+Move the change into one of the core files:
+
+- `lib/levanter/src/levanter/grug/attention.py`
+- `lib/levanter/src/levanter/grug/model.py`
+- `lib/levanter/src/levanter/grug/loss.py`
+
+Keep the “grug” style:
+- top-level functions,
+- small dataclasses only for parameter/state containers,
+- explicit sharding when needed (and loud failures otherwise).
+
+### 2) Update/extend tests
+
+Add or adjust tests to lock the intended surface:
+
+- `lib/levanter/tests/test_grugformer_core.py`
+- `lib/levanter/tests/test_grugformer_model_loss.py`
+- `lib/levanter/tests/test_grugformer_fused_loss.py`
+
+The goal is:
+- shapes don’t regress,
+- `jit` still works,
+- sharding doesn’t explode,
+- mask semantics remain correct.
+
+### 3) Clean up old experiments
+
+After merging a canonical improvement:
+
+- If an experiment is now redundant and not referenced, **delete it** and mark it `deleted` in `docs/reports/grug-archive.md`.
+- If an experiment represents a meaningful historical run, keep it but mark it `superseded`, and point to the canonical change (or the new experiment).
+  Do this only if it's not going to be a maintenance burden.
+
+Prefer “archive entry + deletion” over keeping lots of old code in-tree.
+
+### 4) Run repo checks
+
+Before sending the PR:
+
+```sh
+uv run python infra/pre-commit.py --all-files
+```
+
+## Notes / Inspiration
+
+This workflow is inspired by projects like `modded-nanogpt`: keep a small, readable core, iterate quickly via “hackable” entrypoints, and regularly upstream what works.
+
diff --git a/docs/reports/grug-archive.md b/docs/reports/grug-archive.md
@@ -0,0 +1,61 @@
+# Grug Archive: Experiments and Snapshots
+
+This file is a lightweight “paper trail” for Grug-related experiments, inspired by the idea of keeping a runnable history without letting a pile of one-off scripts become the de facto source of truth.
+
+## Principles
+
+- **`levanter.grug` is the source of truth.** Speedrun files are snapshots/entrypoints, not the canonical implementation.
+- **Every experiment should be attributable to a commit.** If an experiment is removed or superseded, it should be clear what replaced it and why.
+- **Prefer deletion over permanent snapshots.** If a script is dead, delete it and record the last known-good commit here.
+- **Keep diffs small.** When an experiment is kept “alive”, update it to track the current core rather than forking the entire model.
+
+## When Grug Core Changes
+
+When a change in `levanter.grug` is likely to affect results, performance, or semantics:
+
+1. Update the experiment(s) that should track “best guess”.
+2. For experiments that no longer make sense:
+   - delete them, or
+   - mark them superseded and point to the replacement.
+3. Update the corresponding entry in this archive (and any linked issue).
+
+## Entry Template
+
+Copy/paste this block for new experiments:
+
+```text
+### <experiment-id>
+- Path: `<repo-relative-path>`
+- Introduced: <commit-sha>
+- Last known-good: <commit-sha>
+- Status: active | superseded | deleted
+- Purpose: <one line>
+- Notes: <optional; what changed, how to reproduce, caveats>
+- Superseded by: <experiment-id or commit-sha; optional>
+- Issue: <url or issue id; optional>
+```
+
+## Experiments
+
+### grugformer-attnsink
+- Path: `experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py`
+- Introduced: TBD
+- Last known-good: TBD
+- Status: active
+- Purpose: “Hackable” Grug attention-sink variant; intended edit surface for sinks/aux.
+- Notes: Keep this file short; copy/paste local modifications rather than growing new abstractions.
+
+### grugformer-starter-speedrun
+- Path: `experiments/speedrun/grugformer_starter/grugformer_speedrun.py`
+- Introduced: TBD
+- Last known-good: TBD
+- Status: active
+- Purpose: Minimal starter speedrun for Grug; convenient baseline for quick iteration.
+
+### grugformer-vs-hackable-125m
+- Path: `experiments/speedrun/grugformer_vs_hackable_125m/grugformer_vs_hackable_125m.py`
+- Introduced: TBD
+- Last known-good: TBD
+- Status: active
+- Purpose: Head-to-head comparison between Hackable Transformer and Grugformer (no sinks).
+
diff --git a/experiments/simple_train_config.py b/experiments/simple_train_config.py
@@ -92,7 +92,7 @@ class SimpleTrainConfig:
     """Whether to run the JAX profiler during training."""
     profiler_start_step: int = 5
     """Which step to start profiling."""
-    profiler_num_steps: int = 100
+    profiler_num_steps: int = 25
     """How many steps to profile for once started."""
 
     explicit_mesh_axes: bool = False