Grug #2171

dlwh · 2025-12-05T21:18:07Z

This PR introduces Grugformer: a “grug-simple” JAX LM implementation that leans into explicit sharding and top-level functions rather than heavy abstractions. It adds a minimal core (levanter.grug) plus a small adapter (levanter.models.grug_wrapper) so it can run through the existing Levanter trainer pipeline, and it includes speedrun entrypoints + tests that lock down the intended “grug core surface”.

What’s Included

New Grug core (minimal, notebook-like)

New package: lib/levanter/src/levanter/grug/
- attention.py: Grug-local AttentionMask spec + attention implementation (TPU Splash when on TPU; reference fallback otherwise).
- model.py: parameter dataclasses + init/forward/activations/loss functions.
- loss.py: blockwise “large vocab friendly” CE path (avoid full logits materialization; see note below on tradeoffs).
- data.py, main.py: minimal training/data wiring to run in-repo.
Exported surface is intentionally small (functions + dataclasses; minimal mutation).

Levanter adapter

lib/levanter/src/levanter/models/grug_wrapper.py: wraps grug core behind Levanter’s LmConfig/trainer expectations while keeping the core itself free of NamedArray-heavy abstractions.

Speedruns / templates

experiments/speedrun/grugformer_starter/grugformer_speedrun.py: a grug speedrun template for quick iteration.
experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py: “hackable” grug attention-sink variant (copy/paste edit surface).
experiments/speedrun/grugformer_vs_hackable_125m/grugformer_vs_hackable_125m.py: head-to-head comparison (Hackable Transformer vs Grugformer, no sinks). Hackable path runs without explicit mesh axes for now.

Tests (lock the “grug core surface”)

All Grug tests live under lib/levanter/tests/grug/:
- test_grugformer_core.py: core API + mesh/sharding sanity.
- test_grugformer_model_loss.py: loss correctness vs full logits on small shapes; wrapper plumbing.
- test_grugformer_fused_loss.py: loss-related regression coverage.
- test_grugformer_compilation.py: lowers/jit-traces model+loss under AbstractMesh (no concrete devices required).
- test_grugformer.py: higher-level smoke coverage (tiny synthetic step).

Documentation

.agents/projects/grugformer.md: principles, intended edit surface, and follow-ups.
docs/recipes/change_grug.md: workflow for proposing changes (speedrun edit surface → adopt into canonical grug → archive old experiments).
docs/reports/grug-archive.md: lightweight “experiment archive log” placeholder so we have somewhere to record removals/renames as grug evolves.

Notable Design Choices / Current Constraints

Attention: TPU path uses Splash attention directly; GPU path uses the reference fallback for now.
Loss: large-vocab CE is more painful than we’d like under explicit-sharding; we currently use a blockwise “flash-attention style” transform. The block-size knob is intentionally exposed; we’ve observed meaningful perf sensitivity and will likely revisit this with a better kernel later.

How To Try

Run the h2h speedrun:
- python -m experiments.speedrun.grugformer_vs_hackable_125m.grugformer_vs_hackable_125m
- Set SR_USE_TPU=1 to use TPU preset.
Run tests:
- uv run pytest lib/levanter/tests/grug -q

Follow-ups

Implement a faster large-vocab CE path that’s robust under explicit sharding (avoids the current speed/memory tradeoff).
Expand the speedrun “gauntlet” checks and add more minimal “edit points” for experiments.

# Conflicts: # lib/levanter/src/levanter/data/mixture.py # lib/marin/pyproject.toml # pyproject.toml # uv.lock

github-actions · 2025-12-29T01:08:59Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

dlwh · 2025-12-29T18:28:35Z

bump

Copilot

Pull request overview

This PR introduces Grugformer: a "grug-simple" JAX LM implementation emphasizing explicit sharding and top-level functions over heavy abstractions. The implementation provides a minimal core in levanter.grug, an adapter for integration with Levanter's trainer (levanter.models.grug_wrapper), speedrun entrypoints for experimentation, and comprehensive tests.

Changes:

Adds new levanter.grug package with model, attention, loss, data, and config modules
Provides GrugWrapper adapter to integrate with Levanter's LmHeadModel interface
Includes three speedrun experiments: starter template, attention sink variant, and head-to-head comparison with Hackable Transformer
Adds comprehensive test suite locking down the core API surface
Documents design principles, change workflow, and experiment archiving strategy

Reviewed changes

Copilot reviewed 24 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
uv.lock	Adds einops dependency and updates various package versions
lib/levanter/pyproject.toml	Adds einops to dependencies and TPU test marker
lib/levanter/src/levanter/grug/*.py	Core grug implementation: model, attention, loss, data, config
lib/levanter/src/levanter/models/grug_wrapper.py	Adapter bridging grug core to Levanter's LmHeadModel API
lib/levanter/src/levanter/models/loss.py	Updates to use auto_sharded and zeros_like for better sharding
lib/levanter/tests/grug/*.py	Comprehensive test suite for grug core functionality
experiments/speedrun/grugformer_/.py	Three speedrun experiments showcasing grug usage
docs/*.md	Documentation for principles, workflow, and experiment archiving

Copilot · 2026-01-10T07:16:36Z

lib/levanter/src/levanter/models/grug_wrapper.py

+
+import equinox as eqx
+import jax
+import haliax as hax


Module 'haliax' is imported with both 'import' and 'import from'.

Copilot · 2026-01-10T07:16:36Z

experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py

+
+# nodryrun
+
+import dataclasses


Module 'dataclasses' is imported with both 'import' and 'import from'.

Copilot · 2026-01-10T07:16:37Z

lib/levanter/src/levanter/grug/model.py

+import dataclasses
+


Module 'dataclasses' is imported with both 'import' and 'import from'.

Suggested change

import dataclasses

pc0618 · 2026-01-11T21:23:48Z

Pushed fix for TPU Splash attention crashing during init on tracers (fallback to when is unavailable) + removed unsupported arg from the Grugformer speedrun wrapper. Commit: 0ee618f.

pc0618 · 2026-01-11T21:23:56Z

Follow-up (previous comment had shell quoting issues): fix uses x.sharding when available and falls back to x.aval.sharding for tracers during staging; also stops passing tie_embeddings into GrugModelConfig (it is kept only for param counting). Commit: 0ee618f.

pc0618 · 2026-01-12T02:42:44Z

Added an inline note + refactor in levanter/grug/model.py:init_parameters to use hierarchical key splitting instead of (3 + 7 * num_layers) “magic number” math (more robust to future parameter additions). Commit: b9756b3. Also left a TODO in-code to add a brief explanation in the PR discussion later.

ravwojdyla

This is awesome! I may be a little aggressive with the comments to delete "unused" logic/options or reduce number of files - this is mostly in spirit of karpathy-ish code 🙇 There's a couple of logic questions in here. Some nits as well, like the __all__, which I dislike ¹.

I prefer to protected-ish _, but if marin has a policy on __all__ I'm happy to adjust. ↩

ravwojdyla · 2026-01-16T01:00:33Z

.agents/projects/grugformer.md

+
+### CLI Entrypoint
+
+- `src/marin/grugpt/train.py` implements `def main(argv=None): ...` using `argparse` (no click). Steps:


Outdated path? Probably lib/levanter/src/levanter/grug/main.py?

ravwojdyla · 2026-01-16T01:05:34Z

lib/levanter/src/levanter/grug/config.py

+    global_batch_size: int = 8
+
+
+__all__ = ["GrugModelConfig", "GrugTrainingConfig"]


Do we need the __all__?

ravwojdyla · 2026-01-16T01:07:04Z

lib/levanter/src/levanter/grug/__init__.py

+Levanter integration adapters live under `levanter.models`.
+"""
+
+from .attention import apply_rotary_embedding, attention


For the sake of simplicity - do we need explicit __all__ export? Could this be just a plain init file?

ravwojdyla · 2026-01-16T01:07:45Z

lib/levanter/src/levanter/grug/config.py

+
+
+@dataclass(frozen=True)
+class GrugTrainingConfig:


To reduce number of files - could this live in main.py (the only place apart from tests where this is used right now).

ravwojdyla · 2026-01-16T01:24:06Z

lib/levanter/src/levanter/grug/main.py

+    if global_batch_size is not None and global_batch_size > 0:
+        while data_size > 1 and global_batch_size % data_size != 0:
+            data_size -= 1
+


Should we log/print warning here if final data_size is less than len(devices)?

ravwojdyla · 2026-01-16T03:51:17Z

pyproject.toml

 testpaths = ["tests", "experiments"]

 # Make sure we timeout before CI kills us, and don't run TPU or slow tests by default
-addopts = "--session-timeout=480 -m 'not tpu_ci and not slow'"


Is this intentional?

ravwojdyla · 2026-01-16T03:53:00Z

lib/marin/src/marin/run/ray_run.py

    runtime_dict = {
        "working_dir": current_dir,
        "config": {"setup_timeout_seconds": 1800},
-        "excludes": [".git", "tests/", "docs/", "**/*.pack", "lib/levanter/docs"],


❓ what is the purpose of this change?

ravwojdyla · 2026-01-16T03:58:07Z

lib/levanter/src/levanter/grug/model.py

+        token_embed=token_embed,
+        output_proj=output_proj,
+        blocks=tuple(blocks),
+        final_norm=jnp.ones_like(final_norm),


❓ isn't it already ones?

ravwojdyla · 2026-01-16T03:58:59Z

lib/levanter/src/levanter/grug/model.py

+    num_kv_heads: int = 16
+    head_dim: int | None = None
+    max_seq_len: int = 4096
+    dropout_rate: float = 0.0


❓ where is this used?

ravwojdyla · 2026-01-16T04:00:09Z

lib/levanter/src/levanter/grug/model.py

+    elif isinstance(mask, AttentionMask) and not mask.is_causal:
+        mask = dataclasses.replace(mask, is_causal=True)


❓ what's the intentions behind these 2 lines?

ravwojdyla · 2026-01-16T20:37:10Z

FYI when I run the starter speedrun (130M only) in us-central1 on TPU (v5p-8), I get OOM:

Total hbm usage >= 101.99G:
    reserved        263.00M
    program         101.73G
    arguments            0B

I can work around this but I wonder if that was supposed to work?

ravwojdyla · 2026-01-16T21:53:09Z

lib/levanter/src/levanter/grug/main.py

+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run the Grug trainer.")
+    parser.add_argument("--cache-dir", type=str, default=None, help="Optional TreeCache directory for real data.")


Could we make it simpler to run main.py in isolation without depending on TreeCache or synthetic data? I.e. point it at dir (object store comp) dump of some canonical dataset, e.g. OpenWebText, Fineweb or TinyStories even?

ravwojdyla

Some more comments from experiments

ravwojdyla · 2026-01-17T00:53:02Z

experiments/speedrun/grugformer_vs_hackable_125m/grugformer_vs_hackable_125m.py

+            num_kv_heads=self.num_kv_heads,
+            head_dim=self.head_dim,
+            max_seq_len=self.max_seq_len,
+            tie_embeddings=self.tie_embeddings,


This flag tie_embeddings, doesn't exist in GrugModelConfig (anymore?). I assume it can be completely removed from this experiment?

ravwojdyla · 2026-01-17T01:21:59Z

experiments/speedrun/grugformer_starter/grugformer_speedrun.py

+    # Grug core currently always has separate token embed + output projection; keep this knob
+    # for param counting / compatibility with other LmConfig-based scripts.
+    tie_embeddings: bool = False


Wouldn't it be more intuitive to not expose this config and instead hard code logic in total_trainable_params? Otherwise it may seem like this flag does something, when it doesn't?

ravwojdyla · 2026-01-17T01:23:09Z

experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py

+            num_kv_heads=self.num_kv_heads,
+            head_dim=self.head_dim,
+            max_seq_len=self.max_seq_len,
+            tie_embeddings=self.tie_embeddings,


tie_embeddings doesn't exist in GrugModelConfig (see other comment)

ravwojdyla · 2026-01-17T02:00:33Z

lib/levanter/src/levanter/grug/model.py

+    labels = jnp.concatenate([token_ids[:, 1:], token_ids[:, :1] * 0], axis=1).astype(jnp.int32)
+    loss_weight = loss_weight.astype(loss_dtype)
+
+    # NOTE: `block_size=None` corresponds to a single full-vocab block. On the 125M speedrun,


❓ #2315 (comment) ptal, I can't reproduce this 🙏

Btw setting cross_entropy_block_size to say ~32k on v5p-8 OOMs in 125M experiment.

dlwh added 14 commits November 9, 2025 23:32

grugpt!

0f23fb2

attention in its own file

0c530fd

project description

c397bce

tweak

72d3a8a

basic data loading works

629217a

wip

2222843

Merge remote-tracking branch 'origin/main' into grug

6963ba4

# Conflicts: # lib/levanter/src/levanter/data/mixture.py # lib/marin/pyproject.toml # pyproject.toml # uv.lock

move grug into levanter so we can use it there

797e42e

grug

c3b0950

grug wrapper

7947ecf

structured attentionmask

478dfba

comment

796771a

update principles

078d1b8

wip attention

6551180

github-actions bot added the stale label Dec 29, 2025

wip

d7889f8

github-actions bot removed the stale label Dec 30, 2025

dlwh added 11 commits December 30, 2025 12:57

Merge remote-tracking branch 'origin/main' into grug

2557d42

isolate grug, use ejkernel

8ec3b14

more grug-y

291344c

make AttentionBackend simpler

7870cf4

new grug_wrapper

6ac56b5

update status

b712cee

simpler, updat eplan

1c1a387

fix tests

7b4af70

use our axis conventions

f824fee

grugformer

abcd2d8

grugformer tests, updated plan

9bc2778

dlwh added 8 commits January 8, 2026 15:41

grug: note TODO for vendored CE kernel

5f0658d

grug: handle tokamax mosaic batch-size multiple by splitting

77504e0

hackable tweaks

4bc1c4a

Merge remote-tracking branch 'origin/main' into grugformer-speedrun

1dd2918

revert change

ac9b095

almost ready!

2280d0a

Merge remote-tracking branch 'origin/main' into grugformer-speedrun

8df8892

move tests, cleanup

36b0fef

dlwh mentioned this pull request Jan 10, 2026

Experiment: Grug vs hackable 125m #2315

Open

dlwh added 2 commits January 9, 2026 22:28

a bit of cleanup

6d16e75

jaxtyping

5b2f594

dlwh marked this pull request as ready for review January 10, 2026 07:12

Copilot AI review requested due to automatic review settings January 10, 2026 07:12

cleanup

4ac926f

Copilot started reviewing on behalf of dlwh January 10, 2026 07:12 View session

Copilot AI reviewed Jan 10, 2026

View reviewed changes

pc0618 added 2 commits January 10, 2026 01:49

Merge branch 'main' into grug

b321a33

Fix grug TPU splash sharding for tracers

0ee618f

Refactor Grug init key splitting

b9756b3

dlwh added 3 commits January 12, 2026 09:32

Merge remote-tracking branch 'origin/main' into grug

80a509d

Merge remote-tracking branch 'origin/grug' into grug

df2464a

fix tests?

ffaa4c6

ravwojdyla reviewed Jan 16, 2026

View reviewed changes

ravwojdyla mentioned this pull request Jan 16, 2026

Grug MoE #2371

Open

ravwojdyla reviewed Jan 17, 2026

View reviewed changes


		### CLI Entrypoint

		- `src/marin/grugpt/train.py` implements `def main(argv=None): ...` using `argparse` (no click). Steps:

		global_batch_size: int = 8


		__all__ = ["GrugModelConfig", "GrugTrainingConfig"]

		elif isinstance(mask, AttentionMask) and not mask.is_causal:
		mask = dataclasses.replace(mask, is_causal=True)



		@dataclass(frozen=True)
		class GrugTrainingConfig:

Grug #2171

Are you sure you want to change the base?

Grug #2171

Uh oh!

Conversation

dlwh commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What’s Included

New Grug core (minimal, notebook-like)

Levanter adapter

Speedruns / templates

Tests (lock the “grug core surface”)

Documentation

Notable Design Choices / Current Constraints

How To Try

Follow-ups

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

dlwh commented Dec 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

pc0618 commented Jan 11, 2026

Uh oh!

pc0618 commented Jan 11, 2026

Uh oh!

pc0618 commented Jan 12, 2026

Uh oh!

ravwojdyla left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravwojdyla commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dlwh commented Dec 5, 2025 •

edited

Loading

ravwojdyla left a comment •

edited

Loading