openai · bigbag · Mar 22, 2026 · Mar 23, 2026 · Apr 9, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,6 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+logs/
+plans/
+.runpod_state/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,193 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Parameter Golf is OpenAI's Model Craft Challenge: train the best language model that fits in a **16MB artifact** (code + compressed weights) in under **10 minutes on 8×H100s**, optimized for bits-per-byte (BPB) on FineWeb validation.
+
+## Commands
+
+### Training (multi-GPU)
+```bash
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+### Training (single GPU)
+```bash
+python train_gpt.py
+```
+
+### Download data
+```bash
+python data/cached_challenge_fineweb.py
+```
+
+All model hyperparameters are configured via environment variables (see `Hyperparameters` dataclass in train_gpt.py). Key ones:
+- `DATA_PATH`, `TOKENIZER_PATH` — dataset/tokenizer locations
+- `VOCAB_SIZE`, `NUM_LAYERS`, `MODEL_DIM`, `NUM_HEADS`, `NUM_KV_HEADS`, `MLP_MULT` — architecture
+- `ITERATIONS`, `MAX_WALLCLOCK_SECONDS`, `TRAIN_BATCH_TOKENS`, `TRAIN_SEQ_LEN` — training budget
+- `MATRIX_LR`, `SCALAR_LR`, `EMBED_LR`, `TIED_EMBED_LR`, `HEAD_LR` — per-group learning rates
+- `TTT_ENABLED`, `TTT_OPTIMIZER` (adamw/muon/sgd), `TTT_EPOCHS`, `TTT_LR`, `TTT_COSINE` — test-time training
+- `LEAKY_SLOPE` (0.0=ReLU², 0.5=LeakyReLU(0.5)²), `GPTQ_ENABLED` — activation & quantization
+- `EMA_ENABLED`, `SWA_ENABLED`, `LATE_QAT`, `VALUE_RESIDUAL`, `GATED_ATTENTION`, `XSA_LAST_N`, `LN_SCALE`
+
+There is no build system, test suite, or linter. The project is a single training script.
+
+## Architecture
+
+### train_gpt.py (~1487 lines, single-file constraint)
+
+The entire model, training loop, data loading, evaluation, and serialization live in one file. The challenge rules require all code in `train_gpt.py` (hard limit: 1500 lines).
+
+**Model (GPT class):** Transformer with RMSNorm, RoPE, Grouped Query Attention (GQA), ReLU²/LeakyReLU(0.5)² MLP (`LEAKY_SLOPE`), tied embeddings, logit softcapping, and skip connections between layers.
+
+**Optimizer:** Muon (Newton-Schulz orthogonalization) for 2D matrix parameters; Adam for embeddings and scalar/control parameters. Separate learning rate groups for embeddings, matrices, scalars, and optional untied head.
+
+**Data pipeline:** Binary shards (256-int header + uint16 tokens) → `TokenStream` → `DistributedTokenLoader` → sequential streaming batches. No random sampling.
+
+**Evaluation:** Tokenizer-agnostic BPB metric computed via SentencePiece byte-accounting lookup tables, handling token boundaries and leading spaces correctly.
+
+**Serialization:** Mixed int5 (MLP) / int6 (attention) quantization with GPTQ-lite per-row clip search, FP16 passthrough for embeddings + control tensors, zstd-22 compression. 3% magnitude pruning before quantization. Final artifact must be ≤16,000,000 bytes.
+
+### train_gpt_mlx.py
+
+MLX port for Apple Silicon development. Same architecture, different backend.
+
+## Challenge Rules (key constraints)
+
+- Artifact = `len(open("train_gpt.py").read().encode()) + len(compressed_model_bytes)` ≤ 16MB
+- **Two separate 10-minute limits:**
+  - Training: ≤10 min wallclock on 8×H100s (`MAX_WALLCLOCK_SECONDS=600`)
+  - Evaluation (TTT + sliding window): ≤10 min ADDITIONAL (NOT included in training time)
+  - Total allowed: up to 20 min (10 train + 10 eval)
+- Cannot access validation data during training (test-time training on already-evaluated tokens is allowed)
+- TTT must be "score-first": evaluate tokens before training on them
+- New SOTA requires ≥0.005 nats BPB improvement with p < 0.01 statistical significance
+- Default config: 1024 vocab (SentencePiece BPE), 10 layers, 512 dim, 8 heads, 4 KV heads
+- Current best: 1.1492 BPB (10L, VR+GA+XSA4+SWA+LateQAT, 15.3MB artifact)
+- SOTA on GitHub (verified, rule-compliant): ~1.067 BPB (PR #462: SwiGLU + AdamW TTT 10ep)
+- SOTA on GitHub (unverified/borderline): ~0.978 BPB (PR #517: 100ep Cosine TTT, violates eval time limit)
+
+## Records
+
+Submissions live in `records/track_10min_16mb/` with each containing a `train_gpt.py`, `submission.json` (val_bpb, bytes_total, author), `train.log`, and `README.md` describing techniques used.
+
+## RunPod
+
+Use `$RUNPOD_API_KEY` with `runpodctl`. SSH key: `/home/work/.ssh/id_ed25519`.
+
+### Create H100 pod (parameter-golf template)
+```bash
+PUB_KEY=$(cat /home/work/.ssh/id_ed25519.pub)
+$RUNPOD_API_KEY runpodctl pod create \
+  --template-id y5cejece4j \
+  --gpu-id "NVIDIA H100 80GB HBM3" \
+  --gpu-count 1 \
+  --name "param-golf" \
+  --volume-in-gb 50 --container-disk-in-gb 50 \
+  --ports "8888/http,22/tcp" --ssh \
+  --env "{\"JUPYTER_PASSWORD\":\"parameter-golf\",\"PUBLIC_KEY\":\"$PUB_KEY\"}"
+```
+
+### SSH into pod
+```bash
+ssh -i /home/work/.ssh/id_ed25519 root@<IP> -p <PORT>
+```
+
+### List / stop / delete pods
+```bash
+$RUNPOD_API_KEY runpodctl pod list
+$RUNPOD_API_KEY runpodctl pod stop <POD_ID>
+$RUNPOD_API_KEY runpodctl pod delete <POD_ID>
+```
+
+### Create spot (interruptible) H100 — $1.75/hr vs $2.69 on-demand
+```bash
+PUB_KEY=$(cat /home/work/.ssh/id_ed25519.pub)
+curl -s -X POST https://api.runpod.io/graphql \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer <RUNPOD_API_KEY>" \
+  -d "{\"query\": \"mutation { podRentInterruptable(input: { name: \\\"param-golf-spot\\\", templateId: \\\"y5cejece4j\\\", gpuTypeId: \\\"NVIDIA H100 80GB HBM3\\\", gpuCount: 1, volumeInGb: 50, containerDiskInGb: 50, cloudType: SECURE, startSsh: true, ports: \\\"8888/http,22/tcp\\\", bidPerGpu: 1.75, env: [{key: \\\"JUPYTER_PASSWORD\\\", value: \\\"parameter-golf\\\"}, {key: \\\"PUBLIC_KEY\\\", value: \\\"$PUB_KEY\\\"}] }) { id costPerHr desiredStatus machine { gpuDisplayName location } } }\"}"
+```
+
+### Key info
+- Template ID: `y5cejece4j` (runpod/parameter-golf:latest)
+- H100 SXM GPU ID: `NVIDIA H100 80GB HBM3` (on-demand ~$2.69/hr, spot ~$1.75/hr)
+- Image has Python 3.12, PyTorch 2.9.1, all deps pre-installed
+- Data download: `python3 data/cached_challenge_fineweb.py --variant sp1024` (run on pod)
+- Template doesn't auto-clone — run `git clone https://github.com/openai/parameter-golf.git` on pod
+- Need `pip install --break-system-packages zstandard` on the pod
+
+### Deployment script (`run_on_runpod.sh`)
+```bash
+./run_on_runpod.sh              # Create spot pod, setup, train
+./run_on_runpod.sh --status     # Pod status + SSH command
+./run_on_runpod.sh --logs       # Tail training logs
+./run_on_runpod.sh --results    # Show key metrics
+./run_on_runpod.sh --save-log <tag>  # Save full log
+./run_on_runpod.sh --upload     # Upload train_gpt.py to pod
+./run_on_runpod.sh --rerun      # Re-launch training (upload code + restart)
+./run_on_runpod.sh --prep-data [N]   # Download N shards locally (once)
+./run_on_runpod.sh --upload-data     # Upload local data to pod
+./run_on_runpod.sh --stop       # Stop pod
+./run_on_runpod.sh --delete     # Delete pod
+```
+
+### Training env vars (inline)
+Pass `KEY=VALUE` args directly — forwarded to training process:
+```bash
+./run_on_runpod.sh EMA_ENABLED=1 SWA_ENABLED=0
+./run_on_runpod.sh --rerun TTT_ENABLED=1 TTT_OPTIMIZER=adamw TTT_EPOCHS=10
+./run_on_runpod.sh --rerun NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=10240
+```
+
+### GPU config
+```bash
+GPU_COUNT=8 BID_PRICE=1.75 ./run_on_runpod.sh           # 8xH100 spot ($14/hr)
+GPU_COUNT=1 BID_PRICE=1.75 ./run_on_runpod.sh           # 1xH100 spot ($1.75/hr)
+GPU_ID="NVIDIA RTX PRO 4500 Blackwell" BID_PRICE=0.27 ./run_on_runpod.sh  # cheap size test
+```
+
+### Local data (separate from repo)
+Data lives at `$LOCAL_DATA_ROOT` (default: `~/dev/personal/parameter-golf-data/`).
+```bash
+./run_on_runpod.sh --prep-data 1    # Download 1 shard locally (quick iteration)
+./run_on_runpod.sh --prep-data 80   # Download all 80 shards (full training)
+```
+When local data exists, `./run_on_runpod.sh` auto-detects and rsync's it to the pod instead of downloading from HuggingFace. Override path: `LOCAL_DATA_ROOT=/path/to/data ./run_on_runpod.sh`
+
+### Fast experiment workflow (~30s between runs)
+```bash
+./run_on_runpod.sh --prep-data 1          # Once: download data locally
+GPU_COUNT=1 ./run_on_runpod.sh            # Create pod (auto-uploads local data)
+./run_on_runpod.sh --save-log "baseline"  # Save results
+./run_on_runpod.sh --rerun EMA_ENABLED=1  # New experiment (uploads code, restarts)
+./run_on_runpod.sh --save-log "ema"       # Save results
+./run_on_runpod.sh --delete               # Clean up
+```
+
+### Logging
+Save every training run's log after completion:
+```bash
+./run_on_runpod.sh --save-log "11L_VR1_GA1_prune3pct"
+```
+This saves to `logs/<timestamp>_<tag>.log` and `logs/<timestamp>_<tag>.summary` with key metrics extracted.
+
+### Cost-saving tips
+- **Always delete pods after saving logs/results** — `--save-log <tag>` then `--delete`
+- **Use `--rerun` to iterate** — skips pod creation + data download, ~30s turnaround
+- **Pre-download data locally** — `--prep-data 1` once, auto-uploaded to every pod
+- **Test artifact size on cheap GPUs** — RTX PRO 4500 spot ($0.27/hr) before H100. Needs smaller batch:
+  `GPU_ID="NVIDIA RTX PRO 4500 Blackwell" BID_PRICE=0.27 ./run_on_runpod.sh TRAIN_BATCH_TOKENS=131072 TRAIN_SEQ_LEN=1024 EVAL_STRIDE=0 EMA_ENABLED=0`
+- **Use `EVAL_STRIDE=0`** to skip sliding window eval on single GPU
+- **Use `EMA_ENABLED=0`** on single GPU — EMA kills throughput (~32% slower)
+- **Always `--stop` or `--delete` pods when done** — spot 8xH100 is $14/hr
+- **Spot instances get preempted** — always use `nohup` and check pod status
+- **TTT needs H100** — OOMs on 32GB GPUs. Only enable on H100+
+- **TTT on single GPU is very slow** — use 8xH100 for TTT experiments
+- **TTT has separate 10-min eval budget** — not counted in training time. ~20 epochs safe (~380s TTT + ~200s eval)
+- **TTT adapts all params by default** — Muon for 2D + AdamW for 1D (when `TTT_OPTIMIZER=muon`)
+- **TTT cosine LR enabled by default** (`TTT_COSINE=1`) — prevents overfitting at high epoch counts
+- **Check pod status every 60s during experiments** — spot pods get preempted, don't waste money on dead pods
+- **Save logs after EVERY experiment** before starting the next one — logs are lost when pod dies
diff --git a/notebooks/step1.ipynb b/notebooks/step1.ipynb
diff --git a/notebooks/step1_5.ipynb b/notebooks/step1_5.ipynb
diff --git a/notebooks/step2.ipynb b/notebooks/step2.ipynb
diff --git a/notebooks/step3.ipynb b/notebooks/step3.ipynb
diff --git a/notebooks/step3_1.ipynb b/notebooks/step3_1.ipynb
diff --git a/...rack_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md b/...rack_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md
@@ -0,0 +1,96 @@
+# Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT
+
+**val_bpb = 1.0810** (3-seed mean, std 0.0002) | **~15.99 MB** | 8xH100 SXM
+
+## 3-Seed Results
+
+| Seed | Sliding BPP | **TTT BPP** | Artifact |
+|------|-------------|-------------|----------|
+| 42   | 1.0829      | **1.0808**  | 15,991,930 |
+| 314  | 1.0827      | **1.0810**  | 15,992,919 |
+| 999  | —           | **1.0812**  | 15,992,919 |
+| **Mean** | | **1.0810** | |
+
+Merged SOTA (PR #1019): **1.1147 BPP**. Delta: **-0.0337 BPP**. Clears the 0.005-nat threshold.
+
+## Key Techniques
+
+1. **SP8192 + GPTQ SDClip** — int6 matrices (k=12.85), int8 embeddings (k=20.0), zero selective pruning (PR #1394 @clarkkev)
+2. **3-Layer Depth Recurrence** (layers 3,4,5, activate at frac=0.35) — 17 virtual layers from 11 physical (PR #1331 @dexhunter, PR #1437 @dexhunter)
+3. **Parallel Residuals** (layers 7+) — GPT-J style, attention and MLP read from same input (PR #1412 @Robby955, PR #1204 @msisovic)
+4. **QK-Gain 5.25** — learnable per-head query scaling, monotonic improvement from 4.0 to 5.25
+5. **Legal Score-First TTT** — SGD (lr=0.005, momentum=0.9), 3 epochs per 32K-token chunk, cosine LR decay. Score-before-update ordering. (PR #549 @abaybektursun, PR #1413 @dexhunter)
+6. **Tuned Hyperparameters** — WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 (PR #1445 @X-Abhishek-X)
+7. **LZMA code wrapper** — ~16.6KB code, saves ~43KB vs uncompressed
+
+## Architecture
+
+11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (loops layers 3-5, activated at step ~2016). Parallel residuals from layer 7: attention and MLP operate on same pre-residual input. Skip gates (sigmoid-gated U-Net connections).
+
+## Training
+
+MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps), AdamW for embeddings/scalars. 4550 steps in 588s on 8xH100 SXM. Linear warmdown to LR=0 over final 72% of training. EMA decay 0.9965.
+
+## Quantization
+
+Full-Hessian GPTQ with SDClip: `clip = k * std(row)` for principled rate-distortion. int6 for attention/MLP matrices, int8 for token embeddings. Byte-shuffle + Brotli-11 compression. Zero selective pruning needed -- model fits natively under 16MB.
+
+## TTT (Test-Time Training)
+
+Score-first, chunk-based SGD adaptation at eval time:
+- Chunk val tokens into 32K-token chunks
+- For each chunk: (1) score all sliding windows under `torch.no_grad()`, (2) train model on scored chunk tokens with SGD
+- 3 epochs per chunk, cosine LR decay across chunks
+- Gradient clipping at 1.0, distributed all-reduce for multi-GPU
+- Total TTT eval time: ~370s (within 600s eval budget)
+
+## Compliance
+
+Per Issue #1017 (Track B -- legal eval-time adaptation):
+
+- **Condition 1 (Causality):** Sliding-window eval is strictly causal. Each position scored from prefix tokens only.
+- **Condition 2 (Normalized distribution):** Standard softmax over full vocab. No n-gram cache, no logit biasing.
+- **Condition 3 (Score before update):** Each chunk fully scored under `torch.no_grad()` BEFORE any SGD update. Training only on already-scored tokens.
+- **Condition 4 (Single pass):** Each token scored exactly once. No rescoring, no multi-pass selection.
+
+Additional:
+- No SLOT (standard or causal)
+- No pre-quant TTT on val data (model quantized once during training, TTT adapts at eval time)
+- No ETLB (eval-time logit bias)
+- No n-gram cache or tilt
+- All artifacts under 16,000,000 bytes on all 3 seeds
+- Training under 600s on all 3 seeds (~588s actual)
+- Eval (sliding + TTT) under 600s on all 3 seeds (~500s actual)
+
+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+
+SEED=42 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- **@clarkkev** — SP8192 + GPTQ Embeddings + SDClip + MuonEq-R + depth recurrence (PR #1394)
+- **@dexhunter** — 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
+- **@abaybektursun** — Score-first TTT framework (PR #549, merged precedent)
+- **@Robby955** — Parallel residuals on SP8192 (PR #1412)
+- **@msisovic** — Parallel residuals concept (PR #1204)
+- **@X-Abhishek-X** — Hyperparameter tuning: WD=0.095, MLR=0.022, EMA=0.9965 (PR #1445, #1471)
+
+## Acknowledgements
+
+Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) -- this was instrumental in running the 160+ experiments across Steps 1-22 that led to this result.
+
+## Included Files
+
+- `README.md` (this file)
+- `submission.json`
+- `train_gpt.py`
+- `train_seed42.log`
+- `train_seed314.log`
+- `train_seed999.log`
diff --git a/...ds/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/submission.json b/...ds/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/submission.json
@@ -0,0 +1,36 @@
+{
+  "author": "bigbag",
+  "github_id": "bigbag",
+  "name": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal Score-First TTT",
+  "date": "2026-04-09",
+  "track": "10min_16mb",
+  "val_bpb": 1.08100,
+  "val_bpb_std": 0.00020,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {"val_bpb": 1.08079, "artifact_bytes": 15991930},
+    "314": {"val_bpb": 1.08103, "artifact_bytes": 15992919},
+    "999": {"val_bpb": 1.08118, "artifact_bytes": 15992919}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "SP8192 + 3-Layer Depth Recurrence (L3-5) + Parallel Residuals (L7+) + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Score-First TTT (SGD 3ep) + GPTQ SDClip + Brotli",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  },
+  "attribution": {
+    "sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
+    "depth_recurrence": "@dexhunter (PR #1331, #1437)",
+    "parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
+    "legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
+    "hyperparameter_tuning": "@X-Abhishek-X (PR #1445)"
+  }
+}
diff --git a/records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py b/records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py