Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
plans/
.runpod_state/
193 changes: 193 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Parameter Golf is OpenAI's Model Craft Challenge: train the best language model that fits in a **16MB artifact** (code + compressed weights) in under **10 minutes on 8×H100s**, optimized for bits-per-byte (BPB) on FineWeb validation.

## Commands

### Training (multi-GPU)
```bash
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Training (single GPU)
```bash
python train_gpt.py
```

### Download data
```bash
python data/cached_challenge_fineweb.py
```

All model hyperparameters are configured via environment variables (see `Hyperparameters` dataclass in train_gpt.py). Key ones:
- `DATA_PATH`, `TOKENIZER_PATH` — dataset/tokenizer locations
- `VOCAB_SIZE`, `NUM_LAYERS`, `MODEL_DIM`, `NUM_HEADS`, `NUM_KV_HEADS`, `MLP_MULT` — architecture
- `ITERATIONS`, `MAX_WALLCLOCK_SECONDS`, `TRAIN_BATCH_TOKENS`, `TRAIN_SEQ_LEN` — training budget
- `MATRIX_LR`, `SCALAR_LR`, `EMBED_LR`, `TIED_EMBED_LR`, `HEAD_LR` — per-group learning rates
- `TTT_ENABLED`, `TTT_OPTIMIZER` (adamw/muon/sgd), `TTT_EPOCHS`, `TTT_LR`, `TTT_COSINE` — test-time training
- `LEAKY_SLOPE` (0.0=ReLU², 0.5=LeakyReLU(0.5)²), `GPTQ_ENABLED` — activation & quantization
- `EMA_ENABLED`, `SWA_ENABLED`, `LATE_QAT`, `VALUE_RESIDUAL`, `GATED_ATTENTION`, `XSA_LAST_N`, `LN_SCALE`

There is no build system, test suite, or linter. The project is a single training script.

## Architecture

### train_gpt.py (~1487 lines, single-file constraint)

The entire model, training loop, data loading, evaluation, and serialization live in one file. The challenge rules require all code in `train_gpt.py` (hard limit: 1500 lines).

**Model (GPT class):** Transformer with RMSNorm, RoPE, Grouped Query Attention (GQA), ReLU²/LeakyReLU(0.5)² MLP (`LEAKY_SLOPE`), tied embeddings, logit softcapping, and skip connections between layers.

**Optimizer:** Muon (Newton-Schulz orthogonalization) for 2D matrix parameters; Adam for embeddings and scalar/control parameters. Separate learning rate groups for embeddings, matrices, scalars, and optional untied head.

**Data pipeline:** Binary shards (256-int header + uint16 tokens) → `TokenStream` → `DistributedTokenLoader` → sequential streaming batches. No random sampling.

**Evaluation:** Tokenizer-agnostic BPB metric computed via SentencePiece byte-accounting lookup tables, handling token boundaries and leading spaces correctly.

**Serialization:** Mixed int5 (MLP) / int6 (attention) quantization with GPTQ-lite per-row clip search, FP16 passthrough for embeddings + control tensors, zstd-22 compression. 3% magnitude pruning before quantization. Final artifact must be ≤16,000,000 bytes.

### train_gpt_mlx.py

MLX port for Apple Silicon development. Same architecture, different backend.

## Challenge Rules (key constraints)

- Artifact = `len(open("train_gpt.py").read().encode()) + len(compressed_model_bytes)` ≤ 16MB
- **Two separate 10-minute limits:**
- Training: ≤10 min wallclock on 8×H100s (`MAX_WALLCLOCK_SECONDS=600`)
- Evaluation (TTT + sliding window): ≤10 min ADDITIONAL (NOT included in training time)
- Total allowed: up to 20 min (10 train + 10 eval)
- Cannot access validation data during training (test-time training on already-evaluated tokens is allowed)
- TTT must be "score-first": evaluate tokens before training on them
- New SOTA requires ≥0.005 nats BPB improvement with p < 0.01 statistical significance
- Default config: 1024 vocab (SentencePiece BPE), 10 layers, 512 dim, 8 heads, 4 KV heads
- Current best: 1.1492 BPB (10L, VR+GA+XSA4+SWA+LateQAT, 15.3MB artifact)
- SOTA on GitHub (verified, rule-compliant): ~1.067 BPB (PR #462: SwiGLU + AdamW TTT 10ep)
- SOTA on GitHub (unverified/borderline): ~0.978 BPB (PR #517: 100ep Cosine TTT, violates eval time limit)

## Records

Submissions live in `records/track_10min_16mb/` with each containing a `train_gpt.py`, `submission.json` (val_bpb, bytes_total, author), `train.log`, and `README.md` describing techniques used.

## RunPod

Use `$RUNPOD_API_KEY` with `runpodctl`. SSH key: `/home/work/.ssh/id_ed25519`.

### Create H100 pod (parameter-golf template)
```bash
PUB_KEY=$(cat /home/work/.ssh/id_ed25519.pub)
$RUNPOD_API_KEY runpodctl pod create \
--template-id y5cejece4j \
--gpu-id "NVIDIA H100 80GB HBM3" \
--gpu-count 1 \
--name "param-golf" \
--volume-in-gb 50 --container-disk-in-gb 50 \
--ports "8888/http,22/tcp" --ssh \
--env "{\"JUPYTER_PASSWORD\":\"parameter-golf\",\"PUBLIC_KEY\":\"$PUB_KEY\"}"
```

### SSH into pod
```bash
ssh -i /home/work/.ssh/id_ed25519 root@<IP> -p <PORT>
```

### List / stop / delete pods
```bash
$RUNPOD_API_KEY runpodctl pod list
$RUNPOD_API_KEY runpodctl pod stop <POD_ID>
$RUNPOD_API_KEY runpodctl pod delete <POD_ID>
```

### Create spot (interruptible) H100 — $1.75/hr vs $2.69 on-demand
```bash
PUB_KEY=$(cat /home/work/.ssh/id_ed25519.pub)
curl -s -X POST https://api.runpod.io/graphql \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <RUNPOD_API_KEY>" \
-d "{\"query\": \"mutation { podRentInterruptable(input: { name: \\\"param-golf-spot\\\", templateId: \\\"y5cejece4j\\\", gpuTypeId: \\\"NVIDIA H100 80GB HBM3\\\", gpuCount: 1, volumeInGb: 50, containerDiskInGb: 50, cloudType: SECURE, startSsh: true, ports: \\\"8888/http,22/tcp\\\", bidPerGpu: 1.75, env: [{key: \\\"JUPYTER_PASSWORD\\\", value: \\\"parameter-golf\\\"}, {key: \\\"PUBLIC_KEY\\\", value: \\\"$PUB_KEY\\\"}] }) { id costPerHr desiredStatus machine { gpuDisplayName location } } }\"}"
```

### Key info
- Template ID: `y5cejece4j` (runpod/parameter-golf:latest)
- H100 SXM GPU ID: `NVIDIA H100 80GB HBM3` (on-demand ~$2.69/hr, spot ~$1.75/hr)
- Image has Python 3.12, PyTorch 2.9.1, all deps pre-installed
- Data download: `python3 data/cached_challenge_fineweb.py --variant sp1024` (run on pod)
- Template doesn't auto-clone — run `git clone https://github.com/openai/parameter-golf.git` on pod
- Need `pip install --break-system-packages zstandard` on the pod

### Deployment script (`run_on_runpod.sh`)
```bash
./run_on_runpod.sh # Create spot pod, setup, train
./run_on_runpod.sh --status # Pod status + SSH command
./run_on_runpod.sh --logs # Tail training logs
./run_on_runpod.sh --results # Show key metrics
./run_on_runpod.sh --save-log <tag> # Save full log
./run_on_runpod.sh --upload # Upload train_gpt.py to pod
./run_on_runpod.sh --rerun # Re-launch training (upload code + restart)
./run_on_runpod.sh --prep-data [N] # Download N shards locally (once)
./run_on_runpod.sh --upload-data # Upload local data to pod
./run_on_runpod.sh --stop # Stop pod
./run_on_runpod.sh --delete # Delete pod
```

### Training env vars (inline)
Pass `KEY=VALUE` args directly — forwarded to training process:
```bash
./run_on_runpod.sh EMA_ENABLED=1 SWA_ENABLED=0
./run_on_runpod.sh --rerun TTT_ENABLED=1 TTT_OPTIMIZER=adamw TTT_EPOCHS=10
./run_on_runpod.sh --rerun NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=10240
```

### GPU config
```bash
GPU_COUNT=8 BID_PRICE=1.75 ./run_on_runpod.sh # 8xH100 spot ($14/hr)
GPU_COUNT=1 BID_PRICE=1.75 ./run_on_runpod.sh # 1xH100 spot ($1.75/hr)
GPU_ID="NVIDIA RTX PRO 4500 Blackwell" BID_PRICE=0.27 ./run_on_runpod.sh # cheap size test
```

### Local data (separate from repo)
Data lives at `$LOCAL_DATA_ROOT` (default: `~/dev/personal/parameter-golf-data/`).
```bash
./run_on_runpod.sh --prep-data 1 # Download 1 shard locally (quick iteration)
./run_on_runpod.sh --prep-data 80 # Download all 80 shards (full training)
```
When local data exists, `./run_on_runpod.sh` auto-detects and rsync's it to the pod instead of downloading from HuggingFace. Override path: `LOCAL_DATA_ROOT=/path/to/data ./run_on_runpod.sh`

### Fast experiment workflow (~30s between runs)
```bash
./run_on_runpod.sh --prep-data 1 # Once: download data locally
GPU_COUNT=1 ./run_on_runpod.sh # Create pod (auto-uploads local data)
./run_on_runpod.sh --save-log "baseline" # Save results
./run_on_runpod.sh --rerun EMA_ENABLED=1 # New experiment (uploads code, restarts)
./run_on_runpod.sh --save-log "ema" # Save results
./run_on_runpod.sh --delete # Clean up
```

### Logging
Save every training run's log after completion:
```bash
./run_on_runpod.sh --save-log "11L_VR1_GA1_prune3pct"
```
This saves to `logs/<timestamp>_<tag>.log` and `logs/<timestamp>_<tag>.summary` with key metrics extracted.

### Cost-saving tips
- **Always delete pods after saving logs/results** — `--save-log <tag>` then `--delete`
- **Use `--rerun` to iterate** — skips pod creation + data download, ~30s turnaround
- **Pre-download data locally** — `--prep-data 1` once, auto-uploaded to every pod
- **Test artifact size on cheap GPUs** — RTX PRO 4500 spot ($0.27/hr) before H100. Needs smaller batch:
`GPU_ID="NVIDIA RTX PRO 4500 Blackwell" BID_PRICE=0.27 ./run_on_runpod.sh TRAIN_BATCH_TOKENS=131072 TRAIN_SEQ_LEN=1024 EVAL_STRIDE=0 EMA_ENABLED=0`
- **Use `EVAL_STRIDE=0`** to skip sliding window eval on single GPU
- **Use `EMA_ENABLED=0`** on single GPU — EMA kills throughput (~32% slower)
- **Always `--stop` or `--delete` pods when done** — spot 8xH100 is $14/hr
- **Spot instances get preempted** — always use `nohup` and check pod status
- **TTT needs H100** — OOMs on 32GB GPUs. Only enable on H100+
- **TTT on single GPU is very slow** — use 8xH100 for TTT experiments
- **TTT has separate 10-min eval budget** — not counted in training time. ~20 epochs safe (~380s TTT + ~200s eval)
- **TTT adapts all params by default** — Muon for 2D + AdamW for 1D (when `TTT_OPTIMIZER=muon`)
- **TTT cosine LR enabled by default** (`TTT_COSINE=1`) — prevents overfitting at high epoch counts
- **Check pod status every 60s during experiments** — spot pods get preempted, don't waste money on dead pods
- **Save logs after EVERY experiment** before starting the next one — logs are lost when pod dies
1 change: 1 addition & 0 deletions notebooks/step1.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions notebooks/step1_5.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions notebooks/step2.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions notebooks/step3.ipynb

Large diffs are not rendered by default.

195 changes: 195 additions & 0 deletions notebooks/step3_1.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT

**val_bpb = 1.0810** (3-seed mean, std 0.0002) | **~15.99 MB** | 8xH100 SXM

## 3-Seed Results

| Seed | Sliding BPP | **TTT BPP** | Artifact |
|------|-------------|-------------|----------|
| 42 | 1.0829 | **1.0808** | 15,991,930 |
| 314 | 1.0827 | **1.0810** | 15,992,919 |
| 999 | — | **1.0812** | 15,992,919 |
| **Mean** | | **1.0810** | |

Merged SOTA (PR #1019): **1.1147 BPP**. Delta: **-0.0337 BPP**. Clears the 0.005-nat threshold.

## Key Techniques

1. **SP8192 + GPTQ SDClip** — int6 matrices (k=12.85), int8 embeddings (k=20.0), zero selective pruning (PR #1394 @clarkkev)
2. **3-Layer Depth Recurrence** (layers 3,4,5, activate at frac=0.35) — 17 virtual layers from 11 physical (PR #1331 @dexhunter, PR #1437 @dexhunter)
3. **Parallel Residuals** (layers 7+) — GPT-J style, attention and MLP read from same input (PR #1412 @Robby955, PR #1204 @msisovic)
4. **QK-Gain 5.25** — learnable per-head query scaling, monotonic improvement from 4.0 to 5.25
5. **Legal Score-First TTT** — SGD (lr=0.005, momentum=0.9), 3 epochs per 32K-token chunk, cosine LR decay. Score-before-update ordering. (PR #549 @abaybektursun, PR #1413 @dexhunter)
6. **Tuned Hyperparameters** — WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 (PR #1445 @X-Abhishek-X)
7. **LZMA code wrapper** — ~16.6KB code, saves ~43KB vs uncompressed

## Architecture

11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (loops layers 3-5, activated at step ~2016). Parallel residuals from layer 7: attention and MLP operate on same pre-residual input. Skip gates (sigmoid-gated U-Net connections).

## Training

MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps), AdamW for embeddings/scalars. 4550 steps in 588s on 8xH100 SXM. Linear warmdown to LR=0 over final 72% of training. EMA decay 0.9965.

## Quantization

Full-Hessian GPTQ with SDClip: `clip = k * std(row)` for principled rate-distortion. int6 for attention/MLP matrices, int8 for token embeddings. Byte-shuffle + Brotli-11 compression. Zero selective pruning needed -- model fits natively under 16MB.

## TTT (Test-Time Training)

Score-first, chunk-based SGD adaptation at eval time:
- Chunk val tokens into 32K-token chunks
- For each chunk: (1) score all sliding windows under `torch.no_grad()`, (2) train model on scored chunk tokens with SGD
- 3 epochs per chunk, cosine LR decay across chunks
- Gradient clipping at 1.0, distributed all-reduce for multi-GPU
- Total TTT eval time: ~370s (within 600s eval budget)

## Compliance

Per Issue #1017 (Track B -- legal eval-time adaptation):

- **Condition 1 (Causality):** Sliding-window eval is strictly causal. Each position scored from prefix tokens only.
- **Condition 2 (Normalized distribution):** Standard softmax over full vocab. No n-gram cache, no logit biasing.
- **Condition 3 (Score before update):** Each chunk fully scored under `torch.no_grad()` BEFORE any SGD update. Training only on already-scored tokens.
- **Condition 4 (Single pass):** Each token scored exactly once. No rescoring, no multi-pass selection.

Additional:
- No SLOT (standard or causal)
- No pre-quant TTT on val data (model quantized once during training, TTT adapts at eval time)
- No ETLB (eval-time logit bias)
- No n-gram cache or tilt
- All artifacts under 16,000,000 bytes on all 3 seeds
- Training under 600s on all 3 seeds (~588s actual)
- Eval (sliding + TTT) under 600s on all 3 seeds (~500s actual)

## Reproduction

```bash
pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

- **@clarkkev** — SP8192 + GPTQ Embeddings + SDClip + MuonEq-R + depth recurrence (PR #1394)
- **@dexhunter** — 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
- **@abaybektursun** — Score-first TTT framework (PR #549, merged precedent)
- **@Robby955** — Parallel residuals on SP8192 (PR #1412)
- **@msisovic** — Parallel residuals concept (PR #1204)
- **@X-Abhishek-X** — Hyperparameter tuning: WD=0.095, MLR=0.022, EMA=0.9965 (PR #1445, #1471)

## Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) -- this was instrumental in running the 160+ experiments across Steps 1-22 that led to this result.

## Included Files

- `README.md` (this file)
- `submission.json`
- `train_gpt.py`
- `train_seed42.log`
- `train_seed314.log`
- `train_seed999.log`
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"author": "bigbag",
"github_id": "bigbag",
"name": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal Score-First TTT",
"date": "2026-04-09",
"track": "10min_16mb",
"val_bpb": 1.08100,
"val_bpb_std": 0.00020,
"seeds": [42, 314, 999],
"seed_results": {
"42": {"val_bpb": 1.08079, "artifact_bytes": 15991930},
"314": {"val_bpb": 1.08103, "artifact_bytes": 15992919},
"999": {"val_bpb": 1.08118, "artifact_bytes": 15992919}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP8192 + 3-Layer Depth Recurrence (L3-5) + Parallel Residuals (L7+) + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Score-First TTT (SGD 3ep) + GPTQ SDClip + Brotli",
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
},
"attribution": {
"sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
"depth_recurrence": "@dexhunter (PR #1331, #1437)",
"parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
"legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
"hyperparameter_tuning": "@X-Abhishek-X (PR #1445)"
}
}

Large diffs are not rendered by default.

Loading