|
| 1 | +# Wider Loop + Per-Pass Embeddings + Tap-In V6 + Legal TTT |
| 2 | + |
| 3 | +## Results |
| 4 | + |
| 5 | +3-seed mean **+V6+TTT**: **1.078825** sliding-window BPB. All seeds sub-1.080. Beats merged SOTA (PR #1493, 1.0810) by 0.00218 BPB = 0.00562 nats (Welch t=5.52, df=2.4, p<0.01). |
| 6 | + |
| 7 | +| Seed | Steps | Pre-quant BPB | Quant BPB | Raw SW BPB | + V6 + TTT | Artifact bytes | |
| 8 | +|---|---|---|---|---|---|---| |
| 9 | +| 1234 | 4726 | 1.085931 | 1.097365 | 1.080604 | **1.078086** | **15,974,583** | |
| 10 | +| 42 | 4708 | 1.086812 | 1.098170 | 1.081516 | **1.079063** | 15,979,306 | |
| 11 | +| 2025 | 4718 | 1.087013 | 1.098452 | 1.081802 | **1.079326** | 15,978,483 | |
| 12 | +| **Mean** | 4717 | 1.086585 | 1.097996 | 1.081307 | **1.078825** | 15,977,457 | |
| 13 | + |
| 14 | +### Budget (recommended seed 1234) |
| 15 | + |
| 16 | +| Component | Bytes | |
| 17 | +|---|---| |
| 18 | +| Produced model (.int6.ptz) | 15,974,583 | |
| 19 | +| train_gpt.py (LZMA stub) | 25,346 | |
| 20 | +| **Total** | **15,999,929** | |
| 21 | +| **Headroom under 16 MB** | **+71** | |
| 22 | + |
| 23 | +## Key Techniques |
| 24 | + |
| 25 | +Builds on our previous PR ([#1420 — Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt](https://github.com/openai/parameter-golf/pull/1420)) by: |
| 26 | + |
| 27 | +1. **Wider depth recurrence**: `LOOP_START=3` `LOOP_END=5` `NUM_LOOPS=2` (3 passes through 3 loop blocks instead of 4 passes through 2). 9 loop block executions. |
| 28 | +2. **Per-pass loop embeddings**: 3 zero-init learned vectors (`nn.Embedding(3, 512)`), one fired at the start of each pass. |
| 29 | +3. **Tap-In V6 cross-window n-gram + cross-window lost-len rule** at eval time (C++ matcher, ~135s on 8×H100). |
| 30 | + |
| 31 | + *Why "Tap-In"?* In golf, the tap-in is the tiny final stroke that rolls the ball the last inch into the hole after the big drive has done all the work. The model does the big swing; Tap-In is just the small eval-time nudge that finishes the putt. |
| 32 | + |
| 33 | + *Intuitively*: Tap-In is a document-local scribe. As the model predicts each token, the scribe scans backward through the same document for the exact phrase the model just generated and whispers what came after it last time. If the model's already considering that token, the scribe nudges its probability up a tiny bit; if the phrase fell out of the model's 2048-token attention window (think: a proper name introduced 3000 tokens ago), the scribe is the only one who can recover it. Wrong whispers cost almost nothing because the nudge is small; right whispers — especially for forgotten long-range repetitions — cut multiple nats off the loss at that single position. It fires hundreds of thousands of times across the eval; each individual win is small but they stack into a clean -0.001 BPB on top of the model. |
| 34 | +4. **Legal Score-First TTT** (PR #1413 recipe: `TTT_LR=0.005` `TTT_FREEZE_BLOCKS=0`) stacked on top of V6 in the SCORE phase. |
| 35 | +5. **`HESSIAN_CLIP_LAMBDA=0`**: the #1420 code default of 0.175 was a known-failed feature accidentally left as the default; pinning to 0 is +0.0006 BPB and ~40 KB smaller model. |
| 36 | + |
| 37 | +## What gets evaluated |
| 38 | + |
| 39 | +The competition harness runs `torchrun --nproc_per_node=8 train_gpt.py`. This single file is the entire scored submission — it decompresses, trains, quantizes, and evaluates end-to-end. Everything else in this folder is for human review and reproducibility. |
| 40 | + |
| 41 | +## Methodology — single pass, no double evaluation |
| 42 | + |
| 43 | +**The headline number is from a single causal left-to-right pass through the val set** with Tap-In V6 + Legal TTT applied during scoring. There is no double pass, no second-pass rescoring, no information leak between runs. |
| 44 | + |
| 45 | +The training and eval logs in this folder are intentionally split so each component's contribution can be attributed independently: |
| 46 | + |
| 47 | +| File | What it is | Passes through val | |
| 48 | +|---|---|---| |
| 49 | +| `train_seed{42,1234,2025}.log` | Standard training run; the end-of-training eval inside the training script reports the **Raw SW BPB** column (no rule, no TTT). | 1 | |
| 50 | +| `eval_v6_s42.log` | Re-loads the saved `.int6.ptz` and runs **Baseline + V6** (ablation: V6's isolated contribution). | 1 | |
| 51 | +| `eval_v6_ttt_s{42,1234,2025}.log` | Re-loads the saved `.int6.ptz` and runs **Baseline + V6 + Legal TTT** — these are the headline numbers per seed. | 1 each | |
| 52 | + |
| 53 | +Each eval is a fresh load of the same saved int6 model — no state carried between runs, no information leak from any earlier run into a later one. The leaderboard-scored number is the **+V6+TTT** column of the per-seed table above, produced by a single pass. |
| 54 | + |
| 55 | +## Legality |
| 56 | + |
| 57 | +This submission is legal because every gain comes from the strict prefix and nothing else. Score-first TTT (`human_readable/train_gpt.py:1339-1524`) accumulates `loss_sum` under `torch.no_grad()` BEFORE `optimizer.step()` ever runs, and the chunk math is airtight: chunk `ci`'s training targets max out at global position `(ci+1)·32768` while chunk `ci+1`'s scored targets start at `≥(ci+1)·32768+1` — strict inequality, no token is ever predicted by a model that has already been trained on it. The Tap-In V6 C++ matcher (`human_readable/tapin_cpp.py`) is byte-identical to the previously-audited reference: within-window matches require `p+1 < t` so `cont = ids[p+1]` is strict prefix, cross-window's `lost_len_at_t` upper bound resolves to `(ws+t)-window_size+1 < ws+t+1`, the linked-list `head/tail/fwd[tok]` update happens at `:240-249` AFTER the score block at `:126-238`, and there is zero `is_bnd_[tok]`/`has_ls_[tok]` target-dependent gating anywhere — the Category 15 attack surface is structurally absent, not merely disabled. The probability mixing `p_new(k)=(1-w)·p(k)` for `k≠tok` and `p_new(tok)=(1-w)·p(tok)+w` sums to exactly 1 by construction, so `F.cross_entropy` consumes a proper normalized distribution; eval is one left-to-right sliding pass with non-overlapping 64-token scored ranges so no position is ever rescored; GPTQ Hessians are collected from `train_loader.next_batch()` at `human_readable/train_gpt.py:944` with zero val-data exposure during training; the model is deserialized from `.int6.ptz` BEFORE TTT touches anything, so this is unambiguously eval-time adaptation, not pre-quant TTT; and the 1.0788 BPB sits comfortably above the Shannon floor in the range achievable by legitimate methods. Every one of the four conditions in Issue #1017 is satisfied not by careful gating but by the structure of the code itself — there is no configuration of env vars or hyperparameters under which this submission could become illegal, because the illegal paths simply do not exist in the source. |
| 58 | + |
| 59 | +## Files |
| 60 | + |
| 61 | +- `train_gpt.py` — **the scored artifact**. Self-contained LZMA stub that decompresses, builds the CUTLASS kernel, trains the model, then runs V6 + TTT eval. Contains minified versions of all source files below. |
| 62 | +- `human_readable/` — the unminified source code for review: |
| 63 | + - `train_gpt.py` — model, training loop, GPTQ, serialization, eval functions |
| 64 | + - `tapin_cpp.py` — C++ Tap-In matcher (single-file `load_inline`) |
| 65 | + - `_runner.py` — end-to-end orchestrator: train → monkey-patch MLP → install V6 → TTT eval |
| 66 | + - `cutlass_evt_fusion/` — fused MLP backward kernel from #1420 |
| 67 | + |
| 68 | +<details> |
| 69 | +<summary><b>Reproduce — end-to-end on a fresh 8×H100 box</b></summary> |
| 70 | + |
| 71 | +### 0. Hardware |
| 72 | + |
| 73 | +- 8×H100 80GB SXM (Hopper, sm_90a). The CUTLASS EVT kernel and FA3 require Hopper. |
| 74 | +- ECC OFF gives consistent results (the historical baseline ran ECC OFF). |
| 75 | + |
| 76 | +### 1. Python + PyTorch + FA3 |
| 77 | + |
| 78 | +```bash |
| 79 | +# Python 3.10 or 3.12 |
| 80 | +python3 -m venv venv && source venv/bin/activate |
| 81 | + |
| 82 | +# PyTorch 2.9.1+cu128 (NOT 2.11 — see "PyTorch version" note in PR #1420) |
| 83 | +pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128 |
| 84 | + |
| 85 | +# Flash Attention 3 prebuilt wheel (do NOT compile from source) |
| 86 | +pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 |
| 87 | + |
| 88 | +# Other deps |
| 89 | +pip install sentencepiece brotli numpy ninja |
| 90 | +``` |
| 91 | + |
| 92 | +### 2. CUTLASS headers (one-time, system-wide) |
| 93 | + |
| 94 | +```bash |
| 95 | +sudo git clone --depth=1 https://github.com/NVIDIA/cutlass /opt/cutlass |
| 96 | +``` |
| 97 | + |
| 98 | +### 3. Download the SP8192 dataset |
| 99 | + |
| 100 | +The dataset and tokenizer are pre-built on HuggingFace under the parameter-golf data repo. Place them so the structure is: |
| 101 | + |
| 102 | +``` |
| 103 | +~/data/ |
| 104 | + datasets/fineweb10B_sp8192/ |
| 105 | + fineweb_train_*.bin (128 shards) |
| 106 | + fineweb_val_*.bin (1 shard) |
| 107 | + tokenizers/ |
| 108 | + fineweb_8192_bpe.model |
| 109 | +``` |
| 110 | + |
| 111 | +Then `export DATA_DIR=~/data/`. |
| 112 | + |
| 113 | +### 4. Run (train + V6 + TTT eval, end-to-end) |
| 114 | + |
| 115 | +```bash |
| 116 | +SEED=1234 NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 \ |
| 117 | +PARALLEL_RESIDUAL_START=7 HESSIAN_CLIP_LAMBDA=0 \ |
| 118 | +DATA_DIR=$DATA_DIR \ |
| 119 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 120 | +``` |
| 121 | + |
| 122 | +`train_gpt.py` is self-contained — it decompresses the code, builds the CUTLASS kernel, trains the model (~10 min), then automatically runs V6 + TTT eval (~7 min). No separate eval step needed. |
| 123 | + |
| 124 | +### 5. Expected output |
| 125 | + |
| 126 | +For `SEED=1234`: |
| 127 | + |
| 128 | +``` |
| 129 | +=== V6 + TTT === |
| 130 | + TTT: lr=0.005 epochs=3 chunk=32768 freeze=0 |
| 131 | + val_loss: 2.784796 val_bpb: 1.078086 time: 389.8s |
| 132 | +``` |
| 133 | + |
| 134 | +The headline number is **`val_bpb: 1.078086`**. To reproduce the 3-seed mean of **1.078825** run with `SEED=42` and `SEED=2025` and average. |
| 135 | + |
| 136 | +### Troubleshooting |
| 137 | + |
| 138 | +| Symptom | Fix | |
| 139 | +|---|---| |
| 140 | +| `val_bpb` ≈ 1.16 instead of 1.08 | `torch.compile` was stripped — verify `eval_val_sliding_ttt` has `logits_fn = torch.compile(model.forward_logits, dynamic=False, fullgraph=True)` | |
| 141 | +| `val_bpb` ≈ 1.085 instead of 1.078 (no V6 effect) | `TAPIN_CPP=1 TAPIN_V4_ENABLED=1 TAPIN_V6_CROSS=1` env vars are set automatically by the stub; check `human_readable/_runner.py` if running manually | |
| 142 | +| Training BPB is 0.001 worse than expected | Check `HESSIAN_CLIP_LAMBDA=0` is set | |
| 143 | +| `RuntimeError: Ninja is required` | `pip install ninja` | |
| 144 | +| `RuntimeError: operator cutlass_evt::gemm_mul does not exist` | CUTLASS headers not found at `/opt/cutlass` (step 2) | |
| 145 | +| `Inference tensors cannot be saved for backward` (during TTT) | The TTT SCORE phase must use `torch.no_grad()`, NOT `torch.inference_mode()` (this is correct in the shipped code) | |
| 146 | + |
| 147 | +</details> |
0 commit comments