|
| 1 | +# Wider Loop + Per-Pass Embeddings + Tap-In V6 + Legal TTT |
| 2 | + |
| 3 | +## Results |
| 4 | + |
| 5 | +3-seed mean **+V6+TTT**: **1.078825** sliding-window BPB. All seeds sub-1.080. Beats merged SOTA (PR #1493, 1.0810) by 0.00218 BPB = 0.00562 nats (Welch t=5.52, df=2.4, p<0.01). |
| 6 | + |
| 7 | +| Seed | Steps | Pre-quant BPB | Quant BPB | Raw SW BPB | + V6 + TTT | Artifact bytes | |
| 8 | +|---|---|---|---|---|---|---| |
| 9 | +| 1234 | 4726 | 1.085931 | 1.097365 | 1.080604 | **1.078086** | **15,974,583** | |
| 10 | +| 42 | 4708 | 1.086812 | 1.098170 | 1.081516 | **1.079063** | 15,979,306 | |
| 11 | +| 2025 | 4718 | 1.087013 | 1.098452 | 1.081802 | **1.079326** | 15,978,483 | |
| 12 | +| **Mean** | 4717 | 1.086585 | 1.097996 | 1.081307 | **1.078825** | 15,977,457 | |
| 13 | + |
| 14 | +### Budget (recommended seed 1234) |
| 15 | + |
| 16 | +| Component | Bytes | |
| 17 | +|---|---| |
| 18 | +| Produced model (.int6.ptz) | 15,974,583 | |
| 19 | +| train_gpt.py (LZMA stub) | 25,346 | |
| 20 | +| **Total** | **15,999,929** | |
| 21 | +| **Headroom under 16 MB** | **+71** | |
| 22 | + |
| 23 | +## Key Techniques |
| 24 | + |
| 25 | +Builds on our previous PR ([#1420 — Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt](https://github.com/openai/parameter-golf/pull/1420)) by: |
| 26 | + |
| 27 | +1. **Wider depth recurrence**: `LOOP_START=3` `LOOP_END=5` `NUM_LOOPS=2` (3 passes through 3 loop blocks instead of 4 passes through 2). 9 loop block executions. |
| 28 | +2. **Per-pass loop embeddings**: 3 zero-init learned vectors (`nn.Embedding(3, 512)`), one fired at the start of each pass. |
| 29 | + |
| 30 | + <details> |
| 31 | + <summary><i>Per-Pass Embeddings on Wider Depth Recurrence — mechanistic analysis</i></summary> |
| 32 | + |
| 33 | + Depth recurrence reuses block weights across multiple passes, creating virtual depth without parameters. The cost: quantization error amplifies through reuse by $A(k) = (1 - \rho^k)/(1 - \rho) \approx 2\times$ at our contraction ratio $\rho \approx 0.63$. |
| 34 | + |
| 35 | + **Wider loop.** Looping blocks (3,4,5) x 3 passes instead of (4,5) x 4 passes. Same 9 total loop executions, but 3 distinct parameter sets instead of 2. Gives -0.0007 BPB at identical pre-quant — the improvement is entirely post-quantization. |
| 36 | + |
| 37 | + **Per-pass embeddings.** Three zero-init learned vectors ($e_i \in \mathbb{R}^{512}$, 1,536 params total) added to the residual before each pass. Combined with wider topology: -0.00124 BPB (5-seed, $p < 0.003$). On narrow topology: only -0.0005. The mechanism is strongly topology-dependent. |
| 38 | + |
| 39 | + **Where the gain lives.** The embeddings barely improve fp32 modeling. Nearly the entire gain comes from a collapsed quantization gap (0.0131 → 0.0114). The weights become more quantization-friendly, not more expressive. |
| 40 | + |
| 41 | + We traced this through per-matrix statistics → per-head decomposition → direct intervention. The weight-distribution signature localizes to two attention heads (K head 2, V head 1) in the loop blocks — but injecting bias directly at those heads recovers only ~50% of the gain via better modeling, while failing to reproduce the compression effect. The per-head signature is downstream of the mechanism, not its cause. |
| 42 | + |
| 43 | + The embedding mechanism has two separable effects: a modeling effect (K specialization in the newly-added block 3, reproducible by 192-param direct bias) and a compression effect (quant-gap collapse, not reproducible by any targeted head-level intervention we tested). The full residual-stream embedding constrains K from over-specializing and trades that headroom for compression-friendliness — direct bias takes the unconstrained modeling win but misses the compression side. |
| 44 | + |
| 45 | + </details> |
| 46 | +3. **Tap-In V6 cross-window n-gram + cross-window lost-len rule** at eval time (C++ matcher, ~135s on 8×H100). |
| 47 | + |
| 48 | + *Why "Tap-In"?* In golf, the tap-in is the tiny final stroke that rolls the ball the last inch into the hole after the big drive has done all the work. The model does the big swing; Tap-In is just the small eval-time nudge that finishes the putt. |
| 49 | + |
| 50 | + *Intuitively*: Tap-In is a document-local scribe. As the model predicts each token, the scribe scans backward through the same document for the exact phrase the model just generated and whispers what came after it last time. If the model's already considering that token, the scribe nudges its probability up a tiny bit; if the phrase fell out of the model's 2048-token attention window (think: a proper name introduced 3000 tokens ago), the scribe is the only one who can recover it. Wrong whispers cost almost nothing because the nudge is small; right whispers — especially for forgotten long-range repetitions — cut multiple nats off the loss at that single position. It fires hundreds of thousands of times across the eval; each individual win is small but they stack into a clean -0.001 BPB on top of the model. |
| 51 | +4. **Legal Score-First TTT** (PR #1413 recipe: `TTT_LR=0.005` `TTT_FREEZE_BLOCKS=0`) stacked on top of V6 in the SCORE phase. |
| 52 | +5. **`HESSIAN_CLIP_LAMBDA=0`**: the #1420 code default of 0.175 was a known-failed feature accidentally left as the default; pinning to 0 is +0.0006 BPB and ~40 KB smaller model. |
| 53 | + |
| 54 | +## What gets evaluated |
| 55 | + |
| 56 | +The competition harness runs `torchrun --nproc_per_node=8 train_gpt.py`. This single file is the entire scored submission — it decompresses, trains, quantizes, and evaluates end-to-end. Everything else in this folder is for human review and reproducibility. |
| 57 | + |
| 58 | +## Methodology — single pass, no double evaluation |
| 59 | + |
| 60 | +**The headline number is from a single causal left-to-right pass through the val set** with Tap-In V6 + Legal TTT applied during scoring. There is no double pass, no second-pass rescoring, no information leak between runs. |
| 61 | + |
| 62 | +The training and eval logs in this folder are intentionally split so each component's contribution can be attributed independently: |
| 63 | + |
| 64 | +| File | What it is | Passes through val | |
| 65 | +|---|---|---| |
| 66 | +| `train_seed{42,1234,2025}.log` | Standard training run; the end-of-training eval inside the training script reports the **Raw SW BPB** column (no rule, no TTT). | 1 | |
| 67 | +| `eval_v6_s42.log` | Re-loads the saved `.int6.ptz` and runs **Baseline + V6** (ablation: V6's isolated contribution). | 1 | |
| 68 | +| `eval_v6_ttt_s{42,1234,2025}.log` | Re-loads the saved `.int6.ptz` and runs **Baseline + V6 + Legal TTT** — these are the headline numbers per seed. | 1 each | |
| 69 | + |
| 70 | +Each eval is a fresh load of the same saved int6 model — no state carried between runs, no information leak from any earlier run into a later one. The leaderboard-scored number is the **+V6+TTT** column of the per-seed table above, produced by a single pass. |
| 71 | + |
| 72 | +## Legality |
| 73 | + |
| 74 | +This submission is legal because every gain comes from the strict prefix and nothing else. Score-first TTT (`human_readable/train_gpt.py:1339-1524`) accumulates `loss_sum` under `torch.no_grad()` BEFORE `optimizer.step()` ever runs, and the chunk math is airtight: chunk `ci`'s training targets max out at global position `(ci+1)·32768` while chunk `ci+1`'s scored targets start at `≥(ci+1)·32768+1` — strict inequality, no token is ever predicted by a model that has already been trained on it. The Tap-In V6 C++ matcher (`human_readable/tapin_cpp.py`) is byte-identical to the previously-audited reference: within-window matches require `p+1 < t` so `cont = ids[p+1]` is strict prefix, cross-window's `lost_len_at_t` upper bound resolves to `(ws+t)-window_size+1 < ws+t+1`, the linked-list `head/tail/fwd[tok]` update happens at `:240-249` AFTER the score block at `:126-238`, and there is zero `is_bnd_[tok]`/`has_ls_[tok]` target-dependent gating anywhere — the Category 15 attack surface is structurally absent, not merely disabled. The probability mixing `p_new(k)=(1-w)·p(k)` for `k≠tok` and `p_new(tok)=(1-w)·p(tok)+w` sums to exactly 1 by construction, so `F.cross_entropy` consumes a proper normalized distribution; eval is one left-to-right sliding pass with non-overlapping 64-token scored ranges so no position is ever rescored; GPTQ Hessians are collected from `train_loader.next_batch()` at `human_readable/train_gpt.py:944` with zero val-data exposure during training; the model is deserialized from `.int6.ptz` BEFORE TTT touches anything, so this is unambiguously eval-time adaptation, not pre-quant TTT; and the 1.0788 BPB sits comfortably above the Shannon floor in the range achievable by legitimate methods. Every one of the four conditions in Issue #1017 is satisfied not by careful gating but by the structure of the code itself — there is no configuration of env vars or hyperparameters under which this submission could become illegal, because the illegal paths simply do not exist in the source. |
| 75 | + |
| 76 | +## Files |
| 77 | + |
| 78 | +- `train_gpt.py` — **the scored artifact**. Self-contained LZMA stub that decompresses, builds the CUTLASS kernel, trains the model, then runs V6 + TTT eval. Contains minified versions of all source files below. |
| 79 | +- `human_readable/` — the unminified source code for review: |
| 80 | + - `train_gpt.py` — model, training loop, GPTQ, serialization, eval functions |
| 81 | + - `tapin_cpp.py` — C++ Tap-In matcher (single-file `load_inline`) |
| 82 | + - `_runner.py` — end-to-end orchestrator: train → monkey-patch MLP → install V6 → TTT eval |
| 83 | + - `cutlass_evt_fusion/` — fused MLP backward kernel from #1420 |
| 84 | + |
| 85 | +<details> |
| 86 | +<summary><b>Reproduce — end-to-end on a fresh 8×H100 box</b></summary> |
| 87 | + |
| 88 | +### 0. Hardware |
| 89 | + |
| 90 | +- 8×H100 80GB SXM (Hopper, sm_90a). The CUTLASS EVT kernel and FA3 require Hopper. |
| 91 | +- ECC OFF gives consistent results (the historical baseline ran ECC OFF). |
| 92 | + |
| 93 | +### 1. Python + PyTorch + FA3 |
| 94 | + |
| 95 | +```bash |
| 96 | +# Python 3.10 or 3.12 |
| 97 | +python3 -m venv venv && source venv/bin/activate |
| 98 | + |
| 99 | +# PyTorch 2.9.1+cu128 (NOT 2.11 — see "PyTorch version" note in PR #1420) |
| 100 | +pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128 |
| 101 | + |
| 102 | +# Flash Attention 3 prebuilt wheel (do NOT compile from source) |
| 103 | +pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 |
| 104 | + |
| 105 | +# Other deps |
| 106 | +pip install sentencepiece brotli numpy ninja |
| 107 | +``` |
| 108 | + |
| 109 | +### 2. CUTLASS headers (one-time, system-wide) |
| 110 | + |
| 111 | +```bash |
| 112 | +sudo git clone --depth=1 https://github.com/NVIDIA/cutlass /opt/cutlass |
| 113 | +``` |
| 114 | + |
| 115 | +### 3. Download the SP8192 dataset |
| 116 | + |
| 117 | +The dataset and tokenizer are pre-built on HuggingFace under the parameter-golf data repo. Place them so the structure is: |
| 118 | + |
| 119 | +``` |
| 120 | +~/data/ |
| 121 | + datasets/fineweb10B_sp8192/ |
| 122 | + fineweb_train_*.bin (128 shards) |
| 123 | + fineweb_val_*.bin (1 shard) |
| 124 | + tokenizers/ |
| 125 | + fineweb_8192_bpe.model |
| 126 | +``` |
| 127 | + |
| 128 | +Then `export DATA_DIR=~/data/`. |
| 129 | + |
| 130 | +### 4. Run (train + V6 + TTT eval, end-to-end) |
| 131 | + |
| 132 | +```bash |
| 133 | +SEED=1234 NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 \ |
| 134 | +PARALLEL_RESIDUAL_START=7 HESSIAN_CLIP_LAMBDA=0 \ |
| 135 | +DATA_DIR=$DATA_DIR \ |
| 136 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 137 | +``` |
| 138 | + |
| 139 | +`train_gpt.py` is self-contained — it decompresses the code, builds the CUTLASS kernel, trains the model (~10 min), then automatically runs V6 + TTT eval (~7 min). No separate eval step needed. |
| 140 | + |
| 141 | +### 5. Expected output |
| 142 | + |
| 143 | +For `SEED=1234`: |
| 144 | + |
| 145 | +``` |
| 146 | +=== V6 + TTT === |
| 147 | + TTT: lr=0.005 epochs=3 chunk=32768 freeze=0 |
| 148 | + val_loss: 2.784796 val_bpb: 1.078086 time: 389.8s |
| 149 | +``` |
| 150 | + |
| 151 | +The headline number is **`val_bpb: 1.078086`**. To reproduce the 3-seed mean of **1.078825** run with `SEED=42` and `SEED=2025` and average. |
| 152 | + |
| 153 | +### Troubleshooting |
| 154 | + |
| 155 | +| Symptom | Fix | |
| 156 | +|---|---| |
| 157 | +| `val_bpb` ≈ 1.16 instead of 1.08 | `torch.compile` was stripped — verify `eval_val_sliding_ttt` has `logits_fn = torch.compile(model.forward_logits, dynamic=False, fullgraph=True)` | |
| 158 | +| `val_bpb` ≈ 1.085 instead of 1.078 (no V6 effect) | `TAPIN_CPP=1 TAPIN_V4_ENABLED=1 TAPIN_V6_CROSS=1` env vars are set automatically by the stub; check `human_readable/_runner.py` if running manually | |
| 159 | +| Training BPB is 0.001 worse than expected | Check `HESSIAN_CLIP_LAMBDA=0` is set | |
| 160 | +| `RuntimeError: Ninja is required` | `pip install ninja` | |
| 161 | +| `RuntimeError: operator cutlass_evt::gemm_mul does not exist` | CUTLASS headers not found at `/opt/cutlass` (step 2) | |
| 162 | +| `Inference tensors cannot be saved for backward` (during TTT) | The TTT SCORE phase must use `torch.no_grad()`, NOT `torch.inference_mode()` (this is correct in the shipped code) | |
| 163 | + |
| 164 | +</details> |
0 commit comments