Skip to content

Commit bbe6602

Browse files
committed
update
1 parent 75700cb commit bbe6602

File tree

16 files changed

+3459
-0
lines changed

16 files changed

+3459
-0
lines changed
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Wider Loop + Per-Pass Embeddings + Tap-In V6 + Legal TTT
2+
3+
## Results
4+
5+
3-seed mean **+V6+TTT**: **1.078825** sliding-window BPB. All seeds sub-1.080. Beats merged SOTA (PR #1493, 1.0810) by 0.00218 BPB = 0.00562 nats (Welch t=5.52, df=2.4, p<0.01).
6+
7+
| Seed | Steps | Pre-quant BPB | Quant BPB | Raw SW BPB | + V6 + TTT | Artifact bytes |
8+
|---|---|---|---|---|---|---|
9+
| 1234 | 4726 | 1.085931 | 1.097365 | 1.080604 | **1.078086** | **15,974,583** |
10+
| 42 | 4708 | 1.086812 | 1.098170 | 1.081516 | **1.079063** | 15,979,306 |
11+
| 2025 | 4718 | 1.087013 | 1.098452 | 1.081802 | **1.079326** | 15,978,483 |
12+
| **Mean** | 4717 | 1.086585 | 1.097996 | 1.081307 | **1.078825** | 15,977,457 |
13+
14+
### Budget (recommended seed 1234)
15+
16+
| Component | Bytes |
17+
|---|---|
18+
| Produced model (.int6.ptz) | 15,974,583 |
19+
| train_gpt.py (LZMA stub) | 25,346 |
20+
| **Total** | **15,999,929** |
21+
| **Headroom under 16 MB** | **+71** |
22+
23+
## Key Techniques
24+
25+
Builds on our previous PR ([#1420 — Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt](https://github.com/openai/parameter-golf/pull/1420)) by:
26+
27+
1. **Wider depth recurrence**: `LOOP_START=3` `LOOP_END=5` `NUM_LOOPS=2` (3 passes through 3 loop blocks instead of 4 passes through 2). 9 loop block executions.
28+
2. **Per-pass loop embeddings**: 3 zero-init learned vectors (`nn.Embedding(3, 512)`), one fired at the start of each pass.
29+
3. **Tap-In V6 cross-window n-gram + cross-window lost-len rule** at eval time (C++ matcher, ~135s on 8×H100).
30+
31+
*Why "Tap-In"?* In golf, the tap-in is the tiny final stroke that rolls the ball the last inch into the hole after the big drive has done all the work. The model does the big swing; Tap-In is just the small eval-time nudge that finishes the putt.
32+
33+
*Intuitively*: Tap-In is a document-local scribe. As the model predicts each token, the scribe scans backward through the same document for the exact phrase the model just generated and whispers what came after it last time. If the model's already considering that token, the scribe nudges its probability up a tiny bit; if the phrase fell out of the model's 2048-token attention window (think: a proper name introduced 3000 tokens ago), the scribe is the only one who can recover it. Wrong whispers cost almost nothing because the nudge is small; right whispers — especially for forgotten long-range repetitions — cut multiple nats off the loss at that single position. It fires hundreds of thousands of times across the eval; each individual win is small but they stack into a clean -0.001 BPB on top of the model.
34+
4. **Legal Score-First TTT** (PR #1413 recipe: `TTT_LR=0.005` `TTT_FREEZE_BLOCKS=0`) stacked on top of V6 in the SCORE phase.
35+
5. **`HESSIAN_CLIP_LAMBDA=0`**: the #1420 code default of 0.175 was a known-failed feature accidentally left as the default; pinning to 0 is +0.0006 BPB and ~40 KB smaller model.
36+
37+
## What gets evaluated
38+
39+
The competition harness runs `torchrun --nproc_per_node=8 train_gpt.py`. This single file is the entire scored submission — it decompresses, trains, quantizes, and evaluates end-to-end. Everything else in this folder is for human review and reproducibility.
40+
41+
## Methodology — single pass, no double evaluation
42+
43+
**The headline number is from a single causal left-to-right pass through the val set** with Tap-In V6 + Legal TTT applied during scoring. There is no double pass, no second-pass rescoring, no information leak between runs.
44+
45+
The training and eval logs in this folder are intentionally split so each component's contribution can be attributed independently:
46+
47+
| File | What it is | Passes through val |
48+
|---|---|---|
49+
| `train_seed{42,1234,2025}.log` | Standard training run; the end-of-training eval inside the training script reports the **Raw SW BPB** column (no rule, no TTT). | 1 |
50+
| `eval_v6_s42.log` | Re-loads the saved `.int6.ptz` and runs **Baseline + V6** (ablation: V6's isolated contribution). | 1 |
51+
| `eval_v6_ttt_s{42,1234,2025}.log` | Re-loads the saved `.int6.ptz` and runs **Baseline + V6 + Legal TTT** — these are the headline numbers per seed. | 1 each |
52+
53+
Each eval is a fresh load of the same saved int6 model — no state carried between runs, no information leak from any earlier run into a later one. The leaderboard-scored number is the **+V6+TTT** column of the per-seed table above, produced by a single pass.
54+
55+
## Legality
56+
57+
This submission is legal because every gain comes from the strict prefix and nothing else. Score-first TTT (`human_readable/train_gpt.py:1339-1524`) accumulates `loss_sum` under `torch.no_grad()` BEFORE `optimizer.step()` ever runs, and the chunk math is airtight: chunk `ci`'s training targets max out at global position `(ci+1)·32768` while chunk `ci+1`'s scored targets start at `≥(ci+1)·32768+1` — strict inequality, no token is ever predicted by a model that has already been trained on it. The Tap-In V6 C++ matcher (`human_readable/tapin_cpp.py`) is byte-identical to the previously-audited reference: within-window matches require `p+1 < t` so `cont = ids[p+1]` is strict prefix, cross-window's `lost_len_at_t` upper bound resolves to `(ws+t)-window_size+1 < ws+t+1`, the linked-list `head/tail/fwd[tok]` update happens at `:240-249` AFTER the score block at `:126-238`, and there is zero `is_bnd_[tok]`/`has_ls_[tok]` target-dependent gating anywhere — the Category 15 attack surface is structurally absent, not merely disabled. The probability mixing `p_new(k)=(1-w)·p(k)` for `k≠tok` and `p_new(tok)=(1-w)·p(tok)+w` sums to exactly 1 by construction, so `F.cross_entropy` consumes a proper normalized distribution; eval is one left-to-right sliding pass with non-overlapping 64-token scored ranges so no position is ever rescored; GPTQ Hessians are collected from `train_loader.next_batch()` at `human_readable/train_gpt.py:944` with zero val-data exposure during training; the model is deserialized from `.int6.ptz` BEFORE TTT touches anything, so this is unambiguously eval-time adaptation, not pre-quant TTT; and the 1.0788 BPB sits comfortably above the Shannon floor in the range achievable by legitimate methods. Every one of the four conditions in Issue #1017 is satisfied not by careful gating but by the structure of the code itself — there is no configuration of env vars or hyperparameters under which this submission could become illegal, because the illegal paths simply do not exist in the source.
58+
59+
## Files
60+
61+
- `train_gpt.py`**the scored artifact**. Self-contained LZMA stub that decompresses, builds the CUTLASS kernel, trains the model, then runs V6 + TTT eval. Contains minified versions of all source files below.
62+
- `human_readable/` — the unminified source code for review:
63+
- `train_gpt.py` — model, training loop, GPTQ, serialization, eval functions
64+
- `tapin_cpp.py` — C++ Tap-In matcher (single-file `load_inline`)
65+
- `_runner.py` — end-to-end orchestrator: train → monkey-patch MLP → install V6 → TTT eval
66+
- `cutlass_evt_fusion/` — fused MLP backward kernel from #1420
67+
68+
<details>
69+
<summary><b>Reproduce — end-to-end on a fresh 8×H100 box</b></summary>
70+
71+
### 0. Hardware
72+
73+
- 8×H100 80GB SXM (Hopper, sm_90a). The CUTLASS EVT kernel and FA3 require Hopper.
74+
- ECC OFF gives consistent results (the historical baseline ran ECC OFF).
75+
76+
### 1. Python + PyTorch + FA3
77+
78+
```bash
79+
# Python 3.10 or 3.12
80+
python3 -m venv venv && source venv/bin/activate
81+
82+
# PyTorch 2.9.1+cu128 (NOT 2.11 — see "PyTorch version" note in PR #1420)
83+
pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
84+
85+
# Flash Attention 3 prebuilt wheel (do NOT compile from source)
86+
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
87+
88+
# Other deps
89+
pip install sentencepiece brotli numpy ninja
90+
```
91+
92+
### 2. CUTLASS headers (one-time, system-wide)
93+
94+
```bash
95+
sudo git clone --depth=1 https://github.com/NVIDIA/cutlass /opt/cutlass
96+
```
97+
98+
### 3. Download the SP8192 dataset
99+
100+
The dataset and tokenizer are pre-built on HuggingFace under the parameter-golf data repo. Place them so the structure is:
101+
102+
```
103+
~/data/
104+
datasets/fineweb10B_sp8192/
105+
fineweb_train_*.bin (128 shards)
106+
fineweb_val_*.bin (1 shard)
107+
tokenizers/
108+
fineweb_8192_bpe.model
109+
```
110+
111+
Then `export DATA_DIR=~/data/`.
112+
113+
### 4. Run (train + V6 + TTT eval, end-to-end)
114+
115+
```bash
116+
SEED=1234 NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 \
117+
PARALLEL_RESIDUAL_START=7 HESSIAN_CLIP_LAMBDA=0 \
118+
DATA_DIR=$DATA_DIR \
119+
torchrun --standalone --nproc_per_node=8 train_gpt.py
120+
```
121+
122+
`train_gpt.py` is self-contained — it decompresses the code, builds the CUTLASS kernel, trains the model (~10 min), then automatically runs V6 + TTT eval (~7 min). No separate eval step needed.
123+
124+
### 5. Expected output
125+
126+
For `SEED=1234`:
127+
128+
```
129+
=== V6 + TTT ===
130+
TTT: lr=0.005 epochs=3 chunk=32768 freeze=0
131+
val_loss: 2.784796 val_bpb: 1.078086 time: 389.8s
132+
```
133+
134+
The headline number is **`val_bpb: 1.078086`**. To reproduce the 3-seed mean of **1.078825** run with `SEED=42` and `SEED=2025` and average.
135+
136+
### Troubleshooting
137+
138+
| Symptom | Fix |
139+
|---|---|
140+
| `val_bpb` ≈ 1.16 instead of 1.08 | `torch.compile` was stripped — verify `eval_val_sliding_ttt` has `logits_fn = torch.compile(model.forward_logits, dynamic=False, fullgraph=True)` |
141+
| `val_bpb` ≈ 1.085 instead of 1.078 (no V6 effect) | `TAPIN_CPP=1 TAPIN_V4_ENABLED=1 TAPIN_V6_CROSS=1` env vars are set automatically by the stub; check `human_readable/_runner.py` if running manually |
142+
| Training BPB is 0.001 worse than expected | Check `HESSIAN_CLIP_LAMBDA=0` is set |
143+
| `RuntimeError: Ninja is required` | `pip install ninja` |
144+
| `RuntimeError: operator cutlass_evt::gemm_mul does not exist` | CUTLASS headers not found at `/opt/cutlass` (step 2) |
145+
| `Inference tensors cannot be saved for backward` (during TTT) | The TTT SCORE phase must use `torch.no_grad()`, NOT `torch.inference_mode()` (this is correct in the shipped code) |
146+
147+
</details>
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
W0410 00:23:52.565000 257236 torch/distributed/run.py:803]
2+
W0410 00:23:52.565000 257236 torch/distributed/run.py:803] *****************************************
3+
W0410 00:23:52.565000 257236 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
4+
W0410 00:23:52.565000 257236 torch/distributed/run.py:803] *****************************************
5+
Using real FA3Using real FA3
6+
7+
Using real FA3
8+
Using real FA3
9+
Using real FA3
10+
Using real FA3
11+
Using real FA3
12+
Using real FA3
13+
/home/ubuntu/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
14+
warnings.warn( # warn only once
15+
[rank0]:[W410 00:24:18.139368805 ProcessGroupNCCL.cpp:5072] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
16+
TAPIN_CPP=1: using C++ fast matcher for apply_tapin_rule_v5
17+
Val tokens: 40,540,160
18+
Model loaded: 35,946,072 params (int6 dequantized)
19+
20+
=== BASELINE (no Tap-In) ===
21+
val_loss: 2.793668 val_bpb: 1.081516 time: 108.6s
22+
23+
=== WITH TAP-IN V5/V6 (probability mixing) ===
24+
ent_min=0.0 top_k=1000 mix_w=0.02 min_match=3 max_match=100 bayes=False
25+
val_loss: 2.790909 val_bpb: 1.080448 time: 134.6s
26+
V5 STATS: {'fires': 642474, 'cross_fires': 69167, 'within_fires': 573307, 'cal_total_total': [0, 0, 0, 0, 0, 0], 'cal_hits_total': [0, 0, 0, 0, 0, 0], 'alpha_dump': []}
27+
28+
=== RESULT ===
29+
Baseline BPB: 1.081516
30+
Tap-In BPB: 1.080448
31+
Delta: -0.001068
32+
Time: baseline 109s, tap-in 135s
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
===== seed s1234 =====
2+
ttt_sliding:done val_loss=2.784796 val_bpb=1.078086 elapsed=389.5s
3+
val_loss: 2.784796 val_bpb: 1.078086 time: 389.8s
4+
=== SUMMARY (s1234) ===
5+
baseline (raw SW): 1.080604
6+
+ V6 + TTT (now): 1.078086 (Δ -0.002518)
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
===== seed s2025 =====
2+
ttt_sliding:done val_loss=2.787999 val_bpb=1.079326 elapsed=385.7s
3+
val_loss: 2.787999 val_bpb: 1.079326 time: 386.1s
4+
=== SUMMARY (s2025) ===
5+
baseline (raw SW): 1.081802
6+
+ V6 + TTT (now): 1.079326 (Δ -0.002476)

0 commit comments

Comments
 (0)