Skip to content

Commit 80f98fc

Browse files
committed
fix(causal): correct PR openai#1437 with causal n-gram kernel + token-only experts
The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug: within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token metadata at the position being scored), leaking 1-2 bits about the answer per scored position. This is an Issue openai#1017 condition 2 violation. PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR openai#1420's thread and proposed the same fix that's applied here: * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last prefix token) for hint gating. Updates use the actual current tok via new tok_is_bnd / tok_is_ws variables so within_update / word_update still segment words correctly. Variable naming and structure copied verbatim from PR openai#1420's fix. * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0. Empirically the within / word experts under prefix-only gating fire for the wrong positions (within fires for word-starts, word fires for mid-word) and contribute *negative* BPB. Disabling them gives 1.07951 on s42 vs 1.08108 with the experts active — token_hint is the only legitimate contributor. 5-seed verification (all on the patched kernel): seed pre-fix corrected delta 0 1.07751 1.08035 +0.00284 42 1.07809 1.08097 +0.00288 1234 1.07813 1.08127 +0.00314 1337 1.07801 1.08060 +0.00259 2025 1.07862 1.08135 +0.00273 mean 1.07807 1.08091 +0.00284 All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB headroom). Pre-fix per-seed values preserved in submission.json under seed_results_pre_fix for the public record. Bar comparisons (corrected mean 1.08091): PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the same bug; would correct to ~1.08300 post-fix) This PR is left open as a transparency / diagnostic record, NOT as a record claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest legal anchor. The README has been retitled "Diagnostic (causal-corrected)" and the legality fix is documented in a dedicated section.
1 parent 74d53e8 commit 80f98fc

File tree

8 files changed

+930
-832
lines changed

8 files changed

+930
-832
lines changed

records/track_10min_16mb/2026-04-07_SP8192_ParallelResid7_Loop35_NgramTilt/README.md

Lines changed: 44 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,53 @@
1-
# Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Legal N-gram Tilt — val_bpb 1.07807 (5-seed mean)
1+
# Diagnostic (causal-corrected, 2026-04-07 PM): SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt
22

3-
**val_bpb: 1.07807** (5-seed mean, std 0.00040) | **2.78478 nats per token** | **~15.99 MB** | 8×H100 SXM, 600 s | Legal Score-First TTT + Causal N-gram Tilt
3+
**val_bpb: 1.08091** (5-seed mean, std 0.00043) | **2.79210 nats per token** | **~16.00 MB** | 8×H100 SXM, 600 s | Legal Score-First TTT + Causal Token-Only N-gram Tilt
44

5-
Beats [PR #1394](https://github.com/openai/parameter-golf/pull/1394) (1.08563) by **0.00756 bpb / 0.01952 nats per token** on a 5-seed mean, comfortably clearing the 0.005-nats record threshold. Beats [PR #1420](https://github.com/openai/parameter-golf/pull/1420) (1.08014) by **0.00207 bpb / 0.00534 nats per token**, clearing the 0.005-nats threshold against the next-best legal open PR. Beats our own [PR #1413](https://github.com/openai/parameter-golf/pull/1413) (1.08279) by **0.00472 bpb / 0.01218 nats per token**.
5+
> **2026-04-07 PM correction** — see [Legality Fix](#legality-fix-2026-04-07-pm) section. The original number reported here (1.07807) was produced with a non-causal n-gram kernel inherited from [PR #1420](https://github.com/openai/parameter-golf/pull/1420). @abaybektursun [has acknowledged the bug and proposed the same fix I implemented](https://github.com/openai/parameter-golf/pull/1420#issuecomment-4199452189). This submission is no longer claimed as a record; the corrected mean (1.08091) is ~+0.00284 nats above the original (illegal) 1.07807.
66
7-
## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, legal score-first TTT with causal n-gram tilt)
7+
Bar comparisons (corrected against open PRs):
88

9-
### Core (TTT) table — 5-seed verification, all seeds re-run via shipped mini wrapper
9+
- vs [PR #1394](https://github.com/openai/parameter-golf/pull/1394) (1.08563): beats by **+0.00472 bpb / +0.00472 nats per token** — does NOT meet the 0.005-nat record bar
10+
- vs our [PR #1413](https://github.com/openai/parameter-golf/pull/1413) (1.08279): beats by **+0.00188 bpb / +0.00188 nats per token** — does NOT meet the 0.005-nat record bar
11+
- vs [PR #1420](https://github.com/openai/parameter-golf/pull/1420) (1.08014, **also affected by the same kernel bug**): beats by **-0.00077 bpb / -0.00077 nats per token** — does NOT meet the 0.005-nat record bar (would be ~+0.0029 worse if PR #1420 were also corrected)
1012

11-
| Seed | Steps | Pre-quant BPB | Sliding BPB | **Post-TTT (n-gram tilted) BPB** | val_loss (nats) | Artifact (bytes) |
13+
PR #1413 (no n-gram tilt at all, fully legal) at 1.08279 remains our cleanest legal record claim. This PR is left open as a transparency / diagnostic record.
14+
15+
## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, causal token-only n-gram tilt)
16+
17+
### Core (TTT) table — 5-seed verification, all seeds re-run via shipped mini wrapper with the patched kernel
18+
19+
| Seed | Steps | Pre-quant BPB | Sliding BPB | **Post-TTT (causal token-only) BPB** | val_loss (nats) | Artifact (bytes) |
1220
|---:|---:|---:|---:|---:|---:|---:|
13-
| 0 | 4918 | 1.08728 | 1.08209 | **1.07751** | 2.78333 | **15,992,304**|
14-
| 42 | 4911 | 1.08785 | 1.08268 | **1.07809** | 2.78481 | **15,993,733**|
15-
| 1234 | 4908 | 1.08794 | 1.08280 | **1.07813** | 2.78492 | **15,990,539**|
16-
| 1337 | 4909 | 1.08772 | 1.08246 | **1.07801** | 2.78461 | **15,988,039**|
17-
| 2025 | 4908 | 1.08842 | 1.08306 | **1.07862** | 2.78620 | **15,992,215**|
18-
| **5-seed mean** | | **1.08784** | **1.08262** | **1.07807** | **2.78478** | all < 16,000,000 |
21+
| 0 | 4911 | 1.08730 | 1.08219 | **1.08035** | 2.79067 | **15,994,644**|
22+
| 42 | 4906 | 1.08792 | 1.08272 | **1.08097** | 2.79225 | **15,995,572**|
23+
| 1234 | 4915 | 1.08823 | 1.08336 | **1.08127** | 2.79303 | **15,993,531**|
24+
| 1337 | 4905 | 1.08759 | 1.08235 | **1.08060** | 2.79131 | **15,988,802**|
25+
| 2025 | 4911 | 1.08833 | 1.08302 | **1.08135** | 2.79324 | **15,993,360**|
26+
| **5-seed mean** | | **1.08787** | **1.08273** | **1.08091** | **2.79210** | all < 16,000,000 |
1927

2028
**Verification status:**
21-
- **All 5 seeds independently re-run via the shipped `train_gpt.py` self-extracting LZMA mini wrapper** (~18.9 KB code, ~57 KB decoded payload). Each artifact is the actual `Total submission size quantized+brotli` from the mini-wrapper run, NOT a projection.
22-
- **All 5 artifacts fit under 16,000,000 bytes** with 6,267–11,961 byte headroom.
23-
- 5-seed standard deviation: **0.00040 BPB** (5-seed standard error of the mean: ~0.00018).
24-
- BPB values are reported from the legal score-first TTT eval pass with causal n-gram tilt applied; sliding (no-TTT) and pre-quant numbers are also shown for diagnostic transparency.
29+
- All 5 seeds independently re-run via the shipped `train_gpt.py` (~18.9 KB code) with the **patched** `fused_expert_kernel.cpp` and `NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0`. Each artifact is the actual `Total submission size quantized+brotli` from the mini-wrapper run.
30+
- All 5 artifacts fit under 16,000,000 bytes (corrected runs use the same model weights as the original submission; only the eval-time kernel changed).
31+
- 5-seed standard deviation: **0.00043 BPB**.
32+
- Pre-fix (illegal) per-seed values are preserved in `submission.json` under `seed_results_pre_fix`.
33+
34+
## Legality Fix (2026-04-07 PM)
35+
36+
The original kernel from [PR #1420](https://github.com/openai/parameter-golf/pull/1420) (which this submission ported with `nanobind` removed) had a causality bug in `get_hints_batch`:
37+
38+
- Lines 384-386 read `tok = tokens_[p]` (the **target** token at the position being scored) and derived `is_bnd = is_bnd_[tok]` and `is_ws = has_ls_[tok]`.
39+
- Lines 399-400 then passed those flags to `within_hint(is_bnd, is_ws, ...)` and `word_hint(is_ws, ...)`, gating hint emission on whether the **current target** is mid-word vs word-start vs boundary.
40+
41+
This means the predictive distribution at position `p` depended on metadata derived from `x_p` itself, leaking 1-2 bits per scored position about the answer. [Issue #1017](https://github.com/openai/parameter-golf/issues/1017) condition 2 violation. The original 1.07807 5-seed mean reported in PR #1437's first version is therefore tainted.
42+
43+
**The fix** (matches @abaybektursun's [proposed patch](https://github.com/openai/parameter-golf/pull/1420#issuecomment-4199452189)):
44+
45+
1. **Kernel patch**: derive `prev_is_bnd`/`prev_is_ws` from `tokens_[p-1]` (last prefix token) for hint gating only. The current-token reads at lines 384-386 are kept only for the *update* calls at lines 437-439 (causal because they run after hint emission for that position).
46+
2. **Disable within/word experts**: set `NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0`. Empirically, the within/word experts under prefix-only gating fire for the wrong positions (within fires for word-starts, word fires for mid-word) and contribute *negative* BPB. Only `token_hint` (which has always been causal — `compute_hashes` only reads `tokens[pos - k - 1]` for `k ≥ 0`) is left active.
2547

26-
### Diagnostics (mini-wrapper runs)
48+
**Measured leak magnitude (this submission, 5-seed mean):** TTT `1.07807``1.08091`, delta **+0.00284 nats per token**. Sliding (no tilt) and pre-quant numbers are unchanged because the kernel only affects the TTT eval pass.
2749

28-
| Seed | Pre-quant BPB | Quantized roundtrip BPB | Sliding BPB | TTT BPB | TTT eval (s) | N-gram precompute (s) | N-gram hint coverage |
29-
|---:|---:|---:|---:|---:|---:|---:|---:|
30-
| 0 | 1.08728 | 1.09923 | 1.08209 | 1.07751 | 335.5 | 31.9 | 22.38% |
31-
| 42 | 1.08785 | 1.09937 | 1.08268 | 1.07809 | 316.6 | 32.2 | 22.38% |
32-
| 1234 | 1.08794 | 1.09941 | 1.08280 | 1.07813 | 332.2 | 32.0 | 22.38% |
33-
| 1337 | 1.08772 | 1.09918 | 1.08246 | 1.07801 | 338.4 | 31.9 | 22.38% |
34-
| 2025 | 1.08842 | 1.09957 | 1.08306 | 1.07862 | 333.4 | 32.0 | 22.38% |
50+
**PR #1420 cross-reference**: PR #1420 ships the identical bug. @abaybektursun has [acknowledged it in their thread](https://github.com/openai/parameter-golf/pull/1420#issuecomment-4199452189) and proposed the same fix. Applying the same correction to PR #1420's reported 1.08014 5-seed mean would put it at approximately ~1.08300 post-fix.
3551

3652
## Key Innovations
3753

@@ -161,9 +177,11 @@ export NGRAM_TILT_ENABLED=1
161177
export NGRAM_BASE_BETA=2.0
162178
export NGRAM_AGREE_BONUS=0.1
163179
export NGRAM_WITHIN_THRESHOLD=0.25
164-
export NGRAM_WITHIN_BETA=0.92
180+
# CAUSAL CORRECTION: disable within/word experts
181+
export NGRAM_WITHIN_BETA=0.0
182+
export NGRAM_WORD_BETA=0.0
165183

166-
for SEED in 0 42 1234; do
184+
for SEED in 0 42 1234 1337 2025; do
167185
SEED=$SEED uv run torchrun --standalone --nproc_per_node=8 train_gpt.py
168186
done
169187
```

records/track_10min_16mb/2026-04-07_SP8192_ParallelResid7_Loop35_NgramTilt/fused_expert_kernel.cpp

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -379,11 +379,28 @@ class ContextMixer {
379379
prefetch_open_lookups(hashes, ma0);
380380
}
381381

382+
// CAUSAL FIX (matches @abaybektursun's fix in PR #1420 — see
383+
// https://github.com/openai/parameter-golf/pull/1420#issuecomment-4199452189):
384+
// 1. Hint gating: is_bnd / is_ws derived from tokens_[p-1] (last prefix
385+
// token), not tokens_[p]. This makes the predictive distribution at
386+
// position p depend only on the strict prefix, satisfying Issue #1017
387+
// condition 2.
388+
// 2. Update functions: tok_is_bnd / tok_is_ws derived from the actual
389+
// target tok so within_update / word_update still segment words
390+
// correctly. This is causal because updates happen AFTER the hint
391+
// for position p has been written to the output buffer.
392+
//
393+
// (Variable naming and structure copied verbatim from PR #1420's fix.
394+
// In addition, this submission is run with NGRAM_WITHIN_BETA=0
395+
// NGRAM_WORD_BETA=0 to disable the within/word experts entirely,
396+
// because empirically they contribute negative BPB once the leak is
397+
// removed — see Legality Fix section in the README.)
382398
for (int i = 0; i < n; i++) {
383399
int64_t p = pos[i];
384400
auto tok = uint16_t(tokens_[p]);
385-
bool is_bnd = is_bnd_ && is_bnd_[tok];
386-
bool is_ws = has_ls_ && has_ls_[tok];
401+
auto prev_tok = (p > 0) ? uint16_t(tokens_[p - 1]) : uint16_t(0);
402+
bool is_bnd = is_bnd_ && is_bnd_[prev_tok];
403+
bool is_ws = has_ls_ && has_ls_[prev_tok];
387404
int max_avail = std::min(OPEN_MAX, int(p));
388405

389406
if (i + 1 < n) {
@@ -423,9 +440,11 @@ class ContextMixer {
423440

424441
prefetch_open_updates(hashes, max_avail, tok);
425442

443+
bool tok_is_bnd = is_bnd_ && is_bnd_[tok];
444+
bool tok_is_ws = has_ls_ && has_ls_[tok];
426445
token_update(hashes, max_avail, tok);
427-
within_update(tok, is_bnd, is_ws);
428-
word_update(tok, is_bnd, is_ws);
446+
within_update(tok, tok_is_bnd, tok_is_ws);
447+
word_update(tok, tok_is_bnd, tok_is_ws);
429448

430449
std::memcpy(hashes, next_hashes, sizeof(hashes));
431450
}
Lines changed: 83 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,90 @@
11
{
2-
"name": "SP8192 + Parallel Residuals + 3-Layer Recurrence + Legal N-gram Tilt",
3-
"val_bpb": 1.07807,
4-
"val_loss": 2.78478,
5-
"bytes_total": 15993733,
6-
"blurb": "3-lever stack on top of PR #1394 sp8192 baseline: (1) GPT-J parallel residuals on layers 7-10 (PR #1412 @Robby955), (2) 3-layer depth recurrence (loop layers 3-5 twice instead of 4-5 twice), (3) eval-time causal n-gram tilt with one-token exponential rescaling (PR #1420 @abaybektursun, lineage PR #1145 @AnirudhRahul). All four issue #1017 conditions verified. C++ n-gram kernel ported from PR #1420 with nanobind dependency removed (ctypes shim). 5-seed mean 1.07807 BPB (std 0.00040, all 5 seeds mini-wrapper-verified for fit and BPB) beats PR #1394 (1.08563) by 0.01952 nats per token, beats PR #1420 (1.08014) by 0.00534 nats per token, beats own PR #1413 (1.08279) by 0.01218 nats per token.",
2+
"name": "Diagnostic (causal-corrected): SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt",
3+
"val_bpb": 1.08091,
4+
"val_loss": 2.7921,
5+
"bytes_total": 15995572,
6+
"blurb": "CAUSAL-CORRECTED (2026-04-07 PM): the original n-gram tilt kernel inherited from PR #1420 leaked target metadata via within/word hint emission gates (Issue #1017 condition 2 violation, leak ~+0.0028 nats per token). Fix: kernel patched to derive prev_is_bnd/prev_is_ws from tokens_[p-1] for hint gating (matches @abaybektursun's proposed fix in PR #1420 thread); NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0 disables the within/word experts entirely because they contribute negative BPB once the leak is removed. Corrected 5-seed mean 1.08091 BPB (std 0.00043). Pre-fix mean was 1.07807. PR #1420 has the identical bug; @abaybektursun has acknowledged it in the PR #1420 thread. This submission is left open as a transparency / diagnostic record, NOT as a record claim \u2014 the corrected mean does not clear the 0.005-nat bar against any open PR. PR #1413 (no n-gram tilt at all) at 1.08279 remains our cleanest legal anchor.",
77
"author": "dexhunter",
88
"github_id": "dexhunter",
99
"date": "2026-04-07",
1010
"seed_results": {
11-
"0": {"val_bpb": 1.07751, "val_loss": 2.78333, "steps": 4918, "artifact_bytes": 15992304},
12-
"42": {"val_bpb": 1.07809, "val_loss": 2.78481, "steps": 4911, "artifact_bytes": 15993733},
13-
"1234": {"val_bpb": 1.07813, "val_loss": 2.78492, "steps": 4908, "artifact_bytes": 15990539},
14-
"1337": {"val_bpb": 1.07801, "val_loss": 2.78461, "steps": 4909, "artifact_bytes": 15988039},
15-
"2025": {"val_bpb": 1.07862, "val_loss": 2.78620, "steps": 4908, "artifact_bytes": 15992215}
11+
"0": {
12+
"val_bpb": 1.08035,
13+
"val_loss": 2.79067,
14+
"steps": 4911,
15+
"artifact_bytes": 15994644
16+
},
17+
"42": {
18+
"val_bpb": 1.08097,
19+
"val_loss": 2.79225,
20+
"steps": 4906,
21+
"artifact_bytes": 15995572
22+
},
23+
"1234": {
24+
"val_bpb": 1.08127,
25+
"val_loss": 2.79303,
26+
"steps": 4915,
27+
"artifact_bytes": 15993531
28+
},
29+
"1337": {
30+
"val_bpb": 1.0806,
31+
"val_loss": 2.79131,
32+
"steps": 4905,
33+
"artifact_bytes": 15988802
34+
},
35+
"2025": {
36+
"val_bpb": 1.08135,
37+
"val_loss": 2.79324,
38+
"steps": 4911,
39+
"artifact_bytes": 15993360
40+
}
1641
},
1742
"lineage": [
18-
"PR #1394 (clarkkev) — sp8192 base",
19-
"PR #1413 (dexhunter) — sp8192 + QK5 + legal score-first TTT",
20-
"PR #1412 (Robby955) — parallel residuals on layers 7-10",
21-
"PR #1420 (abaybektursun) — n-gram tilt mechanism + C++ kernel",
22-
"PR #1145 (AnirudhRahul) — original normalized causal n-gram cache pattern",
23-
"PR #549 (abaybektursun, merged) — score-first TTT precedent"
24-
]
25-
}
43+
"PR #1394 (clarkkev) \u2014 sp8192 base",
44+
"PR #1413 (dexhunter) \u2014 sp8192 + QK5 + legal score-first TTT",
45+
"PR #1412 (Robby955) \u2014 parallel residuals on layers 7-10",
46+
"PR #1420 (abaybektursun) \u2014 n-gram tilt mechanism + C++ kernel",
47+
"PR #1145 (AnirudhRahul) \u2014 original normalized causal n-gram cache pattern",
48+
"PR #549 (abaybektursun, merged) \u2014 score-first TTT precedent"
49+
],
50+
"seed_results_pre_fix": {
51+
"0": {
52+
"val_bpb": 1.07751,
53+
"val_loss": 2.78333,
54+
"steps": 4918,
55+
"artifact_bytes": 15992304
56+
},
57+
"42": {
58+
"val_bpb": 1.07809,
59+
"val_loss": 2.78481,
60+
"steps": 4911,
61+
"artifact_bytes": 15993733
62+
},
63+
"1234": {
64+
"val_bpb": 1.07813,
65+
"val_loss": 2.78492,
66+
"steps": 4908,
67+
"artifact_bytes": 15990539
68+
},
69+
"1337": {
70+
"val_bpb": 1.07801,
71+
"val_loss": 2.78461,
72+
"steps": 4909,
73+
"artifact_bytes": 15988039
74+
},
75+
"2025": {
76+
"val_bpb": 1.07862,
77+
"val_loss": 2.7862,
78+
"steps": 4908,
79+
"artifact_bytes": 15992215
80+
}
81+
},
82+
"correction_note": {
83+
"date": "2026-04-07",
84+
"issue": "Issue #1017 condition 2 (causality)",
85+
"root_cause": "fused_expert_kernel.cpp::get_hints_batch read tokens_[p] (target token) and used is_bnd[tok]/is_ws[tok] to gate within_hint/word_hint emission, leaking 1-2 bits per scored position",
86+
"fix": "kernel patched to derive prev_is_bnd/prev_is_ws from tokens_[p-1] for hint gates only; updates still use current tok (causal because they happen after hint emission). Additionally NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0 disables within/word experts (they contribute negative BPB once causal). Only token_hint contributes (already causal).",
87+
"leak_magnitude_nats": 0.00284,
88+
"shared_with": "PR #1420 (acknowledged by @abaybektursun in PR #1420 thread, fix proposal at https://github.com/openai/parameter-golf/pull/1420#issuecomment-4199452189)"
89+
}
90+
}

0 commit comments

Comments
 (0)