Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) by travispchen · Pull Request #798 · openai/parameter-golf

travispchen · 2026-03-26T02:01:22Z

Order-Adaptive Entropy Gating + BackoffNgramMixer + Drift-Free TTT

val_bpb: 0.5466 (3-seed mean, std 0.0010) | ~15.99 MB | 8×H100 SXM

Adds order-adaptive entropy gating on top of PR #779's BackoffNgramMixer + Drift-Free TTT submission. Instead of using a single entropy center for all n-gram orders, each order gets its own threshold — higher orders are trusted at lower entropy, lower orders only kick in when the model is more uncertain.

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	TTT time	Artifact
1337	99.3ms	5,863	1.1279	0.5478	-0.5801	607s	15,995,959
42	98.3ms	5,863	1.1362	0.5458	-0.5904	606s	15,979,251
2025	99.2ms	5,869	1.1369	0.5463	-0.5906	607s	15,994,227
Mean	98.9ms	5,865	1.1337	0.5466 (std 0.0010)	-0.5871	~607s

What Changed vs PR #779

PR #779 uses a single entropy_center=3.5 for all n-gram orders. We replace this with per-order entropy centers:

# PR #779 (single entropy center for all orders)
alpha = alpha_min + (alpha_max - alpha_min) * sigmoid(2.0 * (entropy - 3.5))

# This submission (per-order entropy centers)
ent_centers = {7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5}
ent_center = ent_centers[matched_order]
alpha = alpha_min + (alpha_max - alpha_min) * sigmoid(2.0 * (entropy - ent_center))

Higher-order n-grams (7, 6, 5) are trusted at lower model entropy — when the model is fairly confident, the precise n-gram correction refines the prediction. Lower-order n-grams (4, 3, 2) only intervene at higher entropy — when the model is confused enough that even coarse statistics help.

This is an eval-time-only change. It modifies how existing n-gram statistics are combined with neural predictions, not when data enters the cache. The n-gram cache is still updated strictly AFTER scoring each batch (score-first).

Legality

Score-first: N-gram cache updated AFTER scoring each batch. No future tokens leak into predictions.
No oracle selection: Alpha depends only on model entropy and n-gram order, not on ground truth.
Artifact size: All seeds strictly under 16,000,000 bytes (max: 15,995,959).
Training time: Capped at 600s (10 min) on 8×H100 (actual: ~582s).
Eval time: TTT eval ≤607s on 8×H100.

Ablation

Change	Post-TTT bpb	Delta
PR #779 baseline (single entropy center)	0.6713	—
+ Order-adaptive entropy gating	0.5478	-0.1235

Credits

BackoffNgramMixer + Drift-Free TTT + Base model: PR #779
Order-adaptive entropy gating: This submission

…5466, 3-seed mean) Adds order-adaptive entropy gating on top of PR openai#779's BackoffNgramMixer + Drift-Free TTT. Per-order entropy centers replace single threshold: higher n-gram orders trusted at lower entropy. 3-seed validation: 0.5478, 0.5458, 0.5463 (mean 0.5466, std 0.0010). All artifacts strictly under 16,000,000 bytes. Co-Authored-By: Travis Chen <travispchen@gmail.com>

Add per-order entropy centers from PR openai#798 insight: order 7: center=3.0, order 6: 3.2, order 5: 3.5, order 4: 3.8, order 3: 4.2, order 2: 4.5 Higher orders trusted at lower entropy, lower orders only at high uncertainty. Cubric multipliers applied on top. Original X-WING (0.5644) untouched in concepts/xwing/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#798's approach on our engine: per-order entropy centers (7:3.0, 6:3.2, 5:3.5, 4:3.8, 3:4.2, 2:4.5) without cubric. Testing if cubric was hurting when combined with per-order gating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T13:09:54Z

Community Review — Order-Adaptive Entropy Gating + BackoffNgramMixer

BPB: 0.5466 (3-seed, std 0.0010) | Seeds: 3 | Artifact: 15,995,959 bytes (seed 1337 max) | Compliance: FLAG — inherits PR #779's disallowed mechanism

What this does (factually): Adds per-order entropy thresholds on top of PR #779's BackoffNgramMixer. The base mechanism is unchanged — a hashed n-gram cache (orders 2–7, 4.19M buckets, 7 primes) mixed with neural logits via an entropy-gated alpha. The only delta vs PR #779 is that the sigmoid center is now a function of best_order instead of a fixed 3.5 (train_gpt.py:93).

What I found in the code:

The n-gram mixer is PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779's code, essentially verbatim. BackoffNgramMixer lives in train_gpt.py:39-145. update() (L56-78) and mix_and_score() (L80-142) are the same structure as Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779, including the hashed full_key = ctx_hash ^ (y * primes[cw]) lookup at L119 and p = min(full_c, ctx_c) / max(ctx_c, 1) at L125.
The eval-loop score/update ordering is correct at chunk granularity. eval_val_sliding_ttt (L910+) scores chunk ci under torch.inference_mode() at L1038-1086, then calls mixer.update(val_tokens[chunk_start_tok:chunk_end_tok + 1]) at L1091. The +1 boundary leak I flagged on Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 is still present in this PR (fixed in Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 commit c58742a after my review; Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 branched from pre-fix code).
The running BPB trace from train_seed1337.log shows the monotonic cache-saturation signature that drove the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling:
- final_int6_sliding_window (neural only, no mixer): 1.1373 bpb
- ttt_chunk [1/63] (empty cache): 1.148889
- ttt_chunk [11/63]: 0.893719
- ttt_chunk [31/63]: 0.642689
- ttt_chunk [63/63]: 0.548259
- The 0.5466 mean is the tail average of this trajectory — BPB falls monotonically as the hashed n-gram cache accumulates val-set counts.
Gauntlet: PASS. Imports cleanly, 33,317,980 params, 88,523 code bytes, int6+lzma artifact 3.58 MB on CPU harness (the real artifact with the full-precision weights is 15,995,959 bytes per log, under 16M). Forward pass loss 6.92. Est. ~10.5k steps/10min on 8×H100 (logs confirm 5,864 steps at 582s wallclock).

Questions/flags:

The underlying BackoffNgramMixer was explicitly disallowed by @valerio-oai on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (comment 4145781641, 2026-03-27): "disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." The follow-up (comment 4146407380) spells out the mechanism: hashing the ground-truth token into full_key only reweights the correct token, and in the hash-collision limit this boosts P(correct) arbitrarily toward 1 regardless of the true data distribution — this compresses nothing, it just inflates the logged BPB.
PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 inherits that exact code path. The per-order entropy centers at L93 only change how much alpha mass is placed on the n-gram estimate as a function of which order matched; they do not change the full_key = ctx_hash ^ (y * primes[cw]) lookup (L119) or the full_c / ctx_c ratio (L125). Under the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling's reading, the core Rule 1 violation (per Issue A Field Guide to Valid Submissions #1017 condition 1 — p_t may depend only on the artifact and x_1…x_{t-1}) still applies here.
Prior art note: PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 cleanly credits Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 in its README "Credits" section, so there is no attribution problem — just a dependency on a now-disallowed base.

Verdict: COMPLIANCE FLAG — inherits the disallowed PR #779 mechanism.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE (or NEEDS AUTHOR ACTION), same ruling as #779. The order-adaptive entropy gating is a thoughtful and well-ablated delta in its own right (+0.1235 bpb over the #779 baseline per the README ablation), and on a legal base-model stack I'd expect it to be a genuinely useful eval-time trick. If @travispchen wants to resubmit with the hashed n-gram cache removed — either replacing it with a reweighting over the full vocab (as @valerio-oai suggested on #779) or dropping the mixer entirely and keeping just the Drift-Free TTT + Polyak averaging — the entropy-gating idea itself should port cleanly.

@travispchen — please let me know if my reading of the code is wrong here, especially around the full_key lookup at train_gpt.py:119 and whether the per-order threshold change at L93 is meant to affect anything other than the mixing weight. First submission in this area is no small thing and the ablation work is clean; the only issue is that the base mechanism was ruled on the day after you opened this PR, so the timing is unlucky rather than anything about the care in your submission.

Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet PASS (import/forward/artifact all green); flag is solely inherited-mechanism compliance per the #779 ruling. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA ab9681a7cb21df7b5994b5c8131dea4dc96a0684.

@0hq

…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 26, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

MatoTeziTanka mentioned this pull request Mar 26, 2026

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache #769

Closed

10 tasks

Robby955 mentioned this pull request Mar 26, 2026

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796

Closed

Idan3011 mentioned this pull request Mar 26, 2026

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #810

Closed

quietsmile mentioned this pull request Mar 26, 2026

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks) #840

Open

6 tasks

callithyia mentioned this pull request Mar 26, 2026

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT #850

Closed

7 tasks

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

sunnypatneedi mentioned this pull request Mar 27, 2026

Add daily research log: SLOT, N-gram Residual Learning, LieQ (2026-03-25) sunnypatneedi/parameter-golf#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)#798

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)#798
travispchen wants to merge 1 commit intoopenai:mainfrom
travispchen:oaeg-backoff-ngram

travispchen commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

travispchen commented Mar 26, 2026

Order-Adaptive Entropy Gating + BackoffNgramMixer + Drift-Free TTT

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

What Changed vs PR #779

Legality

Ablation

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Order-Adaptive Entropy Gating + BackoffNgramMixer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants