Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) by deanbrr · Pull Request #779 · openai/parameter-golf

deanbrr · 2026-03-25T22:26:17Z

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)

3-seed mean val_bpb: 0.6683 (std 0.0024), all artifacts under 16 MB, 8xH100 SXM, 600s training + 371s eval.

Results:
Seed 1337: 0.6663 BPB, 15.63 MB artifact
Seed 42: 0.6710 BPB, 15.78 MB artifact
Seed 2024: 0.6675 BPB, 15.48 MB artifact

Background:
I introduced the first n-gram eval cache in this competition (PR #659, val_bpb=1.0920, March 22 2026). That approach used a 5-gram cache with an oracle safety gate ruled illegal by organizers. This submission replaces the oracle gate with entropy-adaptive mixing and multi-order backoff, combined with a drift-free TTT configuration.

Technique:

Multi-order n-gram backoff (orders 2-7). Try highest order first, cascade down on miss. Each order uses 4M hash buckets. Counts accumulated from already-scored tokens only.
Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0)), where H is model entropy. High entropy trusts n-gram more, low entropy trusts the model. Depends only on the model's own output distribution, never on the true target. Mixed probability always applied, no oracle gate.
Drift-free TTT: Q projections only (QTTT=1), eta=0.02, LR=3e-5, 1M token chunks, 1 epoch, no adaptive LR, no Polyak. Produces monotonic BPB improvement through all 60 chunks with no late-chunk reversal.

Ablation (seed 1337):
Base model (no mixer, no TTT): 1.1363
TTT only (no mixer): 1.1369
Mixer only (no TTT): 0.6712
Full system: 0.6663

The BackoffNgramMixer contributes 99% of the improvement. It is a pure eval-time technique requiring no architectural changes or retraining.

Compliance:
Score-first TTT: each chunk scored under inference_mode before training on it. Backward-looking n-gram: counts from already-scored tokens only. No oracle selection. No training data access at eval (naive int5 quantization, no GPTQ). Token count verified: ratio_scored = 1.000000.

Credits:
PR #700 RoyiRa (base architecture, TTT framework), PR #606 gowtham0992 (int5 + Soft-Round QAT), PR #727 Asukabot0 (backoff concept, entropy-adaptive alpha formula), PR #461 Christopher-Lee-McClendon (TTT recipe), PR #518 sofiabod (LeakyReLU, cosine TTT). Dean Barr (original n-gram eval cache concept first in competition PR #659, drift-free TTT discovery, BackoffNgramMixer implementation).

newjordan · 2026-03-25T23:33:39Z

awesome

deanbrr · 2026-03-25T23:49:58Z

awesome

Thank you. causing a big stir. some are calling it gaming

newjordan · 2026-03-26T00:06:52Z

it was definately a gamer move. but I dont think gaming. This is my night studying and testing....

…5466, 3-seed mean) Adds order-adaptive entropy gating on top of PR openai#779's BackoffNgramMixer + Drift-Free TTT. Per-order entropy centers replace single threshold: higher n-gram orders trusted at lower entropy. 3-seed validation: 0.5478, 0.5458, 0.5463 (mean 0.5466, std 0.0010). All artifacts strictly under 16,000,000 bytes. Co-Authored-By: Travis Chen <travispchen@gmail.com>

All GPUs iterate all chunks (4M tokens each), share full 32M cache. Replaces per-GPU partition that limited cache to 4M tokens/GPU. Changes (eval_val_sliding only, no training changes): - Add _bulk_cache_update: vectorized np.bincount (replaces np.add.at) - Chunk-level iteration: windows interleaved rank::world_size per chunk - Delete pre-fill loop (chunk-sync makes it unnecessary) - Trim to orders 2-7 (was 2-28), per-order entropy centers for 2-7 - Upfront timing go/no-go: abort n-gram if est > 550s after 2 chunks - Fix double-counting bug: all_reduce once at end, not per-chunk - Default NGRAM_ALPHA=0.40, NGRAM_ENT_BASE=0.05, NGRAM_ENT_RANGE=0.55 Expected: 0.97 → ~0.30 BPB on 8xH100 (matching PR openai#779/809). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… + Backoff TTT Replaces the heuristic entropy-adaptive alpha with a learned 7-expert gate (Linear 512→7) that routes between the neural model and n-gram orders 2-7. The gate is trained end-to-end during the main training loop using a frozen n-gram oracle pre-computed from training data (counted within wallclock). 3-seed results (8xH100 SXM, 600s): seed 1337: val_bpb=0.1661 (15.74 MB) seed 42: val_bpb=0.1663 (15.76 MB) seed 2024: val_bpb=0.1666 (15.25 MB) mean: val_bpb=0.1663 (std=0.0003) Based on PR openai#779 (deanbrr) BackoffNgramMixer + DriftFreeTTT architecture. Made-with: Cursor

… + Backoff TTT Replaces the heuristic entropy-adaptive alpha with a learned 7-expert gate (Linear 512→7) that routes between the neural model and n-gram orders 2-7. The gate is trained end-to-end during the main training loop using a frozen n-gram oracle pre-computed from training data (counted within wallclock). 3-seed results (8xH100 SXM, 600s): seed 1337: val_bpb=0.1661 (15.74 MB) seed 42: val_bpb=0.1663 (15.76 MB) seed 2024: val_bpb=0.1666 (15.25 MB) mean: val_bpb=0.1663 (std=0.0003) Cleanup: removed dead code (adaptive LR, Polyak averaging, scalar mixer path, unused function params). Added detailed order-of-operations to README proving legality of the training and evaluation procedure. Based on PR openai#779 (deanbrr) BackoffNgramMixer + DriftFreeTTT architecture. Made-with: Cursor

MatoTeziTanka · 2026-03-26T14:38:07Z

Great submission — the entropy-adaptive alpha design is elegant, and the drift-free TTT configuration solving the late-chunk reversal problem is a solid engineering contribution. The ablation data is also really valuable for the community to understand what's driving improvements in this space.

One small thing I noticed while reviewing the eval loop: the cache update at the end of each chunk passes val_tokens[chunk_start_tok:chunk_end_tok + 1], where chunk_end_tok is already (ci + 1) * ttt_chunk_tokens. The + 1 means the n-gram cache receives one token beyond the chunk boundary — the first token of the next unscored chunk. For the highest-position n-grams, that extra token ends up as a target in the count tables before it's been scored.

The practical impact is almost certainly negligible on the final BPB, and it's likely there to handle the boundary condition for forming complete n-grams at the chunk edge. Just flagging it since the compliance section states "counts from already-scored tokens only" — might be worth a quick check to confirm it's intentional.

Nice work overall.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

…eedback)

deanbrr · 2026-03-26T14:53:05Z

MateoTeziTanka "Good catch, thank you. The +1 was inherited from the base code and leaked one unscored token per chunk boundary into the n-gram counts. Fixed in c58742a."

MatoTeziTanka · 2026-03-26T15:15:23Z

@deanbrr Quick fix — nice. Glad it was useful. Clean submission overall.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

valerio-oai · 2026-03-27T22:51:20Z

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

deanbrr · 2026-03-28T01:25:16Z

Revisionist history in my opinion and I don't think you are correct. That is not how entropy estimated n-gram works and you are penalizing the fact that the data set can be pseudo memorized.

You encouraged this and now have been influenced by the gang.

I was awarded the first ML patent in the US and while I respect different viewpoints, I disagree with this.

I think you would be better off looking at token train/test overlap

My humble opinion

@0hq

…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-answers the "where does the 0.0458 improvement come from" question using exact log excerpts from the three archived runs that produced submission.json: seed 7: neural 1.1481 -> +mixer 0.3948 (delta 0.7533) seed 1337: neural 1.1480 -> +mixer 0.3957 (delta 0.7523) seed 2024: neural 1.1492 -> +mixer 0.3969 (delta 0.7523) mean: neural 1.1484 -> +mixer 0.3958 (delta 0.7526) Includes the mixer convergence curve for seed 7 (1.176 -> 0.395 as counts accumulate in strict score-first order) and positions the submission as an eval-stage refinement of already-merged openai#779 and openai#803 rather than a novel training method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)

bd5e1b9

deanbrr force-pushed the submission/backoff-ttt-0.6683 branch from 611612e to bd5e1b9 Compare March 26, 2026 00:35

travispchen mentioned this pull request Mar 26, 2026

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798

Open

newjordan mentioned this pull request Mar 26, 2026

Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800

Closed

4 tasks

pentxayc mentioned this pull request Mar 26, 2026

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803

Open

3 tasks

Naazimsnh02 mentioned this pull request Mar 26, 2026

Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff #808

Open

MatoTeziTanka mentioned this pull request Mar 26, 2026

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache #769

Closed

10 tasks

newjordan mentioned this pull request Mar 26, 2026

Record: X-WING 3D Cubric + Complementary Training (val_bpb=0.4820) #814

Closed

6 tasks

hypery11 mentioned this pull request Mar 26, 2026

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825

Open

4 tasks

AnirudhRahul mentioned this pull request Mar 26, 2026

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT #834

Closed

6 tasks

callithyia mentioned this pull request Mar 26, 2026

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT #850

Closed

7 tasks

Fix: remove +1 from chunk boundary in n-gram cache update (reviewer f…

c58742a

…eedback)

pappanick mentioned this pull request Mar 26, 2026

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12) #860

Open

7 tasks

deanbrr mentioned this pull request Mar 26, 2026

Illegal submissions megathread #677

Open

sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 26, 2026

exp58: rewrite n-gram to match PR openai#753/openai#769/openai#779 (d…

9cd7357

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

valerio-oai closed this Mar 27, 2026

This was referenced Apr 11, 2026

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) #761

Open

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #889

Open

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633) #764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)#779

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)#779
deanbrr wants to merge 2 commits intoopenai:mainfrom
deanbrr:submission/backoff-ttt-0.6683

deanbrr commented Mar 25, 2026

Uh oh!

newjordan commented Mar 25, 2026

Uh oh!

deanbrr commented Mar 25, 2026

Uh oh!

newjordan commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading

Uh oh!

deanbrr commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

deanbrr commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

deanbrr commented Mar 25, 2026

Uh oh!

newjordan commented Mar 25, 2026

Uh oh!

deanbrr commented Mar 25, 2026

Uh oh!

newjordan commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deanbrr commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

deanbrr commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading