Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff by Naazimsnh02 · Pull Request #808 · openai/parameter-golf

Naazimsnh02 · 2026-03-26T04:18:04Z

Summary

val_bpb: 0.6360 (seed 1337) | ~15.94 MB | 8×H100 SXM | 3 seeds

Adds multi-order n-gram backoff (orders 2-7) with entropy-adaptive alpha to the depth recurrence stack, achieving a new record.

Key contributions

Multi-order n-gram backoff (orders 2-7): Hash-table n-gram counting at eval time. Highest-order match first, cascade down on miss. Zero training cost — purely eval-time.
Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2 * (H − 4.0)) — trusts n-gram more when the neural model is uncertain, model when confident.
Multi-GPU n-gram prefill: Each rank pre-populates its hash tables with all tokens scored by earlier ranks, fixing the table fragmentation problem on multi-GPU setups (without this, 8-GPU gets 0.87 BPB instead of 0.64).
Depth Recurrence: Repeating layers 4,5 for 13 virtual layers from 11 physical at zero parameter cost (carried over from previous submission).

Results

Seed	BPB
1337	0.6360
2025	0.6381
42	0.6387
Mean	0.6376

Built on PR #549 stack (LeakyReLU(0.5)², BigramHash(2048), XSA4, Partial RoPE, LN Scale, VE128, EMA+SWA, Parameter Banking + Parallel Muon, int6 GPTQ-lite + lzma).

Credits

N-gram backoff technique inspired by PR Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672) #770 (@minh-stakc) and PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (@deanbrr)
Base model stack: PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (@abaybektursun), PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)

… Backoff (2-seed)

MatoTeziTanka · 2026-03-26T14:24:53Z

Interesting approach — the depth recurrence with layers 4,5 repeated for 13 virtual layers at zero parameter cost is creative, and the multi-GPU n-gram prefill fix (0.87 → 0.64 BPB without it) is a good catch.

You're at 2 seeds right now. The leaderboard requires 3-seed validation for record claims — one more run should close it out.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

MatoTeziTanka · 2026-04-11T13:13:27Z

Following up on this one with a new finding, since @valerio-oai ruled on the underlying n-gram mechanism after my first comment.

Compliance flag — same disallowed pattern as PR #779.

@valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." The mechanism is spelled out in comment 4146407380: hashing the target token into the bucket key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data — arbitrarily low BPB without real compression.

Looking at records/track_10min_16mb/2026-03-26_DepthRecurrence_NgramBackoff_0.6360/train_gpt.py, the multi-order n-gram backoff uses the same target-in-key hashing pattern in every code path that touches the full_tables:

L1110 (multi-GPU prefill update): full_key = ((ctx_hash ^ (tgt * ng_primes[cw % len(ng_primes)])) & ng_mask).astype(np.int64) → np.add.at(full_tables[oi], full_key, 1) at L1112
L1170 (eval-time score buildup): full_key = ((ctx_hash ^ (tgt_np * ng_primes[ctx_w % len(ng_primes)])) & ng_mask) → full_counts = full_tables[oi][full_key] at L1179
L1411 (rank-local score): same pattern, full_key = ((ctx_hash ^ (tgt_np * ng_primes[...])) & ng_mask) → looked up at L1419
L1461 (chunk-end update): same pattern, np.add.at(full_tables[oi], full_key, 1) at L1463

Each of these hashes the ground-truth tgt / tgt_np / tgt_tok into the bucket key, and the full_counts lookup at scoring time then reads the count of that exact target in that bucket. Under @valerio-oai's #779 ruling, that's the Rule 1 violation in Issue #1017 condition 1 (p_t may depend only on the artifact and x_1...x_{t-1}).

The multi-GPU n-gram prefill mechanism the README describes (0.87 → 0.64 BPB without it → with it) is also worth thinking about in this light: each rank pre-populates its hash tables with all tokens scored by earlier ranks, which means each rank's tables already contain the target tokens it is about to score. The 0.23 BPB gap between "prefill on" and "prefill off" is the size of the leak, not the size of a compression gain.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #779.

@Naazimsnh02 — please let me know if I've misread the code, especially the full_key lookups at L1110/L1170/L1411/L1461 and whether there's a renormalization over the full vocabulary that I'm missing — if so, I'd want to retract. The depth-recurrence stack (layers 4–5 repeated for 13 virtual layers at zero parameter cost) and the multi-GPU prefill plumbing are both genuinely interesting engineering, separate from the n-gram question — if you wanted to resubmit on a legal base (either dropping the cache and reporting the pure-neural sliding-window number, or reworking the cache as a full-vocab reweighting per @valerio-oai's suggested path on #779), the depth-recurrence and multi-GPU work would carry over cleanly. The 3-seed update from the earlier ask is in place and looks consistent.

Reviewed by @MatoTeziTanka — The Agora. Static code review against train_gpt.py at SHA 6c1b833b. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code.

@0hq

…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Naazimsnh02 added 2 commits March 25, 2026 19:14

Add depth recurrence + LoRA TTT submission

e63c36d

Record Submission: 0.6364 BPB - Depth Recurrence + Multi-Order N-gram…

b00c823

… Backoff (2-seed)

notapplica mentioned this pull request Mar 26, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Update record to 0.6360 BPB (seed 1337) and rename folder

6c1b833

Naazimsnh02 changed the title ~~Record: 0.6364 BPB - Depth Recurrence + Multi-Order N-gram Backoff~~ Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff Mar 26, 2026

This was referenced Apr 12, 2026

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465) #758

Open

Record: XSA-all + LeakyReLU² + VR + GA + 7-gram cache (val_bpb=1.0337) #715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff#808

Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff#808
Naazimsnh02 wants to merge 3 commits intoopenai:mainfrom
Naazimsnh02:ngram-depth-recurrence-0.6364

Naazimsnh02 commented Mar 26, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Naazimsnh02 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key contributions

Results

Credits

Uh oh!

MatoTeziTanka commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Naazimsnh02 commented Mar 26, 2026 •

edited

Loading

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading