Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672) by minh-stakc · Pull Request #770 · openai/parameter-golf

minh-stakc · 2026-03-25T21:30:28Z

Summary

val_bpb: 0.6672 (seed 42) | 15.0 MB artifact | 1xB200 (HiPerGator)

Technique

Base 11L SOTA architecture with eval-time multi-order n-gram cache interpolation.

Key innovations

Multi-order backoff (orders 2-7): Highest order first, cascade down on miss. Captures repeated document patterns outside the transformer's context window.
Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0)). When the model is uncertain (high entropy), trust n-gram statistics more; when confident, trust the LM.

Compliance

Score-first, backward-looking: n-gram counts built from previously scored tokens only
No oracle selection: alpha depends on model entropy, never on ground-truth labels
Single blended prediction per token, no min(NLL)

Results

Metric	Value
Pre-quant val_bpb	1.1927
Post-quant roundtrip	1.1577
Post n-gram sliding (s64)	0.6672
Artifact size	15,025,238 bytes

Architecture

11L, 512d, 8H/4KV GQA, MLP 3x, XSA4, Partial RoPE, LN Scale, VE128, SmearGate, BigramHash(2048), EMA(0.997), Late QAT, OrthoInit. Int6+GPTQ-lite+3% pruning+zstd-22.

Reproduction

SEED=42 NGRAM_CACHE=1 NGRAM_ORDER=7 NGRAM_MIN_ORDER=2 \
NGRAM_ENTROPY=1 EVAL_STRIDE=64 PRUNE_PCT=0.03 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Credits

PR #414 (signalrush), PR #315 (jfprincz), PR #702/#727 (lukacf)

Test plan

Artifact under 16MB (15.0 MB)
Score-first n-gram cache (backward-looking)
No min(NLL), no target-aware gating
Single seed included (additional seeds pending)

…l_bpb=0.6672)

MatoTeziTanka · 2026-04-11T13:27:30Z

Community Review — 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha

BPB: 0.6672 (seed 42, 3-seed mean 0.6678) | Seeds: 3 | Artifact: 15,025,238 bytes (93.9% budget) | Compliance: FLAG (n-gram cache)

What this does: Trains an 11L/512d/GQA model that converges to an honest post-quant val_bpb of 1.1577, then at eval time interpolates the neural prediction with a multi-order (2–7) hashed n-gram cache and reports 0.6672. The entire −0.49 BPB gap between 1.1577 and 0.6672 comes from the n-gram cache.

What I found in the code (SHA 7b786ba0b3dc778d02f179be697a020623a2da36, records/track_10min_16mb/2026-03-25_NgramCache_EntropyAdaptive_0.6672/train_gpt.py):

Line 967: the target token is hashed into the lookup key:
```
tgt_np = val_np[jv].astype(np.uint64)
full_key = ((ctx_hash ^ (tgt_np * ng_primes[ctx_w % len(ng_primes)])) & ng_mask).astype(np.int64)
```
This is the same full_key = ctx_hash ^ (target * prime) pattern @valerio-oai ruled disallowed on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779.
Line 975: full_counts = full_tables[oi][full_key] — the per-token prediction at position t queries a bucket whose identity is determined by the gold token val_np[jv] at that same position t. Whether the bucket holds a nonzero count (from any prior increment at line 993 for any earlier occurrence of the same (context, target) pair anywhere in the corpus) directly separates "correct" from "incorrect" guesses for the current token.
Lines 988–993: the table update is deferred until after scoring ("Score-first: update tables AFTER scoring"), so the author reads this as backward-looking. But the leak is not about the timing of the current token's increment — it is about the fact that the key that gets read at line 975 is a function of the current gold token, so the cache's content at that key implicitly encodes "has this exact (context, target) been seen in the past." Per @valerio-oai's mechanism explanation in #779 comment 4146407380, that is oracle information about token t.
train_seed42.log lines 88 & 92: final_int6_roundtrip val_bpb:1.1577 → final_int6_sliding_window val_bpb:0.6672. The neural artifact alone is 1.1577; the entire −0.4905 BPB improvement is from the n-gram pass.

Family-bug cluster: The README explicitly credits the n-gram cache and entropy-adaptive alpha to PR #702 / #727 (lukacf). #702 was created the same day (2026-03-25) and appears to be the parent; PRs #798, #808, and #825 are later siblings that were reviewed under the same ruling. #770 is an early sibling, not the root — but it uses the same target-in-key mechanism.

Questions/flags for @minh-stakc:

Per @valerio-oai's ruling on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (comment 4145781641, 2026-03-27), hashed n-gram caches that hash the target token into the lookup key are disallowed for leaking eval tokens. Mechanism explained in comment 4146407380. Can you confirm whether the line-967 full_key is the same construction covered by that ruling?
Per Issue A Field Guide to Valid Submissions #1017 condition 1, "p_t may depend only on the artifact and x_1...x_{t−1}." Because full_key at position t is a function of val_np[jv] (the gold token x_t), the probability you mix in at line 986 depends on x_t, not only on x_1..x_{t−1}. Do you read condition 1 differently?
Would you be open to porting the entropy-adaptive alpha idea and the multi-order backoff structure to a context-only key (drop tgt_np from full_key and use a full-vocab re-weighting from a single context table)? That would preserve the engineering contribution — the adaptive alpha schedule, the backoff cascade, the int6+GPTQ-lite+zstd-22 pipeline — while moving it onto a legal base.

What I'm not flagging: The base model itself is sound. The CPU gauntlet passes cleanly — imports OK, 26,993,756 params, forward loss 6.9338, artifact 3.62 MB after int6+lzma on the gauntlet stub (22.6% of budget; the real 15.0 MB in the log includes GPTQ-lite calibration tensors the stub skips), estimated ~13k steps on 8xH100 in 10 min. The quantization stack (Int6 per-row + GPTQ-lite + 3% magnitude pruning + zstd-22), the OrthoInit, the Partial RoPE + XSA last-4 + VE128 architecture, the late-QAT schedule, and the 3-seed discipline are all real engineering worth preserving. This review is strictly about the n-gram cache mechanism, not about the base model.

Verdict: COMPLIANCE FLAG — family-bug n-gram cache (target-in-key). Base artifact BPB is 1.1577, not 0.6672.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE, pending author resubmission with a context-only n-gram cache or with the n-gram mixer removed (base 1.1577 BPB is still a legitimate non-record submission for the architecture/quantization work). The same ruling already applied to #779, #798, #808, and #825 should apply here; per the README credits, #702 / #727 (lukacf) is the upstream parent of this mechanism and should also be evaluated.

Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet: PASS (import OK, forward loss 6.9338, 26.99M params, artifact 22.6% of budget on gauntlet stub). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 7b786ba0b3dc778d02f179be697a020623a2da36.

@0hq

…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (va…

2ff44a6

…l_bpb=0.6672)

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Naazimsnh02 mentioned this pull request Mar 26, 2026

Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff #808

Open

Add 3-seed validation (mean 0.6678, std 0.0008)

7b786ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)#770

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)#770
minh-stakc wants to merge 2 commits intoopenai:mainfrom
minh-stakc:submission/ngram-cache-0.6672

minh-stakc commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

minh-stakc commented Mar 25, 2026

Summary

Technique

Key innovations

Compliance

Results

Architecture

Reproduction

Credits

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants