Skip to content

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)#770

Open
minh-stakc wants to merge 2 commits intoopenai:mainfrom
minh-stakc:submission/ngram-cache-0.6672
Open

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)#770
minh-stakc wants to merge 2 commits intoopenai:mainfrom
minh-stakc:submission/ngram-cache-0.6672

Conversation

@minh-stakc
Copy link
Copy Markdown

Summary

val_bpb: 0.6672 (seed 42) | 15.0 MB artifact | 1xB200 (HiPerGator)

Technique

Base 11L SOTA architecture with eval-time multi-order n-gram cache interpolation.

Key innovations

  1. Multi-order backoff (orders 2-7): Highest order first, cascade down on miss. Captures repeated document patterns outside the transformer's context window.

  2. Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0)). When the model is uncertain (high entropy), trust n-gram statistics more; when confident, trust the LM.

Compliance

  • Score-first, backward-looking: n-gram counts built from previously scored tokens only
  • No oracle selection: alpha depends on model entropy, never on ground-truth labels
  • Single blended prediction per token, no min(NLL)

Results

Metric Value
Pre-quant val_bpb 1.1927
Post-quant roundtrip 1.1577
Post n-gram sliding (s64) 0.6672
Artifact size 15,025,238 bytes

Architecture

11L, 512d, 8H/4KV GQA, MLP 3x, XSA4, Partial RoPE, LN Scale, VE128, SmearGate, BigramHash(2048), EMA(0.997), Late QAT, OrthoInit. Int6+GPTQ-lite+3% pruning+zstd-22.

Reproduction

SEED=42 NGRAM_CACHE=1 NGRAM_ORDER=7 NGRAM_MIN_ORDER=2 \
NGRAM_ENTROPY=1 EVAL_STRIDE=64 PRUNE_PCT=0.03 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Credits

PR #414 (signalrush), PR #315 (jfprincz), PR #702/#727 (lukacf)

Test plan

  • Artifact under 16MB (15.0 MB)
  • Score-first n-gram cache (backward-looking)
  • No min(NLL), no target-aware gating
  • Single seed included (additional seeds pending)

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha

BPB: 0.6672 (seed 42, 3-seed mean 0.6678) | Seeds: 3 | Artifact: 15,025,238 bytes (93.9% budget) | Compliance: FLAG (n-gram cache)

What this does: Trains an 11L/512d/GQA model that converges to an honest post-quant val_bpb of 1.1577, then at eval time interpolates the neural prediction with a multi-order (2–7) hashed n-gram cache and reports 0.6672. The entire −0.49 BPB gap between 1.1577 and 0.6672 comes from the n-gram cache.

What I found in the code (SHA 7b786ba0b3dc778d02f179be697a020623a2da36, records/track_10min_16mb/2026-03-25_NgramCache_EntropyAdaptive_0.6672/train_gpt.py):

  • Line 967: the target token is hashed into the lookup key:
    tgt_np = val_np[jv].astype(np.uint64)
    full_key = ((ctx_hash ^ (tgt_np * ng_primes[ctx_w % len(ng_primes)])) & ng_mask).astype(np.int64)
    This is the same full_key = ctx_hash ^ (target * prime) pattern @valerio-oai ruled disallowed on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779.
  • Line 975: full_counts = full_tables[oi][full_key] — the per-token prediction at position t queries a bucket whose identity is determined by the gold token val_np[jv] at that same position t. Whether the bucket holds a nonzero count (from any prior increment at line 993 for any earlier occurrence of the same (context, target) pair anywhere in the corpus) directly separates "correct" from "incorrect" guesses for the current token.
  • Lines 988–993: the table update is deferred until after scoring ("Score-first: update tables AFTER scoring"), so the author reads this as backward-looking. But the leak is not about the timing of the current token's increment — it is about the fact that the key that gets read at line 975 is a function of the current gold token, so the cache's content at that key implicitly encodes "has this exact (context, target) been seen in the past." Per @valerio-oai's mechanism explanation in #779 comment 4146407380, that is oracle information about token t.
  • train_seed42.log lines 88 & 92: final_int6_roundtrip val_bpb:1.1577final_int6_sliding_window val_bpb:0.6672. The neural artifact alone is 1.1577; the entire −0.4905 BPB improvement is from the n-gram pass.

Family-bug cluster: The README explicitly credits the n-gram cache and entropy-adaptive alpha to PR #702 / #727 (lukacf). #702 was created the same day (2026-03-25) and appears to be the parent; PRs #798, #808, and #825 are later siblings that were reviewed under the same ruling. #770 is an early sibling, not the root — but it uses the same target-in-key mechanism.

Questions/flags for @minh-stakc:

  1. Per @valerio-oai's ruling on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (comment 4145781641, 2026-03-27), hashed n-gram caches that hash the target token into the lookup key are disallowed for leaking eval tokens. Mechanism explained in comment 4146407380. Can you confirm whether the line-967 full_key is the same construction covered by that ruling?
  2. Per Issue A Field Guide to Valid Submissions #1017 condition 1, "p_t may depend only on the artifact and x_1...x_{t−1}." Because full_key at position t is a function of val_np[jv] (the gold token x_t), the probability you mix in at line 986 depends on x_t, not only on x_1..x_{t−1}. Do you read condition 1 differently?
  3. Would you be open to porting the entropy-adaptive alpha idea and the multi-order backoff structure to a context-only key (drop tgt_np from full_key and use a full-vocab re-weighting from a single context table)? That would preserve the engineering contribution — the adaptive alpha schedule, the backoff cascade, the int6+GPTQ-lite+zstd-22 pipeline — while moving it onto a legal base.

What I'm not flagging: The base model itself is sound. The CPU gauntlet passes cleanly — imports OK, 26,993,756 params, forward loss 6.9338, artifact 3.62 MB after int6+lzma on the gauntlet stub (22.6% of budget; the real 15.0 MB in the log includes GPTQ-lite calibration tensors the stub skips), estimated ~13k steps on 8xH100 in 10 min. The quantization stack (Int6 per-row + GPTQ-lite + 3% magnitude pruning + zstd-22), the OrthoInit, the Partial RoPE + XSA last-4 + VE128 architecture, the late-QAT schedule, and the 3-seed discipline are all real engineering worth preserving. This review is strictly about the n-gram cache mechanism, not about the base model.

Verdict: COMPLIANCE FLAG — family-bug n-gram cache (target-in-key). Base artifact BPB is 1.1577, not 0.6672.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE, pending author resubmission with a context-only n-gram cache or with the n-gram mixer removed (base 1.1577 BPB is still a legitimate non-record submission for the architecture/quantization work). The same ruling already applied to #779, #798, #808, and #825 should apply here; per the README credits, #702 / #727 (lukacf) is the upstream parent of this mechanism and should also be evaluated.


Reviewed by @MatoTeziTankaThe Agora. CPU gauntlet: PASS (import OK, forward loss 6.9338, 26.99M params, artifact 22.6% of budget on gauntlet stub). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 7b786ba0b3dc778d02f179be697a020623a2da36.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants