Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)#770
Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)#770minh-stakc wants to merge 2 commits intoopenai:mainfrom
Conversation
Community Review — 11L + Multi-Order N-gram Backoff + Entropy-Adaptive AlphaBPB: 0.6672 (seed 42, 3-seed mean 0.6678) | Seeds: 3 | Artifact: 15,025,238 bytes (93.9% budget) | Compliance: FLAG (n-gram cache) What this does: Trains an 11L/512d/GQA model that converges to an honest post-quant val_bpb of 1.1577, then at eval time interpolates the neural prediction with a multi-order (2–7) hashed n-gram cache and reports 0.6672. The entire −0.49 BPB gap between 1.1577 and 0.6672 comes from the n-gram cache. What I found in the code (SHA
Family-bug cluster: The README explicitly credits the n-gram cache and entropy-adaptive alpha to PR #702 / #727 (lukacf). #702 was created the same day (2026-03-25) and appears to be the parent; PRs #798, #808, and #825 are later siblings that were reviewed under the same ruling. #770 is an early sibling, not the root — but it uses the same target-in-key mechanism. Questions/flags for @minh-stakc:
What I'm not flagging: The base model itself is sound. The CPU gauntlet passes cleanly — imports OK, 26,993,756 params, forward loss 6.9338, artifact 3.62 MB after int6+lzma on the gauntlet stub (22.6% of budget; the real 15.0 MB in the log includes GPTQ-lite calibration tensors the stub skips), estimated ~13k steps on 8xH100 in 10 min. The quantization stack (Int6 per-row + GPTQ-lite + 3% magnitude pruning + zstd-22), the OrthoInit, the Partial RoPE + XSA last-4 + VE128 architecture, the late-QAT schedule, and the 3-seed discipline are all real engineering worth preserving. This review is strictly about the n-gram cache mechanism, not about the base model. Verdict: COMPLIANCE FLAG — family-bug n-gram cache (target-in-key). Base artifact BPB is 1.1577, not 0.6672. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE, pending author resubmission with a context-only n-gram cache or with the n-gram mixer removed (base 1.1577 BPB is still a legitimate non-record submission for the architecture/quantization work). The same ruling already applied to #779, #798, #808, and #825 should apply here; per the README credits, #702 / #727 (lukacf) is the upstream parent of this mechanism and should also be evaluated. Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet: PASS (import OK, forward loss 6.9338, 26.99M params, artifact 22.6% of budget on gauntlet stub). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
val_bpb: 0.6672 (seed 42) | 15.0 MB artifact | 1xB200 (HiPerGator)
Technique
Base 11L SOTA architecture with eval-time multi-order n-gram cache interpolation.
Key innovations
Multi-order backoff (orders 2-7): Highest order first, cascade down on miss. Captures repeated document patterns outside the transformer's context window.
Entropy-adaptive alpha:
alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0)). When the model is uncertain (high entropy), trust n-gram statistics more; when confident, trust the LM.Compliance
Results
Architecture
11L, 512d, 8H/4KV GQA, MLP 3x, XSA4, Partial RoPE, LN Scale, VE128, SmearGate, BigramHash(2048), EMA(0.997), Late QAT, OrthoInit. Int6+GPTQ-lite+3% pruning+zstd-22.
Reproduction
Credits
PR #414 (signalrush), PR #315 (jfprincz), PR #702/#727 (lukacf)
Test plan