Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff#808
Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff#808Naazimsnh02 wants to merge 3 commits intoopenai:mainfrom
Conversation
|
Interesting approach — the depth recurrence with layers 4,5 repeated for 13 virtual layers at zero parameter cost is creative, and the multi-GPU n-gram prefill fix (0.87 → 0.64 BPB without it) is a good catch. You're at 2 seeds right now. The leaderboard requires 3-seed validation for record claims — one more run should close it out. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
Following up on this one with a new finding, since @valerio-oai ruled on the underlying n-gram mechanism after my first comment. Compliance flag — same disallowed pattern as PR #779. @valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." The mechanism is spelled out in comment 4146407380: hashing the target token into the bucket key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data — arbitrarily low BPB without real compression. Looking at
Each of these hashes the ground-truth The multi-GPU n-gram prefill mechanism the README describes (0.87 → 0.64 BPB without it → with it) is also worth thinking about in this light: each rank pre-populates its hash tables with all tokens scored by earlier ranks, which means each rank's tables already contain the target tokens it is about to score. The 0.23 BPB gap between "prefill on" and "prefill off" is the size of the leak, not the size of a compression gain. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #779. @Naazimsnh02 — please let me know if I've misread the code, especially the Reviewed by @MatoTeziTanka — The Agora. Static code review against |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
val_bpb: 0.6360 (seed 1337) | ~15.94 MB | 8×H100 SXM | 3 seeds
Adds multi-order n-gram backoff (orders 2-7) with entropy-adaptive alpha to the depth recurrence stack, achieving a new record.
Key contributions
alpha = 0.05 + 0.55 * sigmoid(2 * (H − 4.0))— trusts n-gram more when the neural model is uncertain, model when confident.Results
Built on PR #549 stack (LeakyReLU(0.5)², BigramHash(2048), XSA4, Partial RoPE, LN Scale, VE128, EMA+SWA, Parameter Banking + Parallel Muon, int6 GPTQ-lite + lzma).
Credits