Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)#798
Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)#798travispchen wants to merge 1 commit intoopenai:mainfrom
Conversation
…5466, 3-seed mean) Adds order-adaptive entropy gating on top of PR openai#779's BackoffNgramMixer + Drift-Free TTT. Per-order entropy centers replace single threshold: higher n-gram orders trusted at lower entropy. 3-seed validation: 0.5478, 0.5458, 0.5463 (mean 0.5466, std 0.0010). All artifacts strictly under 16,000,000 bytes. Co-Authored-By: Travis Chen <travispchen@gmail.com>
Add per-order entropy centers from PR openai#798 insight: order 7: center=3.0, order 6: 3.2, order 5: 3.5, order 4: 3.8, order 3: 4.2, order 2: 4.5 Higher orders trusted at lower entropy, lower orders only at high uncertainty. Cubric multipliers applied on top. Original X-WING (0.5644) untouched in concepts/xwing/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#798's approach on our engine: per-order entropy centers (7:3.0, 6:3.2, 5:3.5, 4:3.8, 3:4.2, 2:4.5) without cubric. Testing if cubric was hurting when combined with per-order gating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Order-Adaptive Entropy Gating + BackoffNgramMixerBPB: 0.5466 (3-seed, std 0.0010) | Seeds: 3 | Artifact: 15,995,959 bytes (seed 1337 max) | Compliance: FLAG — inherits PR #779's disallowed mechanism What this does (factually): Adds per-order entropy thresholds on top of PR #779's What I found in the code:
Questions/flags:
Verdict: COMPLIANCE FLAG — inherits the disallowed PR #779 mechanism. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE (or NEEDS AUTHOR ACTION), same ruling as #779. The order-adaptive entropy gating is a thoughtful and well-ablated delta in its own right (+0.1235 bpb over the #779 baseline per the README ablation), and on a legal base-model stack I'd expect it to be a genuinely useful eval-time trick. If @travispchen wants to resubmit with the hashed n-gram cache removed — either replacing it with a reweighting over the full vocab (as @valerio-oai suggested on #779) or dropping the mixer entirely and keeping just the Drift-Free TTT + Polyak averaging — the entropy-gating idea itself should port cleanly. @travispchen — please let me know if my reading of the code is wrong here, especially around the Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet PASS (import/forward/artifact all green); flag is solely inherited-mechanism compliance per the #779 ruling. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Order-Adaptive Entropy Gating + BackoffNgramMixer + Drift-Free TTT
val_bpb: 0.5466 (3-seed mean, std 0.0010) | ~15.99 MB | 8×H100 SXM
Adds order-adaptive entropy gating on top of PR #779's BackoffNgramMixer + Drift-Free TTT submission. Instead of using a single entropy center for all n-gram orders, each order gets its own threshold — higher orders are trusted at lower entropy, lower orders only kick in when the model is more uncertain.
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
What Changed vs PR #779
PR #779 uses a single
entropy_center=3.5for all n-gram orders. We replace this with per-order entropy centers:Higher-order n-grams (7, 6, 5) are trusted at lower model entropy — when the model is fairly confident, the precise n-gram correction refines the prediction. Lower-order n-grams (4, 3, 2) only intervene at higher entropy — when the model is confused enough that even coarse statistics help.
This is an eval-time-only change. It modifies how existing n-gram statistics are combined with neural predictions, not when data enters the cache. The n-gram cache is still updated strictly AFTER scoring each batch (score-first).
Legality
Ablation
Credits