Skip to content

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)#798

Open
travispchen wants to merge 1 commit intoopenai:mainfrom
travispchen:oaeg-backoff-ngram
Open

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)#798
travispchen wants to merge 1 commit intoopenai:mainfrom
travispchen:oaeg-backoff-ngram

Conversation

@travispchen
Copy link
Copy Markdown

Order-Adaptive Entropy Gating + BackoffNgramMixer + Drift-Free TTT

val_bpb: 0.5466 (3-seed mean, std 0.0010) | ~15.99 MB | 8×H100 SXM

Adds order-adaptive entropy gating on top of PR #779's BackoffNgramMixer + Drift-Free TTT submission. Instead of using a single entropy center for all n-gram orders, each order gets its own threshold — higher orders are trusted at lower entropy, lower orders only kick in when the model is more uncertain.

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Pre-TTT bpb Post-TTT bpb TTT gain TTT time Artifact
1337 99.3ms 5,863 1.1279 0.5478 -0.5801 607s 15,995,959
42 98.3ms 5,863 1.1362 0.5458 -0.5904 606s 15,979,251
2025 99.2ms 5,869 1.1369 0.5463 -0.5906 607s 15,994,227
Mean 98.9ms 5,865 1.1337 0.5466 (std 0.0010) -0.5871 ~607s

What Changed vs PR #779

PR #779 uses a single entropy_center=3.5 for all n-gram orders. We replace this with per-order entropy centers:

# PR #779 (single entropy center for all orders)
alpha = alpha_min + (alpha_max - alpha_min) * sigmoid(2.0 * (entropy - 3.5))

# This submission (per-order entropy centers)
ent_centers = {7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5}
ent_center = ent_centers[matched_order]
alpha = alpha_min + (alpha_max - alpha_min) * sigmoid(2.0 * (entropy - ent_center))

Higher-order n-grams (7, 6, 5) are trusted at lower model entropy — when the model is fairly confident, the precise n-gram correction refines the prediction. Lower-order n-grams (4, 3, 2) only intervene at higher entropy — when the model is confused enough that even coarse statistics help.

This is an eval-time-only change. It modifies how existing n-gram statistics are combined with neural predictions, not when data enters the cache. The n-gram cache is still updated strictly AFTER scoring each batch (score-first).

Legality

  • Score-first: N-gram cache updated AFTER scoring each batch. No future tokens leak into predictions.
  • No oracle selection: Alpha depends only on model entropy and n-gram order, not on ground truth.
  • Artifact size: All seeds strictly under 16,000,000 bytes (max: 15,995,959).
  • Training time: Capped at 600s (10 min) on 8×H100 (actual: ~582s).
  • Eval time: TTT eval ≤607s on 8×H100.

Ablation

Change Post-TTT bpb Delta
PR #779 baseline (single entropy center) 0.6713
+ Order-adaptive entropy gating 0.5478 -0.1235

Credits

  • BackoffNgramMixer + Drift-Free TTT + Base model: PR #779
  • Order-adaptive entropy gating: This submission

…5466, 3-seed mean)

Adds order-adaptive entropy gating on top of PR openai#779's BackoffNgramMixer + Drift-Free TTT.
Per-order entropy centers replace single threshold: higher n-gram orders trusted at lower entropy.
3-seed validation: 0.5478, 0.5458, 0.5463 (mean 0.5466, std 0.0010).
All artifacts strictly under 16,000,000 bytes.

Co-Authored-By: Travis Chen <travispchen@gmail.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 26, 2026
Add per-order entropy centers from PR openai#798 insight:
  order 7: center=3.0, order 6: 3.2, order 5: 3.5,
  order 4: 3.8, order 3: 4.2, order 2: 4.5
Higher orders trusted at lower entropy, lower orders only at high
uncertainty. Cubric multipliers applied on top.

Original X-WING (0.5644) untouched in concepts/xwing/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 26, 2026
PR openai#798's approach on our engine: per-order entropy centers
(7:3.0, 6:3.2, 5:3.5, 4:3.8, 3:4.2, 2:4.5) without cubric.
Testing if cubric was hurting when combined with per-order gating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Order-Adaptive Entropy Gating + BackoffNgramMixer

BPB: 0.5466 (3-seed, std 0.0010) | Seeds: 3 | Artifact: 15,995,959 bytes (seed 1337 max) | Compliance: FLAG — inherits PR #779's disallowed mechanism

What this does (factually): Adds per-order entropy thresholds on top of PR #779's BackoffNgramMixer. The base mechanism is unchanged — a hashed n-gram cache (orders 2–7, 4.19M buckets, 7 primes) mixed with neural logits via an entropy-gated alpha. The only delta vs PR #779 is that the sigmoid center is now a function of best_order instead of a fixed 3.5 (train_gpt.py:93).

What I found in the code:

Questions/flags:

Verdict: COMPLIANCE FLAG — inherits the disallowed PR #779 mechanism.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE (or NEEDS AUTHOR ACTION), same ruling as #779. The order-adaptive entropy gating is a thoughtful and well-ablated delta in its own right (+0.1235 bpb over the #779 baseline per the README ablation), and on a legal base-model stack I'd expect it to be a genuinely useful eval-time trick. If @travispchen wants to resubmit with the hashed n-gram cache removed — either replacing it with a reweighting over the full vocab (as @valerio-oai suggested on #779) or dropping the mixer entirely and keeping just the Drift-Free TTT + Polyak averaging — the entropy-gating idea itself should port cleanly.

@travispchen — please let me know if my reading of the code is wrong here, especially around the full_key lookup at train_gpt.py:119 and whether the per-order threshold change at L93 is meant to affect anything other than the mixing weight. First submission in this area is no small thing and the ablation work is clean; the only issue is that the base mechanism was ruled on the day after you opened this PR, so the timing is unlucky rather than anything about the care in your submission.


Reviewed by @MatoTeziTankaThe Agora. CPU gauntlet PASS (import/forward/artifact all green); flag is solely inherited-mechanism compliance per the #779 ruling. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA ab9681a7cb21df7b5994b5c8131dea4dc96a0684.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants