Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) by newjordan · Pull Request #753 · openai/parameter-golf

newjordan · 2026-03-25T17:53:09Z

Results

Seed	Sliding BPB	7-gram Backoff BPB	Artifact
42	1.1210	0.9631	15.59 MB
2045	1.1196	0.9620	15.71 MB
7	1.1202	0.9624	15.59 MB
Mean	1.1203	0.9625	—

Progression

PR	Mean BPB	Notes
#190	—	The Stinky Frost Recipe
#390, #401	1.1295, 1.1243	Sponge Bath TTT + EMA/SWA/QAT
#445	1.1236	Late Training Replay + GPTQ-lite
#498, #499	1.1378	The Frugendorff
#508, #578	1.1215	GPTQ + Early QAT + Legal TTT
#533, #577	1.1207	GPTQ + Short TTT
#587	1.1208	XSA + quantization tuning
#656	1.1195	Three Breadsticks
#706	1.0461	Podracing I (fixed 5-gram)
#753	0.9625	Podracing II (backoff + adaptive)

What Changed vs Podracing I

Two eval-time improvements, no training changes:

Multi-order backoff (2-7): longest context first, cascade on miss
Entropy-adaptive alpha: trust n-gram more when model is uncertain

Compliance

Score-first, backward-looking cache
Alpha from model entropy only — no target access
GPTQ calibration inside training phase
Training logs + submission.json included

Credits

N-gram eval cache: @deanbrr (Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659)
Backoff + adaptive alpha: @Asukabot0 (Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727)
Base architecture: @signalrush (Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414)

Reproduce

SEED=2045 MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 TTT_EVAL_ENABLED=0 torchrun --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~140s eval.

@deanbrr

Multi-order backoff (2-7) + entropy-adaptive alpha on 11L/512d U-Net. All 3 seeds sub-1.0. GPTQ calibration inside training phase. Seeds: 42=0.9631, 2045=0.9620, 7=0.9624, mean=0.9625 Credits: @deanbrr openai#659, @Asukabot0 openai#727, @signalrush openai#414 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ZERO changes to model, training loop, optimizer, compile, or anything outside the eval function. The C-step is pure numpy on CPU. Patch adds: - 5 env vars (CUBRIC_CADENCE, COUNT_DECAY, BOOST/PRUNE/REWEIGHT) - _cubric_c_step() function (numpy, CPU-only) - Buffering + firing logic inside eval_val_sliding_hashed_ngram - Training path is byte-identical to train_gpt.py Usage: CUBRIC_CADENCE=4 to enable, CUBRIC_CADENCE=0 (default) = off Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests order (8,9), buckets (8M,16M), min_count (1,3), alpha range, entropy sigmoid params. All eval-time, no training changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No more copies. Cubric env vars + C-step function + eval wiring added directly to the production script. CUBRIC_CADENCE=0 (default) = off, identical to original. Run script points to real train_gpt.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0.9625 mean BPB. Backoff 2-7 + entropy-adaptive alpha. Three identical copies for safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pure deletion — 166 lines of dead code removed, zero functional change. TTT eval was gated behind `if args.ttt_eval_enabled:` which was always False. The function `eval_val_sliding_ttt` and all TTT parameter parsing removed. N-gram backoff eval, GPTQ, and all scoring paths unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SOTA untouched. Each test is a separate copy: - train_gpt_baseline.py (clean SOTA copy, control) - train_gpt_cadence4.py (SOTA + cubric C-step, cadence=4) - train_gpt_cadence10.py (SOTA + cubric C-step, cadence=10) Each has its own run script. HYPOTHESES.md documents everything. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) N-gram7 BPB: 0.9370 (±0.0003) across seeds 1337/42/2025 Sliding BPB: 1.1222 (±0.0003) Artifact: ~15.9 MB (within 16MB cap) Training: 600s on 8xH100 Key innovation: order-adaptive entropy gating assigns different entropy thresholds per n-gram order. High-order matches (7-gram) trusted at moderate model confidence; low-order matches (2-gram) only trusted when model is very uncertain. Built on PR openai#753 (Podracing II) with XSA extended to all 11 layers and entropy_center=3.0. Co-Authored-By: Travis Chen <travispchen@gmail.com>

Logistic domain mixing was wrong for target-probability mixing. PR openai#753 uses linear: p_mixed = (1-a)*p_neural + a*p_ngram. Keep CTW-inspired depth-adaptive alpha boost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-order adaptive alpha scaling on legal score-first 7-gram backoff. Tracks per-order beat rate on already-scored tokens, suppresses noisy low orders (2-3 → 0.3x alpha), boosts accurate high orders (5-7 → 2.0x). Results (seeds 2045/43/300): Sliding BPB (no n-gram): 1.1198 mean Cubric n-gram BPB: 0.9362 mean (0.9357/0.9362/0.9365) Artifact: 15.59 MB (int6+zstd) 0.026 BPB improvement over Podracing II (openai#753, 0.9625). Original contribution: per-order adaptive alpha scaling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-order adaptive alpha scaling on score-first 7-gram backoff. Orders 2-3 suppressed to 0.3x, orders 5-7 boosted to 2.0x. 0.026 BPB improvement over PR openai#753 (0.9625). Pending: multi-seed verification + zstd compression check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-order adaptive alpha scaling on score-first 7-gram backoff. Seeds 2045=0.9357, 43=0.9362, 300=0.9365. Mean=0.9362. 0.026 BPB improvement over PR openai#753 (0.9625). Logs, submission.json, README included. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

MatoTeziTanka · 2026-04-11T17:08:42Z

Community Review — Podracing II: Electric Bugaloo (n-gram backoff + entropy-adaptive alpha)

BPB: 0.9625 (3-seed mean) | Seeds: 3 (42, 2045, 7) | Artifact: ~15.59-15.71 MB | Compliance: FLAG (open question on hashed n-gram family-bug)

What this does: Eval-time hybrid: at each scored token, mix the model probability with a backward-looking hashed n-gram cache estimate, using a per-token alpha derived from model entropy. The cache is multi-order (orders min_order..max_order, default 2..7) with longest-context-first backoff. No training-side changes vs predecessors. Score-first TTT (eval_val_sliding_ttt) is also wired in but the headline submission script disables it (TTT_EVAL_ENABLED=0).

What I found in the code (head SHA 8a59150):

eval_val_sliding_hashed_ngram at train_gpt.py:970-1192 is the hot path. Per token:
- Score is computed at L1107-1115 using ctx_tables[n][ctx_key] and full_tables[n][full_key].
- The cache is updated at L1126-1140 after the segment has been scored, via np.add.at(ctx_tables[n], ctx_key, 1) and np.add.at(full_tables[n], full_key, 1). So per-token temporal ordering looks score-before-update.
- Cache tables are allocated once per eval call (L1020-1021) and persist across all sliding windows in that rank. With stride overlap, only the new stride suffix of each window is scored (L1067 s = max(wlen - stride, 0)), so each token is scored exactly once globally before its np.add.at update.
The n-gram probability is min(full_counts, ctx_counts) / max(ctx_counts, 1) (L1111). Both full_key and ctx_key are hashed to buckets = 4_194_304 (L1022, default NGRAM_EVAL_BUCKETS=4_194_304).
The mix is p = (1 - a) * p_model + a * p_ngram per token (L1121), with a from a sigmoid on per-token entropy bounded to [alpha_min=0.05, alpha_max=0.60] (L1075-1083). Alpha uses only logits_f[i, s:wlen], no y_batch access — entropy is computed from model output only.
Submission run: NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 TTT_EVAL_ENABLED=0 per the PR body's reproduce block, so the TTT eval path is off in the headline numbers.
CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK, Hyperparameters and GPT resolve, code size 106,176 bytes, smoke PASS.

Questions / flags:

Hashed n-gram family-bug exposure (the big one). full_key = (ctx_hash ^ (target * prime[ctx_width % len(primes)])) & mask (L1105 / L1138) collides families of distinct (context, target) pairs into the same bucket as soon as buckets (~4.2M) is smaller than the number of distinct n-gram fulls. With orders 2..7 and ~10M val tokens written into a single 4M-entry table per order, full_counts for any queried (ctx, tgt) is the sum of counts of every (ctx', tgt') that hashed there, not the true count of (ctx, tgt). The min(full_counts, ctx_counts) clamp at L1111 caps the inflation but does not eliminate it: whenever ctx_counts < full_counts (which the clamp implies happens), the estimator returns ctx_counts/ctx_counts = 1.0, i.e. probability 1 on the target — a free 0 nats on collided tokens. Mixed with alpha_max=0.60, that's a substantial artificial advantage. Has the author measured the collision rate (e.g. fraction of scored tokens where full_counts >= ctx_counts) and the BPB delta against either (a) a non-hashed dict-backed cache, or (b) a much larger buckets (e.g. 64M)? Ideally the same 3 seeds with NGRAM_EVAL_BUCKETS swept upward — if BPB stays at 0.9625 the family-bug worry is empirically ruled out; if it drifts up, the headline number is partially attributable to hash collisions rather than n-gram statistics. Per Issue Illegal submissions megathread #677 (valerio-oai, 2026-03-27), eval caches must be backward-looking with no oracle selection — the temporal ordering here is fine, but this is a separate "is the estimator computing what it claims" concern.
min(full, ctx) clamp semantics. Even ignoring collisions, on cold/rare contexts where full_counts > ctx_counts, the clamp returns p = 1.0. With min_count=2 and a fresh table early in the eval, how often does a 7-gram lookup hit has_data (ctx_counts >= 2) but with full_counts >= ctx_counts? A short log of (matched_at_order_n, p==1.0_count) per order would settle whether order-7 hits are mostly legitimate repeats or mostly collision-driven 1.0s.
Cross-order double-counting on cache write. On the update path (L1126-1140) every order from min_order..max_order is updated with its own (ctx_key, full_key), but the score path (L1092-1115) stops at the first order that produces has_data and marks ng_matched. That's fine for scoring, but it means the longer-order tables grow against the same token whether or not it was scored from them — subsequent windows then see densified high-order tables that may be more collision-prone. Same ask as (1): collision-rate telemetry would resolve it.
N-gram cache attribution. PR body credits @deanbrr (Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659) for the n-gram eval cache and @Asukabot0 (Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727) for backoff + adaptive alpha. Good. The compliance question above isn't unique to this PR; it applies to the whole hashed-cache family lineage and is worth raising upstream rather than treating any single PR as the source.
TTT path is dead code in the submission. eval_val_sliding_ttt (L1194-1334) implements the score-then-train-then-EMA-load pattern that looks score-first per chunk, but TTT_EVAL_ENABLED=0 in the reproduce block means the headline 0.9625 doesn't depend on it. Mentioning this only so reviewers don't audit a path that isn't in the run.
3-seed discipline acknowledged. Three seeds with all sub-0.9640 individual results is exactly the standard Proposals for new rules to handle flood of submissions #129 / community asks for. The variance is small enough that the result isn't seed-cherry-picked.

Verdict: NEEDS CLARIFICATION — the score-before-update temporal legality looks clean and the entropy-adaptive alpha touches no targets, but the hashed full_key estimator has an open empirical question (family collisions inflating full_counts, then min(full, ctx) saturating at p=1.0) that materially affects whether 0.9625 reflects n-gram statistics vs. hash-table artifacts. Not a "close" — a "show the collision-rate / large-bucket sweep" ask.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

HOLD pending (a) author-supplied collision telemetry (matched-fraction per order, fraction of p==1.0 outcomes per order), and (b) a NGRAM_EVAL_BUCKETS sweep (e.g. 4M / 16M / 64M) on at least one seed showing BPB stability. If BPB is stable across bucket sizes, this is MERGE — sub-1.0 would be a real milestone for the family. If BPB walks up with bigger buckets, the result is a hash-collision artifact and should be relabeled or withdrawn.
This also surfaces a category-level ask that should be raised on Issue Illegal submissions megathread #677: hashed n-gram caches need a community-agreed validity check (collision rate or large-bucket parity) to be evaluated as a class.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK, HAS_HYPERPARAMETERS=True, HAS_GPT=True, model_dim=512, num_heads=8, num_layers=11, vocab=1024, train_seq_len=2048, code_bytes=106176, SMOKE_TEST_PASS. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 8a59150 (refs/pull/753/head).

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

newjordan force-pushed the submission/podracing-ii branch from f9f804a to ed062df Compare March 25, 2026 18:04

newjordan changed the title ~~Podracing II: Electric Bugaloo — 0.9620 BPB (best seed), mean 0.9823~~ Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) Mar 25, 2026

Octavian and others added 7 commits March 25, 2026 13:24

N-gram parameter sweep: 10 arms, one variable each, 1-GPU Vast

f26350f

Tests order (8,9), buckets (8M,16M), min_count (1,3), alpha range, entropy sigmoid params. All eval-time, no training changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add proper run script for cubric test — no more paste issues

7199387

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Podracer garage: SOTA train_gpt.py + run.sh × 3 safety copies

52138d4

0.9625 mean BPB. Backoff 2-7 + entropy-adaptive alpha. Three identical copies for safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ndokutovich mentioned this pull request Mar 25, 2026

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633) #764

Closed

3 tasks

travispchen mentioned this pull request Mar 25, 2026

Record: Order-Adaptive Entropy Gating + XSA-All (val_bpb=0.9370) #774

Open

newjordan mentioned this pull request Mar 25, 2026

Podracing III: Cubric Lite — 0.9362 BPB #782

Open

6 tasks

sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 26, 2026

exp58: rewrite n-gram to match PR openai#753/openai#769/openai#779 (d…

9cd7357

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

minh-stakc mentioned this pull request Mar 30, 2026

Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200) #1114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964)#753

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964)#753
newjordan wants to merge 8 commits intoopenai:mainfrom
newjordan:submission/podracing-ii

newjordan commented Mar 25, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

newjordan commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Progression

What Changed vs Podracing I

Compliance

Credits

Reproduce

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Podracing II: Electric Bugaloo (n-gram backoff + entropy-adaptive alpha)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

newjordan commented Mar 25, 2026 •

edited

Loading