Skip to content

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964)#753

Open
newjordan wants to merge 8 commits intoopenai:mainfrom
newjordan:submission/podracing-ii
Open

Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964)#753
newjordan wants to merge 8 commits intoopenai:mainfrom
newjordan:submission/podracing-ii

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 25, 2026

podracing

Results

Seed Sliding BPB 7-gram Backoff BPB Artifact
42 1.1210 0.9631 15.59 MB
2045 1.1196 0.9620 15.71 MB
7 1.1202 0.9624 15.59 MB
Mean 1.1203 0.9625

Progression

PR Mean BPB Notes
#190 The Stinky Frost Recipe
#390, #401 1.1295, 1.1243 Sponge Bath TTT + EMA/SWA/QAT
#445 1.1236 Late Training Replay + GPTQ-lite
#498, #499 1.1378 The Frugendorff
#508, #578 1.1215 GPTQ + Early QAT + Legal TTT
#533, #577 1.1207 GPTQ + Short TTT
#587 1.1208 XSA + quantization tuning
#656 1.1195 Three Breadsticks
#706 1.0461 Podracing I (fixed 5-gram)
#753 0.9625 Podracing II (backoff + adaptive)

What Changed vs Podracing I

Two eval-time improvements, no training changes:

  1. Multi-order backoff (2-7): longest context first, cascade on miss
  2. Entropy-adaptive alpha: trust n-gram more when model is uncertain

Compliance

  • Score-first, backward-looking cache
  • Alpha from model entropy only — no target access
  • GPTQ calibration inside training phase
  • Training logs + submission.json included

Credits

Reproduce

SEED=2045 MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 TTT_EVAL_ENABLED=0 torchrun --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~140s eval.

Multi-order backoff (2-7) + entropy-adaptive alpha on 11L/512d U-Net.
All 3 seeds sub-1.0. GPTQ calibration inside training phase.

Seeds: 42=0.9631, 2045=0.9620, 7=0.9624, mean=0.9625

Credits: @deanbrr openai#659, @Asukabot0 openai#727, @signalrush openai#414

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@newjordan newjordan force-pushed the submission/podracing-ii branch from f9f804a to ed062df Compare March 25, 2026 18:04
@newjordan newjordan changed the title Podracing II: Electric Bugaloo — 0.9620 BPB (best seed), mean 0.9823 Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) Mar 25, 2026
Octavian and others added 7 commits March 25, 2026 13:24
ZERO changes to model, training loop, optimizer, compile, or anything
outside the eval function. The C-step is pure numpy on CPU.

Patch adds:
- 5 env vars (CUBRIC_CADENCE, COUNT_DECAY, BOOST/PRUNE/REWEIGHT)
- _cubric_c_step() function (numpy, CPU-only)
- Buffering + firing logic inside eval_val_sliding_hashed_ngram
- Training path is byte-identical to train_gpt.py

Usage: CUBRIC_CADENCE=4 to enable, CUBRIC_CADENCE=0 (default) = off

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests order (8,9), buckets (8M,16M), min_count (1,3), alpha range,
entropy sigmoid params. All eval-time, no training changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No more copies. Cubric env vars + C-step function + eval wiring added
directly to the production script. CUBRIC_CADENCE=0 (default) = off,
identical to original. Run script points to real train_gpt.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0.9625 mean BPB. Backoff 2-7 + entropy-adaptive alpha.
Three identical copies for safety.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pure deletion — 166 lines of dead code removed, zero functional change.
TTT eval was gated behind `if args.ttt_eval_enabled:` which was always False.
The function `eval_val_sliding_ttt` and all TTT parameter parsing removed.
N-gram backoff eval, GPTQ, and all scoring paths unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SOTA untouched. Each test is a separate copy:
- train_gpt_baseline.py (clean SOTA copy, control)
- train_gpt_cadence4.py (SOTA + cubric C-step, cadence=4)
- train_gpt_cadence10.py (SOTA + cubric C-step, cadence=10)

Each has its own run script. HYPOTHESES.md documents everything.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
travispchen added a commit to travispchen/parameter-golf that referenced this pull request Mar 25, 2026
…ed mean)

N-gram7 BPB: 0.9370 (±0.0003) across seeds 1337/42/2025
Sliding BPB: 1.1222 (±0.0003)
Artifact: ~15.9 MB (within 16MB cap)
Training: 600s on 8xH100

Key innovation: order-adaptive entropy gating assigns different
entropy thresholds per n-gram order. High-order matches (7-gram)
trusted at moderate model confidence; low-order matches (2-gram)
only trusted when model is very uncertain.

Built on PR openai#753 (Podracing II) with XSA extended to all 11 layers
and entropy_center=3.0.

Co-Authored-By: Travis Chen <travispchen@gmail.com>
ahmettrkck added a commit to ahmettrkck/parameter-golf that referenced this pull request Mar 25, 2026
Logistic domain mixing was wrong for target-probability mixing.
PR openai#753 uses linear: p_mixed = (1-a)*p_neural + a*p_ngram.
Keep CTW-inspired depth-adaptive alpha boost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 25, 2026
Per-order adaptive alpha scaling on legal score-first 7-gram backoff.
Tracks per-order beat rate on already-scored tokens, suppresses noisy
low orders (2-3 → 0.3x alpha), boosts accurate high orders (5-7 → 2.0x).

Results (seeds 2045/43/300):
  Sliding BPB (no n-gram): 1.1198 mean
  Cubric n-gram BPB: 0.9362 mean (0.9357/0.9362/0.9365)
  Artifact: 15.59 MB (int6+zstd)

0.026 BPB improvement over Podracing II (openai#753, 0.9625).
Original contribution: per-order adaptive alpha scaling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 26, 2026
Per-order adaptive alpha scaling on score-first 7-gram backoff.
Orders 2-3 suppressed to 0.3x, orders 5-7 boosted to 2.0x.
0.026 BPB improvement over PR openai#753 (0.9625).

Pending: multi-seed verification + zstd compression check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 26, 2026
Per-order adaptive alpha scaling on score-first 7-gram backoff.
Seeds 2045=0.9357, 43=0.9362, 300=0.9365. Mean=0.9362.
0.026 BPB improvement over PR openai#753 (0.9625).

Logs, submission.json, README included.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 26, 2026
…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Podracing II: Electric Bugaloo (n-gram backoff + entropy-adaptive alpha)

BPB: 0.9625 (3-seed mean) | Seeds: 3 (42, 2045, 7) | Artifact: ~15.59-15.71 MB | Compliance: FLAG (open question on hashed n-gram family-bug)

What this does: Eval-time hybrid: at each scored token, mix the model probability with a backward-looking hashed n-gram cache estimate, using a per-token alpha derived from model entropy. The cache is multi-order (orders min_order..max_order, default 2..7) with longest-context-first backoff. No training-side changes vs predecessors. Score-first TTT (eval_val_sliding_ttt) is also wired in but the headline submission script disables it (TTT_EVAL_ENABLED=0).

What I found in the code (head SHA 8a59150):

  • eval_val_sliding_hashed_ngram at train_gpt.py:970-1192 is the hot path. Per token:
    • Score is computed at L1107-1115 using ctx_tables[n][ctx_key] and full_tables[n][full_key].
    • The cache is updated at L1126-1140 after the segment has been scored, via np.add.at(ctx_tables[n], ctx_key, 1) and np.add.at(full_tables[n], full_key, 1). So per-token temporal ordering looks score-before-update.
    • Cache tables are allocated once per eval call (L1020-1021) and persist across all sliding windows in that rank. With stride overlap, only the new stride suffix of each window is scored (L1067 s = max(wlen - stride, 0)), so each token is scored exactly once globally before its np.add.at update.
  • The n-gram probability is min(full_counts, ctx_counts) / max(ctx_counts, 1) (L1111). Both full_key and ctx_key are hashed to buckets = 4_194_304 (L1022, default NGRAM_EVAL_BUCKETS=4_194_304).
  • The mix is p = (1 - a) * p_model + a * p_ngram per token (L1121), with a from a sigmoid on per-token entropy bounded to [alpha_min=0.05, alpha_max=0.60] (L1075-1083). Alpha uses only logits_f[i, s:wlen], no y_batch access — entropy is computed from model output only.
  • Submission run: NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 TTT_EVAL_ENABLED=0 per the PR body's reproduce block, so the TTT eval path is off in the headline numbers.
  • CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK, Hyperparameters and GPT resolve, code size 106,176 bytes, smoke PASS.

Questions / flags:

  1. Hashed n-gram family-bug exposure (the big one). full_key = (ctx_hash ^ (target * prime[ctx_width % len(primes)])) & mask (L1105 / L1138) collides families of distinct (context, target) pairs into the same bucket as soon as buckets (~4.2M) is smaller than the number of distinct n-gram fulls. With orders 2..7 and ~10M val tokens written into a single 4M-entry table per order, full_counts for any queried (ctx, tgt) is the sum of counts of every (ctx', tgt') that hashed there, not the true count of (ctx, tgt). The min(full_counts, ctx_counts) clamp at L1111 caps the inflation but does not eliminate it: whenever ctx_counts < full_counts (which the clamp implies happens), the estimator returns ctx_counts/ctx_counts = 1.0, i.e. probability 1 on the target — a free 0 nats on collided tokens. Mixed with alpha_max=0.60, that's a substantial artificial advantage. Has the author measured the collision rate (e.g. fraction of scored tokens where full_counts >= ctx_counts) and the BPB delta against either (a) a non-hashed dict-backed cache, or (b) a much larger buckets (e.g. 64M)? Ideally the same 3 seeds with NGRAM_EVAL_BUCKETS swept upward — if BPB stays at 0.9625 the family-bug worry is empirically ruled out; if it drifts up, the headline number is partially attributable to hash collisions rather than n-gram statistics. Per Issue Illegal submissions megathread #677 (valerio-oai, 2026-03-27), eval caches must be backward-looking with no oracle selection — the temporal ordering here is fine, but this is a separate "is the estimator computing what it claims" concern.

  2. min(full, ctx) clamp semantics. Even ignoring collisions, on cold/rare contexts where full_counts > ctx_counts, the clamp returns p = 1.0. With min_count=2 and a fresh table early in the eval, how often does a 7-gram lookup hit has_data (ctx_counts >= 2) but with full_counts >= ctx_counts? A short log of (matched_at_order_n, p==1.0_count) per order would settle whether order-7 hits are mostly legitimate repeats or mostly collision-driven 1.0s.

  3. Cross-order double-counting on cache write. On the update path (L1126-1140) every order from min_order..max_order is updated with its own (ctx_key, full_key), but the score path (L1092-1115) stops at the first order that produces has_data and marks ng_matched. That's fine for scoring, but it means the longer-order tables grow against the same token whether or not it was scored from them — subsequent windows then see densified high-order tables that may be more collision-prone. Same ask as (1): collision-rate telemetry would resolve it.

  4. N-gram cache attribution. PR body credits @deanbrr (Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659) for the n-gram eval cache and @Asukabot0 (Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727) for backoff + adaptive alpha. Good. The compliance question above isn't unique to this PR; it applies to the whole hashed-cache family lineage and is worth raising upstream rather than treating any single PR as the source.

  5. TTT path is dead code in the submission. eval_val_sliding_ttt (L1194-1334) implements the score-then-train-then-EMA-load pattern that looks score-first per chunk, but TTT_EVAL_ENABLED=0 in the reproduce block means the headline 0.9625 doesn't depend on it. Mentioning this only so reviewers don't audit a path that isn't in the run.

  6. 3-seed discipline acknowledged. Three seeds with all sub-0.9640 individual results is exactly the standard Proposals for new rules to handle flood of submissions #129 / community asks for. The variance is small enough that the result isn't seed-cherry-picked.

Verdict: NEEDS CLARIFICATION — the score-before-update temporal legality looks clean and the entropy-adaptive alpha touches no targets, but the hashed full_key estimator has an open empirical question (family collisions inflating full_counts, then min(full, ctx) saturating at p=1.0) that materially affects whether 0.9625 reflects n-gram statistics vs. hash-table artifacts. Not a "close" — a "show the collision-rate / large-bucket sweep" ask.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

  • HOLD pending (a) author-supplied collision telemetry (matched-fraction per order, fraction of p==1.0 outcomes per order), and (b) a NGRAM_EVAL_BUCKETS sweep (e.g. 4M / 16M / 64M) on at least one seed showing BPB stability. If BPB is stable across bucket sizes, this is MERGE — sub-1.0 would be a real milestone for the family. If BPB walks up with bigger buckets, the result is a hash-collision artifact and should be relabeled or withdrawn.
  • This also surfaces a category-level ask that should be raised on Issue Illegal submissions megathread #677: hashed n-gram caches need a community-agreed validity check (collision rate or large-bucket parity) to be evaluated as a class.

Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK, HAS_HYPERPARAMETERS=True, HAS_GPT=True, model_dim=512, num_heads=8, num_layers=11, vocab=1024, train_seq_len=2048, code_bytes=106176, SMOKE_TEST_PASS. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 8a59150 (refs/pull/753/head).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants