Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964)#753
Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964)#753newjordan wants to merge 8 commits intoopenai:mainfrom
Conversation
Multi-order backoff (2-7) + entropy-adaptive alpha on 11L/512d U-Net. All 3 seeds sub-1.0. GPTQ calibration inside training phase. Seeds: 42=0.9631, 2045=0.9620, 7=0.9624, mean=0.9625 Credits: @deanbrr openai#659, @Asukabot0 openai#727, @signalrush openai#414 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f9f804a to
ed062df
Compare
ZERO changes to model, training loop, optimizer, compile, or anything outside the eval function. The C-step is pure numpy on CPU. Patch adds: - 5 env vars (CUBRIC_CADENCE, COUNT_DECAY, BOOST/PRUNE/REWEIGHT) - _cubric_c_step() function (numpy, CPU-only) - Buffering + firing logic inside eval_val_sliding_hashed_ngram - Training path is byte-identical to train_gpt.py Usage: CUBRIC_CADENCE=4 to enable, CUBRIC_CADENCE=0 (default) = off Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests order (8,9), buckets (8M,16M), min_count (1,3), alpha range, entropy sigmoid params. All eval-time, no training changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No more copies. Cubric env vars + C-step function + eval wiring added directly to the production script. CUBRIC_CADENCE=0 (default) = off, identical to original. Run script points to real train_gpt.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0.9625 mean BPB. Backoff 2-7 + entropy-adaptive alpha. Three identical copies for safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pure deletion — 166 lines of dead code removed, zero functional change. TTT eval was gated behind `if args.ttt_eval_enabled:` which was always False. The function `eval_val_sliding_ttt` and all TTT parameter parsing removed. N-gram backoff eval, GPTQ, and all scoring paths unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SOTA untouched. Each test is a separate copy: - train_gpt_baseline.py (clean SOTA copy, control) - train_gpt_cadence4.py (SOTA + cubric C-step, cadence=4) - train_gpt_cadence10.py (SOTA + cubric C-step, cadence=10) Each has its own run script. HYPOTHESES.md documents everything. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) N-gram7 BPB: 0.9370 (±0.0003) across seeds 1337/42/2025 Sliding BPB: 1.1222 (±0.0003) Artifact: ~15.9 MB (within 16MB cap) Training: 600s on 8xH100 Key innovation: order-adaptive entropy gating assigns different entropy thresholds per n-gram order. High-order matches (7-gram) trusted at moderate model confidence; low-order matches (2-gram) only trusted when model is very uncertain. Built on PR openai#753 (Podracing II) with XSA extended to all 11 layers and entropy_center=3.0. Co-Authored-By: Travis Chen <travispchen@gmail.com>
Logistic domain mixing was wrong for target-probability mixing. PR openai#753 uses linear: p_mixed = (1-a)*p_neural + a*p_ngram. Keep CTW-inspired depth-adaptive alpha boost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-order adaptive alpha scaling on legal score-first 7-gram backoff. Tracks per-order beat rate on already-scored tokens, suppresses noisy low orders (2-3 → 0.3x alpha), boosts accurate high orders (5-7 → 2.0x). Results (seeds 2045/43/300): Sliding BPB (no n-gram): 1.1198 mean Cubric n-gram BPB: 0.9362 mean (0.9357/0.9362/0.9365) Artifact: 15.59 MB (int6+zstd) 0.026 BPB improvement over Podracing II (openai#753, 0.9625). Original contribution: per-order adaptive alpha scaling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-order adaptive alpha scaling on score-first 7-gram backoff. Orders 2-3 suppressed to 0.3x, orders 5-7 boosted to 2.0x. 0.026 BPB improvement over PR openai#753 (0.9625). Pending: multi-seed verification + zstd compression check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-order adaptive alpha scaling on score-first 7-gram backoff. Seeds 2045=0.9357, 43=0.9362, 300=0.9365. Mean=0.9362. 0.026 BPB improvement over PR openai#753 (0.9625). Logs, submission.json, README included. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)
Community Review — Podracing II: Electric Bugaloo (n-gram backoff + entropy-adaptive alpha)BPB: 0.9625 (3-seed mean) | Seeds: 3 (42, 2045, 7) | Artifact: ~15.59-15.71 MB | Compliance: FLAG (open question on hashed n-gram family-bug) What this does: Eval-time hybrid: at each scored token, mix the model probability with a backward-looking hashed n-gram cache estimate, using a per-token alpha derived from model entropy. The cache is multi-order (orders What I found in the code (head SHA
Questions / flags:
Verdict: NEEDS CLARIFICATION — the score-before-update temporal legality looks clean and the entropy-adaptive alpha touches no targets, but the hashed Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:
Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK, HAS_HYPERPARAMETERS=True, HAS_GPT=True, model_dim=512, num_heads=8, num_layers=11, vocab=1024, train_seq_len=2048, code_bytes=106176, SMOKE_TEST_PASS. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
Results
Progression
What Changed vs Podracing I
Two eval-time improvements, no training changes:
Compliance
Credits
Reproduce
8xH100 SXM, 600s training + ~140s eval.