Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean)#1420
Conversation
14879d0 to
d581795
Compare
…m tilt, SP8192 primary path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score AdamW TTT) - PR openai#727 confirmed CLOSED (illegal n-gram hash cache) - Merged SOTA unchanged at 1.1147 - New primary target: PR openai#1420 (abaybektursun, 1.08014): SP8192 + Triple Loop (3×, 17 virtual layers) + N-gram Tilt (legal, properly normalized, -0.0029 bpb) + Fused Kernels (+127 steps) - PR openai#1413 (1.08279): confirms legal score-first TTT adds -0.003 bpb - ETLB (-0.0019 bpb) noted as unruled — await @valerio-oai - Strategy updated to v10.0: SP8192 + Triple Loop replaces SP4096 + 2× https://claude.ai/code/session_01TbdBLJPXpbK5wGHpLAQ9x4
635dd75 to
accb40b
Compare
- Add ngram_tilt_enabled and tilt hyperparameters to Hyperparameters - Add build_ngram_extension(): cmake-based C++ build for fused_expert_ext - Add precompute_ngram_hints(): rank-0 computes, broadcasts to all ranks - Integrate Tilt into eval_val_sliding_ttt scoring loop: * Tilt applied AFTER TTT scoring (same sliding window) * TTT gradient uses ORIGINAL NLL (not tilted) * Tilted NLL accumulated for final score - Track both base and tilted BPP for delta reporting - Copy fused_expert_blend.cpp to repo root for C++ build
Post-Quantization Compression: Eight Negative Results@clarkkev established in PR #1394 that compressed model size is governed by Shannon entropy, not hardware bitwidth: This note documents eight attempts to improve the compression pipeline beyond SDClip + GPTQ + Brotli. I raided the toolkits of crystallography (E8 lattice sphere packing), particle physics ( E8 Lattice Vector QuantizationThe E8 lattice achieves optimal sphere packing in 8 dimensions (Viazovska, 2016), with normalized second moment 14% below the cubic lattice. I implemented D8 nearest-point rounding (the integer sublattice: all coordinates with even sum) and measured MSE on Gaussian-distributed weights. D8 increased MSE by 8.37%. The constraint removes half the codewords from the integer grid without adding new ones. The VQ advantage requires dense index-based encoding, not per-coordinate int8 storage. Abandoned. Entropy EqualizationInterpretability analysis revealed 80x variation in per-matrix quantization sensitivity. I derived the optimal bit allocation via Lagrange multipliers on the rate-distortion model, yielding Controlled A/B (same Hessians, 5 seeds): -0.004 BPB. End-to-end training: +0.002 BPB. The A/B test isolated the clip-allocation effect by holding GPTQ randomness constant. In practice, GPTQ stochasticity (~0.002 BPB from calibration sampling and floating-point non-determinism) exceeds the signal. The improvement is real but unmeasurable. Sign-Flip GaugeMLP hidden neurons admit a Scale Discretization53K per-row float16 scales contribute ~100KB of mantissa entropy. I snapped them to a log-lattice before GPTQ, expecting the solver to absorb the <0.8% perturbation. Artifact grew by 31KB. The discretization destroyed the smooth mantissa gradient that Brotli was already exploiting: Shannon entropy decreased, but Kolmogorov complexity increased. ZigZag EncodingTwo's Complement maps Matrix TranspositionColumn-major storage compressed 13KB smaller than row-major on the same quantized tensors because input-feature correlations dominate output-neuron correlations. Combined with stratigraphic dict ordering (grouping same-type matrices for inter-layer LZ77 matches): -16KB offline. End-to-end: +37KB. GPTQ output varies ~30-40KB across runs, overwhelming the signal. Permutation SortMLP hidden-dimension permutation symmetry ( The Noise FloorEvery experiment followed the same pattern: positive in controlled settings, neutral or negative end-to-end. The root cause is a GPTQ noise floor of ~0.002 BPB and ~30-40KB in artifact size, arising from Hessian estimation variance and floating-point non-determinism. Any compression-side optimization below this floor is unmeasurable in practice. Brotli quality=11 is empirically at the byte-level compression frontier for this data. Six distinct byte-manipulation strategies (ZigZag, transposition, bit masking, scale discretization, dict reordering, permutation sort) all failed to improve on it |
|
Experimental Attempts with Negative Results 1. Isospectral Conjugation (Failed: OOM Error)
2. Skip-Gate Variance Normalization (Failed: Redundant)
|
|
I have also been playing with E8 lattice VQ over the past few days - despite not managing to break the frontier it is the best of the VQ methods I've tried, and certainly beat various learned/shared codebook strategies. |
|
@Eppie Curious to know what you think about the ngram tilt |
|
@mtybadger Cool! We need some more fun ideas that will help beat this sota, let me know if you get any more ideas/results |
|
@abaybektursun ngram tilt looks cool! From what I can see, it appears to be fully online, basically, it's a different approach to mixing the prediction from the trained model and the various order-N contexts, with some fixed confidence / count thresholds / priorities. Perhaps "mixing" is the wrong word, since it is focused on narrowing the model's probability distribution by assigning extra probability to the token predicted by the ngram model (as far as I can understand it). Also very cool to see the fused CUDA kernels. Great work! |
…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.
Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper. The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800) which is well within the std (~0.00046). Margins vs the legal open chronology are unchanged in direction: - vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01205 nats per token 3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit; s0 and s1234 mini-wrapper re-runs still in progress.
All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper. The mean improves slightly from the prior mixed-source 1.07813 to 1.07807 because s1234 produced a noticeably lower TTT under the mini wrapper (1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but the largest single-seed drift in the verification set). All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections): - s0: 15,992,304 bytes (7,696 byte headroom) - s42: 15,993,733 bytes (6,267 byte headroom) - s1234: 15,990,539 bytes (9,461 byte headroom) - s1337: 15,988,039 bytes (11,961 byte headroom) - s2025: 15,992,215 bytes (7,785 byte headroom) Margins vs the legal open chronology: - vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01218 nats per token All four issue openai#1017 conditions remain verified for the n-gram tilt path.
|
Thanks for the detailed legality note. For readers who have not followed #1017 closely, my understanding is that the main checklist is roughly:
The score-before-update part looks fine to me. The part I am still unsure about is Condition 1 in the actual implementation. In Could you clarify whether that is the intended reading? If the claim is that this is still prefix-only under #1017, a short explanation in the PR body (or an inline code comment) would help reviewers connect the implementation to the four conditions. |
Mined the top 20 open PRs at openai/parameter-golf and found that PARALLEL RESIDUALS (compute attn + mlp in parallel from the same pre-norm input) is in 3 of the top 6 recent records: PR openai#1437: SP8192 + Parallel Residuals + 3L Recurrence — val_bpb 1.07800 PR openai#1420: Triple Loop + Parallel Residuals + N-gram Tilt — val_bpb 1.08014 PR openai#1425: PROTEUS Parallel Residuals + INT5/INT6 We never tried it. Patch 13 adds USE_PARALLEL_RESIDUALS=1 which switches Block.forward from serial (x = x + attn(x); x = x + mlp(x)) to parallel (x = x + attn(LN(x)) + mlp(LN(x))). Idempotent, anchors on the first 3 lines of Block.forward which are invariant under Patch 11 (smear gate). Also discovered LESSONS.md §29 ("depth recurrence is DEAD under GPTQ") is contradicted by 5 of the top 10 recent records — they use depth recurrence + mixed-precision INT5/INT6 instead of pure int6 GPTQ. Worth re-investigating in a future research fire. experiments.json — 4 new PR_* configs: PR0: parallel residuals alone (no n-gram, isolated effect) PR1: parallel + leaky_relu + full n-gram (current best stack + new trick) PR2: parallel + smear + leaky + full n-gram (max stack) PR3: PR1 with seed=42 for noise check RESEARCH_LOG.md — full record of the research fire findings + the queue of techniques to investigate in future fires (n-gram tilt, depth recurrence, MuonEq-R, PartialRoPE+FA3, SwiGLU, codebooks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
records/track_10min_16mb/2026-04-06_TripleLoop_FusedKernels_Ngram/ngram/fused_expert_blend.cpp
Outdated
Show resolved
Hide resolved
Subagent A (BPE-8192 trainer): the exact tokenizer is already on disk at data/tokenizers/fineweb_8192_bpe.model (370,908 bytes, the literal file behind LESSONS.md §18c -0.129 BPB Mac win). Just needs scp to pod. Subagent B (closed/merged PR audit): top 8 merged records analyzed. Frequency table reveals 5+ convergent techniques we DON'T have: - SmearGate in 6/8 (75%) - zstd-22 in 5/8 (62%) - EMA 0.997 in 4+/8 - Partial RoPE in 2+/8 - XSA in 1/8 (PR openai#1019 = literal openai#1 record at 1.11473) - AR Self-Gen GPTQ in 1/8 (also PR openai#1019) Subagent C (N-gram Tilt): FOUND the definition. It's a multiplicative single-token exponential boost from a causal eval-time n-gram cache: p_tilt(t) = p_model(t) · exp(β · [t==hint]) / Z Z = 1 + p_model(hint) · (exp(β) - 1) Used by PRs openai#1437, openai#1420, openai#1430. Bespoke to parameter-golf, not in any published paper. Delta: -0.0029 to -0.0055 BPB. Subagent D (TTT researcher): full ~80-line Score-First TTT sketch provided. Pattern: score chunk in inference_mode, train on chunk SGD, move on. PR openai#461 framework. Cost ~410s on 8xH100. ~-0.0025 BPB. Subagent E (records miner): top 5 records analyzed, EMA + XSA + Parallel Muon are convergent best practices. We have leaky_relu and that's all from the comp's stack. 8-action priority list compiled. Highest EV next: scp BPE-8192, implement EMA, XSA, Partial RoPE, LN Scale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@dexhunter I can't secure 8xh100 right now, can you test this fix on your PR if you have access? 1. Hint gating (lines 400-409): is_bnd/is_ws now derived from tokens_[p-1] instead of tokens_[p]
Fixes the Rule 1 causal violation @Gusanidas identified
The probability distribution at position p no longer depends on x_p
2. Update functions (lines 448-456): New tok_is_bnd/tok_is_ws flags derived from the actual target tok
Ensures within_update() and word_update() still correctly track word boundaries using the real token
Without this, the state machine would segment words incorrectly, corrupting future hints |
7a70e6d to
5e2eff8
Compare
…A captured Subagent extracted the canonical formula from PR openai#1420 (the source for PR openai#1437 and the entire Legal N-gram Tilt family): p_tilt(x_t) = p_model(x_t) * exp(beta * 1[x_t == hint]) / Z Z = 1 + p_model(hint) * (exp(beta) - 1) Verified legal under issue openai#1017 four conditions (causal, normalized, score-before-update, single-pass). Genuinely different from EM-INF (last fire's PASS) — multiplicative reweighting using external signal, not entropy sharpening. DEFERRED code patch despite high confidence because: 1. Eval-only metric — our loop measures train_loss with SKIP_FINAL_EVAL=1 2. Subagent's "50 LOC sketch" has O(L^2) forward-pass bug, real impl is 150+ 3. Modifying eval pipeline risks breaking FINAL int8_zlib_roundtrip path Marked HIGH PRIORITY for next H100 escalation cycle. Estimated +0.0015-0.0030 BPB at our SP-1024 vocab size — same order as largest single-technique gains. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@abaybektursun yep, I am running a fixed version right now and for 5 seeds it will take about 85 min, (my agents) will report once we have more results |
|
@abaybektursun Confirming I tested the exact fix you described in the comment above, on my fork of the kernel for PR #1437. Results below — your fix is structurally correct, and we converged on essentially the same patch independently. Summary: the causal patch is correct, BUT the within/word experts contribute essentially nothing once the leak is removed. The cleanest legal mode is to also set My fix vs your proposal: identical structure. // Inside get_hints_batch, for each position p:
auto tok = uint16_t(tokens_[p]); // for updates only (causal — runs after hint emit)
bool is_bnd = is_bnd_ && is_bnd_[tok]; // for updates only
bool is_ws = has_ls_ && has_ls_[tok]; // for updates only
// CAUSAL FIX: hint gating must use prefix-only metadata (tokens_[p-1]).
bool prev_is_bnd = false, prev_is_ws = false;
if (p > 0) {
uint16_t prev_tok = uint16_t(tokens_[p - 1]);
prev_is_bnd = is_bnd_ && is_bnd_[prev_tok];
prev_is_ws = has_ls_ && has_ls_[prev_tok];
}
// ... compute hints with prefix-only gates ...
within_hint(prev_is_bnd, prev_is_ws, ...);
word_hint(prev_is_ws, ...);
// Updates still use current tok (causal because they run after hint is locked):
token_update(hashes, max_avail, tok);
within_update(tok, is_bnd, is_ws);
word_update(tok, is_bnd, is_ws);This matches your description exactly: prefix-only Measurements (s42, sp8192 + par7+loop35+ngram + score-first TTT, 8×H100 SXM, 600 s):
Reading: the leak was worth ~+0.0027-0.0029 nats of TTT BPB, consistent across stacks I tested (with/without VR + Hessian SDClip). The structural fix recovers all of the causality, but the within/word experts then contribute negative BPB compared to leaving them at
For PR #1420: applying the same correction to your reported For PR #1437: I'm updating my submission to ship the corrected (token-only) version with a transparency note. The corrected 5-seed mean estimate is ~1.08095 (vs the originally reported 1.07807). PR #1437 will no longer claim a record under the corrected number, but the public record will be honest about the bug and the correction. I'll ping back here once the seed sweep finishes (~30 min). Thanks again for catching this and for the quick fix proposal — collaboration was painless. |
|
@dexhunter Sweet thanks, yes can you please run my latest code as well? I pushed the fix |
…-only experts The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug: within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token metadata at the position being scored), leaking 1-2 bits about the answer per scored position. This is an Issue openai#1017 condition 2 violation. PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR openai#1420's thread and proposed the same fix that's applied here: * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last prefix token) for hint gating. Updates use the actual current tok via new tok_is_bnd / tok_is_ws variables so within_update / word_update still segment words correctly. Variable naming and structure copied verbatim from PR openai#1420's fix. * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0. Empirically the within / word experts under prefix-only gating fire for the wrong positions (within fires for word-starts, word fires for mid-word) and contribute *negative* BPB. Disabling them gives 1.07951 on s42 vs 1.08108 with the experts active — token_hint is the only legitimate contributor. 5-seed verification (all on the patched kernel): seed pre-fix corrected delta 0 1.07751 1.08035 +0.00284 42 1.07809 1.08097 +0.00288 1234 1.07813 1.08127 +0.00314 1337 1.07801 1.08060 +0.00259 2025 1.07862 1.08135 +0.00273 mean 1.07807 1.08091 +0.00284 All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB headroom). Pre-fix per-seed values preserved in submission.json under seed_results_pre_fix for the public record. Bar comparisons (corrected mean 1.08091): PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the same bug; would correct to ~1.08300 post-fix) This PR is left open as a transparency / diagnostic record, NOT as a record claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest legal anchor. The README has been retitled "Diagnostic (causal-corrected)" and the legality fix is documented in a dedicated section.
…ikely illegal), merged SOTA unchanged - PR openai#1430 (renqianluo, Apr 7): claims 0.39642 bpb via per-sample SLOT + n-gram order-22 hash + TTT. Flagged likely illegal: n-gram hash cache matches closed openai#727/openai#741 pattern; SLOT unruled (Issue openai#140). No organizer reviews yet. - Merged SOTA unchanged at 1.1147 (PR openai#1019) - Issue openai#140: no new rulings on SLOT, causal SLOT, or ETLB - Legal path unchanged: PR openai#1420 stack (SP8192 + Triple Loop + N-gram Tilt + Legal TTT) targeting ~1.075–1.077 - No new breakthrough papers beyond existing tracking https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
…ctions - N-gram Tilt bug: PR openai#1420 kernel is non-causal; PR openai#1437 (dexhunter) found/fixed it (pre-fix 1.07807 → post-fix 1.08091). Updated primary reference to PR openai#1437 kernel. - PR openai#1423 flagged illegal (pre-quant TTT, same as openai#1351/openai#1408/openai#1416) - Added full PR openai#1421–1444 scan results - Updated best open legal PR: ~1.08091 (PR openai#1437) not 1.08014 (openai#1420) - Session 8 lessons learned added to CLAUDE.md https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
…led, 2 new PRs validate deferred specs Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits in a row). Strong evidence of true novelty. PR #1430 still OPEN, 0 comments, no comp owner activity since creation. Increasingly likely to be reverted or outlawed. NEW PRs validate two of our deferred H100 escalation specs: - PR #1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec - PR #1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec Combined with PR #1437/#1420 already validating Patch 23 N-gram Tilt, the 3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple- confirmed by independent comp PRs. Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime. Reminder: depth recurrence is back on the table — 5+ records use it now. LESSONS.md §29 needs another update from "stale" to "real direction". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deep review of train_gpt.py reveals ttt_adapt_adamw() trains on val data for 10 full epochs (TTT_EPOCHS=10, TTT_ENABLED=1 by default) before quantization. This is the same pre-quantization TTT violation as PRs openai#1423 and openai#1416 — the artifact encodes information from the entire validation set, violating strict causal dependence. The ~0.04-0.05 BPB improvement from dTTT is entirely attributable to fitting the test set. Best verified-valid score updated to 1.0801 BPB (PR openai#1420). https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
…reshold The previous "Diagnostic" framing was based on a unit error: I compared val_bpb deltas as if they were nats-per-token deltas, missing the factor of ~2.583 (mean bytes per token in the sp8192 val set, computable directly from this submission's val_loss / val_bpb ratio). With the correct units, the causal-corrected 5-seed mean (1.08091 BPB, 2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394: vs PR openai#1394 (1.08563): +0.01219 nats per token ✅ 2.4× the bar vs PR openai#1019 (1.11473): +0.08736 nats per token ✅ comfortably vs PR openai#1413 (ours): +0.00486 nats per token — essentially tied vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel bug; its corrected ~1.08298 yields +0.00535 nats ✅ Title reverted from "Diagnostic (causal-corrected)" to "Record". The legality fix section is preserved (the kernel patch is still a real correctness fix matching @abaybektursun's proposed patch in PR openai#1420). The leak magnitude in the legality fix section now correctly states "+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB. Pre-fix per-seed values are still preserved in submission.json under seed_results_pre_fix for the public record.
5e2eff8 to
f265f65
Compare
…bpb 1.08014 5-seed mean 1.08014 BPB (std=0.0004), best seed 1.07971. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f265f65 to
d2bda6f
Compare
Documents the Rule 1 causal violation in PR openai#1420's n-gram tilt code, including why it was hard to spot, a concrete detection checklist, and the fix pattern of separating prefix-only flags from target flags. https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
Thesis: the speed path is the most underutilized section of openai/parameter-golf. The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties. Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses under free wins + comp ports. Findings: TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total: - Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in plain sight. We're paying 8x kernel-launch overhead because grad_accum was inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup. - Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is 625K sequential forwards at B=1 stride=64. 97% of each window's context is shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives 5-15x eval speedup, saves 3-5 min of the 600s budget. - Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv 2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35 backlog, never shipped. - Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved). Replaces 220s AR self-gen with 14s. +2000 extra training steps. - Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time. TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan: - Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the fastest step in the leaderboard at 69.6 ms/step) - Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4 contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms → 1.3 ms (15x). World-novel, NOT in modded-nanogpt. - Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity (PRs openai#1105, openai#1420). Identity itself looks world-novel. - Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE megakernel). Combined eval speedup ~5x on top of Shot 0b. TIER 3 BIG DREAMS (world-first opportunities): - Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels; nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens templates. Potential PhD-defensible mini-paper. - Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel) - Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC. - Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC. - Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick winner, continue. Online hyperband. 200 LOC. - Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min compile cold-start permanently. Stacked expected impact: - Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6 - +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35 - +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22 - +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15 - +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12 - +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**) 10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100. That's where val_bpb drops BELOW comp records. Key finding: eval path holds the biggest speed wins currently, not training. Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2 Shots 13-14 save 5-8 min per eval pass. More than any training-side single patch would buy at our current rate. Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed), /tmp/phase2_world_speed_research.md (12 research areas surveyed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. phase2/metrics.py (~330 lines)
- Structured JSONL telemetry for per-step timing + GPU/CPU/RAM utilization
- mark(event, **extra) for phase-level events (setup done, train started, etc)
- step(step, ms, train_loss, tok_per_sec, prefetch_queue_depth, ...) hot-path
- Best-effort nvidia-smi + /proc/meminfo + torch.cuda.memory_allocated readers
so the helper has no new deps (no pynvml, no psutil)
- Sparse nvidia-smi sampling (every 50 steps by default) to avoid per-step cost
- print_summary() for end-of-run table, compare_runs() for before/after
- Smoke test in __main__ passes
- Used by every subsequent Phase 2 shot so we can measure the speedup and
verify the val_bpb invariant
2. submission/run.sh: free env var wins (Tier 0, zero risk, zero LOC)
- TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 — disables the Inductor pass that
@abaybektursun fixed upstream in pytorch#179494 / #179422 specifically for
this comp's shape. Per PR openai#1420 this gives +5.93 ms/step (+8.8%).
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,garbage_collection_threshold:0.8
reduces fragmentation and avoids cudaMalloc stalls during training.
Respects user override if already set.
- Honest note on grad_accum: the research agent's "drop 8 → 1 for 30-50% win"
claim is wrong. Peak activation memory at grad_accum=1 microbatch=384 seqs
is ~448 GB (vs our current 56 GB at microbatch=48), blows H100 80GB 8×.
We KEEP grad_accum=8 for world_size=1 at the current TRAIN_BATCH_TOKENS.
Documented in the script so future-me doesn't fall for it again.
Next: data prefetch thread + pinned RAM (Task 9), then compile cache warmup
(Task 10), then CPU n-gram precompute (Task 11), then wire phase2/bootstrap
(Task 12).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt
val_bpb: 1.08309 (5-seed mean, std=0.00044)
Changes
One extra loop pass through layers 4-5. PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 passes through layers 4-5 three times total (NUM_LOOPS=2, giving 15 virtual layers from 11 physical). I add a fourth pass (NUM_LOOPS=3), giving 17 virtual layers. The encoder becomes
[0,1,2,3,4,5,4,5]and the decoder[4,5,4,5,6,7,8,9,10]. It costs about 200 training steps, but the extra depth more than compensates. Quadruple looping (19 virtual) was worse because the step count drops too far.Activate looping earlier (0.35 instead of 0.50). At 0.50, half the training budget runs without the looped layers doing anything. I swept
{0.30, 0.35, 0.40, 0.50}on seed 1234. 0.35 won, though 0.40 was close. Below 0.35 the model doesn't get enough non-looped warmup and quality degrades.Fused MLP kernels (Triton TMA forward + CUTLASS EVT backward). This took the most engineering effort and gave the most BPB back. The forward fuses
leaky_relu(fc(x), 0.5).square()into a single Triton TMA kernel so the 403MB intermediate never hits HBM. The backward fuses(grad_out @ proj.weight) * act_gradinto a CUTLASS 3.x Epilogue Visitor Tree, running the elementwise multiply while tiles are still in registers. Together: ~10% higher throughput, +127 training steps in the same 600s. I initially tried wrapping the entire MLP in a customautograd.Function, but that killedtorch.compile's cross-layer fusions and made everything 2.7x slower. The trick was to fuse surgically, just the forward activation and one backward GEMM, and let the compiler handle the rest. Details in Appendix A.1–A.3.Parallel residuals for layers 7-10. GPT-J style (Wang & Komatsuzaki, 2021): attention and MLP both read from the same pre-residual input, outputs summed in parallel. I expected this to mostly help quantization (less interference between attention and MLP during GPTQ calibration), and it did tighten the gap slightly. The bigger surprise was +68 training steps from the faster forward pass. I also tried Hessian-Aware SDClip from PR #1412 alongside this, but it made things worse with triple looping. It probably needs its own λ tuning for the deeper architecture.
Eval-time n-gram tilt (causality-fixed). The original submission had a causality violation in the within-word and word-start hint channels:
is_bnd/is_wsflags were derived fromtokens_[p](the target token being predicted), which made the hint-gating decision depend on the target. This was caught by @Gusanidas in review. The fix splits the flags into two sets: prefix-derived flags (tokens_[p-1]) for hint gating, and target-derived flags (tokens_[p]) for post-scoring state updates. However, the within-word and word-start channels cannot produce useful hints without target-dependent gating — they either fire too broadly or at the wrong positions. After testing all causal alternatives (prev_tok gating, state-based gating, disabling channels), the winning configuration uses token_hint only (orders 8-16), which was always fully causal. The remaining token_hint channel provides a consistent -0.00014 BPB across all seeds. The improvement is real but small — most of the original -0.0029 delta came from the (now-removed) target-dependent gating in within/word channels. Full details in Appendix A.4.N-gram legality (#1017 conditions)
Update (post-review fix): The original submission had a Rule 1 violation in the within-word and word-start hint channels. The
is_bnd/is_wsflags used to gate hint generation were derived fromtokens_[p](the target), making the decision of whether to produce a hint depend on the token being predicted. This was caught by @Gusanidas. The fix removes the within-word and word-start channels from hint output entirely — they cannot produce useful hints without target-dependent gating. Only thetoken_hintchannel (orders 8–16) remains, which was always fully causal. The n-gram delta dropped from -0.0029 to -0.00014 BPB.Audited against the four conditions proposed in #1017 for eval-time adaptation:
Condition 1, Causal dependence (
p_tdepends only on artifact +x_1...x_{t-1}):compute_hashesreadstokens[pos - k - 1]for k=0,1,..., all strictly before positionpos.token_hintlooks up hash tables containing only entries inserted by prior iterations. The target tokentokens[pos]is read only for the post-scoring update phase.Condition 2, Full normalized distribution: The tilted distribution is
p_tilt(t) = p_model(t) · exp(β · 1[t==hint]) / ZwhereZ = 1 + p_model(hint) · (exp(β) - 1). Proper probability distribution over the full vocabulary.Condition 3, Score-before-update: Hint and beta are written to output arrays before
token_updateinsertstokens[pos]into the tables.Condition 4, Single left-to-right pass:
get_hints_batchprocesses positions sequentially. The sliding window scores each token exactly once.Double-buffered async data prefetch. Background thread + pinned memory + separate CUDA stream. I built this to work around the virtualized disk I/O on cloud H100 instances (see below), but it ended up helping in every setting I tested.
PyTorch 2.9.1 instead of 2.11. See below.
What the model looks like inside
I ran per-matrix rate-distortion, recurrence error amplification, and skip gate analysis on the trained model. Three things stood out:
Loop layers are 2.2x more sensitive to quantization than non-loop layers. Blocks 4 and 5 get reused across passes, so rounding error in those weights compounds. The single most sensitive matrix in the entire network (block 4's value projection) has 80x the BPB-per-byte cost of the least sensitive. This suggests mixed-precision quantization (more bits for loop layers) is the biggest remaining opportunity.
The third loop pass contributes 63% of what the second does. I measured a contraction ratio of 0.634 across passes: each loop iteration changes the representation by ~63% of the previous one. A hypothetical 4th pass would add only 0.63³ = 25% new information, which matches the empirical finding that quadruple looping hurts. The 3rd pass at 63% is clearly worth the step cost; the 4th at 25% is not.
All 8 skip connections are load-bearing. Gates are 0.61-0.70 (sigmoid), meaning roughly 35% encoder / 65% decoder blend. The first loop pass's skip connections (skips 2,3) have the highest weight norms (21.9, 19.5 vs 2.8-13.8 for others), so the first encoder pass through layers 4-5 is the most important information source for the decoder.
What the progress looks like: three models on the same prompt (temp=0.8)
Prompt (50 tokens): "Insurance Company Declares Living Man Dead George Johannesen is very much alive. Which is why it was so surpr"
Ground truth: ising when the Canadian man received a letter addressed "To the Estate of George Johannesen." Even more surprising is that it came from his insurance company, who should really be on top of such things...
#1019 drifts into incoherence ("Rachel Drobles... techniques of the car industry... Lyon Man is dead"). #1105 stays on topic but loops on "Living Man is the only insurance company." This model picks up the actual narrative thread ("the Canadian man received a letter"), invents plausible biographical details, and maintains coherence throughout. All three are wrong about what happens next, but the errors become progressively more plausible.
Debugging the platform
This was the hardest submission I've worked on. Most of the time went to infrastructure, not the model.
Virtualized disks tank throughput. The cloud H100 instances I rented use virtio block devices. The coprime-stride data loader from #726 does random reads across 143 shards, which is fine on bare metal but brutal on a virtual disk. That's what led me to build the async prefetch. It turned out to help everywhere, not just on virtualized storage.
PyTorch 2.9.1 vs 2.11: a full day lost. I could not reproduce results from other submissions. Training the same architecture with the same seed gave 0.0042 BPB worse results on torch 2.11. (I initially measured a 0.015 gap, which turned out to be a wrong model file on the server. The real gap, once I controlled for that, was 0.0042.) I swapped Triton versions, disabled autocast, forced cuBLAS backends, diffed Inductor-generated kernels. The root cause was two independent issues:
Autocast backward changed in PR pytorch#165068 (landed Dec 2025, present in 2.11, absent from 2.9.1). Two lines in
cached_cast()add anAutoGradMode enable_grad(true)guard on weight casts, inserting extraToCopyBackwardnodes into the autograd graph. This changes floating-point accumulation order by 1 ULP of bf16 (7.15e-7) in saved activations, which compounds over 5000 momentum steps into +60KB of weight entropy. The model goes from fitting at 16.00MB (no pruning) to 16.06MB (5.4% pruning needed). I verified eval is version-invariant to 0.00003 BPB; the entire gap is from training.Inductor over-fusion in backward codegen: Inductor 2.11's
mix_order_reductionfuses_fused_rms_norm_backwardinto adjacent kernels, producing fewer but larger Triton kernels (65 functions / 11,855 lines vs 71 / 11,292 in 2.9.1). The fatter kernels hit register pressure and cost +5.93ms per backward pass (+8.8%). In a 600s budget, that's ~57 lost training steps. I submitted a fix that disablesmix_order_reductionby default (aligning open-source with fbcode, where it was already off): pytorch/pytorch#179494.Separately, our fused CUTLASS kernel crashed on torch 2.11 because Triton 3.6.0's
TensorDescriptor.from_tensor()tries to access.data_ptr()on FakeTensors duringtorch.compiletracing. I traced that through Inductor'sFallbackKernelcodegen and submitted a second fix: pytorch/pytorch#179422. Two PyTorch PRs from a golf competition.In time-budgeted competitions, the platform is the model. A 6ms/step Inductor regression can cost as much BPB as most algorithmic innovations.
How this submission came together
The first few days were mostly wasted. I tried improving the architecture directly: 12 layers, SwiGLU, mixed int5/int8 per layer. Nothing worked. The model was 930KB over the 16MB budget and MLP weights alone were 69% of the compressed artifact. Brotli-11 was already within 1-2% of Shannon entropy. There was nowhere to go.
Worse: a new optimizer schedule I'd been developing (Mixed NS5, a convergent Newton-Schulz coefficient ramp) changed the weight distribution enough that the model no longer fit in the 16MB budget. It was 930KB over, and aggressive pruning to fit destroyed the quality gains.
Then I lost a full day to PyTorch version divergence (described above). Besides the upstream fix, the useful thing that came out of it was a proof that compressed model size is a chaotic function of training hyperparameters. 1 ULP of bf16 rounding (7.15e-7) in a saved activation compounds over 5000 momentum steps into 60KB swings in Brotli output. I also proved that L2 weight decay is scale-invariant under max-abs quantization:
Q(γW) = Q(W). All the per-bank WD tuning I'd been doing was chasing noise.Once I stopped trying to control compression through training and focused on what was actually deterministic (GPTQ deadzone for size, n-gram tilt for eval), things moved fast. Clean reproduction of the baseline. Pivot to SP8192 + SDClip. Triple looping. Fused kernels. Parallel residuals. Each gain was small but they stacked: 45 experiments, five seeds, 1.08014 BPB.
What didn't work
Innovations that worked on earlier models but not here
Mixed NS5 coefficient schedule. On our SP4608 model this was worth -0.0066 BPB for free: use the standard Muon polynomial
(3.4445, -4.775, 2.0315)to ramp singular values toward 1, then switch to the convergent polynomial(1.875, -1.25, 0.375)which hasp(1)=1, p'(1)=0to lock them in. The split adapts per bank based on aspect ratio as a proxy for condition number. On the SP8192 architecture the coefficient schedule produced weight distributions that were hostile to Brotli compression: the model was 500KB over budget and needed 46% pruning.EC-GPTQ (entropy-constrained rounding). Inside the GPTQ inner loop, I added an element-wise deadzone:
dz = λ · d / s², where d is the Hessian diagonal and s is the scale. Borderline ±1 values get rounded to 0 when the GPTQ error compensation cost is small. On the SP4096 architecture this achieved 10x better rate-distortion than uniform deadzoning (0.5×10⁻⁵ BPB/KB vs 6.8×10⁻⁵). On the SP8192 + SDClip architecture it was harmful: SDClip'sc = k·σalready controls entropy per row, and adding EC-GPTQ on top just introduced extra quantization damage for no compression benefit.Per-bank weight decay tuning. MLP is 69% of the compressed model. I tried giving MLP slightly lower WD (0.07 vs 0.09) to improve quality, offset by higher attention WD. Even ±0.005 from the baseline was catastrophic: lower MLP WD means larger MLP weights, which Brotli can't compress cheaply, so the artifact blows up.
L2 weight decay as a compression lever. I proved mathematically that L2 WD is scale-invariant under max-abs quantization:
Q(γW) = round(W / (max|W|/31)) = Q(W). Multiplying all weights by a constant changes nothing about the quantized integers. This was useful to understand (it meant all the WD-based compression tuning I'd been doing was chasing noise), but it also closed a door.Code size
All code ships as part of the artifact:
train_gpt.py, CUTLASS EVT source, and the n-gram C++ source. For a competition run, these would be bundled into a single LZMA-compressed blob.train_gpt.pyis minified withpython-minifier(annotations, pass statements, and docstrings removed; variable names preserved).submission.py(143 bytes) is the entry point: it decompressestrain_gpt.py.lzmaand executes it. For a competition run,torchrunwould invokesubmission.pyinstead oftrain_gpt.py. Total code cost: 19,811 bytes. All 5 seeds fit under 16MB with 1.8-9.9KB headroom. The unminifiedtrain_gpt.py(64KB) is included in the PR for readability.Requirements
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291Credits
Full component lineage: every piece traced to its origin PR
This competition is deeply collaborative. Nearly every component traces through multiple contributors. I've tried to credit the earliest PR that introduced each technique, but many were refined across several submissions.
Appendix
A.0 Ablation: fused 5-seed without parallel residuals
5-seed results: fused kernels + triple loop + n-gram, no parallel residuals
5-seed mean: 1.08080 BPB (std=0.00064). Seed 1234 n-gram was run in terminal (1.08007), not logged to file.
Adding parallel residuals (layers 7+) improves seed 1234 from 1.08007 to 1.07971 (-0.00036), primarily from +68 extra training steps due to the faster parallel forward pass. Full parallel-residuals 5-seed results are in the main table above (mean 1.08014).
A.1 Fused MLP Kernels: Design & Implementation
These kernels were first developed for PR #1105 on the SP4608 architecture. This submission ports them to the SP8192 + triple-loop architecture and integrates the CUTLASS EVT backward with
torch.compile's tracing.Forward (Triton TMA): fuses F.linear + LeakyReLU(0.5) + square
Fuses
F.linear(x, up_w) -> LeakyReLU(0.5) -> squareinto a single kernel. The 403MB intermediate never touches HBM.Uses Triton's Tensor Memory Access (TMA) descriptors for H100-native global-to-shared memory loads. Block sizes
128x256x64with 8 warps, 4 pipeline stages. The kernel performs the GEMM accumulation in FP32, then applies activation and squaring inline before writing back to BF16.The interleaved write pattern splits the accumulator into two halves via
tl.reshape + tl.permute + tl.split, writing activation gradient and post-activation to separate output buffers in a single pass.Backward (CUTLASS EVT): fuses (go @ down_w.T) * act_grad
Fuses
(go @ down_w.T) * act_gradinto a single CUTLASS 3.x kernel via Epilogue Visitor Tree. The elementwise multiply runs in the GEMM epilogue while tiles are still in registers, eliminating one 403MB write + read per layer.I store the activation gradient in the forward pass instead of the pre-activation. This removes all branching from the backward:
The identity
post = 0.5 * act_grad * preholds for both signs:This reduces the CUTLASS EVT epilogue to a trivial 3-node tree:
Sm90EVT<multiplies, AccFetch, AuxLoad>.Why surgical fusion, not full-MLP autograd.Function
torch.compile's cross-layer fusions (RMSNorm backward, residual adds, RoPE backward) account for ~21.6% of step time. Wrapping the full MLP backward inautograd.Functionmakes it opaque to Inductor, so everything runs in eager mode at 2.7x slower net (I hit this in my #670). So I fuse only the forward activation and one backward GEMM+pointwise, preserving the compiler's scope over everything else.A.2 Kernel Benchmarks
Per-layer timing and end-to-end
End-to-end (35 steps, seed=42, 2xH100):
On 8xH100: unfused 4553 steps → fused 4680 steps in 588s (+127 steps, +2.8%).
A.3 Step-Time Profile
Where all 313ms goes (2xH100, Nsight Systems)
A.4 N-Gram Tilt
The n-gram system was originally developed in PR #1105 for SP4608 models. This submission ports it to SP8192. Source code:
ngram/fused_expert_blend.cpp(C++ open-addressing hash, nanobind FFI) andngram/eval_ngram.py(tilt math + sliding window). Eval time on 8xH100: ~90s.Post-review causality fix
The original submission had three hint channels:
token_hint(orders 8–16),within_hint(within-word BPE completion), andword_hint(word-start prediction). @Gusanidas identified thatwithin_hintandword_hintusedis_bnd/is_wsflags derived fromtokens_[p](the target token) to gate whether a hint was produced — a Rule 1 violation.What was invalid: The gating decision "should I produce a hint at this position?" depended on whether the target token was a word boundary or had a leading space. This meant the probability distribution P(x_t | x_1...x_{t-1}) changed depending on the value of x_t itself.
What was tried to salvage within/word channels:
is_bnd/is_wsfromtokens_[p-1](prefix): semantically inverted, delta = +0.00033 (harmful)within_len_state only: fires too broadly, delta = +0.00120 (harmful)Conclusion: The within/word channels' -0.0025 BPB contribution came entirely from target-dependent gating. Without it, they add noise. Only
token_hint(orders 8–16) produces a legitimate improvement. The fix removes within/word from hint output while keeping their state updates (dead code, no effect).Parameter sweep (token_hint only, 4M token subset, 8 GPUs in parallel):
Full-val delta with best params (beta=1.5): consistent -0.00014 BPB across all 5 seeds. The improvement is real but small.
Causality proof (token_hint channel)
The surviving
token_hintchannel is a textbook online n-gram with strict lookup-then-update discipline:p_tdepends only on artifact +x_1...x_{t-1}x_t-dependent updateA.5 Data Prefetch
Double-buffered async prefetch
Background thread prepares next batch in pinned memory while GPU trains. Separate CUDA stream for H2D overlap.
On the PR #1334 architecture: +39 steps, +0.7% throughput. The extra steps landed in a worse compression region (+40KB), so the net effect was actually harmful for that architecture. On PR #1394's
ShuffledSequenceLoaderwith memmap, the data pipeline is already efficient enough that prefetch isn't the bottleneck.A.6 ETLB (Eval-Time Logit Bias)
Algorithm and results
From PR #1399. Learns a vocab-sized bias vector via SGD on already-scored context tokens, carried across sliding windows:
logits + biasResult (seed 1234, double-loop config on torch 2.11): n-gram only 1.08152 → ETLB + n-gram 1.08132 (-0.00020). Not re-tested on the final triple-loop fused config.
Rejected: takes 615s, doesn't fit in 600s eval budget.
A.7 Setup & Reproduction
Full build instructions