Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) by dexhunter · Pull Request #1413 · openai/parameter-golf

dexhunter · 2026-04-06T11:28:11Z

Summary

On top of PR #1394 (@clarkkev) — the current clean sp8192 benchmark — this submission adds a single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, 3 epochs, freeze=0).

val_bpb: 1.08279 (3-seed mean across seeds 0/42/1234) — 0.00731 nats/token below PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (1.08563), clearing the 0.005 nats record threshold by 0.00231 nats.
All 3 seeds fit 16 MB (margins 7,454–10,942 bytes)
Training 588 s / seed, eval 381–392 s / seed (well under the 600 s budgets)
No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT follows the PR #549 precedent — every chunk is scored under inference_mode() before any parameter update.

Hardware: 8×H100 80GB SXM, PyTorch 2.9.1+cu128. See the README in the new folder for the full two-table results + diagnostics layout per repo SUBMISSION_GUIDE.md.

Per-seed (post-TTT)

Seed	Pre-TTT sliding bpb	Post-TTT bpb	Δ TTT	Artifact	Train ms	Eval ms
0	1.08397	1.08210	−0.00187	15,991,018	588,004	385,050
42	1.08470	1.08315	−0.00155	15,992,546	588,009	381,500
1234	1.08590	1.08314	−0.00276	15,989,058	588,000	386,880
mean	1.08486	1.08279	−0.00206	15,990,874	588,004	384,477

Lineage / change from PR #1394

Same base stack as PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394: sp8192 BPE, 11L/512d/8H/4KV, MLP 4×, Partial RoPE 16d, depth recurrence (loop layers 4–5 twice from 50% training), MuonEq-R WD=0.085, full-Hessian GPTQ int6 + int8 embeddings + SD-clip, Brotli+byte-shuffle, EMA.
Two changes: (1) QK_GAIN_INIT raised from 4.0 → 5.0; (2) added a legal score-first TTT sliding pass (LR=0.005, 3 epochs, freeze_blocks=0) as an additional eval mode.

Compliance (Issue #1017 four conditions)

Condition 1 (Causality): Strict left-to-right causal model. Sliding eval never references future tokens.
Condition 2 (Normalized distribution): Standard softmax over full vocab. No logit biasing, no BigramHash, no two-pass.
Condition 3 (Score before update): Every TTT chunk is scored under torch.inference_mode() BEFORE any parameter update. Training on a chunk only happens AFTER its scoring has been accumulated into loss_sum. Matches the PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 pattern.
Condition 4 (Single pass): Each token is scored exactly once.

Additional flags:

No SLOT (standard or causal). No eval-time delta optimization.
No pre-quant TTT on val data.
No n-gram cache at eval.
No tokenizer change — uses PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394's SentencePiece BPE 8192 unchanged.
Rule-checker (tools/verify_rules.py) passes all 3 seed logs with --frontier-bpp 1.08563 --merged-sota-nats 2.80428.

Reproduction

export NCCL_NET=Socket
export QK_GAIN_INIT=5.0
export TTT_ENABLED=1
export TTT_LR=0.005
export TTT_EPOCHS=3
for SEED in 0 42 1234; do
    SEED=$SEED torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

@clarkkev — PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 sp8192 base stack (GPTQ embeddings, depth recurrence, MuonEq-R, SD-clip)
@abaybektursun — PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 GPTQ-XSA lineage; PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 legal score-first TTT precedent
@Christopher-Lee-McClendon — PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 LoRA TTT reference
@unnir — PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 XSA

Files

Only adds records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/ with README, submission.json, train_gpt.py, and 3 seed logs.

@clarkkev

…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.

…freeze support

Changes: - num_loops: 2 -> 3, enable_looping_at: 0.5 -> 0.35 - Add score-first TTT eval (ported from PR openai#1413) - Novel twist: ttt_loop_only=1 freezes all except blocks 4-5 - TTT config: LR=0.005, epochs=3, SGD, chunk_tokens=32768 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@valerio-oai

…m tilt, SP8192 primary path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score AdamW TTT) - PR openai#727 confirmed CLOSED (illegal n-gram hash cache) - Merged SOTA unchanged at 1.1147 - New primary target: PR openai#1420 (abaybektursun, 1.08014): SP8192 + Triple Loop (3×, 17 virtual layers) + N-gram Tilt (legal, properly normalized, -0.0029 bpb) + Fused Kernels (+127 steps) - PR openai#1413 (1.08279): confirms legal score-first TTT adds -0.003 bpb - ETLB (-0.0019 bpb) noted as unruled — await @valerio-oai - Strategy updated to v10.0: SP8192 + Triple Loop replaces SP4096 + 2× https://claude.ai/code/session_01TbdBLJPXpbK5wGHpLAQ9x4

Novel: TTT adapts ONLY scalar/control parameters (q_gain, attn_scale, mlp_scale, resid_mix, RMSNorm weights, skip_weights, skip_gates). Matrix weights (c_q/c_k/c_v/proj/MLP/tok_emb) stay frozen. This is mechanistically different from full-model TTT (openai#1413, openai#537): the model retunes its existing control knobs rather than learning new weight directions. Higher LR (0.01) since scalars need bigger steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Robby955

…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.

Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper. The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800) which is well within the std (~0.00046). Margins vs the legal open chronology are unchanged in direction: - vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01205 nats per token 3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit; s0 and s1234 mini-wrapper re-runs still in progress.

All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper. The mean improves slightly from the prior mixed-source 1.07813 to 1.07807 because s1234 produced a noticeably lower TTT under the mini wrapper (1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but the largest single-seed drift in the verification set). All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections): - s0: 15,992,304 bytes (7,696 byte headroom) - s42: 15,993,733 bytes (6,267 byte headroom) - s1234: 15,990,539 bytes (9,461 byte headroom) - s1337: 15,988,039 bytes (11,961 byte headroom) - s2025: 15,992,215 bytes (7,785 byte headroom) Margins vs the legal open chronology: - vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01218 nats per token All four issue openai#1017 conditions remain verified for the n-gram tilt path.

@abaybektursun

…-only experts The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug: within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token metadata at the position being scored), leaking 1-2 bits about the answer per scored position. This is an Issue openai#1017 condition 2 violation. PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR openai#1420's thread and proposed the same fix that's applied here: * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last prefix token) for hint gating. Updates use the actual current tok via new tok_is_bnd / tok_is_ws variables so within_update / word_update still segment words correctly. Variable naming and structure copied verbatim from PR openai#1420's fix. * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0. Empirically the within / word experts under prefix-only gating fire for the wrong positions (within fires for word-starts, word fires for mid-word) and contribute *negative* BPB. Disabling them gives 1.07951 on s42 vs 1.08108 with the experts active — token_hint is the only legitimate contributor. 5-seed verification (all on the patched kernel): seed pre-fix corrected delta 0 1.07751 1.08035 +0.00284 42 1.07809 1.08097 +0.00288 1234 1.07813 1.08127 +0.00314 1337 1.07801 1.08060 +0.00259 2025 1.07862 1.08135 +0.00273 mean 1.07807 1.08091 +0.00284 All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB headroom). Pre-fix per-seed values preserved in submission.json under seed_results_pre_fix for the public record. Bar comparisons (corrected mean 1.08091): PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the same bug; would correct to ~1.08300 post-fix) This PR is left open as a transparency / diagnostic record, NOT as a record claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest legal anchor. The README has been retitled "Diagnostic (causal-corrected)" and the legality fix is documented in a dedicated section.

Novel mechanism: zero-initialized nn.Embedding(4096, 512) created at eval time, trained exclusively through the standard score-first TTT loop. Learns document-local bigram patterns without modifying any artifact weights. Hash: h = (prev_token * 2039 + curr_token) % 4096 Injection: tok_emb(x) + eval_hash_emb(h), before RMSNorm Compliance: same score-first pattern as openai#549/openai#1413 TTT precedent. Precedent for eval-time params: LoRA-TTT (openai#1254, openai#1354). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

QK_GAIN_INIT: 4.0 → 5.0 (matches openai#1413 best practice). Hash embedding: 16K buckets, 10x LR, zero-init at eval time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d execution Materializes two local record folders from fetched refs pr1413 and pr1437 using a builder script that preserves the upstream FORMAT_RAW+FILTER_LZMA2 wrapper format with roundtrip decode validation. Scripts: - prepare_pr1413_variants.py: offline builder with wrapper-format fidelity - runpod_1413.sh: single-run launcher with conditional final_model.pt copy - runpod_1413_batch.sh: sequential A/B/C/D/E runner with shared archive timestamp, one-time SP8192 prep, per-run subdirectories, and g++ guard Run contract: A: faithful openai#1413 control (16,719 code bytes) B: PARALLEL_RESIDUAL_START=7 C: LOOP_START=3 LOOP_END=5 D: parallel residual + loop adjustment (17,390 code bytes) E: eval-only n-gram tilt on D checkpoint (SKIP_TRAINING=1) Campaign docs updated to reflect the strategy pivot from single-seed openai#1413 to full offline A/B/C/D/E batch prep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@abaybektursun

…reshold The previous "Diagnostic" framing was based on a unit error: I compared val_bpb deltas as if they were nats-per-token deltas, missing the factor of ~2.583 (mean bytes per token in the sp8192 val set, computable directly from this submission's val_loss / val_bpb ratio). With the correct units, the causal-corrected 5-seed mean (1.08091 BPB, 2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394: vs PR openai#1394 (1.08563): +0.01219 nats per token ✅ 2.4× the bar vs PR openai#1019 (1.11473): +0.08736 nats per token ✅ comfortably vs PR openai#1413 (ours): +0.00486 nats per token — essentially tied vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel bug; its corrected ~1.08298 yields +0.00535 nats ✅ Title reverted from "Diagnostic (causal-corrected)" to "Record". The legality fix section is preserved (the kernel patch is still a real correctness fix matching @abaybektursun's proposed patch in PR openai#1420). The leak magnitude in the legality fix section now correctly states "+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB. Pre-fix per-seed values are still preserved in submission.json under seed_results_pre_fix for the public record.

Base: PR openai#1394 (SP8192 + GPTQ Embeddings + SDClip + DR + MuonEq-R) Novel: RDClip (Rate-Distortion Clip) — per-group GPTQ clip search that minimizes compressed_bytes + lambda * Hessian_weighted_MSE. Extends SDClip's fixed formula to empirical rate-distortion optimization. Groups: embed, attn_qk, attn_vo, mlp, other. Search: 5 multipliers per group on first tensor. Also added: score-first TTT (ported from R12, same as openai#549/openai#1413). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove RDClip to establish baseline for openai#1394 + TTT. Tests whether the base + TTT matches openai#1413's 1.08279. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel: Context-only delta optimization during eval. Per-batch additive delta (512-dim) optimized with AdamW on ONLY already-scored positions. New positions scored with optimized delta. Model weights frozen. Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS windows only. No cross-window contamination within current batch. Same compliance pattern as score-first TTT (openai#549/openai#1413). Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096). Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 5a is a trivial-wins composition on top of v6.1 SLOT-100 baseline (2026-04-08_v61_h100_aggressive_slot_steps100, 1.146523): 1) QK_GAIN_INIT=5.0 (PR openai#1413) 2) MUON_EQ_R=1 (Newton-Schulz row L2 normalize, PR openai#1394) 3) --ema 0.9965 (PR openai#1421/openai#1445, vs prior 0.997) 4) HIDDEN_MULT=5.0 (FFN dim 4x->5x, byte re-investment from int6 tied embed) 5) EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1 (Phase 1A int6 tied embed, -0.6 MB on rANS artifact) 3-seed val_bpb at SLOT lr=0.1 steps=100 stride=64 (mid-eval 28-29% of full sliding-window): s1337: 1.144045 (28.7% of windows) s1338: 1.142021 (28.7%) s1339: 1.141649 (29.4%) ------- mean: 1.142572 std: 0.001247 Delta vs prior 2026-04-08_v61_h100_aggressive_slot_steps100 (1.146523): -0.003951 bpb Submitted as non-record because 1.142572 does not beat the current PR openai#1019 record (1.1147). The Phase 5a stack documents both the trivial-wins composition AND the negative ablations from Phases 1B/1C/2A-C/3/5b that other submitters can skip: Phase 1B (FP32 scalar -> Int8): only -0.05 MB, kept Phase 1C (Pentanary -> Ternary BitNet b1.58 1-layer sanity): regression +0.014 bpb, abandoned Phase 1A pent_tok (Tied embed Pentanary): regression +0.043 bpb, abandoned Phase 2A (Inter-layer delta prediction Wl - Wl-1): delta entropy HIGHER than W (per-layer ranges differ), abandoned Phase 2B (Hadamard 16-dim block transform): no rANS gain, abandoned Phase 2C (Context-aware rANS lookup table): rans_codec_rs Rust rebuild blocker, abandoned Phase 3 (Custom HQGRANS1 binary container, pickle bypass): only -70 KB rans / +17 KB after lzma9 -- pickle isn't actually leaking 30%, abandoned Phase 4 architecture sweep (1-seed s1337 SLOT-100 stride=64): p5a (no extra) ~1.144 base p5a_bg4096 ~1.146 hurts p5a_hm5 ~1.144 -> 1.142 (3-seed) BEST p5a_bg4096_hm5 ~1.144 tie p5a_bg8192 ~1.148 hurts p5a_nl12 ~1.147 hurts p5a_ve4 ~1.150 hurts Phase 5b (Depth Recurrence PR openai#1239 style): nl9r2 (unique 9 x recur 2 = 18 effective): 30% eval @ 1.151, abandoned nl7r2 (unique 7 x recur 2 = 14 effective): 92% eval @ 1.166, abandoned The 28-29% mid-eval window is the converged region: per-window cumulative bpb has flattened to within +/-0.001 of the 100% value in every prior 3-seed SLOT-100 run we have measured. Full 100%-eval is in flight on the same H100 pod and will be appended in a follow-up commit if the final number differs from the mid-eval estimate. Code change vs 2026-04-08_v61_h100_aggressive_slot_steps100/train_gpt.py is purely env-var driven (no source-code changes to the model architecture or serializer). The training script picks up the Phase 5a env vars at import time (make_model() reads HIDDEN_MULT, EMBED_QUANT_BITS, etc). Reproducibility: bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1337 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1338 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1339 Hardware: 8x H100 80GB SXM (RunPod). 600s wallclock training, ~50 min single-GPU SLOT-100 eval per seed (eval is unbounded). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026

Add score-first TTT eval (ported from PR openai#1413) with loop-only …

3968719

…freeze support

resouer mentioned this pull request Apr 6, 2026

Record: SP8192 + Parallel Residuals + Coprime-Stride + TTT — val_bpb 1.08286 (3-seed mean) resouer/parameter-golf#10

Open

dexhunter mentioned this pull request Apr 7, 2026

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437

Open

resouer mentioned this pull request Apr 8, 2026

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) #1460

Open

sisegod mentioned this pull request Apr 8, 2026

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed @76% = 1.136399, -0.010 vs prior; TTT 1.205 not competitive) #1465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279

dexhunter commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 6, 2026

Summary

Per-seed (post-TTT)

Lineage / change from PR #1394

Compliance (Issue #1017 four conditions)

Reproduction

Credits

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant