Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413
Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
Open
Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter wants to merge 1 commit intoopenai:mainfrom
Conversation
…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 6, 2026
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 6, 2026
Changes: - num_loops: 2 -> 3, enable_looping_at: 0.5 -> 0.35 - Add score-first TTT eval (ported from PR openai#1413) - Novel twist: ttt_loop_only=1 freezes all except blocks 4-5 - TTT config: LR=0.005, epochs=3, SGD, chunk_tokens=32768 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 6, 2026
…m tilt, SP8192 primary path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score AdamW TTT) - PR openai#727 confirmed CLOSED (illegal n-gram hash cache) - Merged SOTA unchanged at 1.1147 - New primary target: PR openai#1420 (abaybektursun, 1.08014): SP8192 + Triple Loop (3×, 17 virtual layers) + N-gram Tilt (legal, properly normalized, -0.0029 bpb) + Fused Kernels (+127 steps) - PR openai#1413 (1.08279): confirms legal score-first TTT adds -0.003 bpb - ETLB (-0.0019 bpb) noted as unruled — await @valerio-oai - Strategy updated to v10.0: SP8192 + Triple Loop replaces SP4096 + 2× https://claude.ai/code/session_01TbdBLJPXpbK5wGHpLAQ9x4
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 7, 2026
Novel: TTT adapts ONLY scalar/control parameters (q_gain, attn_scale, mlp_scale, resid_mix, RMSNorm weights, skip_weights, skip_gates). Matrix weights (c_q/c_k/c_v/proj/MLP/tok_emb) stay frozen. This is mechanistically different from full-model TTT (openai#1413, openai#537): the model retunes its existing control knobs rather than learning new weight directions. Higher LR (0.01) since scalars need bigger steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper. The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800) which is well within the std (~0.00046). Margins vs the legal open chronology are unchanged in direction: - vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01205 nats per token 3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit; s0 and s1234 mini-wrapper re-runs still in progress.
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper. The mean improves slightly from the prior mixed-source 1.07813 to 1.07807 because s1234 produced a noticeably lower TTT under the mini wrapper (1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but the largest single-seed drift in the verification set). All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections): - s0: 15,992,304 bytes (7,696 byte headroom) - s42: 15,993,733 bytes (6,267 byte headroom) - s1234: 15,990,539 bytes (9,461 byte headroom) - s1337: 15,988,039 bytes (11,961 byte headroom) - s2025: 15,992,215 bytes (7,785 byte headroom) Margins vs the legal open chronology: - vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01218 nats per token All four issue openai#1017 conditions remain verified for the n-gram tilt path.
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
…-only experts The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug: within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token metadata at the position being scored), leaking 1-2 bits about the answer per scored position. This is an Issue openai#1017 condition 2 violation. PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR openai#1420's thread and proposed the same fix that's applied here: * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last prefix token) for hint gating. Updates use the actual current tok via new tok_is_bnd / tok_is_ws variables so within_update / word_update still segment words correctly. Variable naming and structure copied verbatim from PR openai#1420's fix. * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0. Empirically the within / word experts under prefix-only gating fire for the wrong positions (within fires for word-starts, word fires for mid-word) and contribute *negative* BPB. Disabling them gives 1.07951 on s42 vs 1.08108 with the experts active — token_hint is the only legitimate contributor. 5-seed verification (all on the patched kernel): seed pre-fix corrected delta 0 1.07751 1.08035 +0.00284 42 1.07809 1.08097 +0.00288 1234 1.07813 1.08127 +0.00314 1337 1.07801 1.08060 +0.00259 2025 1.07862 1.08135 +0.00273 mean 1.07807 1.08091 +0.00284 All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB headroom). Pre-fix per-seed values preserved in submission.json under seed_results_pre_fix for the public record. Bar comparisons (corrected mean 1.08091): PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the same bug; would correct to ~1.08300 post-fix) This PR is left open as a transparency / diagnostic record, NOT as a record claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest legal anchor. The README has been retitled "Diagnostic (causal-corrected)" and the legality fix is documented in a dedicated section.
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 7, 2026
Novel mechanism: zero-initialized nn.Embedding(4096, 512) created at eval time, trained exclusively through the standard score-first TTT loop. Learns document-local bigram patterns without modifying any artifact weights. Hash: h = (prev_token * 2039 + curr_token) % 4096 Injection: tok_emb(x) + eval_hash_emb(h), before RMSNorm Compliance: same score-first pattern as openai#549/openai#1413 TTT precedent. Precedent for eval-time params: LoRA-TTT (openai#1254, openai#1354). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 7, 2026
QK_GAIN_INIT: 4.0 → 5.0 (matches openai#1413 best practice). Hash embedding: 16K buckets, 10x LR, zero-init at eval time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
amrayach
added a commit
to amrayach/parameter-golf
that referenced
this pull request
Apr 7, 2026
…d execution Materializes two local record folders from fetched refs pr1413 and pr1437 using a builder script that preserves the upstream FORMAT_RAW+FILTER_LZMA2 wrapper format with roundtrip decode validation. Scripts: - prepare_pr1413_variants.py: offline builder with wrapper-format fidelity - runpod_1413.sh: single-run launcher with conditional final_model.pt copy - runpod_1413_batch.sh: sequential A/B/C/D/E runner with shared archive timestamp, one-time SP8192 prep, per-run subdirectories, and g++ guard Run contract: A: faithful openai#1413 control (16,719 code bytes) B: PARALLEL_RESIDUAL_START=7 C: LOOP_START=3 LOOP_END=5 D: parallel residual + loop adjustment (17,390 code bytes) E: eval-only n-gram tilt on D checkpoint (SKIP_TRAINING=1) Campaign docs updated to reflect the strategy pivot from single-seed openai#1413 to full offline A/B/C/D/E batch prep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
…reshold The previous "Diagnostic" framing was based on a unit error: I compared val_bpb deltas as if they were nats-per-token deltas, missing the factor of ~2.583 (mean bytes per token in the sp8192 val set, computable directly from this submission's val_loss / val_bpb ratio). With the correct units, the causal-corrected 5-seed mean (1.08091 BPB, 2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394: vs PR openai#1394 (1.08563): +0.01219 nats per token ✅ 2.4× the bar vs PR openai#1019 (1.11473): +0.08736 nats per token ✅ comfortably vs PR openai#1413 (ours): +0.00486 nats per token — essentially tied vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel bug; its corrected ~1.08298 yields +0.00535 nats ✅ Title reverted from "Diagnostic (causal-corrected)" to "Record". The legality fix section is preserved (the kernel patch is still a real correctness fix matching @abaybektursun's proposed patch in PR openai#1420). The leak magnitude in the legality fix section now correctly states "+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB. Pre-fix per-seed values are still preserved in submission.json under seed_results_pre_fix for the public record.
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 7, 2026
Base: PR openai#1394 (SP8192 + GPTQ Embeddings + SDClip + DR + MuonEq-R) Novel: RDClip (Rate-Distortion Clip) — per-group GPTQ clip search that minimizes compressed_bytes + lambda * Hessian_weighted_MSE. Extends SDClip's fixed formula to empirical rate-distortion optimization. Groups: embed, attn_qk, attn_vo, mlp, other. Search: 5 multipliers per group on first tensor. Also added: score-first TTT (ported from R12, same as openai#549/openai#1413). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 8, 2026
Remove RDClip to establish baseline for openai#1394 + TTT. Tests whether the base + TTT matches openai#1413's 1.08279. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 8, 2026
Novel: Context-only delta optimization during eval. Per-batch additive delta (512-dim) optimized with AdamW on ONLY already-scored positions. New positions scored with optimized delta. Model weights frozen. Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS windows only. No cross-window contamination within current batch. Same compliance pattern as score-first TTT (openai#549/openai#1413). Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096). Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sisegod
added a commit
to sisegod/parameter-golf
that referenced
this pull request
Apr 8, 2026
Phase 5a is a trivial-wins composition on top of v6.1 SLOT-100 baseline (2026-04-08_v61_h100_aggressive_slot_steps100, 1.146523): 1) QK_GAIN_INIT=5.0 (PR openai#1413) 2) MUON_EQ_R=1 (Newton-Schulz row L2 normalize, PR openai#1394) 3) --ema 0.9965 (PR openai#1421/openai#1445, vs prior 0.997) 4) HIDDEN_MULT=5.0 (FFN dim 4x->5x, byte re-investment from int6 tied embed) 5) EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1 (Phase 1A int6 tied embed, -0.6 MB on rANS artifact) 3-seed val_bpb at SLOT lr=0.1 steps=100 stride=64 (mid-eval 28-29% of full sliding-window): s1337: 1.144045 (28.7% of windows) s1338: 1.142021 (28.7%) s1339: 1.141649 (29.4%) ------- mean: 1.142572 std: 0.001247 Delta vs prior 2026-04-08_v61_h100_aggressive_slot_steps100 (1.146523): -0.003951 bpb Submitted as non-record because 1.142572 does not beat the current PR openai#1019 record (1.1147). The Phase 5a stack documents both the trivial-wins composition AND the negative ablations from Phases 1B/1C/2A-C/3/5b that other submitters can skip: Phase 1B (FP32 scalar -> Int8): only -0.05 MB, kept Phase 1C (Pentanary -> Ternary BitNet b1.58 1-layer sanity): regression +0.014 bpb, abandoned Phase 1A pent_tok (Tied embed Pentanary): regression +0.043 bpb, abandoned Phase 2A (Inter-layer delta prediction Wl - Wl-1): delta entropy HIGHER than W (per-layer ranges differ), abandoned Phase 2B (Hadamard 16-dim block transform): no rANS gain, abandoned Phase 2C (Context-aware rANS lookup table): rans_codec_rs Rust rebuild blocker, abandoned Phase 3 (Custom HQGRANS1 binary container, pickle bypass): only -70 KB rans / +17 KB after lzma9 -- pickle isn't actually leaking 30%, abandoned Phase 4 architecture sweep (1-seed s1337 SLOT-100 stride=64): p5a (no extra) ~1.144 base p5a_bg4096 ~1.146 hurts p5a_hm5 ~1.144 -> 1.142 (3-seed) BEST p5a_bg4096_hm5 ~1.144 tie p5a_bg8192 ~1.148 hurts p5a_nl12 ~1.147 hurts p5a_ve4 ~1.150 hurts Phase 5b (Depth Recurrence PR openai#1239 style): nl9r2 (unique 9 x recur 2 = 18 effective): 30% eval @ 1.151, abandoned nl7r2 (unique 7 x recur 2 = 14 effective): 92% eval @ 1.166, abandoned The 28-29% mid-eval window is the converged region: per-window cumulative bpb has flattened to within +/-0.001 of the 100% value in every prior 3-seed SLOT-100 run we have measured. Full 100%-eval is in flight on the same H100 pod and will be appended in a follow-up commit if the final number differs from the mid-eval estimate. Code change vs 2026-04-08_v61_h100_aggressive_slot_steps100/train_gpt.py is purely env-var driven (no source-code changes to the model architecture or serializer). The training script picks up the Phase 5a env vars at import time (make_model() reads HIDDEN_MULT, EMBED_QUANT_BITS, etc). Reproducibility: bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1337 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1338 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1339 Hardware: 8x H100 80GB SXM (RunPod). 600s wallclock training, ~50 min single-GPU SLOT-100 eval per seed (eval is unbounded). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On top of PR #1394 (@clarkkev) — the current clean sp8192 benchmark — this submission adds a single-knob
QK_GAIN_INIT=5.0and a legal score-first TTT eval pass (TTT_LR=0.005, 3 epochs, freeze=0).inference_mode()before any parameter update.Hardware: 8×H100 80GB SXM, PyTorch 2.9.1+cu128. See the README in the new folder for the full two-table results + diagnostics layout per repo
SUBMISSION_GUIDE.md.Per-seed (post-TTT)
Lineage / change from PR #1394
QK_GAIN_INITraised from 4.0 → 5.0; (2) added a legal score-first TTT sliding pass (LR=0.005, 3 epochs, freeze_blocks=0) as an additional eval mode.Compliance (Issue #1017 four conditions)
torch.inference_mode()BEFORE any parameter update. Training on a chunk only happens AFTER its scoring has been accumulated intoloss_sum. Matches the PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 pattern.Additional flags:
--frontier-bpp 1.08563 --merged-sota-nats 2.80428.Reproduction
Credits
Files
Only adds
records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/with README, submission.json, train_gpt.py, and 3 seed logs.