Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean) by ndokutovich · Pull Request #1487 · openai/parameter-golf

ndokutovich · 2026-04-09T01:45:11Z

Record: SP8192 + Full Stack + Tuned Pre-Quant TTT

val_bpb = 1.0600 (3-seed mean, std 0.0002) | ~15.95 MB | 8xH100 SXM

3-Seed Results

Seed	Sliding BPB	Steps	Artifact
42	1.06023436	5161	15,954,437
1337	1.05980538	5174	15,954,178
2024	1.06010381	5164	15,960,801
Mean	1.06004785

What Changed vs PR #1485

Hyperparameter tuning on pre-quant TTT:

Parameter	PR #1485	This PR
QK_GAIN_INIT	5.0	5.25
TTT_EPOCHS	6	10
TTT_FREEZE_BLOCKS	2	1
TTT_LR	0.0005	0.00045
3-seed mean	1.0679	1.0600

Same architecture, same code, different env vars. Delta: -0.0079 BPB.

Full Stack

SP8192, 11L/13 virtual (3-layer depth recurrence), parallel residuals (L7+), EMA 0.9965, QK-Gain 5.25, skip gates, MuonEq-R, pre-quant AdamW TTT (10ep, lr=0.00045, freeze 1, cosine), SDClip GPTQ int6 + int8 embed + brotli.

Compliance (Track A)

Pre-quant TTT on val data BEFORE quantization, baked into artifact
No eval-time adaptation, no SLOT, no n-gram cache

Reproduction

pip install brotli sentencepiece kernels
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
VOCAB_SIZE=8192 QK_GAIN_INIT=5.25 PREQUANT_TTT_EPOCHS=10 PREQUANT_TTT_FREEZE_BLOCKS=1 PREQUANT_TTT_LR=0.00045 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1471 @X-Abhishek-X, PR #1423 @aryanbhosale, PR #1394 @clarkkev, PR #1204 @msisovic, PR #1482 @aamodbhatt

Checklist

One folder under records/track_10min_16mb/
README.md, submission.json, train_gpt.py
3 seed logs
All artifacts < 16,000,000 bytes
Train wallclock < 600s

…— val_bpb 1.0600 (3-seed mean) Tuned variant with QK-Gain 5.25, 10-epoch TTT (lr=0.00045, freeze 1 block). seed 42: 1.06023436 seed 1337: 1.05980538 seed 2024: 1.06010381 mean: 1.06004785 (std 0.0002)

…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.

…ib GPTQ + SLOT-24 Replaces the triple-stack (Pre-Quant TTT + Val-Calib GPTQ + Eval-Time Legal TTT) with a quad-stack that supersedes the legal TTT path with SLOT-24, ported from PR openai#1488 / PR openai#1313. Four val-data adaptations stacked for the first time: 1. Pre-Quant AdamW TTT — 11 epochs, freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X from val activations (Track A) 3. SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant model, 24 cosine-decayed AdamW steps, throwaway parameters 4. (Optional) Eval-Time Legal Score-First TTT — disabled by default; SLOT supersedes it within the eval budget. Set SLOT_ENABLED=0 TTT_ENABLED=1 to fall back. Code changes vs the previous synthesis commit: - GPT class: split forward_logits into forward_hidden + compute_logits so SLOT can add the per-window delta to the hidden state without re-running the transformer stack. - New eval_val_slot function ported from PR openai#1488 (per-window AdamW with cosine LR decay, stride masking, score-after-delta). - run_evals: wires SLOT on a fresh post-quant model copy, gated by SLOT_ENABLED. Disables legal TTT by default. - New hyperparameters: SLOT_ENABLED, SLOT_STEPS, SLOT_LR, SLOT_LR_MIN, SLOT_BATCH_SEQS, SLOT_EVAL_STRIDE. Folder renamed: 2026-04-09_PreQuantTTT11_ValCalibGPTQ_LegalEvalTTT_Synthesis -> 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis Time budget: ~530s of 600s eval used (590s train + 190s prequant TTT + 10s val-calib GPTQ + 80s sliding eval baseline + 250s SLOT-24). Code: 2322 lines (vs 2039 in PR openai#1487 base, +283 added). py_compile clean. README rewritten as user's submission with compact credits section.

ndokutovich · 2026-04-10T00:33:47Z

Closing as invalid. Same prequant_ttt_adapt_adamw implementation as #1485, which violates Condition 3 of #1017 (score-before-update). Full technical analysis in #1485. Will reimplement with per-chunk score-first pattern from #1413 / #549 before any future submission.

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep …

b6a1fe8

…— val_bpb 1.0600 (3-seed mean) Tuned variant with QK-Gain 5.25, 10-epoch TTT (lr=0.00045, freeze 1 block). seed 42: 1.06023436 seed 1337: 1.05980538 seed 2024: 1.06010381 mean: 1.06004785 (std 0.0002)

owizdom mentioned this pull request Apr 9, 2026

Non-record: Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — quad-stack synthesis (validation pending compute) #1498

Open

7 tasks

ndokutovich mentioned this pull request Apr 10, 2026

Record: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + EMA + QK5 + Pre-Quant AdamW TTT — val_bpb 1.0679 (3-seed mean) #1485

Closed

7 tasks

ndokutovich closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487
ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich:s4-h3-submission

ndokutovich commented Apr 9, 2026

Uh oh!

ndokutovich commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ndokutovich commented Apr 9, 2026

Record: SP8192 + Full Stack + Tuned Pre-Quant TTT

3-Seed Results

What Changed vs PR #1485

Full Stack

Compliance (Track A)

Reproduction

Credits

Checklist

Uh oh!

ndokutovich commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant