Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) by dexhunter · Pull Request #1413 · openai/parameter-golf

dexhunter · 2026-04-06T11:28:11Z

Summary

On top of PR #1394 (@clarkkev) — the current clean sp8192 benchmark — this submission adds a single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, 3 epochs, freeze=0).

val_bpb: 1.08279 (3-seed mean across seeds 0/42/1234) — 0.00731 nats/token below PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (1.08563), clearing the 0.005 nats record threshold by 0.00231 nats.
All 3 seeds fit 16 MB (margins 7,454–10,942 bytes)
Training 588 s / seed, eval 381–392 s / seed (well under the 600 s budgets)
No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT follows the PR #549 precedent — every chunk is scored under inference_mode() before any parameter update.

Hardware: 8×H100 80GB SXM, PyTorch 2.9.1+cu128. See the README in the new folder for the full two-table results + diagnostics layout per repo SUBMISSION_GUIDE.md.

Per-seed (post-TTT)

Seed	Pre-TTT sliding bpb	Post-TTT bpb	Δ TTT	Artifact	Train ms	Eval ms
0	1.08397	1.08210	−0.00187	15,991,018	588,004	385,050
42	1.08470	1.08315	−0.00155	15,992,546	588,009	381,500
1234	1.08590	1.08314	−0.00276	15,989,058	588,000	386,880
mean	1.08486	1.08279	−0.00206	15,990,874	588,004	384,477

Lineage / change from PR #1394

Same base stack as PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394: sp8192 BPE, 11L/512d/8H/4KV, MLP 4×, Partial RoPE 16d, depth recurrence (loop layers 4–5 twice from 50% training), MuonEq-R WD=0.085, full-Hessian GPTQ int6 + int8 embeddings + SD-clip, Brotli+byte-shuffle, EMA.
Two changes: (1) QK_GAIN_INIT raised from 4.0 → 5.0; (2) added a legal score-first TTT sliding pass (LR=0.005, 3 epochs, freeze_blocks=0) as an additional eval mode.

Compliance (Issue #1017 four conditions)

Condition 1 (Causality): Strict left-to-right causal model. Sliding eval never references future tokens.
Condition 2 (Normalized distribution): Standard softmax over full vocab. No logit biasing, no BigramHash, no two-pass.
Condition 3 (Score before update): Every TTT chunk is scored under torch.inference_mode() BEFORE any parameter update. Training on a chunk only happens AFTER its scoring has been accumulated into loss_sum. Matches the PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 pattern.
Condition 4 (Single pass): Each token is scored exactly once.

Additional flags:

No SLOT (standard or causal). No eval-time delta optimization.
No pre-quant TTT on val data.
No n-gram cache at eval.
No tokenizer change — uses PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394's SentencePiece BPE 8192 unchanged.
Rule-checker (tools/verify_rules.py) passes all 3 seed logs with --frontier-bpp 1.08563 --merged-sota-nats 2.80428.

Reproduction

export NCCL_NET=Socket
export QK_GAIN_INIT=5.0
export TTT_ENABLED=1
export TTT_LR=0.005
export TTT_EPOCHS=3
for SEED in 0 42 1234; do
    SEED=$SEED torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

@clarkkev — PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 sp8192 base stack (GPTQ embeddings, depth recurrence, MuonEq-R, SD-clip)
@abaybektursun — PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 GPTQ-XSA lineage; PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 legal score-first TTT precedent
@Christopher-Lee-McClendon — PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 LoRA TTT reference
@unnir — PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 XSA

Files

Only adds records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/ with README, submission.json, train_gpt.py, and 3 seed logs.

@clarkkev

…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279

dexhunter commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 6, 2026

Summary

Per-seed (post-TTT)

Lineage / change from PR #1394

Compliance (Issue #1017 four conditions)

Reproduction

Credits

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant