Skip to content

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279
Open

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

On top of PR #1394 (@clarkkev) — the current clean sp8192 benchmark — this submission adds a single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, 3 epochs, freeze=0).

  • val_bpb: 1.08279 (3-seed mean across seeds 0/42/1234) — 0.00731 nats/token below PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (1.08563), clearing the 0.005 nats record threshold by 0.00231 nats.
  • All 3 seeds fit 16 MB (margins 7,454–10,942 bytes)
  • Training 588 s / seed, eval 381–392 s / seed (well under the 600 s budgets)
  • No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT follows the PR #549 precedent — every chunk is scored under inference_mode() before any parameter update.

Hardware: 8×H100 80GB SXM, PyTorch 2.9.1+cu128. See the README in the new folder for the full two-table results + diagnostics layout per repo SUBMISSION_GUIDE.md.

Per-seed (post-TTT)

Seed Pre-TTT sliding bpb Post-TTT bpb Δ TTT Artifact Train ms Eval ms
0 1.08397 1.08210 −0.00187 15,991,018 588,004 385,050
42 1.08470 1.08315 −0.00155 15,992,546 588,009 381,500
1234 1.08590 1.08314 −0.00276 15,989,058 588,000 386,880
mean 1.08486 1.08279 −0.00206 15,990,874 588,004 384,477

Lineage / change from PR #1394

Compliance (Issue #1017 four conditions)

  • Condition 1 (Causality): Strict left-to-right causal model. Sliding eval never references future tokens.
  • Condition 2 (Normalized distribution): Standard softmax over full vocab. No logit biasing, no BigramHash, no two-pass.
  • Condition 3 (Score before update): Every TTT chunk is scored under torch.inference_mode() BEFORE any parameter update. Training on a chunk only happens AFTER its scoring has been accumulated into loss_sum. Matches the PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 pattern.
  • Condition 4 (Single pass): Each token is scored exactly once.

Additional flags:

Reproduction

export NCCL_NET=Socket
export QK_GAIN_INIT=5.0
export TTT_ENABLED=1
export TTT_LR=0.005
export TTT_EPOCHS=3
for SEED in 0 42 1234; do
    SEED=$SEED torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

Files

Only adds records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/ with README, submission.json, train_gpt.py, and 3 seed logs.

…(3-seed mean)

On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal
score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the
clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all
fitting 16MB with 7-11K margin.

Per-seed (post-TTT):
- seed 0   : 1.08210 (val_loss 2.79517)
- seed 42  : 1.08315 (val_loss 2.79788)
- seed 1234: 1.08314 (val_loss 2.79785)
- mean     : 1.08279 (2.79697 nats per token)

Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token,
clearing the 0.005 nats record threshold by 0.00231 nats per seed.

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change.
Score-first TTT matches PR openai#549 precedent: every chunk scored under
inference_mode() before any parameter update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant