Skip to content

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)#1560

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-ttt-warmdown75-chunk48
Open

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)#1560
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-ttt-warmdown75-chunk48

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

val_bpb = 1.07406 (3-seed mean, std 0.00132) | 2.77441 nats | ~15.99 MB | 8xH100 SXM, 600s

Seed Steps Pre-TTT BPB Post-TTT BPB TTT Gain TTT Time Artifact
42 4918 1.08400 1.07352 -0.01048 213s 15,994,146
0 4900 1.08363 1.07310 -0.01053 221s 15,997,570
1337 4908 1.08619 1.07556 -0.01063 219s 15,988,610
Mean 4909 1.08461 1.07406 -0.01055 218s 15,993,442

Merged SOTA (PR #1493 @bigbag): 1.0810 BPB (2.78932 nats). Delta: -0.01491 nats (clears 0.005 bar by 3.0x).

Key Innovation

Warmdown fraction and TTT chunk size tuning on PR #1530's VarLen + Triton fused MLP + doc-TTT stack:

  • warmdown_frac = 0.75 (vs 0.72 default) — extends cosine decay, letting the model settle into a lower-loss basin before quantization
  • TTT_CHUNK_SIZE = 48 (vs 32 default) — larger chunks provide more context per TTT gradient step, improving LoRA adaptation
  • Muon momentum 0.97 — shorter memory horizon tracks the loss surface better during extended warmdown

Rule Compliance (Issue #1017)

  • Condition 1 (Causality): VarLen attention with per-document cu_seqlens, strict causal masking
  • Condition 2 (Normalized): Standard softmax over full vocabulary
  • Condition 3 (Score before update): TTT chunks scored under torch.no_grad() BEFORE LoRA gradient update
  • Condition 4 (Single pass): Each token scored exactly once
  • No SLOT, no pre-quant TTT, no n-gram cache
  • All artifacts < 16 MB, train < 600s, eval < 225s
  • Compile warmup uses random tokens (not val data)

Test Plan

  • 3-seed verification (seeds 42, 0, 1337)
  • All artifacts under 16,000,000 bytes
  • Train under 600s on all seeds (~587s)
  • Eval under 225s on all seeds

Credits

Pavel Liashkov and others added 2 commits April 11, 2026 19:48
…Legal TTT — val_bpb 1.07785 (3-seed mean)

3-seed mean: 1.07785 (std 0.00047), seeds 42/314/999
All artifacts under 16MB, training under 600s, eval under 600s
Improved parallel residuals (cross-lane routing), Muon 0.97, MATRIX_LR=0.03
Score-first TTT (SGD 3ep), no SLOT, no pre-quant TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (3-seed mean)

PR openai#1530 v2 base + warmdown_frac=0.75 + TTT_CHUNK_SIZE=48 + Muon 0.97.
3-seed mean: 1.07406 (std 0.00132), 2.77441 nats.
Delta vs merged SOTA (openai#1493): -0.01491 nats (clears 0.005 bar by 3.0x).
All artifacts < 16 MB, train < 600s, eval < 225s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant