11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264
11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264stukenov wants to merge 1 commit intoopenai:mainfrom
Conversation
11-layer model with mixed int5/int6 quantization, full-model SGD test-time training, SmearGate, BigramHash, SWA, and OrthoInit. Single seed result (1337): val_bpb=1.1455 Artifact: 15.94 MB (under 16MB limit) 3-seed validation in progress.
|
ttt with int5 mlp is a sick combo, how long does your ttt take during eval? tryna figure out if 3 epochs is worth it over 2 |
|
@notapplica i am need runpod credits for 3-seed validation |
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
Community Review — 11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)BPB: 1.1455 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 830 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=56124 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=56124 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Techniques
Architecture
11L / 512d / 8h / 4kv (GQA) / MLP 3x / relu^2 / 2048 seq_len / 26.67M params
Results
Trained on 8xH100 SXM, 600s wallclock, 115ms/step.
Test plan