11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB) by stukenov · Pull Request #264 · openai/parameter-golf

stukenov · 2026-03-20T20:09:28Z

Summary

val_bpb: 1.1455 (seed 1337, single seed — 3-seed validation in progress)
Artifact: 15.94 MB (int5-MLP + int6-attn + zstd-22)

Techniques

Technique	Source	Impact
11 layers (vs 9 baseline)	Funded by int5 savings	More model capacity
Int5 MLP [-16,15] + Int6 attention [-32,31]	Inspired by #180	Saves ~1.9MB, funds 11th layer
Full-model SGD TTT (2 epochs)	Inspired by #152	~0.005 BPB at eval time
SmearGate + BigramHash	From #102/#135	Bigram context injection
SWA (30 checkpoints)	From #89	Better generalization
OrthoInit + muP scaling	From #162	Stable training
Muon WD=0.04	From #60	Quantization-friendly weights
Sliding window eval stride=64	From #50	Full context per token

Architecture

11L / 512d / 8h / 4kv (GQA) / MLP 3x / relu^2 / 2048 seq_len / 26.67M params

Results

Stage	val_bpb
End of training (5197 steps)	1.1583
Post int5/int6 quant + sliding window	1.1507
Post TTT (2 epochs SGD, lr=0.002)	1.1455

Trained on 8xH100 SXM, 600s wallclock, 115ms/step.

Test plan

Single seed run (1337): 1.1455 BPB
3-seed validation (1337, 42, 2025) — in progress
Artifact under 16MB: 15.99 MB total
Training under 10 min: 600s on 8xH100
Eval under 10 min: ~696s (TTT 422s + sliding eval 273s)

11-layer model with mixed int5/int6 quantization, full-model SGD test-time training, SmearGate, BigramHash, SWA, and OrthoInit. Single seed result (1337): val_bpb=1.1455 Artifact: 15.94 MB (under 16MB limit) 3-seed validation in progress.

mohosy · 2026-03-21T00:01:01Z

ttt with int5 mlp is a sick combo, how long does your ttt take during eval? tryna figure out if 3 epochs is worth it over 2

stukenov · 2026-03-21T06:35:29Z

@notapplica i am need runpod credits for 3-seed validation

Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.

Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

MatoTeziTanka · 2026-04-11T20:08:32Z

Community Review — 11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)

BPB: 1.1455 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA ab0777880a9b, file records/track_10min_16mb/2026-03-21_11L_Int5MLP_TTT_SmearGate/train_gpt.py):

At line 830 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(model, device, val_tokens, seq_len, lr, momentum, epochs, batch_size, log_fn) — for epoch in range(epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=56124 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=56124 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

HyperPotatoNeo mentioned this pull request Mar 21, 2026

11L + XSA4 + EMA(0.997) + seq2048 + Int5-MLP + MuonWD=0.04 + LateK-FP16 | val_bpb=1.1361 #372

Closed

leloykun mentioned this pull request Mar 22, 2026

Invalid submissions due to information leakage during TTT #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/11L-int5-ttt

stukenov commented Mar 20, 2026

Uh oh!

mohosy commented Mar 21, 2026

Uh oh!

stukenov commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stukenov commented Mar 20, 2026

Summary

Techniques

Architecture

Results

Test plan

Uh oh!

mohosy commented Mar 21, 2026

Uh oh!

stukenov commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — 11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants