Skip to content

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264

Open
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/11L-int5-ttt
Open

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)#264
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/11L-int5-ttt

Conversation

@stukenov
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1455 (seed 1337, single seed — 3-seed validation in progress)
  • Artifact: 15.94 MB (int5-MLP + int6-attn + zstd-22)

Techniques

Technique Source Impact
11 layers (vs 9 baseline) Funded by int5 savings More model capacity
Int5 MLP [-16,15] + Int6 attention [-32,31] Inspired by #180 Saves ~1.9MB, funds 11th layer
Full-model SGD TTT (2 epochs) Inspired by #152 ~0.005 BPB at eval time
SmearGate + BigramHash From #102/#135 Bigram context injection
SWA (30 checkpoints) From #89 Better generalization
OrthoInit + muP scaling From #162 Stable training
Muon WD=0.04 From #60 Quantization-friendly weights
Sliding window eval stride=64 From #50 Full context per token

Architecture

11L / 512d / 8h / 4kv (GQA) / MLP 3x / relu^2 / 2048 seq_len / 26.67M params

Results

Stage val_bpb
End of training (5197 steps) 1.1583
Post int5/int6 quant + sliding window 1.1507
Post TTT (2 epochs SGD, lr=0.002) 1.1455

Trained on 8xH100 SXM, 600s wallclock, 115ms/step.

Test plan

  • Single seed run (1337): 1.1455 BPB
  • 3-seed validation (1337, 42, 2025) — in progress
  • Artifact under 16MB: 15.99 MB total
  • Training under 10 min: 600s on 8xH100
  • Eval under 10 min: ~696s (TTT 422s + sliding eval 273s)

11-layer model with mixed int5/int6 quantization, full-model SGD
test-time training, SmearGate, BigramHash, SWA, and OrthoInit.

Single seed result (1337): val_bpb=1.1455
Artifact: 15.94 MB (under 16MB limit)
3-seed validation in progress.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 21, 2026

ttt with int5 mlp is a sick combo, how long does your ttt take during eval? tryna figure out if 3 epochs is worth it over 2

@stukenov
Copy link
Copy Markdown
Author

@notapplica i am need runpod credits for 3-seed validation

HyperPotatoNeo added a commit to HyperPotatoNeo/parameter-golf that referenced this pull request Mar 21, 2026
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264),
MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048),
SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer.
Single-seed result (seed=1337), ~8903 steps on 8xH100.
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442)
flagged as potentially invalid for adapting on eval tokens BEFORE scoring them.
Added correct score-then-adapt protocol with implementation guide.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB)

BPB: 1.1455 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA ab0777880a9b, file records/track_10min_16mb/2026-03-21_11L_Int5MLP_TTT_SmearGate/train_gpt.py):

At line 830 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(model, device, val_tokens, seq_len, lr, momentum, epochs, batch_size, log_fn) — for epoch in range(epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=56124 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=56124 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants