Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354) by ibarrajo · Pull Request #290 · openai/parameter-golf

ibarrajo · 2026-03-21T00:01:11Z

Summary

val_bpb: 1.1354 (sliding window, stride=64)
15.85 MB artifact (int6 + zstd-22, under 16MB)
8xH100 SXM, 8,945 steps in 600s + 132s eval

Approach

Four improvements stacked on the PR #198 base:

Partial XSA (last 3 layers) — efficient GQA-aware self-attention debiasing (PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265, arXiv:2603.09078)
TTT (3-epoch full-model SGD, freeze first 2 blocks) — eval-time adaptation (PR Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303 #254)
Batch=524K — 22% more gradient updates (PR Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) #236 finding)
RoPE base 50K — extended positional encoding (PR Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507) #206)

Key Metrics

Metric	Value
Sliding val_bpb (stride=64)	1.1354
Standard roundtrip val_bpb	1.1583
Artifact size	15,851,371 bytes
Training steps	8,945
TTT time	50s
Eval time	80s

Note

Uses PyTorch SDPA fallback (FA3 not in RunPod image — see #280). With FA3, expect ~600 more training steps and slightly better BPB.

Test plan

Artifact under 16MB (15.85MB)
Trains in 600s on 8xH100
Eval completes in <600s (132s total)
Post-quant roundtrip verified
train_gpt.py runs from records/ folder
Train log included
Multi-seed validation (budget constrained — single seed)

🤖 Generated with Claude Code

Stacks Partial XSA (last 3 layers), TTT (3-epoch SGD), batch=524K, and RoPE50K on the PR openai#198 base. 8,945 steps on 8xH100 in 600s. 15.85MB artifact (int6+zstd-22). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:07:37Z

Community Review — Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)

BPB: 1.1354 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 84e656ec981f, file records/track_10min_16mb/2026-03-20_XSA_TTT_BatchOpt_AlexIbarra/train_gpt.py):

At line 371 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76548 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76548 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

sseanliu mentioned this pull request Mar 21, 2026

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)#290

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)#290
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:record-xsa-ttt-submission

ibarrajo commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibarrajo commented Mar 21, 2026

Summary

Approach

Key Metrics

Note

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants