Skip to content

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)#290

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:record-xsa-ttt-submission
Open

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)#290
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:record-xsa-ttt-submission

Conversation

@ibarrajo
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1354 (sliding window, stride=64)
  • 15.85 MB artifact (int6 + zstd-22, under 16MB)
  • 8xH100 SXM, 8,945 steps in 600s + 132s eval

Approach

Four improvements stacked on the PR #198 base:

  1. Partial XSA (last 3 layers) — efficient GQA-aware self-attention debiasing (PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)  #265, arXiv:2603.09078)
  2. TTT (3-epoch full-model SGD, freeze first 2 blocks) — eval-time adaptation (PR Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303 #254)
  3. Batch=524K — 22% more gradient updates (PR Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400) #236 finding)
  4. RoPE base 50K — extended positional encoding (PR Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507) #206)

Key Metrics

Metric Value
Sliding val_bpb (stride=64) 1.1354
Standard roundtrip val_bpb 1.1583
Artifact size 15,851,371 bytes
Training steps 8,945
TTT time 50s
Eval time 80s

Note

Uses PyTorch SDPA fallback (FA3 not in RunPod image — see #280). With FA3, expect ~600 more training steps and slightly better BPB.

Test plan

  • Artifact under 16MB (15.85MB)
  • Trains in 600s on 8xH100
  • Eval completes in <600s (132s total)
  • Post-quant roundtrip verified
  • train_gpt.py runs from records/ folder
  • Train log included
  • Multi-seed validation (budget constrained — single seed)

🤖 Generated with Claude Code

Stacks Partial XSA (last 3 layers), TTT (3-epoch SGD), batch=524K,
and RoPE50K on the PR openai#198 base. 8,945 steps on 8xH100 in 600s.
15.85MB artifact (int6+zstd-22).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)

BPB: 1.1354 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 84e656ec981f, file records/track_10min_16mb/2026-03-20_XSA_TTT_BatchOpt_AlexIbarra/train_gpt.py):

At line 371 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76548 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76548 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants