Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299) by kasimte · Pull Request #455 · openai/parameter-golf

kasimte · 2026-03-22T19:44:12Z

Summary

NEW SOTA: 1.1299 val_bpb (3-seed mean), beating current record of 1.1428 by 0.0129 nats
Built on PR #374 by @unnir (v38: Tight SWA + VE128 + XSA4, 1.1246 single-seed) with added test-time training (TTT)
All 3 seeds verified: artifacts < 16MB, training < 10min on 8xH100s

Results

Seed	Sliding Window BPB	Artifact Size
1337	1.1291	15,787,610
7	1.1309	15,659,426
99	1.1296	15,688,657
Mean	1.1299	15,711,898

Key techniques

Tight SWA: 12 checkpoints from last ~600 steps (scale<0.2), zero SWA penalty
Test-Time Training: 3 epochs SGD on already-evaluated val tokens (~51s)
Late QAT: STE int6 fake-quant during warmdown (scale<0.1)
Sliding window eval: stride=64, context=2048 (~100s)
11L, 512-dim, GQA 8/4, XSA last 4 layers, Partial RoPE, Shared VE128, SmearGate+BigramHash

Test plan

3 seeds verified (1337, 7, 99) — all beat SOTA by ≥0.012
All artifacts < 16,000,000 bytes
train_gpt.py compiles (ast.parse passes)
Script runs from within records folder (logs confirm path)
PR only adds files to one new folder

@unnir

Beats SOTA (1.1428) by 0.0129 nats across 3 seeds (1337, 7, 99). Built on PR openai#374 by @unnir with added test-time training.

mohosy · 2026-03-23T00:23:11Z

tight swa with ve128 is clean, the 3 seed mean at 1.1299 is solid. how many checkpoints are you averaging and whats the interval

MatoTeziTanka · 2026-04-11T20:07:01Z

Community Review — Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299)

BPB: 1.1299 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 6b36249645c1, file records/track_10min_16mb/2026-03-22_TightSWA_VE128_TTT/train_gpt.py):

At line 995 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=69485 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=69485 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299)

6b36249

Beats SOTA (1.1428) by 0.0129 nats across 3 seeds (1337, 7, 99). Built on PR openai#374 by @unnir with added test-time training.

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299)#455

Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299)#455
kasimte wants to merge 1 commit intoopenai:mainfrom
kasimte:submission/TightSWA-VE128-TTT

kasimte commented Mar 22, 2026

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kasimte commented Mar 22, 2026

Summary

Results

Key techniques

Test plan

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants