Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM) by shivnarainms22 · Pull Request #366 · openai/parameter-golf

shivnarainms22 · 2026-03-21T20:14:21Z

Summary

Non-record submission combining two techniques on top of thwu1's #1 record base (1.1428 bpb):

Backout Connection (inspired by PR Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364) #339): Learned residual subtraction at the U-Net midpoint (layer 5). Subtracts lambda * h_mid from the final representation before RMSNorm. Adds exactly 1 scalar parameter at zero computational cost.
Test-Time Training (inspired by PR Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338): 3 epochs of full-weight SGD (lr=0.002, momentum=0.9) on validation tokens after
quantization roundtrip. First 2 blocks frozen. Adapts the quantized model to recover from quantization degradation.

Results

Hardware	Steps	val_bpb	Artifact Size
1xH100 (RunPod)	869	1.4463	15.5MB
1xA100 (Northeastern HPC)	423	1.6760	15.5MB
8xH100 SXM	Pending	Pending	Pending

Scores reflect undertraining on 1xGPU (~869 steps vs ~7000+ on 8xH100). All components verified working end-to-end: training,
SWA, mixed int5/int6 quantization, zstd-22 compression, TTT, and sliding window eval.

Architecture

10 layers, 512 dim, GQA (8/4 heads), 3x MLP (relu^2)
SmearGate + BigramHash(10240, dim=128)
U-Net skip connections, tied embeddings
Mixed int5 (MLP) / int6 (attention) quantization + zstd-22
3% magnitude pruning, SWA(start_frac=0.4)
Backout connection at layer 5 (lambda init=0.2)
TTT: 3 epochs SGD post-quantization
Sliding window eval stride=64

Note

8xH100 SXM results pending compute availability. Will update this PR with full results once obtained.

MatoTeziTanka · 2026-04-11T20:09:48Z

Community Review — Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)

BPB: 1.1574 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA f9c74fbff55e, file records/track_non_record_16mb/2026-03-21_TTT_BackoutConn_Int5MLP/train_gpt.py):

At line 864 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=10, vocab=1024, code=59331 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=10, vocab=1024, code=59331 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

shivnarainms22 added 2 commits March 21, 2026 13:11

Non-record: 10L Int5-MLP + TTT + Backout Connection (1xH100 verified)

c4c3bfe

Update: 8xH100 result val_bpb=1.1574, add EMA replacing SWA

f9c74fb

shivnarainms22 changed the title ~~Non-record: 10L Int5-MLP + TTT + Backout Connection~~ Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM) Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)#366

Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)#366
shivnarainms22 wants to merge 2 commits intoopenai:mainfrom
shivnarainms22:submission/ttt-backout-nonrecord

shivnarainms22 commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shivnarainms22 commented Mar 21, 2026

Summary

Results

Architecture

Note

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants