Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds) by zachgoldfine44 · Pull Request #450 · openai/parameter-golf

zachgoldfine44 · 2026-03-22T18:17:32Z

12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT

val_bpb: 1.14662 (mean of 3 seeds, sliding window stride=64, post int6+zstd quantization roundtrip)

3-Seed Results

Seed	val_bpb	artifact_bytes	valid
1337	1.14749	14,014,540	yes
42	1.14575	14,104,510	yes
7	1.14662	14,385,363	yes
Mean	1.14662
Std	0.00071

Training: 600s, ~5,370 steps, ~112 ms/step on 8xH100 SXM
Eval: ~120s (20s roundtrip + 98s sliding window stride=64)
No TTT

Key Innovations

Catalytic Residual Connections (novel): Replace x + f(x) with x + c * f(x), where c is a learned per-dimension vector. -0.024 bpb at zero computational overhead (~11K extra params).
12 Layers: Standard stack uses 10-11 layers leaving significant budget headroom. 12L is the depth sweet spot (-0.023 bpb vs 11L).
BigramHash(10240): Larger bigram vocabulary (-0.070 bpb vs BigramHash(2048)).
Late QAT (threshold=0.25): STE int6 quantization in the final portion of training.
SWA: Stochastic weight averaging from last 20% of warmdown.

Run Command

pip install sentencepiece zstandard
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py

All parameters set as defaults. No env vars needed.

Built on PR #180 standard stack by @thwu1.

Full logs for all 3 seeds included.

…T (val_bpb=1.1466, mean 3 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) First submission combining 6 independently-proven architecture improvements: - Catalytic Residuals (PR openai#450, -0.024 bpb) - Value Residual/ResFormer (PR openai#413, -0.015 bpb) - Gated Attention (PR openai#413, -0.003 bpb) - BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048) - 12 Layers (-0.023 bpb vs 11L) - 3x MLP 8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)

mohosy · 2026-03-23T00:23:05Z

catalytic residuals are lowkey genius, learning a per dim scale on the residual for basically free params. 12 layers is intresting too most people stopped at 11 did you have any issues with artifact size going that deep

MatoTeziTanka · 2026-04-11T20:08:40Z

Community Review — Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)

BPB: 1.1466 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA b3547e3f9d3c, file records/track_10min_16mb/2026-03-22_12L_CatalyticResiduals_BigBigram/train_gpt.py):

At line 1189 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=12, vocab=1024, code=80952 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=12, vocab=1024, code=80952 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QA…

b3547e3

…T (val_bpb=1.1466, mean 3 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

joshuaswarren mentioned this pull request Mar 22, 2026

Non-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) #474

Open

skarakulak mentioned this pull request Mar 23, 2026

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64 #507

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
zachgoldfine44:submission/12L-catalytic-bigbigram-swa

zachgoldfine44 commented Mar 22, 2026

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zachgoldfine44 commented Mar 22, 2026

12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT

3-Seed Results

Key Innovations

Run Command

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants