Skip to content

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450

Open
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
zachgoldfine44:submission/12L-catalytic-bigbigram-swa
Open

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450
zachgoldfine44 wants to merge 1 commit intoopenai:mainfrom
zachgoldfine44:submission/12L-catalytic-bigbigram-swa

Conversation

@zachgoldfine44
Copy link
Copy Markdown

12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT

val_bpb: 1.14662 (mean of 3 seeds, sliding window stride=64, post int6+zstd quantization roundtrip)

3-Seed Results

Seed val_bpb artifact_bytes valid
1337 1.14749 14,014,540 yes
42 1.14575 14,104,510 yes
7 1.14662 14,385,363 yes
Mean 1.14662
Std 0.00071
  • Training: 600s, ~5,370 steps, ~112 ms/step on 8xH100 SXM
  • Eval: ~120s (20s roundtrip + 98s sliding window stride=64)
  • No TTT

Key Innovations

  1. Catalytic Residual Connections (novel): Replace x + f(x) with x + c * f(x), where c is a learned per-dimension vector. -0.024 bpb at zero computational overhead (~11K extra params).

  2. 12 Layers: Standard stack uses 10-11 layers leaving significant budget headroom. 12L is the depth sweet spot (-0.023 bpb vs 11L).

  3. BigramHash(10240): Larger bigram vocabulary (-0.070 bpb vs BigramHash(2048)).

  4. Late QAT (threshold=0.25): STE int6 quantization in the final portion of training.

  5. SWA: Stochastic weight averaging from last 20% of warmdown.

Run Command

pip install sentencepiece zstandard
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py

All parameters set as defaults. No env vars needed.

Built on PR #180 standard stack by @thwu1.

Full logs for all 3 seeds included.

…T (val_bpb=1.1466, mean 3 seeds)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuaswarren added a commit to joshuaswarren/parameter-golf that referenced this pull request Mar 22, 2026
…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690)

First submission combining 6 independently-proven architecture improvements:
- Catalytic Residuals (PR openai#450, -0.024 bpb)
- Value Residual/ResFormer (PR openai#413, -0.015 bpb)
- Gated Attention (PR openai#413, -0.003 bpb)
- BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048)
- 12 Layers (-0.023 bpb vs 11L)
- 3x MLP

8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

catalytic residuals are lowkey genius, learning a per dim scale on the residual for basically free params. 12 layers is intresting too most people stopped at 11 did you have any issues with artifact size going that deep

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)

BPB: 1.1466 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA b3547e3f9d3c, file records/track_10min_16mb/2026-03-22_12L_CatalyticResiduals_BigBigram/train_gpt.py):

At line 1189 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=12, vocab=1024, code=80952 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=12, vocab=1024, code=80952 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants