Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)#450
Conversation
…T (val_bpb=1.1466, mean 3 seeds) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) First submission combining 6 independently-proven architecture improvements: - Catalytic Residuals (PR openai#450, -0.024 bpb) - Value Residual/ResFormer (PR openai#413, -0.015 bpb) - Gated Attention (PR openai#413, -0.003 bpb) - BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048) - 12 Layers (-0.023 bpb vs 11L) - 3x MLP 8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)
|
catalytic residuals are lowkey genius, learning a per dim scale on the residual for basically free params. 12 layers is intresting too most people stopped at 11 did you have any issues with artifact size going that deep |
Community Review — Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)BPB: 1.1466 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 1189 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=12, vocab=1024, code=80952 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=12, vocab=1024, code=80952 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT
val_bpb: 1.14662 (mean of 3 seeds, sliding window stride=64, post int6+zstd quantization roundtrip)
3-Seed Results
Key Innovations
Catalytic Residual Connections (novel): Replace
x + f(x)withx + c * f(x), wherecis a learned per-dimension vector. -0.024 bpb at zero computational overhead (~11K extra params).12 Layers: Standard stack uses 10-11 layers leaving significant budget headroom. 12L is the depth sweet spot (-0.023 bpb vs 11L).
BigramHash(10240): Larger bigram vocabulary (-0.070 bpb vs BigramHash(2048)).
Late QAT (threshold=0.25): STE int6 quantization in the final portion of training.
SWA: Stochastic weight averaging from last 20% of warmdown.
Run Command
All parameters set as defaults. No env vars needed.
Built on PR #180 standard stack by @thwu1.
Full logs for all 3 seeds included.