Record: 11L Int6 QAT + SmearGate + OrthoInit + SWA + TTT (val_bpb=1.1478)#150
Record: 11L Int6 QAT + SmearGate + OrthoInit + SWA + TTT (val_bpb=1.1478)#150yahya010 wants to merge 7 commits intoopenai:mainfrom
Conversation
10L int6 STE QAT + BigramHash bigram embedding + zstd-22, MLP 1344, Muon 0.99, sliding window stride=64. 3-seed mean 1.1593 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 techniques stacked: 10L, STE int6 QAT, full int6+zstd-22, MLP 1344, BigramHash, fp16 tied embedding, Muon 0.99 WD=0.02, seq2048, grad clip 0.3, warmdown 3000, sliding window stride=64. 3 seeds: 1.1572, 1.1581, 1.1578 (mean 1.1577, std 0.00047) t=-245.7, p << 0.01 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…478) 11 techniques stacked: 11 layers, MLP 3x, STE int6 QAT, SmearGate, BigramHash(2048), OrthoInit+muP, SWA(8 snapshots), TTT(SGD 3 epochs), NTK-RoPE base=50000, Muon WD=0.04, sliding window stride=64. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FA3 (flash_attn_func) compiles with fullgraph=True, giving 112ms/step vs 116ms with SDPA. 5,352 steps, sliding window 1.1454. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Disabled SWA (was corrupting QAT quant robustness) and unfreeze all blocks during TTT. Quant gap reduced from 0.0103 to 0.0083. 5,506 steps at 109ms/step, sliding window 1.1414.
Drop QAT, use WD=0.04 + SWA for quant robustness (leader's approach). SWA every 50 steps when scale<0.5, averaging 29 snapshots. 5,626 steps at 107ms/step, sliding window 1.1393.
v21: 11L + no-QAT + SWA + TTT + SmearGate + OrthoInit (1.1393 BPB) v24: PR openai#338 SOTA stack (partial RoPE, LN scale, late QAT, XSA4, EMA) run_modal.py: Modal cloud runner for 8xH100 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 11L Int6 QAT + SmearGate + OrthoInit + SWA + TTT (val_bpb=1.1478)Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'flash_attn'. Classification via |
Retraction — this IMPORT_FAIL was a flash_attn stub gap in my runnerSorry @yahya010, this one's on me. My CPU smoke runner already ships a stub for On the real eval image (8×H100 SXM Python 3.10), Your PR is not broken. I'm retracting the IMPORT_FAIL classification and adding a Again — sorry for the noise. |
Community Review — Record: 11L Int6 QAT + SmearGate + OrthoInit + SWA + TTT (val_bpb=1.1478)BPB: 1.1478 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 388 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=64336 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=64336 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
|
Correction to the review above — I cited "PR #1416 / PR #1423 lineage" as the "legal Pre-Quant TTT pattern" that trains on a held-out slice of training data with score-first-per-chunk discipline. That citation is wrong. The CLOSE verdict on your PR still stands (your What I got wrong: I pointed at #1416 and #1423 as the legal contrast pattern. At their current heads, both actually have the same pattern your PR has — The actual legal reference is PR #1413 (dexhunter) — the current leaderboard entry at val_bpb 1.0828 (SP8192 + QK-Gain 5 + Legal Score-First TTT). I decompressed its lzma shim and verified the per-chunk pattern: for each chunk Practical implication for a resubmission: to match #1413's legal shape, the TTT function would:
Apologies for the wrong citation. The verdict is unchanged — CLOSE under the #1376 ruling — only the legal reference I pointed you at needed fixing. Correction by @MatoTeziTanka — The Agora. The CLOSE verdict stands; only the legal reference citation was wrong. |
Summary
Full-stack submission: val_bpb = 1.1478 (seed 1337, sliding window stride=64)
12 techniques stacked:
Results
Timing Budget
Requires:
pip install zstandard