Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)#366
Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)#366shivnarainms22 wants to merge 2 commits intoopenai:mainfrom
Conversation
Community Review — Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM)BPB: 1.1574 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 864 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=10, vocab=1024, code=59331 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=10, vocab=1024, code=59331 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Non-record submission combining two techniques on top of thwu1's #1 record base (1.1428 bpb):
lambda * h_midfrom the final representation before RMSNorm. Adds exactly 1 scalar parameter at zero computational cost.quantization roundtrip. First 2 blocks frozen. Adapts the quantized model to recover from quantization degradation.
Results
Scores reflect undertraining on 1xGPU (~869 steps vs ~7000+ on 8xH100). All components verified working end-to-end: training,
SWA, mixed int5/int6 quantization, zstd-22 compression, TTT, and sliding window eval.
Architecture
Note
8xH100 SXM results pending compute availability. Will update this PR with full results once obtained.