Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)#397
Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)#397translatingthename wants to merge 1 commit intoopenai:mainfrom
Conversation
3-seed mean: 1.1371 (seeds 42, 7, 2024) Dynamic evaluation (Krause et al., ICML 2018) applied during sliding window scoring. 2.0% consistent bpb improvement at zero artifact cost. Built on PR openai#315 (jfprincz) and PR openai#338 (alertcat).
Community Review — Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)BPB: 1.1364 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 1198 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=77946 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=77946 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Dynamic evaluation (Krause et al., ICML 2018) applied to the SOTA pipeline without modifying training. The model takes periodic SGD gradient steps during sliding window scoring, adapting to local text distribution. 2.0% consistent bpb improvement at zero artifact cost.
3-seed mean: 1.1371 (seeds 42, 7, 2024). Best seed: 1.1364. Merged SOTA: 1.1428.
Results (3-seed, 8xH100 SXM, SDPA backend)
Novel Contribution: Dynamic Evaluation
After TTT adaptation, we score the validation stream using sliding windows (stride=64). Between batches of scored windows, we take an SGD gradient step (lr=0.001) on the model weights. The model adapts to the local distribution as it scores. TTT adapts weights before scoring; dynamic eval adapts during scoring. The two are complementary.
Attribution
Built on PR #315 (jfprincz): XSA, EMA, Partial RoPE, LN Scale, Late QAT.
PR #338 (alertcat): TTT integration.
SmearGate/BigramHash/OrthoInit originally by unnir.
Reference: Krause et al., "Dynamic Evaluation of Neural Sequence Models," ICML 2018.
See
records/track_10min_16mb/2026-03-22_DynamicEval_TTT_11L/README.mdfor full details, ablation, what didn't work, and reproduction instructions.