Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)#1368
Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)#1368JKSNS wants to merge 1 commit intoopenai:mainfrom
Conversation
| LS Value | val_bpb@200 | Delta vs baseline | |----------|-------------|-------------------| | 0.00 | 1.737 | — | | 0.01 | 1.787 | +0.050 (worse) | | 0.05 | 1.960 | +0.223 (worse) | | 0.10 | 2.146 | +0.409 (worse) | Root cause: at ~1800 training steps, the model needs sharp gradients. Label smoothing reduces gradient magnitude, slowing convergence. Confirms PR openai#1368 finding: label smoothing causes "short-horizon degradation." Also: MoD (Mixture of Depths) with per-token routing: 732ms/step (2.2x slower than baseline), routing overhead dominates. KILLED. Embedding mixup for LMs: theoretically dubious (discrete sequences aren't naturally mixable like images). Not tested — high risk, low expected value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)BPB: 0.8503 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 1111 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=72328 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=72328 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
0.8503 BPB on seed 1337 using 8× H100 SXM, with a 13.3 MB artifact, 600s training, and 535s evaluation.
This run targets better SLOT efficiency and model quality through a warmer SLOT initialization, lightweight effective-depth expansion, and short-horizon training cleanup:
Mean-delta SLOT warm start
Instead of resetting deltas to zero each batch, the mean of the previous batch’s converged deltas is carried forward with alpha = 0.9. This exploits systematic model bias as a free initialization signal, giving SLOT a head start on the global correction component so the full 32 AdamW steps can focus on per-sample refinement. Local validation showed gains ranging from 0.16% to 0.68%, increasing with model bias strength.
Depth recurrence with
iter_embedanditer_gateLayers 4 and 5 are executed twice per forward pass, yielding 13 virtual layers from 11 physical layers at zero parameter cost. Learned per-iteration conditioning, with gate initialization at -2.0, differentiates the repeated passes.
Label smoothing short-horizon degradation
Identified and documented a short-training failure mode from label smoothing, causing a 0.07 BPB regression at 5,600 steps. A corrected follow-up run reached val_bpb 1.2022 at step 4000, versus 1.6725 in this run, projecting roughly 0.77 to 0.78 SLOT before compute interruption.
Also investigated and ruled out: FOMAML meta-SLOT training, persistent optimizer state across batches, MuonEq-R, QK-Gain 5.0, and WD 0.09. Full details and ablations are in the README.