Skip to content

Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)#1368

Open
JKSNS wants to merge 1 commit intoopenai:mainfrom
JKSNS:record/mean-delta-slot-warmstart
Open

Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)#1368
JKSNS wants to merge 1 commit intoopenai:mainfrom
JKSNS:record/mean-delta-slot-warmstart

Conversation

@JKSNS
Copy link
Copy Markdown

@JKSNS JKSNS commented Apr 5, 2026

0.8503 BPB on seed 1337 using 8× H100 SXM, with a 13.3 MB artifact, 600s training, and 535s evaluation.

This run targets better SLOT efficiency and model quality through a warmer SLOT initialization, lightweight effective-depth expansion, and short-horizon training cleanup:

  1. Mean-delta SLOT warm start
    Instead of resetting deltas to zero each batch, the mean of the previous batch’s converged deltas is carried forward with alpha = 0.9. This exploits systematic model bias as a free initialization signal, giving SLOT a head start on the global correction component so the full 32 AdamW steps can focus on per-sample refinement. Local validation showed gains ranging from 0.16% to 0.68%, increasing with model bias strength.

  2. Depth recurrence with iter_embed and iter_gate
    Layers 4 and 5 are executed twice per forward pass, yielding 13 virtual layers from 11 physical layers at zero parameter cost. Learned per-iteration conditioning, with gate initialization at -2.0, differentiates the repeated passes.

  3. Label smoothing short-horizon degradation
    Identified and documented a short-training failure mode from label smoothing, causing a 0.07 BPB regression at 5,600 steps. A corrected follow-up run reached val_bpb 1.2022 at step 4000, versus 1.6725 in this run, projecting roughly 0.77 to 0.78 SLOT before compute interruption.

Also investigated and ruled out: FOMAML meta-SLOT training, persistent optimizer state across batches, MuonEq-R, QK-Gain 5.0, and WD 0.09. Full details and ablations are in the README.

@JKSNS JKSNS changed the title non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB) Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB) Apr 5, 2026
yuyeon added a commit to yuyeon/parameter-golf that referenced this pull request Apr 5, 2026
| LS Value | val_bpb@200 | Delta vs baseline |
|----------|-------------|-------------------|
| 0.00     | 1.737       | —                 |
| 0.01     | 1.787       | +0.050 (worse)    |
| 0.05     | 1.960       | +0.223 (worse)    |
| 0.10     | 2.146       | +0.409 (worse)    |

Root cause: at ~1800 training steps, the model needs sharp gradients.
Label smoothing reduces gradient magnitude, slowing convergence.
Confirms PR openai#1368 finding: label smoothing causes "short-horizon degradation."

Also: MoD (Mixture of Depths) with per-token routing: 732ms/step (2.2x slower
than baseline), routing overhead dominates. KILLED.

Embedding mixup for LMs: theoretically dubious (discrete sequences aren't
naturally mixable like images). Not tested — high risk, low expected value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)

BPB: 0.8503 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 571a7eed08df, file records/track_10min_16mb/2026-04-04_MeanDelta_SLOT_WarmStart_DepthRecurrence/train_gpt.py):

At line 1111 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt_adamw(args, base_model, device, val_tokens, rank, world_size, log0) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=72328 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=72328 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants