Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB) by JKSNS · Pull Request #1368 · openai/parameter-golf

JKSNS · 2026-04-05T02:13:30Z

0.8503 BPB on seed 1337 using 8× H100 SXM, with a 13.3 MB artifact, 600s training, and 535s evaluation.

This run targets better SLOT efficiency and model quality through a warmer SLOT initialization, lightweight effective-depth expansion, and short-horizon training cleanup:

Mean-delta SLOT warm start
Instead of resetting deltas to zero each batch, the mean of the previous batch’s converged deltas is carried forward with alpha = 0.9. This exploits systematic model bias as a free initialization signal, giving SLOT a head start on the global correction component so the full 32 AdamW steps can focus on per-sample refinement. Local validation showed gains ranging from 0.16% to 0.68%, increasing with model bias strength.
Depth recurrence with iter_embed and iter_gate
Layers 4 and 5 are executed twice per forward pass, yielding 13 virtual layers from 11 physical layers at zero parameter cost. Learned per-iteration conditioning, with gate initialization at -2.0, differentiates the repeated passes.
Label smoothing short-horizon degradation
Identified and documented a short-training failure mode from label smoothing, causing a 0.07 BPB regression at 5,600 steps. A corrected follow-up run reached val_bpb 1.2022 at step 4000, versus 1.6725 in this run, projecting roughly 0.77 to 0.78 SLOT before compute interruption.

Also investigated and ruled out: FOMAML meta-SLOT training, persistent optimizer state across batches, MuonEq-R, QK-Gain 5.0, and WD 0.09. Full details and ablations are in the README.

| LS Value | val_bpb@200 | Delta vs baseline | |----------|-------------|-------------------| | 0.00 | 1.737 | — | | 0.01 | 1.787 | +0.050 (worse) | | 0.05 | 1.960 | +0.223 (worse) | | 0.10 | 2.146 | +0.409 (worse) | Root cause: at ~1800 training steps, the model needs sharp gradients. Label smoothing reduces gradient magnitude, slowing convergence. Confirms PR openai#1368 finding: label smoothing causes "short-horizon degradation." Also: MoD (Mixture of Depths) with per-token routing: 732ms/step (2.2x slower than baseline), routing overhead dominates. KILLED. Embedding mixup for LMs: theoretically dubious (discrete sequences aren't naturally mixable like images). Not tested — high risk, low expected value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:03:53Z

Community Review — Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)

BPB: 0.8503 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 571a7eed08df, file records/track_10min_16mb/2026-04-04_MeanDelta_SLOT_WarmStart_DepthRecurrence/train_gpt.py):

At line 1111 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt_adamw(args, base_model, device, val_tokens, rank, world_size, log0) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=72328 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=72328 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

init

571a7ee

JKSNS changed the title ~~non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)~~ Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB) Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)#1368

Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)#1368
JKSNS wants to merge 1 commit intoopenai:mainfrom
JKSNS:record/mean-delta-slot-warmstart

JKSNS commented Apr 5, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JKSNS commented Apr 5, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants