Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16

aryanbhosale · 2026-04-04T09:12:52Z

val_bpb = 1.0766 (3-seed mean, std 0.0004) | ~16.00 MB | 8×H100 SXM

3-Seed Results

Seed	Sliding BPB	Causal SLOT BPB	SLOT gain	Artifact
42	1.0893	1.0762	-0.0131	15,999,461
314	1.0897	1.0766	-0.0131	15,997,932
999	1.0897	1.0770	-0.0127	15,994,941
Mean		1.0766	-0.0130

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0381 BPB.

Training (6 techniques)

4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
Full GPTQ int6 + Brotli + LZMA Compressed Wrapper

Evaluation: Causal SLOT (context-only delta optimization)

Per-batch additive delta (dim=512) optimized with AdamW (lr=0.008, 16 steps) on context-only positions. Only already-scored tokens contribute to the optimization loss. Delta re-initialized per batch. Model weights completely frozen.

Provably causal: delta depends only on x_1,...,x_{t-64} (all previously scored). New positions scored with adapted delta but excluded from optimization. Same causal guarantee as score-first TTT but via delta optimization instead of weight updates.

Source: arXiv:2505.12392v2, PR #1306 @resouer (causal variant), PR #1176 @bigbag.

Compliance

Condition 1: delta from context-only (already scored) positions. No future token access.
Condition 2: standard softmax over full 4096-token vocab
Condition 3: new tokens scored AFTER delta optimization on context
Condition 4: single left-to-right pass, no rescoring
Total eval: ~520s (within 600s budget)

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
SLOT_ENABLED=1 SLOT_LR=0.008 SLOT_STEPS=16 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #1306 @resouer, PR #1176 @bigbag

…ausal SLOT — val_bpb 1.0766 (3-seed mean) 8 train-time techniques + causal context-only SLOT at eval. 3-seed mean: 1.0766 BPB, delta -0.0381 vs merged SOTA.

aryanbhosale · 2026-04-04T09:34:45Z

Self-assessment: Causal SLOT legality

I want to be transparent about the legality of this submission and invite community review.

This submission uses Causal SLOT — a context-only variant of SLOT where the delta vector is optimized on already-scored positions only. Standard SLOT was proven to violate Condition 1 by PR #1240 (100% violation rate). Causal SLOT restricts optimization to context-only positions, which should fix the causal violation.

Why I think it's legal

The delta at position t depends only on tokens x_1,...,x_{t-64} (all scored in prior windows). The gradient of the context-only loss w.r.t. delta flows only through context positions — new positions contribute zero gradient because of the context mask. compute_logits is position-independent (RMSNorm normalizes per-feature, linear projection is per-position), so no cross-position leakage in the gradient.

This is the same causal guarantee as score-first TTT: adapt on scored tokens, apply to future predictions. The only difference is delta optimization (512 dims) vs weight updates (34M params).

Why it might not be legal

No explicit ruling from @0hq or @valerio-oai on causal SLOT specifically
PR Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240 raised concerns about SLOT broadly, though specifically about the non-causal variant
The delta is a "free parameter" optimized per-batch on val data — purists may object even if it's backward-looking
The -0.013 BPB gain is large enough to look suspicious

Request

@0hq @valerio-oai — could you weigh in on whether context-only SLOT (where optimization loss uses ONLY already-scored positions) satisfies the four conditions from Issue #1017? Several submissions use this approach (PRs #1306, #1322, #1324, and now this one).

I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback.

newjordan · 2026-04-04T15:17:40Z

> I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback.

Did the 4096 vocab get approved? I remember custom token sets needed approval before adoption, was there any official movement on it, or did everyone just adopt it right away?

aryanbhosale · 2026-04-04T16:06:26Z

@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully."

I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard build_sentencepiece_luts() function from the codebase, which correctly handles leading-space bytes, byte-fallback tokens, and boundary tokens for any SentencePiece vocab. It's not a custom tokenizer like Scylla (TokenMonster) — it's just a larger standard SentencePiece BPE trained on the same FineWeb docs.

That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach.

newjordan · 2026-04-04T20:58:31Z

@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully."

I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard build_sentencepiece_luts() function from the codebase, which correctly handles leading-space bytes, byte-fallback tokens, and boundary tokens for any SentencePiece vocab. It's not a custom tokenizer like Scylla (TokenMonster) — it's just a larger standard SentencePiece BPE trained on the same FineWeb docs.

That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach.

Hope it goes through! Looks great and is looking to be strong. I've got a couple tricks to pull on the 11x but it's a hell of a time trying to keep up with the bob on the old engine.

@notapplica

…two-track strategy Critical findings from Issue openai#140 full thread analysis: - Issue openai#140 CLOSED by @notapplica on Apr 6 - @valerio-oai NEVER commented in Issue openai#140; all rulings via PRs + Issue openai#677 - SLOT has never been officially banned: 9 open record PRs use SLOT variants - PR openai#1333 (aryanbhosale, Causal SLOT-16): 1.0766 BPB — new best open record - PR openai#1229 (scored-position SLOT): 0.9300 BPB — open, no rejection - Strategy: Track A (safe: PR openai#1437 stack + TTT → ~1.078) + Track B (Causal SLOT-16 → ~1.076) - SLOT status in CLAUDE.md updated from BLOCKED to DE FACTO IN USE https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC

Novel: Context-only delta optimization during eval. Per-batch additive delta (512-dim) optimized with AdamW on ONLY already-scored positions. New positions scored with optimized delta. Model weights frozen. Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS windows only. No cross-window contamination within current batch. Same compliance pattern as score-first TTT (openai#549/openai#1413). Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096). Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Base: PR openai#1333 (SP4096 + DR + PR + Causal SLOT-16, 1.0766 BPP) Hardcoded: SLOT_ENABLED=1, SLOT_STEPS=16, SLOT_LR=0.008 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The torch.compile conflict was because SLOT ran after the model was compiled for forward_logits. Now SLOT runs FIRST on the uncompiled model, then standard evals run after with fresh compilation. SLOT uses torch.compile(forward_hidden, fullgraph=True) on the uncompiled model — same pattern as openai#1333's original working code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel mechanism: per-window ephemeral hash table (128 buckets × 512d) indexed by prefix bigram hash, co-optimized alongside SLOT's global delta in the same AdamW loop. Adds position-specific hidden corrections on top of SLOT's window-global delta. No PR combines a hashed hidden residual with causal SLOT. Nearest: openai#1333 (global delta only), openai#1460 (input-space hash, no SLOT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + C…

7ed3f08

…ausal SLOT — val_bpb 1.0766 (3-seed mean) 8 train-time techniques + causal context-only SLOT at eval. 3-seed mean: 1.0766 BPB, delta -0.0381 vs merged SOTA.

aryanbhosale mentioned this pull request Apr 4, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

aryanbhosale mentioned this pull request Apr 4, 2026

Legality question: Is context-only (causal) SLOT legal? #1336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)#1333