Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)#1333
Conversation
…ausal SLOT — val_bpb 1.0766 (3-seed mean) 8 train-time techniques + causal context-only SLOT at eval. 3-seed mean: 1.0766 BPB, delta -0.0381 vs merged SOTA.
Self-assessment: Causal SLOT legalityI want to be transparent about the legality of this submission and invite community review. This submission uses Causal SLOT — a context-only variant of SLOT where the delta vector is optimized on already-scored positions only. Standard SLOT was proven to violate Condition 1 by PR #1240 (100% violation rate). Causal SLOT restricts optimization to context-only positions, which should fix the causal violation. Why I think it's legalThe delta at position t depends only on tokens x_1,...,x_{t-64} (all scored in prior windows). The gradient of the context-only loss w.r.t. delta flows only through context positions — new positions contribute zero gradient because of the context mask. This is the same causal guarantee as score-first TTT: adapt on scored tokens, apply to future predictions. The only difference is delta optimization (512 dims) vs weight updates (34M params). Why it might not be legal
Request@0hq @valerio-oai — could you weigh in on whether context-only SLOT (where optimization loss uses ONLY already-scored positions) satisfies the four conditions from Issue #1017? Several submissions use this approach (PRs #1306, #1322, #1324, and now this one). I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback. |
|
> I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback. Did the 4096 vocab get approved? I remember custom token sets needed approval before adoption, was there any official movement on it, or did everyone just adopt it right away? |
|
@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully." I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach. |
Hope it goes through! Looks great and is looking to be strong. I've got a couple tricks to pull on the 11x but it's a hell of a time trying to keep up with the bob on the old engine. |
…two-track strategy Critical findings from Issue openai#140 full thread analysis: - Issue openai#140 CLOSED by @notapplica on Apr 6 - @valerio-oai NEVER commented in Issue openai#140; all rulings via PRs + Issue openai#677 - SLOT has never been officially banned: 9 open record PRs use SLOT variants - PR openai#1333 (aryanbhosale, Causal SLOT-16): 1.0766 BPB — new best open record - PR openai#1229 (scored-position SLOT): 0.9300 BPB — open, no rejection - Strategy: Track A (safe: PR openai#1437 stack + TTT → ~1.078) + Track B (Causal SLOT-16 → ~1.076) - SLOT status in CLAUDE.md updated from BLOCKED to DE FACTO IN USE https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
Novel: Context-only delta optimization during eval. Per-batch additive delta (512-dim) optimized with AdamW on ONLY already-scored positions. New positions scored with optimized delta. Model weights frozen. Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS windows only. No cross-window contamination within current batch. Same compliance pattern as score-first TTT (openai#549/openai#1413). Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096). Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base: PR openai#1333 (SP4096 + DR + PR + Causal SLOT-16, 1.0766 BPP) Hardcoded: SLOT_ENABLED=1, SLOT_STEPS=16, SLOT_LR=0.008 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The torch.compile conflict was because SLOT ran after the model was compiled for forward_logits. Now SLOT runs FIRST on the uncompiled model, then standard evals run after with fresh compilation. SLOT uses torch.compile(forward_hidden, fullgraph=True) on the uncompiled model — same pattern as openai#1333's original working code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Novel mechanism: per-window ephemeral hash table (128 buckets × 512d) indexed by prefix bigram hash, co-optimized alongside SLOT's global delta in the same AdamW loop. Adds position-specific hidden corrections on top of SLOT's window-global delta. No PR combines a hashed hidden residual with causal SLOT. Nearest: openai#1333 (global delta only), openai#1460 (input-space hash, no SLOT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16
val_bpb = 1.0766 (3-seed mean, std 0.0004) | ~16.00 MB | 8×H100 SXM
3-Seed Results
Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0381 BPB.
Training (6 techniques)
Evaluation: Causal SLOT (context-only delta optimization)
Per-batch additive delta (dim=512) optimized with AdamW (lr=0.008, 16 steps) on context-only positions. Only already-scored tokens contribute to the optimization loss. Delta re-initialized per batch. Model weights completely frozen.
Provably causal: delta depends only on x_1,...,x_{t-64} (all previously scored). New positions scored with adapted delta but excluded from optimization. Same causal guarantee as score-first TTT but via delta optimization instead of weight updates.
Source: arXiv:2505.12392v2, PR #1306 @resouer (causal variant), PR #1176 @bigbag.
Compliance
Reproduction
Credits
PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #1306 @resouer, PR #1176 @bigbag