Skip to content

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)#1333

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp4096-causal-slot
Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)#1333
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp4096-causal-slot

Conversation

@aryanbhosale
Copy link
Copy Markdown
Contributor

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16

val_bpb = 1.0766 (3-seed mean, std 0.0004) | ~16.00 MB | 8×H100 SXM

3-Seed Results

Seed Sliding BPB Causal SLOT BPB SLOT gain Artifact
42 1.0893 1.0762 -0.0131 15,999,461
314 1.0897 1.0766 -0.0131 15,997,932
999 1.0897 1.0770 -0.0127 15,994,941
Mean 1.0766 -0.0130

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0381 BPB.

Training (6 techniques)

  1. 4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
  2. Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  3. Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
  4. MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  5. QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
  6. Full GPTQ int6 + Brotli + LZMA Compressed Wrapper

Evaluation: Causal SLOT (context-only delta optimization)

Per-batch additive delta (dim=512) optimized with AdamW (lr=0.008, 16 steps) on context-only positions. Only already-scored tokens contribute to the optimization loss. Delta re-initialized per batch. Model weights completely frozen.

Provably causal: delta depends only on x_1,...,x_{t-64} (all previously scored). New positions scored with adapted delta but excluded from optimization. Same causal guarantee as score-first TTT but via delta optimization instead of weight updates.

Source: arXiv:2505.12392v2, PR #1306 @resouer (causal variant), PR #1176 @bigbag.

Compliance

  • Condition 1: delta from context-only (already scored) positions. No future token access.
  • Condition 2: standard softmax over full 4096-token vocab
  • Condition 3: new tokens scored AFTER delta optimization on context
  • Condition 4: single left-to-right pass, no rescoring
  • Total eval: ~520s (within 600s budget)

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
SLOT_ENABLED=1 SLOT_LR=0.008 SLOT_STEPS=16 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #1306 @resouer, PR #1176 @bigbag

…ausal SLOT — val_bpb 1.0766 (3-seed mean)

8 train-time techniques + causal context-only SLOT at eval.
3-seed mean: 1.0766 BPB, delta -0.0381 vs merged SOTA.
@aryanbhosale
Copy link
Copy Markdown
Contributor Author

Self-assessment: Causal SLOT legality

I want to be transparent about the legality of this submission and invite community review.

This submission uses Causal SLOT — a context-only variant of SLOT where the delta vector is optimized on already-scored positions only. Standard SLOT was proven to violate Condition 1 by PR #1240 (100% violation rate). Causal SLOT restricts optimization to context-only positions, which should fix the causal violation.

Why I think it's legal

The delta at position t depends only on tokens x_1,...,x_{t-64} (all scored in prior windows). The gradient of the context-only loss w.r.t. delta flows only through context positions — new positions contribute zero gradient because of the context mask. compute_logits is position-independent (RMSNorm normalizes per-feature, linear projection is per-position), so no cross-position leakage in the gradient.

This is the same causal guarantee as score-first TTT: adapt on scored tokens, apply to future predictions. The only difference is delta optimization (512 dims) vs weight updates (34M params).

Why it might not be legal

  1. No explicit ruling from @0hq or @valerio-oai on causal SLOT specifically
  2. PR Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240 raised concerns about SLOT broadly, though specifically about the non-causal variant
  3. The delta is a "free parameter" optimized per-batch on val data — purists may object even if it's backward-looking
  4. The -0.013 BPB gain is large enough to look suspicious

Request

@0hq @valerio-oai — could you weigh in on whether context-only SLOT (where optimization loss uses ONLY already-scored positions) satisfies the four conditions from Issue #1017? Several submissions use this approach (PRs #1306, #1322, #1324, and now this one).

I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback.

@newjordan
Copy link
Copy Markdown

> I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback.

Did the 4096 vocab get approved? I remember custom token sets needed approval before adoption, was there any official movement on it, or did everyone just adopt it right away?

@aryanbhosale
Copy link
Copy Markdown
Contributor Author

@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully."

I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard build_sentencepiece_luts() function from the codebase, which correctly handles leading-space bytes, byte-fallback tokens, and boundary tokens for any SentencePiece vocab. It's not a custom tokenizer like Scylla (TokenMonster) — it's just a larger standard SentencePiece BPE trained on the same FineWeb docs.

That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach.

@newjordan
Copy link
Copy Markdown

@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully."

I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard build_sentencepiece_luts() function from the codebase, which correctly handles leading-space bytes, byte-fallback tokens, and boundary tokens for any SentencePiece vocab. It's not a custom tokenizer like Scylla (TokenMonster) — it's just a larger standard SentencePiece BPE trained on the same FineWeb docs.

That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach.

Hope it goes through! Looks great and is looking to be strong. I've got a couple tricks to pull on the 11x but it's a hell of a time trying to keep up with the bob on the old engine.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 7, 2026
…two-track strategy

Critical findings from Issue openai#140 full thread analysis:
- Issue openai#140 CLOSED by @notapplica on Apr 6
- @valerio-oai NEVER commented in Issue openai#140; all rulings via PRs + Issue openai#677
- SLOT has never been officially banned: 9 open record PRs use SLOT variants
- PR openai#1333 (aryanbhosale, Causal SLOT-16): 1.0766 BPB — new best open record
- PR openai#1229 (scored-position SLOT): 0.9300 BPB — open, no rejection
- Strategy: Track A (safe: PR openai#1437 stack + TTT → ~1.078) + Track B (Causal SLOT-16 → ~1.076)
- SLOT status in CLAUDE.md updated from BLOCKED to DE FACTO IN USE

https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
Novel: Context-only delta optimization during eval. Per-batch additive
delta (512-dim) optimized with AdamW on ONLY already-scored positions.
New positions scored with optimized delta. Model weights frozen.

Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS
windows only. No cross-window contamination within current batch.

Same compliance pattern as score-first TTT (openai#549/openai#1413).
Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096).

Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
Base: PR openai#1333 (SP4096 + DR + PR + Causal SLOT-16, 1.0766 BPP)
Hardcoded: SLOT_ENABLED=1, SLOT_STEPS=16, SLOT_LR=0.008

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
The torch.compile conflict was because SLOT ran after the model was
compiled for forward_logits. Now SLOT runs FIRST on the uncompiled
model, then standard evals run after with fresh compilation.

SLOT uses torch.compile(forward_hidden, fullgraph=True) on the
uncompiled model — same pattern as openai#1333's original working code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
Novel mechanism: per-window ephemeral hash table (128 buckets × 512d)
indexed by prefix bigram hash, co-optimized alongside SLOT's global
delta in the same AdamW loop. Adds position-specific hidden corrections
on top of SLOT's window-global delta.

No PR combines a hashed hidden residual with causal SLOT.
Nearest: openai#1333 (global delta only), openai#1460 (input-space hash, no SLOT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants