Skip to content

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)#1333

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp4096-causal-slot
Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)#1333
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp4096-causal-slot

Conversation

@aryanbhosale
Copy link
Copy Markdown

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16

val_bpb = 1.0766 (3-seed mean, std 0.0004) | ~16.00 MB | 8×H100 SXM

3-Seed Results

Seed Sliding BPB Causal SLOT BPB SLOT gain Artifact
42 1.0893 1.0762 -0.0131 15,999,461
314 1.0897 1.0766 -0.0131 15,997,932
999 1.0897 1.0770 -0.0127 15,994,941
Mean 1.0766 -0.0130

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0381 BPB.

Training (6 techniques)

  1. 4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
  2. Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  3. Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
  4. MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  5. QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
  6. Full GPTQ int6 + Brotli + LZMA Compressed Wrapper

Evaluation: Causal SLOT (context-only delta optimization)

Per-batch additive delta (dim=512) optimized with AdamW (lr=0.008, 16 steps) on context-only positions. Only already-scored tokens contribute to the optimization loss. Delta re-initialized per batch. Model weights completely frozen.

Provably causal: delta depends only on x_1,...,x_{t-64} (all previously scored). New positions scored with adapted delta but excluded from optimization. Same causal guarantee as score-first TTT but via delta optimization instead of weight updates.

Source: arXiv:2505.12392v2, PR #1306 @resouer (causal variant), PR #1176 @bigbag.

Compliance

  • Condition 1: delta from context-only (already scored) positions. No future token access.
  • Condition 2: standard softmax over full 4096-token vocab
  • Condition 3: new tokens scored AFTER delta optimization on context
  • Condition 4: single left-to-right pass, no rescoring
  • Total eval: ~520s (within 600s budget)

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
SLOT_ENABLED=1 SLOT_LR=0.008 SLOT_STEPS=16 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #1306 @resouer, PR #1176 @bigbag

…ausal SLOT — val_bpb 1.0766 (3-seed mean)

8 train-time techniques + causal context-only SLOT at eval.
3-seed mean: 1.0766 BPB, delta -0.0381 vs merged SOTA.
@aryanbhosale
Copy link
Copy Markdown
Author

Self-assessment: Causal SLOT legality

I want to be transparent about the legality of this submission and invite community review.

This submission uses Causal SLOT — a context-only variant of SLOT where the delta vector is optimized on already-scored positions only. Standard SLOT was proven to violate Condition 1 by PR #1240 (100% violation rate). Causal SLOT restricts optimization to context-only positions, which should fix the causal violation.

Why I think it's legal

The delta at position t depends only on tokens x_1,...,x_{t-64} (all scored in prior windows). The gradient of the context-only loss w.r.t. delta flows only through context positions — new positions contribute zero gradient because of the context mask. compute_logits is position-independent (RMSNorm normalizes per-feature, linear projection is per-position), so no cross-position leakage in the gradient.

This is the same causal guarantee as score-first TTT: adapt on scored tokens, apply to future predictions. The only difference is delta optimization (512 dims) vs weight updates (34M params).

Why it might not be legal

  1. No explicit ruling from @0hq or @valerio-oai on causal SLOT specifically
  2. PR Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240 raised concerns about SLOT broadly, though specifically about the non-causal variant
  3. The delta is a "free parameter" optimized per-batch on val data — purists may object even if it's backward-looking
  4. The -0.013 BPB gain is large enough to look suspicious

Request

@0hq @valerio-oai — could you weigh in on whether context-only SLOT (where optimization loss uses ONLY already-scored positions) satisfies the four conditions from Issue #1017? Several submissions use this approach (PRs #1306, #1322, #1324, and now this one).

I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback.

@newjordan
Copy link
Copy Markdown

> I also have a fully legal Track A submission at PR #1334 (1.0897 BPB, no SLOT, no TTT, no eval-time adaptation) as a fallback.

Did the 4096 vocab get approved? I remember custom token sets needed approval before adoption, was there any official movement on it, or did everyone just adopt it right away?

@aryanbhosale
Copy link
Copy Markdown
Author

@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully."

I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard build_sentencepiece_luts() function from the codebase, which correctly handles leading-space bytes, byte-fallback tokens, and boundary tokens for any SentencePiece vocab. It's not a custom tokenizer like Scylla (TokenMonster) — it's just a larger standard SentencePiece BPE trained on the same FineWeb docs.

That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach.

@newjordan
Copy link
Copy Markdown

@newjordan Good question. The sp4096 tokenizer wasn't individually "approved" — it was introduced by @clarkkev in PR #1218 with data hosted on their HF repo (kevclark/parameter-golf). The README rule (criterion 2) says: "If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully."

I didn't create a new tokenizer — I'm using the same sp4096 SentencePiece BPE model from @clarkkev's export, same as PRs #1218, #1285, #1287, #1291, and several others. The byte-accounting uses the standard build_sentencepiece_luts() function from the codebase, which correctly handles leading-space bytes, byte-fallback tokens, and boundary tokens for any SentencePiece vocab. It's not a custom tokenizer like Scylla (TokenMonster) — it's just a larger standard SentencePiece BPE trained on the same FineWeb docs.

That said, I don't think there's been an explicit "sp4096 is approved" statement from the maintainers. It's been adopted organically by ~8 PRs at this point. The competition description does say "tokenizer-agnostic" and encourages "novel tokenizers" as a valid approach.

Hope it goes through! Looks great and is looking to be strong. I've got a couple tricks to pull on the 11x but it's a hell of a time trying to keep up with the bob on the old engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants