Skip to content

Non-record: Does SLOT violate causal dependence? (empirical test + question)#1240

Open
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:nonrecord/slot-causal-dependence-analysis
Open

Non-record: Does SLOT violate causal dependence? (empirical test + question)#1240
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:nonrecord/slot-causal-dependence-analysis

Conversation

@andrewbaggio1
Copy link
Copy Markdown

What this is

I wrote a script with a coding agent that tests whether SLOT violates causal
dependence. I think the answer is yes.

My Test

  • Take a window of tokens, run SLOT, and record the NLL at each scored position.
  • Flip a single target token somewhere in the window and re-run SLOT.
  • If the NLL changes at other positions, those predictions depended on the token
    we flipped.

Without SLOT, changing a target has no effect on other positions because the
model is causal. With SLOT, every single scored position is affected, and I
found a 100% violation rate across 240 tested pairs.

I also checked self-prediction: is P(x_{t+1}) better when x_{t+1} is actually
in the optimization targets vs when it's swapped for a random token? Yes, by
+0.24 nats (shared delta) and +0.73 nats (per-sample + logit bias).

I'm flagging my own submission (PR #1209).

I looked at PR #1229, which pushes SLOT to its logical extreme: per-sample
delta, per-sample logit bias, and scored-position masking. It gets 0.9300 BPB,
which is the best on the non-verified leaderboard and 0.19 below merged SOTA.
In defense of it, the mechanism is the same as shared-delta SLOT, just with
more parameters to memorize the evaluation targets.

Counterargument

@AnubhavBharadwaaj correctly points out that in stride=64 sliding window,
1984/2048 tokens are already-scored context. So ~97% of the gradient comes
from known tokens. I think this is fair because the leakage in shared-delta
SLOT is small. But it's not zero, and "a little bit of future information"
is still future information.

Reproducing

# No GPU needed. ~30 seconds on CPU.
python prove_slot_causal_violation.py

Request for a decision

@0hq @valerio-oai SLOT has been debated across PRs #1084, #1128, #1172,
#1176, #1209 without a ruling. I'd really appreciate you weighing in!

Full writeup in the README generated by Claude Code.

andrewbaggio1 and others added 2 commits March 31, 2026 23:48
Combines Full Hessian GPTQ, legal score-first chunked TTT (3 epochs),
and SLOT delta optimization (8 AdamW steps per batch). All eval-time
techniques are single-pass, score-before-update compliant.

3-seed mean: 1.1064 +/- 0.0004 BPB on 8xH100 SXM.
Beats verified SOTA (openai#1019, 1.1147) by 0.0083 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Empirical test showing SLOT violates causal dependence (Issue openai#1017 Rule 1).
Includes reproducible proof script, output log, and full writeup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant