You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several recent submissions use a context-only variant of SLOT where the delta vector is optimized using only already-scored positions (not the new positions being evaluated). Is this legal under the four conditions from Issue #1017?
What standard SLOT does (proven illegal by PR #1240)
Optimizes delta using loss from ALL positions in the window, including new (unscored) tokens. PR #1240 showed this violates Condition 1 with a 100% violation rate — flipping a target token changes predictions at other positions.
What context-only (causal) SLOT does
For each sliding window (2048 tokens, stride=64):
Compute frozen hidden states H under no_grad
Context positions = 0..1983 (already scored in prior windows), new positions = 1984..2047
Optimize delta (dim=512) using AdamW on context-only positions' loss for N steps
Score new positions with H + delta
The delta gradient depends only on context hidden states and context labels. Since compute_logits applies independently per-position (RMSNorm + linear), no cross-position information flows in the gradient.
Same principle as TTT: adapt on scored tokens, apply to future predictions. @0hq confirmed TTT is legal: "You're allowed to use any preceding tokens from the evaluation set that you've already been tested on in any way you'd like."
Condition 1 satisfied: delta depends only on x_1,...,x_{t-64} (all previously scored). No future token access.
Condition 3 satisfied: new tokens scored AFTER delta optimization on context.
Model weights frozen — only a per-batch 512-dim delta vector is optimized.
Arguments against legality
No explicit ruling on SLOT variants from maintainers
The delta is shared across all positions — it implicitly carries information from context to new positions (but TTT weight updates do the same thing)
Large BPP gains (-0.01 to -0.03) from what is essentially fitting a parameter to val data at eval time
Request
@0hq@valerio-oai — a ruling on context-only SLOT would help the community. If it's illegal, ~10+ open PRs need to be flagged. If it's legal, participants can use it with confidence.
Key question: Is the distinction between "standard SLOT" (optimize on all positions) and "causal SLOT" (optimize only on already-scored positions) meaningful for legality?
Question
Several recent submissions use a context-only variant of SLOT where the delta vector is optimized using only already-scored positions (not the new positions being evaluated). Is this legal under the four conditions from Issue #1017?
What standard SLOT does (proven illegal by PR #1240)
Optimizes delta using loss from ALL positions in the window, including new (unscored) tokens. PR #1240 showed this violates Condition 1 with a 100% violation rate — flipping a target token changes predictions at other positions.
What context-only (causal) SLOT does
For each sliding window (2048 tokens, stride=64):
no_gradThe delta gradient depends only on context hidden states and context labels. Since
compute_logitsapplies independently per-position (RMSNorm + linear), no cross-position information flows in the gradient.Submissions using this approach
Arguments for legality
Arguments against legality
Request
@0hq @valerio-oai — a ruling on context-only SLOT would help the community. If it's illegal, ~10+ open PRs need to be flagged. If it's legal, participants can use it with confidence.
Key question: Is the distinction between "standard SLOT" (optimize on all positions) and "causal SLOT" (optimize only on already-scored positions) meaningful for legality?