Skip to content

Legality question: Is context-only (causal) SLOT legal? #1336

@aryanbhosale

Description

@aryanbhosale

Question

Several recent submissions use a context-only variant of SLOT where the delta vector is optimized using only already-scored positions (not the new positions being evaluated). Is this legal under the four conditions from Issue #1017?

What standard SLOT does (proven illegal by PR #1240)

Optimizes delta using loss from ALL positions in the window, including new (unscored) tokens. PR #1240 showed this violates Condition 1 with a 100% violation rate — flipping a target token changes predictions at other positions.

What context-only (causal) SLOT does

For each sliding window (2048 tokens, stride=64):

  1. Compute frozen hidden states H under no_grad
  2. Context positions = 0..1983 (already scored in prior windows), new positions = 1984..2047
  3. Optimize delta (dim=512) using AdamW on context-only positions' loss for N steps
  4. Score new positions with H + delta

The delta gradient depends only on context hidden states and context labels. Since compute_logits applies independently per-position (RMSNorm + linear), no cross-position information flows in the gradient.

Submissions using this approach

Arguments for legality

  • Same principle as TTT: adapt on scored tokens, apply to future predictions. @0hq confirmed TTT is legal: "You're allowed to use any preceding tokens from the evaluation set that you've already been tested on in any way you'd like."
  • Condition 1 satisfied: delta depends only on x_1,...,x_{t-64} (all previously scored). No future token access.
  • Condition 3 satisfied: new tokens scored AFTER delta optimization on context.
  • Model weights frozen — only a per-batch 512-dim delta vector is optimized.

Arguments against legality

  • No explicit ruling on SLOT variants from maintainers
  • PR Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240 raised concerns about SLOT broadly (though specifically about the non-causal variant)
  • The delta is shared across all positions — it implicitly carries information from context to new positions (but TTT weight updates do the same thing)
  • Large BPP gains (-0.01 to -0.03) from what is essentially fitting a parameter to val data at eval time

Request

@0hq @valerio-oai — a ruling on context-only SLOT would help the community. If it's illegal, ~10+ open PRs need to be flagged. If it's legal, participants can use it with confidence.

Key question: Is the distinction between "standard SLOT" (optimize on all positions) and "causal SLOT" (optimize only on already-scored positions) meaningful for legality?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions