Legality question: Is context-only (causal) SLOT legal?

## Question

Several recent submissions use a **context-only variant of SLOT** where the delta vector is optimized using only already-scored positions (not the new positions being evaluated). Is this legal under the four conditions from Issue #1017?

## What standard SLOT does (proven illegal by PR #1240)

Optimizes delta using loss from ALL positions in the window, including new (unscored) tokens. PR #1240 showed this violates Condition 1 with a 100% violation rate — flipping a target token changes predictions at other positions.

## What context-only (causal) SLOT does

For each sliding window (2048 tokens, stride=64):
1. Compute frozen hidden states H under `no_grad`
2. Context positions = 0..1983 (already scored in prior windows), new positions = 1984..2047
3. Optimize delta (dim=512) using AdamW on **context-only positions' loss** for N steps
4. Score new positions with H + delta

The delta gradient depends only on context hidden states and context labels. Since `compute_logits` applies independently per-position (RMSNorm + linear), no cross-position information flows in the gradient.

## Submissions using this approach

- PR #1306 (@resouer) — 1.0846 BPB, "Causal SLOT + Pre-quant TTT"
- PR #1322 (@newjordan) — 1.0854 BPB, "SLOT-32"  
- PR #1324 (@yahya010) — 0.8275 BPB, "SLOT-28"
- PR #1333 (@aryanbhosale) — 1.0766 BPP, "Causal SLOT-16"

## Arguments for legality

- **Same principle as TTT**: adapt on scored tokens, apply to future predictions. @0hq confirmed TTT is legal: "You're allowed to use any preceding tokens from the evaluation set that you've already been tested on in any way you'd like."
- **Condition 1 satisfied**: delta depends only on x_1,...,x_{t-64} (all previously scored). No future token access.
- **Condition 3 satisfied**: new tokens scored AFTER delta optimization on context.
- **Model weights frozen** — only a per-batch 512-dim delta vector is optimized.

## Arguments against legality

- No explicit ruling on SLOT variants from maintainers
- PR #1240 raised concerns about SLOT broadly (though specifically about the non-causal variant)
- The delta is shared across all positions — it implicitly carries information from context to new positions (but TTT weight updates do the same thing)
- Large BPP gains (-0.01 to -0.03) from what is essentially fitting a parameter to val data at eval time

## Request

@0hq @valerio-oai — a ruling on context-only SLOT would help the community. If it's illegal, ~10+ open PRs need to be flagged. If it's legal, participants can use it with confidence.

Key question: **Is the distinction between "standard SLOT" (optimize on all positions) and "causal SLOT" (optimize only on already-scored positions) meaningful for legality?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legality question: Is context-only (causal) SLOT legal? #1336

Question

What standard SLOT does (proven illegal by PR #1240)

What context-only (causal) SLOT does

Submissions using this approach

Arguments for legality

Arguments against legality

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Legality question: Is context-only (causal) SLOT legal? #1336

Description

Question

What standard SLOT does (proven illegal by PR #1240)

What context-only (causal) SLOT does

Submissions using this approach

Arguments for legality

Arguments against legality

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions