Skip to content

Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean)#1303

Open
anthony-maio wants to merge 7 commits intoopenai:mainfrom
anthony-maio:submission/slot-qkgain-ttt-frontier
Open

Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean)#1303
anthony-maio wants to merge 7 commits intoopenai:mainfrom
anthony-maio:submission/slot-qkgain-ttt-frontier

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

  • val_bpb: 0.9462 (3-seed mean, std 0.0030)
  • Artifact: 15.7-15.8 MB (all seeds < 16MB)
  • Training: 600s on 8xH100 SXM | Eval: ~384s (sliding + SLOT)

3-Seed Results

Seed Sliding BPB + SLOT BPB Artifact
1337 1.1222 0.9493 15,742,066
42 1.1209 0.9433 15,827,886
2024 1.1216 0.9458 15,757,370
Mean 1.1216 0.9462 +/- 0.0030

Beats merged SOTA (1.1147, PR #1019) by 0.169 BPB (33x the 0.005-nat threshold, p << 0.01).

Techniques

Built on the PR #175 VRL + LeakyReLU2 + lzma base with:

Compliance

  • Score-first SLOT (frozen model, torch.no_grad() hidden states)
  • No n-gram cache, no two-pass rescoring, no eval-time GPTQ
  • Self-contained (no network calls, no env overrides required)
  • All seeds within time and size budgets

Reproduction

torchrun --standalone --nproc_per_node=8 train_gpt.py

Training: ~600s. Eval: ~384s. Total: ~16 min.

Credits

Integrates four proven post-March-25 techniques:
- QK-Gain 4.0 (PR openai#1125 sweep)
- XSA all 11 layers (PR openai#1176)
- SLOT per-sample delta + logit bias with scored-position masking (PR openai#1229)
- forward_hidden/compute_logits refactor for SLOT compatibility
3-seed results (1337: 0.9493, 42: 0.9433, 2024: 0.9458)
Sliding window baseline: 1.1216. SLOT-16 improvement: -0.175 BPB.
All artifacts under 16MB cap. Eval time ~384s.
Copilot AI review requested due to automatic review settings April 3, 2026 14:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10min/16mb record submission directory implementing SLOT-16 evaluation on top of the existing VRL + LeakyReLU² + XSA-all + QK-gain=4.0 stack, along with reproducibility artifacts.

Changes:

  • Introduces a full training/eval script (train_gpt.py) including sliding-window scoring and SLOT (per-sample hidden delta + logit bias) eval-time optimization.
  • Adds record metadata (submission.json) and documentation (README.md) describing the run and results.
  • Adds three seed training logs capturing training, quantization, sliding-window, and SLOT metrics.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_gpt.py End-to-end training + int6+lzma export + sliding-window eval + SLOT eval implementation.
records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/submission.json Submission metadata and aggregated 3-seed results.
records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/README.md Human-readable summary of techniques, results, and reproduction command.
records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_seed42.log Seed 42 run log for reproducibility/audit.
records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_seed2024.log Seed 2024 run log for reproducibility/audit.
records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_seed1337.log Seed 1337 run log for reproducibility/audit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import sys
import time
import uuid
import zlib
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zlib is imported but never used in this script. Removing unused imports helps keep the submission minimal and avoids confusion about which compressor/format is actually in use (the script uses lzma everywhere).

Suggested change
import zlib

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +107
ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1")))
ttt_lr = float(os.environ.get("TTT_LR", 0.002))
ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3))
ttt_muon = bool(int(os.environ.get("TTT_MUON", "1")))
ttt_ns_steps = int(os.environ.get("TTT_NS_STEPS", 5))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTT_* hyperparameters are defined but not used anywhere in this submission (no TTT code path references them). Consider removing them (or wiring them up) so the configuration surface matches the implemented behavior and avoids readers assuming TTT is active.

Suggested change
ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1")))
ttt_lr = float(os.environ.get("TTT_LR", 0.002))
ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3))
ttt_muon = bool(int(os.environ.get("TTT_MUON", "1")))
ttt_ns_steps = int(os.environ.get("TTT_NS_STEPS", 5))

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +109
gptq_calib_batches = int(os.environ.get("GPTQ_CALIB_BATCHES", 256))
gptq_block_size = int(os.environ.get("GPTQ_BLOCK_SIZE", 64))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPTQ_* hyperparameters are defined but not used anywhere in this submission (no GPTQ calibration/quantization code path references them). Consider removing them (or wiring them up) so the script reflects the actual quantization approach used.

Suggested change
gptq_calib_batches = int(os.environ.get("GPTQ_CALIB_BATCHES", 256))
gptq_block_size = int(os.environ.get("GPTQ_BLOCK_SIZE", 64))

Copilot uses AI. Check for mistakes.
Comment on lines +289 to +292
if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
return t.float().contiguous()
if t.dtype in {torch.float32, torch.bfloat16}:
passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_float_tensor is defined but never used by the quantization pipeline (mixed_quantize_int6 does not call it). If the keep-float patterns are intended to affect quantization, route the state_dict handling through this helper; otherwise consider removing the unused function/constants to keep the submission minimal.

Suggested change
if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
return t.float().contiguous()
if t.dtype in {torch.float32, torch.bfloat16}:
passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.")
_ = passthrough_orig_dtypes
if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
return t.float().contiguous()
if t.dtype in {torch.float32, torch.bfloat16}:

Copilot uses AI. Check for mistakes.
Comment on lines +881 to +889
xb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device)
yb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device)
wlens = []
for i, ws in enumerate(bws):
wend = min(ws + seq_s, total_tok)
wlen = wend - ws
wlens.append(wlen)
xb[i, :wlen] = val_tokens[ws:wend]
yb[i, :wlen] = val_tokens[ws + 1:wend + 1]
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In eval_val_slot, val_tokens is a CPU tensor but is copied into GPU tensors via per-sample slice assignments. This creates many small H2D transfers and can significantly slow SLOT eval. Prefer building each window as a contiguous slice and using a single .to(device, non_blocking=...) (similar to eval_val_sliding), or build the whole batch on CPU and transfer once per batch.

Suggested change
xb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device)
yb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device)
wlens = []
for i, ws in enumerate(bws):
wend = min(ws + seq_s, total_tok)
wlen = wend - ws
wlens.append(wlen)
xb[i, :wlen] = val_tokens[ws:wend]
yb[i, :wlen] = val_tokens[ws + 1:wend + 1]
xb_cpu = torch.zeros(bsz, seq_s, dtype=val_tokens.dtype, device=val_tokens.device)
yb_cpu = torch.zeros(bsz, seq_s, dtype=val_tokens.dtype, device=val_tokens.device)
wlens = []
for i, ws in enumerate(bws):
wend = min(ws + seq_s, total_tok)
wlen = wend - ws
wlens.append(wlen)
xb_cpu[i, :wlen] = val_tokens[ws:wend]
yb_cpu[i, :wlen] = val_tokens[ws + 1:wend + 1]
xb = xb_cpu.to(device=device, non_blocking=True)
yb = yb_cpu.to(device=device, non_blocking=True)

Copilot uses AI. Check for mistakes.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 3, 2026
…nai#1303 at 0.9462

- logs/daily_research.md: full daily report; PR openai#771 rejected confirmed,
  n-gram PRs status, leaderboard unchanged (1.1147), headline PR openai#1303
  (0.9462 bpb, legality unconfirmed), PR openai#1306 Causal SLOT (-0.009) +
  Pre-quant TTT (-0.022), new paper scan (LaCT, pQuant, SLOT paper)
- CLAUDE.md v7.1: updated key reference PRs (openai#1303, openai#1306), corrected SLOT
  technique table (standard SLOT disputed, Causal SLOT lower-risk alternative,
  Pre-quant TTT novel entry)

https://claude.ai/code/session_01AUKKvYMVeeWQzfTKocVaJZ
GitGeeks added a commit to GitGeeks/parameter-golf that referenced this pull request Apr 3, 2026
- train_gpt_sota_slot.py: SOTA PR openai#1303 baseline (SLOT + XSA-11 + QK-Gain 4.0 + VRL)
- train_gpt_slot_recurrence.py: SOTA + partial depth recurrence with per-iteration conditioning
  RECUR_LAYERS=4,5 RECUR_START_STEP=3000 activates recurrence
  Default (no env vars) = exact SOTA behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 3, 2026
SLOT hyperparameter sweep found steps=24, LR=0.012, stride=96 dramatically
improves over PR openai#1303's SLOT-16 (0.9462 -> 0.8637). Same architecture,
same training — only eval-time SLOT parameters changed.

3-seed: 1337=0.8683, 42=0.8582, 2024=0.8647. All artifacts under 16MB.
GitGeeks added a commit to GitGeeks/parameter-golf that referenced this pull request Apr 3, 2026
Adds RECUR_LAYERS=4,5 support with per-iteration conditioning
(iter_embed + iter_gate) on repeated layers. Delayed activation
via RECUR_START_STEP. Compatible with SLOT eval.
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 3, 2026

Hey — took a careful look at this. QK-Gain 4.0 and XSA-11 are solid, well-documented work. The main thing worth examining is the SLOT implementation.

The good news: SLOT as published (Hu et al., arXiv:2505.12392) optimizes the learned delta on context/prompt tokens only, then evaluates on new tokens. That's clean — the optimization doesn't see the tokens being scored.

The concern: The community has already identified that some Parameter Golf SLOT implementations diverge from the paper. PR #1240 ran a direct test on the per-sample + logit_bias variant and found a 100% causal violation rate — flipping a target token changes NLL at other scored positions. @abaybektursun also found and removed a shared-delta variant from their own submission for the same reason (discussed in issue #140).

The key question for your submission: does your SLOT optimization loss mask include the scored positions (the last stride tokens in each window)? If yes, the optimization sees the same tokens it then evaluates on — that's the non-causal variant. PR #1306 (Causal SLOT) restricts optimization to context-only positions and reports ~1.085 BPB, which is still a strong record result.

From an information-theoretic perspective, the prequential principle (Dawid, 1984) requires that p_t is determined before seeing x_t. If the SLOT optimization computes loss over x_t and adjusts delta/bias accordingly, then p_t was influenced by x_t — even though base model weights didn't change.

Suggestion: Check whether switching to context-only optimization (the Causal SLOT approach) preserves your result. If it does, you're on solid ground regardless of any future ruling. If the BPB jumps from ~0.95 to ~1.08, that delta was coming from the causal violation, and the ~1.08 result is still a legitimate record with your QK-Gain + XSA-11 stack.

Either way, the QK-Gain systematic sweep (PR #1125) is genuinely valuable community work. Nice contribution.

@MatoTeziTanka | Agora

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants