Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean)#1303
Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean)#1303anthony-maio wants to merge 7 commits intoopenai:mainfrom
Conversation
Integrates four proven post-March-25 techniques: - QK-Gain 4.0 (PR openai#1125 sweep) - XSA all 11 layers (PR openai#1176) - SLOT per-sample delta + logit bias with scored-position masking (PR openai#1229) - forward_hidden/compute_logits refactor for SLOT compatibility
3-seed results (1337: 0.9493, 42: 0.9433, 2024: 0.9458) Sliding window baseline: 1.1216. SLOT-16 improvement: -0.175 BPB. All artifacts under 16MB cap. Eval time ~384s.
There was a problem hiding this comment.
Pull request overview
Adds a new 10min/16mb record submission directory implementing SLOT-16 evaluation on top of the existing VRL + LeakyReLU² + XSA-all + QK-gain=4.0 stack, along with reproducibility artifacts.
Changes:
- Introduces a full training/eval script (
train_gpt.py) including sliding-window scoring and SLOT (per-sample hidden delta + logit bias) eval-time optimization. - Adds record metadata (
submission.json) and documentation (README.md) describing the run and results. - Adds three seed training logs capturing training, quantization, sliding-window, and SLOT metrics.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_gpt.py | End-to-end training + int6+lzma export + sliding-window eval + SLOT eval implementation. |
| records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/submission.json | Submission metadata and aggregated 3-seed results. |
| records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/README.md | Human-readable summary of techniques, results, and reproduction command. |
| records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_seed42.log | Seed 42 run log for reproducibility/audit. |
| records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_seed2024.log | Seed 2024 run log for reproducibility/audit. |
| records/track_10min_16mb/2026-04-02_SLOT_QKGain4_XSA11_TTT/train_seed1337.log | Seed 1337 run log for reproducibility/audit. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import sys | ||
| import time | ||
| import uuid | ||
| import zlib |
There was a problem hiding this comment.
zlib is imported but never used in this script. Removing unused imports helps keep the submission minimal and avoids confusion about which compressor/format is actually in use (the script uses lzma everywhere).
| import zlib |
| ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1"))) | ||
| ttt_lr = float(os.environ.get("TTT_LR", 0.002)) | ||
| ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) | ||
| ttt_muon = bool(int(os.environ.get("TTT_MUON", "1"))) | ||
| ttt_ns_steps = int(os.environ.get("TTT_NS_STEPS", 5)) |
There was a problem hiding this comment.
TTT_* hyperparameters are defined but not used anywhere in this submission (no TTT code path references them). Consider removing them (or wiring them up) so the configuration surface matches the implemented behavior and avoids readers assuming TTT is active.
| ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1"))) | |
| ttt_lr = float(os.environ.get("TTT_LR", 0.002)) | |
| ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) | |
| ttt_muon = bool(int(os.environ.get("TTT_MUON", "1"))) | |
| ttt_ns_steps = int(os.environ.get("TTT_NS_STEPS", 5)) |
| gptq_calib_batches = int(os.environ.get("GPTQ_CALIB_BATCHES", 256)) | ||
| gptq_block_size = int(os.environ.get("GPTQ_BLOCK_SIZE", 64)) |
There was a problem hiding this comment.
GPTQ_* hyperparameters are defined but not used anywhere in this submission (no GPTQ calibration/quantization code path references them). Consider removing them (or wiring them up) so the script reflects the actual quantization approach used.
| gptq_calib_batches = int(os.environ.get("GPTQ_CALIB_BATCHES", 256)) | |
| gptq_block_size = int(os.environ.get("GPTQ_BLOCK_SIZE", 64)) |
| if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): | ||
| return t.float().contiguous() | ||
| if t.dtype in {torch.float32, torch.bfloat16}: | ||
| passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") |
There was a problem hiding this comment.
keep_float_tensor is defined but never used by the quantization pipeline (mixed_quantize_int6 does not call it). If the keep-float patterns are intended to affect quantization, route the state_dict handling through this helper; otherwise consider removing the unused function/constants to keep the submission minimal.
| if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): | |
| return t.float().contiguous() | |
| if t.dtype in {torch.float32, torch.bfloat16}: | |
| passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") | |
| _ = passthrough_orig_dtypes | |
| if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): | |
| return t.float().contiguous() | |
| if t.dtype in {torch.float32, torch.bfloat16}: |
| xb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device) | ||
| yb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device) | ||
| wlens = [] | ||
| for i, ws in enumerate(bws): | ||
| wend = min(ws + seq_s, total_tok) | ||
| wlen = wend - ws | ||
| wlens.append(wlen) | ||
| xb[i, :wlen] = val_tokens[ws:wend] | ||
| yb[i, :wlen] = val_tokens[ws + 1:wend + 1] |
There was a problem hiding this comment.
In eval_val_slot, val_tokens is a CPU tensor but is copied into GPU tensors via per-sample slice assignments. This creates many small H2D transfers and can significantly slow SLOT eval. Prefer building each window as a contiguous slice and using a single .to(device, non_blocking=...) (similar to eval_val_sliding), or build the whole batch on CPU and transfer once per batch.
| xb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device) | |
| yb = torch.zeros(bsz, seq_s, dtype=torch.int64, device=device) | |
| wlens = [] | |
| for i, ws in enumerate(bws): | |
| wend = min(ws + seq_s, total_tok) | |
| wlen = wend - ws | |
| wlens.append(wlen) | |
| xb[i, :wlen] = val_tokens[ws:wend] | |
| yb[i, :wlen] = val_tokens[ws + 1:wend + 1] | |
| xb_cpu = torch.zeros(bsz, seq_s, dtype=val_tokens.dtype, device=val_tokens.device) | |
| yb_cpu = torch.zeros(bsz, seq_s, dtype=val_tokens.dtype, device=val_tokens.device) | |
| wlens = [] | |
| for i, ws in enumerate(bws): | |
| wend = min(ws + seq_s, total_tok) | |
| wlen = wend - ws | |
| wlens.append(wlen) | |
| xb_cpu[i, :wlen] = val_tokens[ws:wend] | |
| yb_cpu[i, :wlen] = val_tokens[ws + 1:wend + 1] | |
| xb = xb_cpu.to(device=device, non_blocking=True) | |
| yb = yb_cpu.to(device=device, non_blocking=True) |
…nai#1303 at 0.9462 - logs/daily_research.md: full daily report; PR openai#771 rejected confirmed, n-gram PRs status, leaderboard unchanged (1.1147), headline PR openai#1303 (0.9462 bpb, legality unconfirmed), PR openai#1306 Causal SLOT (-0.009) + Pre-quant TTT (-0.022), new paper scan (LaCT, pQuant, SLOT paper) - CLAUDE.md v7.1: updated key reference PRs (openai#1303, openai#1306), corrected SLOT technique table (standard SLOT disputed, Causal SLOT lower-risk alternative, Pre-quant TTT novel entry) https://claude.ai/code/session_01AUKKvYMVeeWQzfTKocVaJZ
- train_gpt_sota_slot.py: SOTA PR openai#1303 baseline (SLOT + XSA-11 + QK-Gain 4.0 + VRL) - train_gpt_slot_recurrence.py: SOTA + partial depth recurrence with per-iteration conditioning RECUR_LAYERS=4,5 RECUR_START_STEP=3000 activates recurrence Default (no env vars) = exact SOTA behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SLOT hyperparameter sweep found steps=24, LR=0.012, stride=96 dramatically improves over PR openai#1303's SLOT-16 (0.9462 -> 0.8637). Same architecture, same training — only eval-time SLOT parameters changed. 3-seed: 1337=0.8683, 42=0.8582, 2024=0.8647. All artifacts under 16MB.
Adds RECUR_LAYERS=4,5 support with per-iteration conditioning (iter_embed + iter_gate) on repeated layers. Delayed activation via RECUR_START_STEP. Compatible with SLOT eval.
|
Hey — took a careful look at this. QK-Gain 4.0 and XSA-11 are solid, well-documented work. The main thing worth examining is the SLOT implementation. The good news: SLOT as published (Hu et al., arXiv:2505.12392) optimizes the learned delta on context/prompt tokens only, then evaluates on new tokens. That's clean — the optimization doesn't see the tokens being scored. The concern: The community has already identified that some Parameter Golf SLOT implementations diverge from the paper. PR #1240 ran a direct test on the per-sample + logit_bias variant and found a 100% causal violation rate — flipping a target token changes NLL at other scored positions. @abaybektursun also found and removed a shared-delta variant from their own submission for the same reason (discussed in issue #140). The key question for your submission: does your SLOT optimization loss mask include the scored positions (the last stride tokens in each window)? If yes, the optimization sees the same tokens it then evaluates on — that's the non-causal variant. PR #1306 (Causal SLOT) restricts optimization to context-only positions and reports ~1.085 BPB, which is still a strong record result. From an information-theoretic perspective, the prequential principle (Dawid, 1984) requires that p_t is determined before seeing x_t. If the SLOT optimization computes loss over x_t and adjusts delta/bias accordingly, then p_t was influenced by x_t — even though base model weights didn't change. Suggestion: Check whether switching to context-only optimization (the Causal SLOT approach) preserves your result. If it does, you're on solid ground regardless of any future ruling. If the BPB jumps from ~0.95 to ~1.08, that delta was coming from the causal violation, and the ~1.08 result is still a legitimate record with your QK-Gain + XSA-11 stack. Either way, the QK-Gain systematic sweep (PR #1125) is genuinely valuable community work. Nice contribution. — @MatoTeziTanka | Agora Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
Summary
3-Seed Results
Beats merged SOTA (1.1147, PR #1019) by 0.169 BPB (33x the 0.005-nat threshold, p << 0.01).
Techniques
Built on the PR #175 VRL + LeakyReLU2 + lzma base with:
Compliance
torch.no_grad()hidden states)Reproduction
Training: ~600s. Eval: ~384s. Total: ~16 min.
Credits