Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean) by xexyz · Pull Request #1263 · openai/parameter-golf

xexyz · 2026-04-02T21:14:24Z

Summary

val_bpb: 0.9354 (3-seed mean, std 0.0032)
Artifact: ~15.8 MB (all seeds < 16MB)
Training: 600s on 8xH100 SXM | Eval: ~311s (SLOT) + ~120s (sliding) = ~431s total

Architecture

11L, dim=512, 8 heads, 4 KV heads (GQA)
LeakyReLU(0.5)² MLP with 3x expansion
SmearGate + BigramHash embedding augmentation
XSA (cross-sequence attention) on all 11 layers
QK-Gain init = 4.0
~27M parameters

Training

Muon + Adam optimizers, EMA (0.997) + Tight SWA
Late QAT + Full GPTQ int6 + zstd-22
~5250 steps at 114ms/step

Evaluation — SLOT

Based on arXiv:2505.12392v2:

Extract frozen hidden states from last layer under torch.no_grad()
Optimize per-sample delta [bsz, 1, 512] + logit bias [bsz, 1, 1024] via 16 AdamW steps, cosine LR (0.008 → 0.0008)
Scored-position mask: only last stride tokens per non-first window contribute to SLOT loss
Model weights completely frozen — only delta and logit_bias optimized
Standard autoregressive cross-entropy loss preserves causality

3-Seed Results

Seed	Sliding BPB	SLOT BPB	Artifact
1337	1.1264	0.9349	15,890,549
42	1.1264	0.9325	15,830,408
7	1.1261	0.9388	15,810,068
Mean	1.1263	0.9354

Beats merged SOTA (1.1147) by 0.179 BPB. Clears 0.005 nats threshold by 36x.

Compliance

❌ No n-gram cache
❌ No two-pass rescoring
❌ No eval-time access to training data
❌ No oracle/hindsight selection
✅ Score-first SLOT (frozen model, torch.no_grad hidden states)
✅ Self-contained (zero env var overrides required beyond seed)
✅ All seeds within time and size budgets

Reproduction

SEED=1337 GPTQ_CALIB_BATCHES=32 SLOT_ENABLED=1 SLOT_STEPS=16 \
SLOT_LR=0.008 SLOT_LR_MIN=0.0008 \
torchrun --nproc_per_node=8 train_gpt.py

Credits

SLOT mechanism: arXiv:2505.12392v2
Per-sample delta + logit bias approach inspired by PR Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300) #1229 (@resouer)
QK-Gain 4.0 validated by PR Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090 #1125, PR Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176 (@bigbag)
Base architecture builds on merged SOTA PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (@abaybektursun)

…9354 BPB) 3-seed mean: 1337→0.9349, 42→0.9325, 7→0.9388 Sliding baseline: 1.1263 BPB mean SLOT improvement: -0.191 BPB SLOT: per-sample delta [bsz,1,512] + logit bias [bsz,1,1024], 16 AdamW steps, cosine LR 0.008→0.0008, scored-position mask. Model weights frozen during SLOT. ~311s eval time on 8xH100.

…optimization Splits forward_logits into forward_hidden + compute_logits for SLOT. Adds eval_val_sliding_slot: 16 AdamW steps optimizing delta [bsz,1,512] + logit_bias [bsz,1,1024] per batch. Cosine LR 0.008→0.0008. Scored-position mask: only last stride tokens per window. Model weights completely frozen. Expected: 1.12 sliding → ~0.93 with SLOT (based on PRs openai#1229/openai#1263). Enable: SLOT_ENABLED=1 XSA_LAST_N=11 QK_GAIN_INIT=4.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Start from current SOTA (11L XSA-all + GPTQ + SLOT) and add Progressive Residual Warmup. Deeper layers warm up 200+200*l steps. Tuned for 8xH100 (~5000+ steps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean)#1263

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean)#1263
xexyz wants to merge 1 commit intoopenai:mainfrom
xexyz:xexyz/slot-0.9354

xexyz commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xexyz commented Apr 2, 2026

Summary

Architecture

Training

Evaluation — SLOT

3-Seed Results

Compliance

Reproduction

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant