Skip to content

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean)#1263

Open
xexyz wants to merge 1 commit intoopenai:mainfrom
xexyz:xexyz/slot-0.9354
Open

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean)#1263
xexyz wants to merge 1 commit intoopenai:mainfrom
xexyz:xexyz/slot-0.9354

Conversation

@xexyz
Copy link
Copy Markdown

@xexyz xexyz commented Apr 2, 2026

Summary

  • val_bpb: 0.9354 (3-seed mean, std 0.0032)
  • Artifact: ~15.8 MB (all seeds < 16MB)
  • Training: 600s on 8xH100 SXM | Eval: ~311s (SLOT) + ~120s (sliding) = ~431s total

Architecture

  • 11L, dim=512, 8 heads, 4 KV heads (GQA)
  • LeakyReLU(0.5)² MLP with 3x expansion
  • SmearGate + BigramHash embedding augmentation
  • XSA (cross-sequence attention) on all 11 layers
  • QK-Gain init = 4.0
  • ~27M parameters

Training

  • Muon + Adam optimizers, EMA (0.997) + Tight SWA
  • Late QAT + Full GPTQ int6 + zstd-22
  • ~5250 steps at 114ms/step

Evaluation — SLOT

Based on arXiv:2505.12392v2:

  1. Extract frozen hidden states from last layer under torch.no_grad()
  2. Optimize per-sample delta [bsz, 1, 512] + logit bias [bsz, 1, 1024] via 16 AdamW steps, cosine LR (0.008 → 0.0008)
  3. Scored-position mask: only last stride tokens per non-first window contribute to SLOT loss
  4. Model weights completely frozen — only delta and logit_bias optimized
  5. Standard autoregressive cross-entropy loss preserves causality

3-Seed Results

Seed Sliding BPB SLOT BPB Artifact
1337 1.1264 0.9349 15,890,549
42 1.1264 0.9325 15,830,408
7 1.1261 0.9388 15,810,068
Mean 1.1263 0.9354

Beats merged SOTA (1.1147) by 0.179 BPB. Clears 0.005 nats threshold by 36x.

Compliance

  • ❌ No n-gram cache
  • ❌ No two-pass rescoring
  • ❌ No eval-time access to training data
  • ❌ No oracle/hindsight selection
  • ✅ Score-first SLOT (frozen model, torch.no_grad hidden states)
  • ✅ Self-contained (zero env var overrides required beyond seed)
  • ✅ All seeds within time and size budgets

Reproduction

SEED=1337 GPTQ_CALIB_BATCHES=32 SLOT_ENABLED=1 SLOT_STEPS=16 \
SLOT_LR=0.008 SLOT_LR_MIN=0.0008 \
torchrun --nproc_per_node=8 train_gpt.py

Credits

…9354 BPB)

3-seed mean: 1337→0.9349, 42→0.9325, 7→0.9388
Sliding baseline: 1.1263 BPB mean
SLOT improvement: -0.191 BPB

SLOT: per-sample delta [bsz,1,512] + logit bias [bsz,1,1024],
16 AdamW steps, cosine LR 0.008→0.0008, scored-position mask.
Model weights frozen during SLOT. ~311s eval time on 8xH100.
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Apr 2, 2026
…optimization

Splits forward_logits into forward_hidden + compute_logits for SLOT.
Adds eval_val_sliding_slot: 16 AdamW steps optimizing delta [bsz,1,512]
+ logit_bias [bsz,1,1024] per batch. Cosine LR 0.008→0.0008.
Scored-position mask: only last stride tokens per window.
Model weights completely frozen.

Expected: 1.12 sliding → ~0.93 with SLOT (based on PRs openai#1229/openai#1263).
Enable: SLOT_ENABLED=1 XSA_LAST_N=11 QK_GAIN_INIT=4.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HateBunnyPlzzz added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 2, 2026
Approaches revamped (old eval-only approaches removed):
- 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors)
- 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability)
- 03: SVD + Quantized Factors (13 layers via spectral compression)
- 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation)
- 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min)

Unmerged PR research saved to unmerged_runs/:
- PR openai#1263: SLOT (0.9354 BPB, legality contested)
- PR openai#1246: Trinity Ternary (0.9650 BPB)
- PR openai#1241: MDLM Diffusion (0.9901 BPB)
- PR openai#1252: WARP (1.0713 BPP)
- PR openai#1257: Complement Training (1.0855 BPB)
- PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB)
- PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB)
- PR openai#1254: XSA + LoRA TTT (1.1070 BPB)

Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Apr 3, 2026
Start from current SOTA (11L XSA-all + GPTQ + SLOT) and add
Progressive Residual Warmup. Deeper layers warm up 200+200*l steps.
Tuned for 8xH100 (~5000+ steps).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant