Skip to content

Record: Causal SLOT + Pre-quant TTT — val_bpb 1.0846 (3-seed mean)#1306

Open
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/causal-slot-1.0846
Open

Record: Causal SLOT + Pre-quant TTT — val_bpb 1.0846 (3-seed mean)#1306
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/causal-slot-1.0846

Conversation

@resouer
Copy link
Copy Markdown

@resouer resouer commented Apr 3, 2026

Summary

3-seed mean val_bpb: 1.0846 (std 0.0007) | ~15.95 MB | 8xH100 SXM | ~551s eval

Merged SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 1.83126 nats. Delta: -0.051 nats. Clears the 0.005-nat threshold.

Results (3-seed)

Seed Sliding BPP + Causal SLOT BPP val_loss (nats) Artifact
1337 1.0966 1.0841 1.8304 15,952,885
42 1.0969 1.0843 1.8308 15,968,373
2025 1.0972 1.0854 1.8326 15,938,173
Mean 1.0969 1.0846 1.8313

Changes from Merged SOTA (PR #1019)

1. Causal SLOT — provably causal eval-time delta optimization (Novel)

Standard SLOT (PR #1172, #1176, #1229) optimizes delta using loss from all positions including future ones. PR #1240 proved this violates causal dependence (100% violation rate). Our causal SLOT restricts optimization to context-only positions — tokens already scored in previous windows. Provably causal: P(x_{t+1}) depends only on x_1,...,x_t. Delta: -0.009 BPP, ~300s eval time.

2. Pre-quant AdamW TTT (6 epochs)

AdamW TTT on full-precision EMA weights before GPTQ quantization. Post-quant SGD TTT fails on GPTQ stacks (25 failures per PR #756). Pre-quant TTT adapts weights that then quantize better. Delta: -0.022 BPP, 111s.

3. Coprime-stride multi-shard data loader

Weighted random shard sampling with coprime stride patterns for batch diversity. Delta: -0.003 BPP.

Reproduction

torchrun --standalone --nproc_per_node=8 train_gpt.py

No env vars needed. FA3 required (see requirements.txt).

Credits

Base: PR #1019 (@abaybektursun). SLOT concept: arXiv:2505.12392v2, PR #1176 (@bigbag). Coprime-stride loader: PR #1184 (@icryo). Pre-quant TTT concept: PR #1006. Causal SLOT: novel (this submission).

Generated with Claude Code

3-seed mean 1.0846 (std 0.0007). Beats merged SOTA (1.1147) by 0.030.

Novel: provably causal eval-time delta optimization (causal SLOT).
Unlike standard SLOT (PR openai#1240 proved 100% causal violation), delta
is optimized using only backward-looking loss from already-scored
positions. Combined with 6-epoch pre-quant AdamW TTT and
coprime-stride multi-shard data loading.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@resouer resouer force-pushed the submission/causal-slot-1.0846 branch from 8930d5a to d43a0f3 Compare April 3, 2026 16:34
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 3, 2026
…nai#1303 at 0.9462

- logs/daily_research.md: full daily report; PR openai#771 rejected confirmed,
  n-gram PRs status, leaderboard unchanged (1.1147), headline PR openai#1303
  (0.9462 bpb, legality unconfirmed), PR openai#1306 Causal SLOT (-0.009) +
  Pre-quant TTT (-0.022), new paper scan (LaCT, pQuant, SLOT paper)
- CLAUDE.md v7.1: updated key reference PRs (openai#1303, openai#1306), corrected SLOT
  technique table (standard SLOT disputed, Causal SLOT lower-risk alternative,
  Pre-quant TTT novel entry)

https://claude.ai/code/session_01AUKKvYMVeeWQzfTKocVaJZ
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 4, 2026
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant