Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean) by resouer · Pull Request #7 · resouer/parameter-golf

resouer · 2026-04-04T15:32:38Z

Summary

3-seed mean val_bpb: 1.0046 (std 0.0003) | ~15.8 MB | 8xH100 SXM | ~556s SLOT eval

Merged SOTA (PR openai#1019, 3-seed mean): 1.88218 nats. This run: 1.69620 nats. Delta: -0.186 nats. Clears the 0.005-nat threshold.

Results (3-seed)

Seed	Sliding BPP	+ Causal SLOT BPP	val_loss (nats)	Artifact
1337	1.0925	1.0043	1.6957	15,803,625
42	1.0925	1.0048	1.6965	15,808,775
2025	1.0925	1.0047	1.6964	15,794,277
Mean	1.0925	1.0046	1.6962

Changes from Merged SOTA (PR openai#1019)

1. L-BFGS Causal SLOT in Logit Space (Novel)

Standard SLOT optimizes delta using loss from ALL positions including future ones — PR openai#1240 proved 100% causal violation. Our causal SLOT restricts optimization to already-scored context positions only. L-BFGS optimizer in logit space (max_iter=25, history=20, focal loss on last 128 tokens, warm-start, delta clamp +/-5). Delta: -0.087 BPP, ~556s eval.

Nearest PR: openai#1318 (L-BFGS logit SLOT, non-causal). Different: causal constraint on optimization — loss from context positions only.

2. Pre-quant AdamW TTT (6 epochs)

AdamW TTT on full-precision EMA weights before GPTQ. Delta: -0.022 BPP, 110s.

3. Coprime-stride multi-shard data loader

Weighted random shard sampling with coprime stride. Delta: -0.003 BPP.

4. Config (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)

Delta: ~-0.003 BPP combined.

Compliance

Satisfies all four NoesisGenesis conditions (Issue openai#677):

p_t depends only on artifact and prefix x_1...x_{t-1} — causal SLOT uses only already-scored positions
Full softmax over full 1024-token vocabulary
Score-before-update — current tokens don't influence their own scores
Single left-to-right sliding-window pass

Model weights never modified during eval. Only per-window throwaway delta (1024 floats) is optimized then discarded.

Reproduction

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Base: PR openai#1019 (@abaybektursun). Pre-quant TTT: PR openai#1006. Coprime loader: PR openai#1184 (@icryo). L-BFGS SLOT concept: PR openai#1318. Causal SLOT: our PR openai#1306.

3-seed mean 1.0046 (std 0.0003). Beats merged SOTA (1.1147) by 0.110. Novel: L-BFGS causal SLOT — optimizer (L-BFGS), space (logit), and constraint (causal, context-only positions). Passes flip test (PR openai#1240). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

resouer closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean)#7

Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean)#7
resouer wants to merge 1 commit intomainfrom
submission/lbfgs-causal-slot

resouer commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

resouer commented Apr 4, 2026

Summary

Results (3-seed)

Changes from Merged SOTA (PR openai#1019)

1. L-BFGS Causal SLOT in Logit Space (Novel)

2. Pre-quant AdamW TTT (6 epochs)

3. Coprime-stride multi-shard data loader

4. Config (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)

Compliance

Reproduction

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant