V20: Cascaded 2-Phase L-BFGS Causal SLOT (1.00497 BPB, 3-seed) by Bortlesboat · Pull Request #1372 · openai/parameter-golf

Bortlesboat · 2026-04-05T05:01:34Z

Summary

3-seed mean: 1.00497477 BPB (1.69685330 nats)

Beats merged SOTA PR #1019 (1.11473509 BPB) by 0.18532523 nats = 37.1x the required 0.005-nat threshold (Welch t=-139.79, df=2.29, p<<0.001).

The Stack

This submission layers one new eval-time optimization technique on top of the existing SOTA stack:

Component	Source
11L backbone + SP1024 + XSA-all + BigramHash(3072,112)	PR #1019 (@abaybektursun)
Full Hessian GPTQ int6 + brotli+lzma + Coprime loader + QK_GAIN=5.0	PR #1019
L-BFGS Causal SLOT eval loop	PR #1350 (@resouer)
Discriminative per-block pre-quant TTT	PR #1351 (@resouer)
Cascaded 2-Phase L-BFGS (new)	This PR

Cascaded 2-Phase L-BFGS

The PR #1350 L-BFGS Causal SLOT runs a single 25-iteration pass per window with history_size=20. We split this budget:

Phase 1: 5 iters, history=10, uniform loss on 128-token focal window → finds coarse descent direction
Phase 2: 18 iters, history=20, uniform loss, fresh L-BFGS instance with reset history → refines

Total `510 + 1820 = 410` "history-iters" vs baseline `25*20 = 500` — ~18% less L-BFGS work, 8% faster eval (487s vs 560s) with equivalent quality.

Why reset history between phases: per Codex gpt-5.4 review, if Phase 2 changes the objective the prior curvature pairs approximate the wrong Hessian. We warm-start the delta tensor across batches (like PR #1350) but reset L-BFGS memory between phases.

Causality: opt_mask is strictly `[focal_start, s)` where `s = max(wl - slot_stride, 0)` — only already-scored positions. Same guarantee as PR #1350.

Results

Seed	val_bpb	val_loss	steps	artifact
1337	1.00407045	1.69532641	6123	15,882,862
42	1.00648098	1.69939647	6122	15,832,250
999	1.00437287	1.69583703	6120	15,846,954
Mean	1.00497477	1.69685330	6121.67	15,854,022
Std	0.00131315	0.00221720	—	—

All seeds hit 600s wallclock cap. All artifacts well under 16MB.

Ablation note

The train_gpt.py also implements importance-weighted CE mixture (`V20_GRAD_WEIGHTED=1`) with `w_t ∝ (1 - p_target)`. Tested with α=0.5 on seed 1337 and got 1.00725 (~0.003 BPB worse than uniform). Per Codex's prediction: the cascaded structure and importance-weighting overlap rather than compound. This path is disabled by default.

Concurrent PRs

Shares lineage with open PRs #1350 (L-BFGS Causal SLOT) and #1351 (Discriminative TTT). If PR #1350 merges first, this submission becomes a technique-focused contribution documenting Cascaded L-BFGS as an equivalent-quality, faster variant.

Environment

8xH100 80GB HBM3 SXM (RunPod)
PyTorch 2.4.1+cu124, CUDA 12.4, FlashAttention 3.0.0 (Hopper kernels from source)

Bortlesboat · 2026-04-05T17:34:43Z

Closing this PR. After reviewing @resouer's closing comment on #1350, I confirm our submission inherits both compliance violations:

Pre-quant TTT on val_tokens (train_gpt.py:1107-1167, Discriminative TTT block) — trains on validation tokens for 8 epochs before quantization and scoring, violating score-before-update.
Minibatch SLOT leakage (L-BFGS Causal SLOT loop) — 32 overlapping windows per batch with shared delta [1,1,V] leaks gradient information from later windows to earlier windows being scored.

Thanks @resouer for the clear writeup and @ClassicLarry for flagging. Self-closing to keep the record straight. Will be more careful about causality in any future submission.

Add V20: Cascaded 2-Phase L-BFGS Causal SLOT (1.005 BPB, 3-seed)

ef61af8

Bortlesboat closed this Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V20: Cascaded 2-Phase L-BFGS Causal SLOT (1.00497 BPB, 3-seed)#1372

V20: Cascaded 2-Phase L-BFGS Causal SLOT (1.00497 BPB, 3-seed)#1372
Bortlesboat wants to merge 1 commit intoopenai:mainfrom
Bortlesboat:v20-cascaded-lbfgs

Bortlesboat commented Apr 5, 2026

Uh oh!

Bortlesboat commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bortlesboat commented Apr 5, 2026

Summary

The Stack

Cascaded 2-Phase L-BFGS

Results

Ablation note

Concurrent PRs

Environment

Uh oh!

Bortlesboat commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant