openai · Bortlesboat · Apr 5, 2026
diff --git a/records/track_10min_16mb/2026-04-05_V20_Cascaded_LBFGS_Causal_SLOT/README.md b/records/track_10min_16mb/2026-04-05_V20_Cascaded_LBFGS_Causal_SLOT/README.md
@@ -0,0 +1,98 @@
+# V20: Cascaded 2-Phase L-BFGS Causal SLOT + Discriminative TTT
+
+**3-seed mean: 1.00497477 BPB (1.69685330 nats)**
+
+Beats merged SOTA PR #1019 (1.11473509 BPB) by 0.18532523 nats = **37.1x the required 0.005-nat threshold** (Welch t=-139.79, df=2.29, p<<0.001).
+
+## The Stack
+
+This submission layers one new eval-time optimization technique on top of the existing SOTA stack:
+
+| Component | Source |
+|-----------|--------|
+| 11L backbone + SP1024 + XSA-all + BigramHash(3072,112) | PR #1019 (abaybektursun) |
+| Full Hessian GPTQ int6 + brotli+lzma + Coprime loader + QK_GAIN=5.0 | PR #1019 |
+| L-BFGS Causal SLOT eval loop (history reset per window, causal mask on already-scored tokens) | PR #1350 (resouer) |
+| Discriminative per-block pre-quant TTT (graduated LR 0.3x→1.0x across 10 layer groups) | PR #1351 (resouer) |
+| **Cascaded 2-Phase L-BFGS** (our addition) | This PR |
+
+## What's New: Cascaded 2-Phase L-BFGS
+
+The baseline L-BFGS Causal SLOT (PR #1350) runs a single 25-iteration L-BFGS pass per window with history_size=20. We split this budget into two phases:
+
+- **Phase 1 (coarse):** 5 iters, history=10, uniform mean loss over the full 128-token focal window. Finds the dominant descent direction cheaply.
+- **Phase 2 (refine):** 18 iters, history=20, uniform mean loss, **fresh L-BFGS instance with reset history**. Polishes the coarse Phase 1 solution.
+
+**Why reset history between phases?** Though Phase 1 and Phase 2 share the same loss here (both uniform over the focal window), we designed the interface so Phase 2 can diverge (e.g., different weighting, focal window). Per Codex gpt-5.4 review: *"Previous L-BFGS curvature pairs approximate the old objective's Hessian. If Phase 2 changes the objective, those pairs are now approximating the wrong matrix. Warm-starting δ is good; inheriting the memory is usually not."* We warm-start the delta tensor across batches within an eval pass, but reset the L-BFGS memory between phases within a batch. We also warm-start δ across batches (proven useful from PR #1350).
+
+**Why 5+18 = 23 iters (vs baseline 25)?** L-BFGS per-iter cost scales as O(history × dim). Total "history-iters":
+- Baseline (single phase): 25 iters × history 20 = **500 history-iters**
+- Cascaded V20: (5×10) + (18×20) = 50 + 360 = **410 history-iters**
+
+This is ~18% less L-BFGS work with equivalent or better quality. In wallclock terms, SLOT eval drops from ~560s (PR #1350) to ~487s (V20) on 8xH100 — an 8% speedup on the eval phase.
+
+**Implementation detail:** Phase 1 and Phase 2 both use uniform per-token loss over opt_mask positions. The opt_mask is strictly `[focal_start, s)` where `s = max(wl - slot_stride, 0)` — only already-scored positions from previous sliding windows. This is the same causality guarantee as PR #1350: test-time SLOT optimizes on tokens already graded, never on the positions currently being scored.
+
+## Results
+
+| Seed | val_loss (nats) | val_bpb | train_steps | artifact_bytes |
+|------|-----------------|---------|-------------|----------------|
+| 1337 | 1.69532641 | 1.00407045 | 6123 | 15,882,862 |
+| 42 | 1.69939647 | 1.00648098 | 6122 | 15,832,250 |
+| 999 | 1.69583703 | 1.00437287 | 6120 | 15,846,954 |
+| **Mean** | **1.69685330** | **1.00497477** | 6121.67 | 15,854,022 |
+| **Std** | 0.00221720 | 0.00131315 | — | — |
+
+All 3 seeds trained to the 600s wallclock cap, landing at 6120-6123 training steps. Artifact sizes stay well under the 16MB (16,000,000 byte) cap across all seeds.
+
+### vs PR #1019 (merged SOTA, 1.11473509 BPB ± 0.00035387)
+
+| Metric | Value |
+|--------|-------|
+| Delta BPB | −0.10976032 |
+| Delta nats | −0.18532523 |
+| Welch t-statistic | −139.7872 |
+| Welch df | 2.2890 |
+| p-value | p << 0.001 |
+| Threshold (0.005 nats) | 37.1x exceeded |
+
+## Reproduction
+
+```bash
+# 1. Clone repo and download FineWeb (sp1024 variant)
+git clone https://github.com/openai/parameter-golf.git
+cd parameter-golf
+python3 data/cached_challenge_fineweb.py --variant sp1024
+
+# 2. Install FA3 (Hopper FlashAttention 3) from source — required for 10-min budget
+pip install ninja
+git clone https://github.com/Dao-AILab/flash-attention.git
+cd flash-attention/hopper && MAX_JOBS=8 python setup.py install
+cd ../..
+
+# 3. Copy this submission's train_gpt.py, then run 3 seeds
+for SEED in 1337 42 999; do
+    SEED=$SEED torchrun --nproc_per_node=8 train_gpt.py 2>&1 | tee run_$SEED.log
+done
+```
+
+Expected: `final_causal_slot_exact val_bpb:` lines around 1.004-1.006.
+
+## Hardware & Environment
+
+- 8x NVIDIA H100 80GB HBM3 (SXM, RunPod secure cloud)
+- PyTorch 2.4.1+cu124
+- CUDA 12.4
+- FlashAttention 3.0.0 (Hopper kernels, built from source)
+- Total train + eval wallclock per seed: ~21 min (600s train + ~225s dTTT + ~486s SLOT)
+
+## Ablation Notes
+
+The V20 script also implements a second technique (importance-weighted CE mixture for Phase 2) gated behind `V20_GRAD_WEIGHTED=1` env var. That path uses `w_t ∝ (1 - p_target_t)` as the token-importance weight (Codex review: this is the true |dCE/dlogit| magnitude; NLL-weighting would over-concentrate on outliers). This submission runs with **`V20_GRAD_WEIGHTED=0` (uniform loss)** as the stable default. The importance-weighting path is intended for a future submission.
+
+## Credits
+
+- **PR #1019** (@abaybektursun) — backbone architecture, GPTQ, XSA
+- **PR #1350** (@resouer) — L-BFGS Causal SLOT eval framework
+- **PR #1351** (@resouer) — Discriminative per-block TTT
+- Codex (gpt-5.4) — review of Cascaded L-BFGS history-reset rationale and importance-weighting correction
diff --git a/records/track_10min_16mb/2026-04-05_V20_Cascaded_LBFGS_Causal_SLOT/requirements.txt b/records/track_10min_16mb/2026-04-05_V20_Cascaded_LBFGS_Causal_SLOT/requirements.txt
@@ -0,0 +1,10 @@
+numpy
+tqdm
+torch
+huggingface-hub
+kernels
+setuptools
+typing-extensions==4.15.0
+datasets
+tiktoken
+sentencepiece
diff --git a/records/track_10min_16mb/2026-04-05_V20_Cascaded_LBFGS_Causal_SLOT/submission.json b/records/track_10min_16mb/2026-04-05_V20_Cascaded_LBFGS_Causal_SLOT/submission.json
@@ -0,0 +1,56 @@
+{
+  "author": "Andy Barnes",
+  "github_id": "Bortlesboat",
+  "name": "V20: Cascaded 2-Phase L-BFGS Causal SLOT + Discriminative TTT",
+  "blurb": "Cascaded 2-Phase L-BFGS eval-time SLOT optimizer on top of the PR #1019 + PR #1350 + PR #1351 stack. Splits the L-BFGS budget into a coarse Phase 1 (5 iters, history=10) and a refined Phase 2 (18 iters, history=20) with fresh history reset between phases (per Codex gpt-5.4 review: changing the loss landscape invalidates the prior curvature pairs). Total 23 L-BFGS iterations vs baseline 25 — ~8% faster with equivalent quality. 3-seed exact mean: 1.00497477 BPB / 1.69685330 nats, beating merged SOTA PR #1019 (1.11473509 BPB) by 0.18532523 nats = 37.1x the required 0.005-nat threshold (Welch t=-139.79, df=2.29, p<<0.001).",
+  "date": "2026-04-05",
+  "track": "10min_16mb",
+  "val_loss": 1.69685330,
+  "val_bpb": 1.00497477,
+  "val_loss_std": 0.00221720,
+  "val_bpb_std": 0.00131315,
+  "seeds": [1337, 42, 999],
+  "seed_results": {
+    "1337": {
+      "val_loss": 1.69532641,
+      "val_bpb": 1.00407045,
+      "artifact_bytes": 15882862,
+      "train_steps": 6123,
+      "train_time_ms": 600134,
+      "step_avg_ms": 98.01
+    },
+    "42": {
+      "val_loss": 1.69939647,
+      "val_bpb": 1.00648098,
+      "artifact_bytes": 15832250,
+      "train_steps": 6122,
+      "train_time_ms": 600143,
+      "step_avg_ms": 98.04
+    },
+    "999": {
+      "val_loss": 1.69583703,
+      "val_bpb": 1.00437287,
+      "artifact_bytes": 15846954,
+      "train_steps": 6120,
+      "train_time_ms": 600114,
+      "step_avg_ms": 98.06
+    }
+  },
+  "comparison_baseline_pr": 1019,
+  "implementation_lineage_pr": 1350,
+  "dttt_lineage_pr": 1351,
+  "delta_vs_pr1019_nats": -0.18532523,
+  "delta_vs_pr1019_bpb": -0.10976032,
+  "t_statistic": -139.7872,
+  "welch_df": 2.2890,
+  "artifact_bytes_mean": 15854022,
+  "artifact_bytes_max": 15882862,
+  "bytes_total": 15882862,
+  "train_steps_mean": 6121.67,
+  "step_avg_ms_mean": 98.04,
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.4.1+cu124",
+  "cuda_version": "12.4",
+  "flash_attn_version": "3.0.0 (Hopper FA3 kernels, built from source)",
+  "technique_summary": "PR #1019 backbone (XSA-all + BigramHash 3072x112 + GPTQ int6 + brotli+lzma) + PR #1350 L-BFGS Causal SLOT eval + PR #1351 Discriminative per-block TTT + Cascaded 2-Phase L-BFGS (our addition)"
+}