openai · Its-Just-Crump · Apr 5, 2026
diff --git a/...ds/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/README.md b/...ds/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/README.md
@@ -0,0 +1,140 @@
+# Record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli
+
+**val_bpb: 1.1020** (3-seed mean, std 0.0011) | **~15.88 MB max** | 8xH100 SXM, 600s | No TTT
+
+**Improvement over current merged SOTA ([PR #1019](https://github.com/openai/parameter-golf/pull/1019), 1.1147 BPB):** -0.0127 BPB / -0.0088 nats (Welch t=-18.37, df=2.38, p<0.001)
+
+## Results
+
+| Seed | Steps | ms/step | **Sliding Window BPB** | Model Bytes | Total Artifact |
+|------|-------|---------|------------------------|-------------|----------------|
+| 42   | 5,733 | 104.67  | **1.10327**            | 15,748,095  | 15,824,545     |
+| 314  | 5,945 | 100.94  | **1.10181**            | 15,792,991  | 15,869,441     |
+| 999  | 5,936 | 101.10  | **1.10102**            | 15,799,271  | 15,875,721     |
+| **Mean** | |        | **1.10203**            |             |                |
+
+Spread across seeds: 0.0023 BPB (very tight). All 3 seeds fit under 16MB with >=124KB margin.
+
+## Tokenizer Change: BPB Correctness Proof
+
+This submission uses a SentencePiece 4096 BPE tokenizer (`fineweb_4096_bpe.model`) instead of the baseline SP1024. Per competition rules, we provide detailed proof that val_bpb is correctly calculated.
+
+**How BPB is computed in this script:**
+
+The `val_bpb` metric is computed by the same `sliding_window_bpb()` function used by all submissions in this repo. The function:
+
+1. Evaluates cross-entropy loss in nats per token over the full validation set using a sliding window (stride=64)
+2. Counts the total number of bytes in the validation text by summing `token_byte_lengths[token_id]` for each token
+3. Computes `BPB = total_nats / (total_bytes * ln(2))`
+
+The `token_byte_lengths` lookup table is built by `build_sentencepiece_luts()`, which inspects each token's UTF-8 byte length via `sp.id_to_piece(token_id)`. This is independent of vocabulary size — a token that represents "the" is 3 bytes whether the vocab is 1024 or 4096.
+
+**Key invariant:** The total byte count of the validation set is identical regardless of tokenizer, because every tokenizer produces a lossless segmentation of the same byte sequence. More tokens (SP1024) or fewer tokens (SP4096) — the bytes sum is the same. Therefore BPB is a fair cross-tokenizer comparison.
+
+**Verification from logs:** The validation set has `tokens:45508608` SP4096 tokens. At ~3.32 bytes/token average, this covers the same ~151M byte validation set used by SP1024 submissions (which have ~131M tokens at ~1.15 bytes/token). The per-token cross-entropy is higher with SP4096 (2.54 nats vs 1.88 nats) because each token covers more bytes, but the per-byte rate (BPB) is directly comparable.
+
+---
+
+## What Changed vs PR #1019
+
+This submission replaces the SP1024 + BigramHash + LZMA stack with a SP4096-native architecture that gets more capacity from the larger vocabulary and recurrent/parallel techniques instead of explicit bigram features.
+
+### 1. SP4096 Tokenizer + MLP 4x (from SP1024 + MLP 3x)
+
+Switching to a 4096-token SentencePiece vocabulary with 4x MLP multiplier increases model capacity from ~27M to 34.4M parameters. The larger vocabulary captures more subword patterns natively, eliminating the need for BigramHash (which compresses 3.4x worse per parameter with SP4096).
+
+### 2. Depth Recurrence (Layers 4-5 from Step 3000)
+
+After step 3000, layers 4 and 5 are re-executed, effectively giving the model 13 logical layers for the cost of 11 layers' parameters. This adds zero parameters — it's purely a compute-time technique that trades ~10% wall-clock time for improved representation depth. Source: [PR #1260](https://github.com/openai/parameter-golf/pull/1260) ablation, estimated -0.0035 BPB.
+
+### 3. Parallel Residuals (Layer 7+)
+
+From layer 7 onward, the MLP and attention outputs are merged through a learned `lane_merge` scalar and `resid_mix_mlp` vector per layer (~20KB raw, ~3-5KB compressed). This allows the model to balance attention vs MLP contributions dynamically. Source: [PR #1289](https://github.com/openai/parameter-golf/pull/1289), estimated -0.0035 BPB.
+
+### 4. QK-Gain 5.0
+
+Initializes query and key projections with 5x scale, sharpening attention from the start of training without any parameter cost. Source: [PR #1217](https://github.com/openai/parameter-golf/pull/1217) (45 experiments), estimated -0.001 BPB.
+
+### 5. MuonEq-R Optimizer
+
+Row-norm normalization before Newton-Schulz iteration in Muon. ~15 lines of code, zero parameter cost, minor but consistent improvement. Source: [PR #1334](https://github.com/openai/parameter-golf/pull/1334).
+
+### 6. ADAM_WD=0.090 + GPTQ Tuning
+
+Increased Adam weight decay from 0.02 to 0.090 (matching Muon WD). GPTQ calibration increased from 64 to 128 AR self-generated sequences for denser Hessian estimates with the larger SP4096 model. Dampening factor tuned to 0.01.
+
+### 7. Brotli Compression (from LZMA)
+
+SP4096 int6 weights compress better under Brotli than LZMA. This switch recovers the size headroom that BigramHash removal freed up.
+
+### Dropped vs PR #1019
+
+| Removed | Reason |
+|---------|--------|
+| BigramHash 3072x112 | Compresses 3.4x worse per param with SP4096, net size-negative |
+| TrigramHash | Same compression issue with SP4096 |
+| LZMA preset=9 | Brotli compresses SP4096 int6 weights better |
+| TTT | Neutral or negative on this stack (25 failed attempts, [PR #756](https://github.com/openai/parameter-golf/pull/756)) |
+
+---
+
+## Architecture
+
+| Component | Setting | Source |
+|-----------|---------|--------|
+| Layers | 11 (512d, 8 GQA heads, 4 KV heads) | Baseline |
+| MLP | **4x** (2048) with LeakyReLU(0.5)^2 | [#493](https://github.com/openai/parameter-golf/pull/493) @parinzee |
+| Tokenizer | **SentencePiece 4096** | [#1334](https://github.com/openai/parameter-golf/pull/1334) |
+| Attention | XSA on all 11 layers | [#478](https://github.com/openai/parameter-golf/pull/478) @gowtham0992 |
+| Depth Recurrence | **Layers 4-5 from step 3000** | [#1260](https://github.com/openai/parameter-golf/pull/1260) |
+| Parallel Residuals | **Layer 7+ with learned merge** | [#1289](https://github.com/openai/parameter-golf/pull/1289) |
+| QK-Gain | **5.0** | [#1217](https://github.com/openai/parameter-golf/pull/1217) |
+| Optimizer | Parallel Muon + **MuonEq-R** + Parameter Banking | [#399](https://github.com/openai/parameter-golf/pull/399), [#1334](https://github.com/openai/parameter-golf/pull/1334) |
+| RoPE | Partial (16/64 dims) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
+| LN Scale | 1/sqrt(layer+1) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
+| VE128 | Layers 9-10 | [#374](https://github.com/openai/parameter-golf/pull/374) @unnir |
+| SmearGate | Position-mixing gate | [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman |
+| U-Net skips | Encoder-decoder connections | [#289](https://github.com/openai/parameter-golf/pull/289) |
+| Weight avg | EMA(0.997) + Tight SWA(every 50) | [#401](https://github.com/openai/parameter-golf/pull/401) @newjordan |
+| Quantization | Full Hessian GPTQ int6 (AR self-gen, **128 batch**) | [#535](https://github.com/openai/parameter-golf/pull/535) @raahilshah |
+| Compression | **Brotli** | **This work** |
+| Warmdown | 4000 iterations | [#364](https://github.com/openai/parameter-golf/pull/364) @shikhar1729 |
+| Late QAT | STE at LR scale < 0.15 | [#286](https://github.com/openai/parameter-golf/pull/286) @chris-buckley |
+| Selective pruning | +/-1 values by reconstruction error | [#609](https://github.com/openai/parameter-golf/pull/609) @saml212 |
+| Flash Attention 3 | Hopper warp-specialized kernels | [#122](https://github.com/openai/parameter-golf/pull/122) @mtybadger |
+
+## Requirements
+
+**Flash Attention 3 (Hopper) is required.**
+
+```bash
+pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
+pip install sentencepiece zstandard brotli
+python3 -c "from flash_attn_interface import flash_attn_func; import sentencepiece, zstandard, brotli; print('deps OK')"
+```
+
+## Run Command
+
+```bash
+VOCAB_SIZE=4096 MLP_MULT=4.0 QK_GAIN_INIT=5.0 MUON_EQ_R=1 \
+RECUR_LAYERS="4,5" RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
+MUON_WD=0.090 ADAM_WD=0.090 WARMDOWN_ITERS=4000 \
+GPTQ_CALIB_BATCHES=128 GPTQ_DAMP=0.01 \
+BIGRAM_VOCAB_SIZE=0 TRIGRAM=0 TARGET_MB=15.9 SEED=42 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Lineage
+
+```
+PR #1019 (Merged SOTA, 1.1147 BPB) -- SP1024 + BigramHash + LZMA
+    +-- This work replaces with:
+        +-- SP4096 + MLP 4x (native vocabulary capacity, no bigram needed)
+        +-- Depth recurrence layers 4-5 from step 3000 (from #1260)
+        +-- Parallel residuals layer 7+ with learned merge (from #1289)
+        +-- QK-Gain 5.0 (from #1217)
+        +-- MuonEq-R optimizer (from #1334)
+        +-- ADAM_WD=0.090, GPTQ 128-batch calibration, damp=0.01
+        +-- Brotli compression (better for SP4096 int6)
+        +-- Guided by 37 GPU runs (~$266) and PR #670 negative results
+```
diff --git a/records/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/requirements.txt b/records/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/requirements.txt
@@ -0,0 +1,4 @@
+flash_attn_3
+sentencepiece
+zstandard
+brotli
diff --git a/records/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/run_pgolf.sh b/records/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/run_pgolf.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+# Parameter Golf — Phase 1b: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli
+# 3-seed validated 2026-04-05: mean 1.1020 BPB (seeds 42, 314, 999)
+# Usage: bash run_pgolf_phase1b.sh [seed]
+
+SEED="${1:-42}"
+
+# === SP4096 ARCHITECTURE ===
+export VOCAB_SIZE=4096
+export MLP_MULT=4.0
+export XSA_LAST_N=11
+
+# === PR #1334 TECHNIQUES ===
+export QK_GAIN_INIT=5.0
+export MUON_EQ_R=1
+export RECUR_LAYERS="4,5"
+export RECUR_START_STEP=3000
+export PARALLEL_START_LAYER=7
+
+# === OPTIMIZER ===
+export MATRIX_LR=0.02
+export SCALAR_LR=0.02
+export TIED_EMBED_LR=0.03
+export MUON_WD=0.090
+export ADAM_WD=0.090
+export WARMDOWN_ITERS=4000
+
+# === GPTQ (AR self-gen, Brotli) ===
+export GPTQ_CALIB_BATCHES=128
+export GPTQ_DAMP=0.01
+export LATE_QAT_THRESHOLD=0.15
+
+# === DISABLED (don't fit SP4096 or confirmed dead) ===
+export BIGRAM_VOCAB_SIZE=0
+export BIGRAM_DIM=0
+export TRIGRAM=0
+export HADAMARD_ROTATION=0
+export SOFT_ROUND_QAT=0
+export PREQUANT_TTT=0
+export MIXED_BITWIDTH=0
+export NGRAM_ENABLED=0
+export TTT_ENABLED=0
+
+# === DATA PATHS (SP4096) ===
+export DATA_PATH=./data/datasets/fineweb10B_sp4096
+export TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model
+
+# === PROVEN STACK ===
+export SWA_ENABLED=1
+export VE_ENABLED=1
+export LN_SCALE=1
+export TARGET_MB=15.9
+export SEED=$SEED
+
+# === FULL RUN (10 min wallclock) ===
+export ITERATIONS=20000
+export MAX_WALLCLOCK_SECONDS=600
+export VAL_LOSS_EVERY=4000
+
+echo "=== Phase 1b: SP4096 + DepthRecur + ParallelResid + QK-Gain + Brotli ==="
+echo "Seed: $SEED | Vocab: $VOCAB_SIZE | MLP: ${MLP_MULT}x | ADAM_WD: $ADAM_WD"
+echo "DepthRecur: layers $RECUR_LAYERS from step $RECUR_START_STEP"
+echo "ParallelResid: from layer $PARALLEL_START_LAYER | QK-Gain: $QK_GAIN_INIT"
+echo ""
+
+torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee "run_phase1b_seed${SEED}_$(date +%Y%m%d_%H%M%S).log"
diff --git a/records/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/submission.json b/records/track_10min_16mb/2026-04-05_SP4096_DepthRecur_ParallelResid_Brotli/submission.json
@@ -0,0 +1,52 @@
+{
+  "author": "Its-Just-Crump",
+  "github_id": "Its-Just-Crump",
+  "name": "SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + AR Self-Gen GPTQ + Brotli",
+  "blurb": "11L/512d GQA with SP4096 tokenizer, MLP 4x (34.4M params), depth recurrence (layers 4-5 from step 3000), parallel residuals (layer 7+), QK-Gain 5.0, MuonEq-R, Full Hessian GPTQ with AR self-gen calibration (128 seqs x 2048 tokens), Brotli compression. No BigramHash. 3-seed exact mean: 1.10203473 BPB, beating current SOTA 1.11473509 BPB by 0.01270 BPB / 0.00880 nats (Welch t=-18.37, df=2.38, p<0.001).",
+  "date": "2026-04-05",
+  "track": "10min_16mb",
+  "val_loss": 2.53578534,
+  "val_bpb": 1.10203473,
+  "val_loss_std": 0.00263202,
+  "val_bpb_std": 0.00114386,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {
+      "val_loss": 2.53863636,
+      "val_bpb": 1.10327377,
+      "artifact_bytes": 15824545,
+      "steps": 5733,
+      "step_avg_ms": 104.67
+    },
+    "314": {
+      "val_loss": 2.53527157,
+      "val_bpb": 1.10181145,
+      "artifact_bytes": 15869441,
+      "steps": 5945,
+      "step_avg_ms": 100.94
+    },
+    "999": {
+      "val_loss": 2.53344809,
+      "val_bpb": 1.10101898,
+      "artifact_bytes": 15875721,
+      "steps": 5936,
+      "step_avg_ms": 101.10
+    }
+  },
+  "comparison_baseline_pr": 1019,
+  "delta_vs_baseline_bpb": -0.01270036,
+  "delta_vs_baseline_nats": -0.00880322,
+  "t_statistic": -18.3720,
+  "welch_df": 2.3794,
+  "artifact_bytes_mean": 15856569,
+  "artifact_bytes_max": 15875721,
+  "bytes_total": 15875721,
+  "train_steps_mean": 5871.33,
+  "step_avg_ms_mean": 102.24,
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "cuda_version": "12.8",
+  "flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
+  "calibration": "AR self-generated (128 seqs x 2048 tokens, temp=0.8, no external data)",
+  "technique_summary": "SP4096 + MLP4x + Depth Recurrence + Parallel Residuals + QK-Gain 5.0 + MuonEq-R + AR Self-Gen GPTQ (128 batch) + Brotli"
+}