openai · xiehuanyi · Apr 10, 2026 · Apr 11, 2026
diff --git a/.../track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/README.md b/.../track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/README.md
@@ -0,0 +1,134 @@
+# Non-record: SP8192 + SOTA recipe on 1xA100 — 1.0704 BPB (TTT) / 1.0727 (sliding)
+
+**Author:** Huanyi Xie (`xiehuanyi`)
+**Date:** 2026-04-11
+**Track:** `non_record_16mb`
+**Result:** **val_bpb = 1.07034733** (int6 GPTQ + Brotli + sliding window eval s64 + Legal Score-First TTT)
+
+## TL;DR
+
+This runs the **exact PR #1493 SOTA recipe** (SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 + legal score-first TTT + MuonEq-R + SDClip GPTQ + Brotli + byte shuffle) on **1 × A100 80GB for 4 hours** instead of the required 8 × H100 for 10 minutes. The compute budget is roughly equivalent (~80 H100-minute-equivalent), but because it wasn't actually run on the required hardware, this is a non-record submission.
+
+**Headline result:**
+- **TTT BPB: 1.07035** (beats upstream main-leaderboard TTT SOTA 1.0810 by 0.01065)
+- **Sliding BPB: 1.07266** (beats upstream main-leaderboard sliding SOTA 1.0827 by 0.01004)
+- **Total submission size: 16,019,227 bytes** (under 16 MiB = 16,777,216)
+
+## Why non-record
+
+This submission does **not** qualify for `track_10min_16mb` because:
+1. Ran on **1×A100 for 4h (14,400s)** instead of **8×H100 for 10 min**
+2. A100 doesn't support FlashAttention-3 (Hopper-only); uses PyTorch SDP with the flash backend as a fallback
+3. Never verified on actual 8×H100 hardware
+
+Rough compute comparison:
+- H100 BF16: ~990 TFLOPS × 8 × 10 min = ~80 H100-minute-equivalent
+- A100 BF16: ~312 TFLOPS × 1 × 240 min = ~76 A100×FLOPs × 3.17 = ~240 A100-minute, approximately matching H100 raw throughput, but without the FA3 speedup.
+
+So this submission is **compute-equivalent** to the main-leaderboard budget, just not on the required hardware.
+
+## What's in the recipe
+
+The training script is a minor adaptation of the PR #1493 script (decompressed from its LZMA+base85 wrapper) with two changes:
+
+1. **FA3 → FA2/SDP fallback**: On A100, FlashAttention-3 is unavailable, so the attention wrapper falls through to PyTorch's `scaled_dot_product_attention` with the flash backend. A manual GQA head-repeat is added for the SDP path since PyTorch SDP doesn't natively support `num_heads != num_kv_heads`.
+2. **Python 3.9 compatibility**: Removed `zip(strict=True)` and nested f-string quotes.
+3. **`GRAD_ACCUM_STEPS` env override**: Added so the script can be run with arbitrary grad-accumulation on single-GPU setups (default still `8 // world_size`).
+
+Everything else is exactly as in the SOTA submission:
+- **SP8192** tokenizer (retokenized FineWeb 10B with a 8192-vocab SentencePiece BPE model borrowed from the 74M_Ternary record)
+- **11L × 512d × 8H / 4KV GQA**, MLP 4×, LeakyReLU(0.5)²
+- **Depth Recurrence**: loops physical layers 3-5 twice, creating 17 virtual layers from 11 physical, activated at `frac=0.35` of training
+- **Parallel Residuals** from layer 7+ (last 4 layers only, GPT-J style)
+- **QK-Gain init = 5.25** (per-head learnable query scaling, non-default SOTA setting)
+- **Skip Gates** (sigmoid-gated U-Net skip connections)
+- **MuonEq-R**: row-normalized Muon, Newton-Schulz 5 steps (plus AdamW for embeddings/scalars/head)
+- **Partial RoPE (16/64)** + LN Scale
+- **EMA decay 0.9965** with warmdown fraction 0.72
+- **MUON_WD = 0.095, ADAM_WD = 0.02, EMBED_WD = 0.085, MATRIX_LR = 0.022**
+- **GPTQ with SDClip**: int6 attention/MLP (k=12.85), int8 embeddings (k=20.0), block size 128
+- **Brotli-11 + byte shuffle** compression
+- **Legal Score-First TTT**: SGD lr=0.005 momentum=0.9, 3 epochs per 32K-token chunk, cosine LR decay, score-before-update ordering
+
+## Numbers (seed 1337)
+
+| Metric | Value |
+|---|---|
+| Pre-quantization post-EMA BF16 | 1.07610 |
+| Int6 quantized (no sliding) | 1.08950 |
+| **Int6 + Sliding Window s64** | **1.07266** |
+| **Int6 + Sliding + Legal TTT** | **1.07035** ← reported |
+| Steps trained | 6371 / 20000 (wallclock capped) |
+| Step avg | ~2260 ms (on 1×A100, SDP backend) |
+| Peak GPU memory | 41.8 GiB |
+| Model params | 35,944,536 |
+| Artifact bytes (int6 + brotli) | 15,970,123 |
+| Code bytes (uncompressed) | 49,104 |
+| **Total submission bytes** | **16,019,227** |
+
+## Comparison vs. upstream records
+
+| Submission | Sliding BPB | TTT BPB |
+|---|---|---|
+| **This (exp62, 1xA100 4h)** | **1.07266** | **1.07035** |
+| PR #1493 SOTA (8xH100 10min) | 1.0827 | 1.0810 |
+| PR #1477 (SP8192 + ParResid + TTT) | 1.082~ | 1.0822 |
+| PR #1413 (SP8192 + QK5 + TTT) | 1.084~ | 1.0828 |
+| PR #1412 (SP8192 + ParResid + SDClip) | 1.086~ | 1.0835 |
+| PR #1394 (SP8192 + GPTQ Emb + SDClip) | 1.088~ | 1.0856 |
+
+Delta vs. PR #1493 SOTA: **-0.01004 sliding, -0.01065 TTT**.
+
+## Comparison with exp60 / exp61 (same training config, different QK_gain)
+
+Three runs were made with identical seeds/hyperparams except `QK_GAIN_INIT`:
+
+| Run | QK_GAIN | Int6 | Sliding | TTT |
+|---|---|---|---|---|
+| exp60 | 5.0 (SOTA default) | 1.09031 | 1.07345 | 1.07137 |
+| exp61 | 5.0 + TTT flag at train | 1.09031 | 1.07345 | 1.07137 |
+| **exp62** | **5.25** | **1.08950** | **1.07266** | **1.07035** |
+
+QK_GAIN_INIT=5.25 (the SOTA record's exact value, non-default) consistently helps all three quantization/eval phases, matching the SOTA paper's observation that "monotonic improvement from 4.0 to 5.25" holds.
+
+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+# A100: flash_attn (FA2) optional, falls back to SDP if not installed
+# pip install flash-attn --no-build-isolation
+
+# 1. Download docs and retokenize with SP8192 (one-time, ~2h on CPU)
+python data/download_hf_docs_and_tokenize.py \
+  --repo-id willdepueoai/parameter-golf \
+  --remote-root datasets \
+  --output-root data \
+  --tokenizer-config data/tokenizer_specs_sp8192.json \
+  --skip-byte \
+  --reuse-sp-model 8192=<path_to_fineweb_8192_bpe.model>
+
+# 2. Train (4h on 1x A100 80GB)
+DATA_DIR=./data/ \
+SEED=1337 \
+VOCAB_SIZE=8192 \
+MAX_WALLCLOCK_SECONDS=14400 \
+QK_GAIN_INIT=5.25 \
+TTT_ENABLED=1 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+## Caveats
+
+- **Single seed (1337) only.** A 3-seed mean (e.g. 42, 314, 999) has not been run. The main-leaderboard SOTA reports 3-seed mean/std; this submission is single-seed for time reasons.
+- **Non-record hardware.** Not verified on 8×H100; used 4h on 1×A100.
+- Two earlier runs (exp60, exp62) crashed with SIGSEGV at the end of their own eval pipelines (torch.compile recompile issue when creating a fresh GPT instance for eval after training). The same saved quantized artifacts were then evaluated successfully via a standalone `eval_only.py` script. The reported numbers come from the standalone eval.
+- The `grad_accum=2` variant (exp63/64) OOM'd: the SOTA model with MLP 4× + depth recurrence has a per-micro-batch footprint larger than the simpler v2_full_stack model from earlier rounds.
+
+## Files
+
+- `README.md` (this file)
+- `submission.json`
+- `train_gpt.py` — A100-adapted SOTA script (FA3→SDP fallback, Python 3.9 compat, GRAD_ACCUM_STEPS env override)
+- `final_model.int6.ptz` — 15.97 MB int6+brotli quantized artifact
+- `train_seed1337.log` — full training log
+- `eval_seed1337.log` — standalone eval log (sliding + TTT numbers)
diff --git a/...ds/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/eval_seed1337.log b/...ds/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/eval_seed1337.log
@@ -0,0 +1,5 @@
+quantized val_loss:2.81379763 val_bpb:1.08949971 eval_time:101404ms
+quantized_sliding_window val_loss:2.77029639 val_bpb:1.07265608 eval_time:1580310ms
+ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
+quantized_ttt val_loss:2.76433372 val_bpb:1.07034733 eval_time:2775717ms
+EXIT_CODE=0
diff --git a/...track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/final_model.int6.ptz b/...track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/final_model.int6.ptz
diff --git a/...rds/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/requirements.txt b/...rds/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/requirements.txt
@@ -0,0 +1,4 @@
+sentencepiece
+brotli
+# A100: FA2 optional, falls back to PyTorch SDP (flash backend) if unavailable.
+# pip install flash-attn --no-build-isolation
diff --git a/records/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/submission.json b/records/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/submission.json
@@ -0,0 +1,80 @@
+{
+  "author": "Huanyi Xie",
+  "github_id": "xiehuanyi",
+  "name": "SP8192 + 3L Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT \u2014 1.07035 BPB (1xA100, non-record)",
+  "blurb": "Reproduces the PR #1493 SOTA 1.0810 recipe (SP8192 + 3-layer recurrence + parallel residuals layer 7+ + QK-Gain 5.25 + Legal Score-First TTT + MuonEq-R + SDClip GPTQ + Brotli) on 1xA100 for 4h instead of 8xH100 for 10 min. Beats the upstream main-leaderboard SOTA by 0.0107 BPB on TTT and 0.0101 BPB on sliding window. Non-record because it ran on 1xA100 (not required 8xH100), but compute-equivalent at ~80 H100-minute budget.",
+  "date": "2026-04-11",
+  "track": "non_record_16mb",
+  "val_loss": 2.76433372,
+  "val_bpb": 1.07034733,
+  "seeds": [1337],
+  "seed_results": {
+    "1337": {
+      "val_loss_pre_quant": 2.77918548,
+      "val_bpb_pre_quant": 1.07609792,
+      "val_loss_quantized": 2.81379763,
+      "val_bpb_quantized": 1.08949971,
+      "val_loss_sliding": 2.77029639,
+      "val_bpb_sliding": 1.07265608,
+      "val_loss_ttt": 2.76433372,
+      "val_bpb_ttt": 1.07034733,
+      "artifact_bytes": 15970123,
+      "code_bytes": 49104,
+      "total_submission_bytes": 16019227,
+      "steps": 6371,
+      "step_avg_ms": 2259.7
+    }
+  },
+  "artifact_bytes_max": 15970123,
+  "bytes_total": 16019227,
+  "train_steps_mean": 6371,
+  "hardware": "1x NVIDIA A100 80GB SXM4 (IBEX cluster, KAUST)",
+  "pytorch_version": "2.8.0+cu128",
+  "cuda_version": "12.8",
+  "python_version": "3.9.18",
+  "attn_backend": "PyTorch SDP (flash backend; FA3 unavailable on A100)",
+  "max_wallclock_seconds": 14400,
+  "train_seq_len": 2048,
+  "eval_seq_len": 2048,
+  "train_batch_tokens": 786432,
+  "num_layers": 11,
+  "model_dim": 512,
+  "mlp_mult": 4.0,
+  "vocab_size": 8192,
+  "num_heads": 8,
+  "num_kv_heads": 4,
+  "qk_gain_init": 5.25,
+  "num_loops": 2,
+  "loop_start": 3,
+  "loop_end": 5,
+  "enable_looping_at": 0.35,
+  "parallel_residual_start": 7,
+  "skip_gates_enabled": true,
+  "muon_row_normalize": true,
+  "muon_wd": 0.095,
+  "adam_wd": 0.02,
+  "embed_wd": 0.085,
+  "matrix_lr": 0.022,
+  "ema_decay": 0.9965,
+  "warmdown_frac": 0.72,
+  "ln_scale": true,
+  "rope_dims": 16,
+  "compressor": "brotli",
+  "matrix_bits": 6,
+  "embed_bits": 8,
+  "matrix_clip_sigmas": 12.85,
+  "embed_clip_sigmas": 20.0,
+  "ttt_enabled": true,
+  "ttt_lr": 0.005,
+  "ttt_epochs": 3,
+  "ttt_momentum": 0.9,
+  "ttt_chunk_tokens": 32768,
+  "model_params": 35944536,
+  "upstream_sota_ref": {
+    "pr": 1493,
+    "val_bpb_sliding": 1.0827,
+    "val_bpb_ttt": 1.0810,
+    "delta_vs_us_sliding": -0.01004,
+    "delta_vs_us_ttt": -0.01065
+  }
+}