openai · OnlyJundong · Apr 6, 2026 · Apr 6, 2026 · Apr 6, 2026
diff --git a/..._16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/README.md b/..._16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/README.md
@@ -0,0 +1,163 @@
+# Extended Compute Scaling Analysis (20K Steps, ~6 Hours)
+
+**val_bpb: 1.0960** (20K steps, 3-seed mean, std 0.0003) | **~15.05 MB** | 4×A100 MIG (Unlimited Compute Track)
+
+## Summary
+
+This submission is a **non-record submission**. It studies how ([PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun) scales under extended compute, removing the 10-minute wall-clock constraint. The same architecture and code are trained for 20K steps (~6 hours) on 4×A100 MIG instances (approximately 10× slower per step than 8×H100 SXM).
+
+Key findings:
+- **20K steps achieves 1.0960 BPB post-TTT** (3-seed mean)
+- **Artifact size balloons mid-training** (peaking at ~17.2MB around step 10K–15K) but **recovers to ~15.05MB** after warmdown completes — warmdown smooths weight entropy and restores compressibility
+- **TTT gains scale with base model quality**: TTT provides -0.006 BPB on the 20K model 
+
+## Results
+
+### 20K steps, ~6 hours (4×A100 MIG, 3-seed comparison)
+
+| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | Artifact |
+|------|----------|-------|-------------|-----------------|----------|----------|
+| 1337 | 828.7ms | 20,000 | 1.1018 | **1.0957** | -0.0061 | 15,077,933 |
+| 42 | 828.8ms | 20,000 | 1.1020 | **1.0962** | -0.0058 | 15,137,145 |
+| 2024 | 839.8ms | 20,000 | 1.1017 | **1.0962** | -0.0055 | 14,942,394 |
+| **Mean** | **832.4ms** | **20,000** | **1.1018** | **1.0960 (std 0.0003)** | **-0.0058** | **15,052,491** |
+
+## Scaling Analysis: BPB & Artifact Size vs Training Steps
+
+Artifact size is computed as int6+LZMA compressed model + code bytes. Every 2,000 steps.
+
+### Seed 2024 (2,000-step intervals)
+
+| Steps | Pre-TTT val_bpb | artifact_bytes | Under 16MB? |
+|------:|--------:|---------------:|:-----------:|
+| 0 | 4.1037 | 4,577,366 | Yes |
+| 2,000 | 1.2651 | 13,800,942 | Yes |
+| 4,000 | 1.2286 | 16,959,534 | **No** |
+| 6,000 | 1.2122 | 17,243,366 | **No** |
+| 8,000 | 1.2046 | 17,248,774 | **No** |
+| 10,000 | 1.2007 | 17,246,738 | **No** |
+| 12,000 | 1.1994 | 17,231,058 | **No** |
+| 14,000 | 1.1835 | 16,929,622 | **No** |
+| 16,000 | 1.1672 | 16,321,958 | **No** |
+| 18,000 | 1.1429 | 15,534,274 | Yes |
+| 20,000 | 1.1110* | 14,942,394 | Yes |
+
+*Step 20K artifact reflects the final model with full warmdown applied. Intermediate checkpoints without warmdown exceed 16MB.
+
+**Note:** Intermediate checkpoints do not benefit from warmdown. The final model has full warmdown applied, resulting in ~15.05MB — well under the 16MB limit. The artifact size peaks mid-training when weights are high-entropy, then drops as warmdown smooths them.
+
+### BPB vs Steps (ASCII plot)
+
+Power-law decay with two distinct phases: rapid early learning, then warmdown-driven final drop.
+
+```
+BPB
+4.10 |*
+     |
+     |
+2.50 |
+     |
+1.26 | *
+1.23 |    *
+1.22 |       *
+1.20 |          *
+1.19 |            * *
+1.18 |                *
+1.10 |                  *
+     +---------+--------+-> steps (K)
+     0        10       20
+
+     |<early >|<warmdown>|
+      (rapid)  (sharp drop)
+```
+
+### Artifact Size vs Steps (ASCII plot)
+
+Non-monotonic: grows rapidly to a peak at ~15K steps, then shrinks back below budget during warmdown.
+
+```
+MB
+17.2 |    * * * * * *
+16.0 |--------------------  16MB limit
+15.1 |                  *
+14.1 | 
+13.1 | *
+ 4.6 |*
+     +---------+--------+-> steps (K)
+     0        10        20
+
+     |<-fits->|<over>|<fits>|
+```
+
+Intermediate checkpoints between steps ~10K–17.5K exceed the 16MB budget and cannot be submitted as-is. Only the final model (with warmdown complete) fits. This means **early stopping is not viable** for this architecture without a separate warmdown phase.
+
+### Observations
+
+1. **BPB vs steps follows a power-law decay** with diminishing returns. The biggest gains are in the first 7,500 steps, with warmdown driving the final sharp drop.
+2. **Artifact size is non-monotonic**: it grows rapidly from 4.6MB (init) to ~17.2MB (step 10K–15K peak), then shrinks back to ~15.1MB during warmdown. Intermediate checkpoints without warmdown exceed 16MB.
+3. **TTT gain scales with compute**: At 20K steps, -0.006 BPB. The base model benefits from test-time adaptation. 
+
+## Architecture
+
+Identical to [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² + Legal TTT + Parallel Muon):
+
+| Component | Setting |
+|-----------|---------|
+| Layers | 11 (512d, 8H, 4KV) |
+| MLP | 3× with LeakyReLU(0.5)² |
+| BigramHash | 1536 |
+| XSA | Last 4 layers |
+| RoPE | Partial (16/64 dims) |
+| LN Scale | 1/√(layer+1) |
+| VE128 | Layers 9-10 |
+| Weight avg | EMA(0.997) + Tight SWA(every 50) |
+| Quantization | GPTQ-lite int6 + lzma |
+| Optimizer | Parameter Banking + Parallel Muon |
+
+### Hyperparameter scaling for extended training
+
+LR schedule parameters scaled proportionally to maintain the same warmup/warmdown ratios:
+
+| Parameter | This work (20K steps) |
+|-----------|-----------|
+| ITERATIONS | 20,000 |
+| WARMDOWN_ITERS | 7,800 (39.0%) |
+| MUON_MOMENTUM_WARMUP_STEPS | 3,340 (16.7%) |
+| MAX_WALLCLOCK_SECONDS | 0 (unlimited) |
+
+## Run Command
+
+```bash
+RUN_ID=run_recordsota_nonrecord_seed1337 \
+NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
+EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
+ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
+VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
+TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
+TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
+MUON_WD=0.04 ADAM_WD=0.04 \
+MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
+MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=3340 WARMDOWN_ITERS=7800 \
+ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=0 EVAL_STRIDE=64 \
+VAL_LOSS_EVERY=2000 \
+SEED=1337 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 train_gpt.py
+```
+
+## Hardware
+
+- 4×NVIDIA A100 MIG instances
+- Seeds 1337 & 42: ~829ms/step | Seed 2024: ~840ms/step (vs 83ms/step on 8×H100 SXM — approximately 10× slower)
+- grad_accum_steps=2 (to match 8-GPU effective batch size of 786,432 tokens)
+- ~4.6–4.7h training + ~34min TTT eval = ~6 hours total per seed
+
+## Credits
+
+This submission uses the full architecture and code from the record-track PR #549 with no ML changes — only extended compute and proportionally scaled LR schedules.
+
+- **Base submission [PR #549](https://github.com/openai/parameter-golf/pull/549)**: @abaybektursun — LeakyReLU² + Legal TTT + Parallel Muon
+- **LeakyReLU² activation**: [PR #493](https://github.com/openai/parameter-golf/pull/493) by @parinzee, [PR #518](https://github.com/openai/parameter-golf/pull/518) by @sofiabod
+- **Optimizer (Parameter Banking + Parallel Muon)**: [PR #399](https://github.com/openai/parameter-golf/pull/399) by @abaybektursun
+- **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @Christopher-Lee-McClendon
+- **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush
diff --git a/...d_16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/run_scaling_20k.sh b/...d_16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/run_scaling_20k.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+set -e
+
+cd "$(dirname "$0")"
+
+# 20k scaling run — intermediate val_bpb logged every 2K steps for scaling curve
+# Warmdown and momentum warmup scaled proportionally (~39% and ~16.7%)
+# 20k * 0.39 = 7800, 20k * 0.167 = 3340
+
+echo "=========================================="
+echo "Scaling run: 20k steps, seed=2024"
+echo "Started at: $(date)"
+echo "=========================================="
+
+env \
+    RUN_ID=train_step20k_seed2024 \
+    NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
+    EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
+    ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
+    VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
+    TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
+    TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
+    MUON_WD=0.04 ADAM_WD=0.04 \
+    MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
+    MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
+    MUON_MOMENTUM_WARMUP_STEPS=3340 WARMDOWN_ITERS=7800 \
+    ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=0 EVAL_STRIDE=64 \
+    VAL_LOSS_EVERY=2000 \
+    SEED=2024 \
+    CUDA_VISIBLE_DEVICES=0,1,2,3 \
+    torchrun --standalone --nproc_per_node=4 train_gpt.py \
+    2>&1 | tee logs/train_step20k_seed2024.txt
+
+echo "=========================================="
+echo "20k run finished at: $(date)"
+echo "=========================================="
diff --git a/...cord_16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/submission.json b/...cord_16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/submission.json
@@ -0,0 +1,20 @@
+{
+  "name": "Extended Compute Scaling Analysis (20K Steps, ~6 Hours)",
+  "val_bpb": 1.0960,
+  "val_bpb_std": 0.0003,
+  "bytes_total": 15052491,
+  "seeds": [1337, 42, 2024],
+  "seed_results": {
+    "1337": { "val_bpb": 1.0957, "pre_ttt_bpb": 1.1018, "ttt_gain": -0.0061, "artifact_bytes": 15077933, "step_avg_ms": 828.7 },
+    "42":   { "val_bpb": 1.0962, "pre_ttt_bpb": 1.1020, "ttt_gain": -0.0058, "artifact_bytes": 15137145, "step_avg_ms": 828.8 },
+    "2024": { "val_bpb": 1.0962, "pre_ttt_bpb": 1.1017, "ttt_gain": -0.0055, "artifact_bytes": 14942394, "step_avg_ms": 839.8 }
+  },
+  "blurb": "Scaling analysis of the PR #549 SOTA stack (LeakyReLU² + TTT + Parallel Muon) under extended compute on 4×A100 MIG. 20K steps (~6 hours training) achieves 1.0960 BPB post-TTT (3-seed mean, std 0.0003) with ~15.05MB artifact. Includes compute scaling curve showing warmdown-driven compression recovery and artifact size non-monotonicity.",
+  "author": "Jundong Hu",
+  "github_id": "OnlyJundong",
+  "date": "2026-04-06",
+  "track": "non_record_16mb",
+  "hardware": "4×A100 MIG",
+  "steps": 20000,
+  "step_avg_ms_mean": 832.4
+}