openai · aryanbhosale · Apr 3, 2026
diff --git a/records/track_10min_16mb/2026-04-03_DepthRecurrence_MuonEqR_GPTQ/README.md b/records/track_10min_16mb/2026-04-03_DepthRecurrence_MuonEqR_GPTQ/README.md
@@ -0,0 +1,104 @@
+# Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)
+
+**val_bpb = 1.1104** (3-seed mean, std 0.0009) | **~15.97 MB** | 8xH100 SXM
+
+## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
+
+| Seed | step_avg | steps | EMA bpb | **Sliding bpb** | val_loss (nats) | Artifact |
+|------|----------|-------|---------|-----------------|-----------------|----------|
+| 42   | 96.7ms   | 6,204 | 1.1300  | **1.1105**      | 1.8751          | 15,974,737 |
+| 314  | 96.7ms   | 6,205 | 1.1288  | **1.1094**      | 1.8731          | 15,972,993 |
+| 999  | 96.7ms   | 6,205 | 1.1292  | **1.1112**      | 1.8762          | 15,969,221 |
+| **Mean** | **96.7ms** | **6,205** | **1.1293** | **1.1104** | **1.8748** | |
+
+SOTA (PR #1019, 3-seed mean): **1.88218 nats**. This run: **1.87481 nats**. Delta: **-0.00737 nats**. Clears the 0.005-nat threshold (Welch t=-7.73, df=2.59).
+
+## Key Innovations
+
+### 1. Depth Recurrence (layers 4,5 repeated)
+
+Layers 4 and 5 (the U-Net hinge point) are executed twice during the forward pass using the same physical parameter banks and block modules. This creates a virtual 13-layer network from an 11-layer parameter budget — zero extra parameters, ~0.003 BPB improvement.
+
+Recurrence activates at step 3000 (after the model has learned basic representations), with the MLP weights primed from the base layers. Step time increases from ~87ms to ~97ms (13 virtual layers vs 11), but the improved per-step learning outweighs the step count reduction.
+
+Lineage: PR #1204 by @msisovic (concept), PR #1260 by @dexhunter (implementation on PR #1218 stack).
+
+### 2. MuonEq-R (Row-Normalized Muon)
+
+Before Newton-Schulz orthogonalization, gradient matrices are row-normalized:
+
+```python
+row_norms = update.float().norm(dim=-1, keepdim=True).clamp_min(1e-7)
+update = update / row_norms.to(update.dtype)
+```
+
+This equalizes row norms so Newton-Schulz operates on better-conditioned matrices. Zero additional bytes, ~0.001 BPB improvement. Source: arXiv:2603.28254, PR #1260 by @dexhunter.
+
+### 3. AR Self-Generated Full Hessian GPTQ (from PR #1019)
+
+After training, the model autoregressively generates 64 sequences of 2048 tokens (temperature=0.8) for GPTQ calibration. Full Hessian GPTQ with Cholesky error compensation. No val data, no train data accessed during quantization. Source: PR #1019 by @abaybektursun.
+
+## Architecture (PR #1019 stack + modifications)
+
+| Component | Setting | Source |
+|-----------|---------|--------|
+| Layers | 11 physical (13 virtual with recurrence) | **This work** |
+| Dimensions | 512d, 8 GQA / 4 KV heads | Baseline |
+| MLP | 3x (1536), LeakyReLU(0.5)^2 | PR #493 @parinzee |
+| Attention | XSA on all 11 layers | PR #1019 @abaybektursun |
+| BigramHash | 3072 x 112 | PR #1019 @abaybektursun |
+| RoPE | Partial (16/64 dims) | PR #315 @jfprincz |
+| LN Scale | 1/sqrt(layer+1) | PR #315 @jfprincz |
+| VE128 | Layers 9-10 | PR #374 @unnir |
+| SmearGate | Position-mixing gate | PR #65 @aquariouseworkman |
+| U-Net skips | Encoder-decoder connections | PR #289 |
+| Weight avg | EMA(0.997) + SWA | PR #401 @newjordan |
+| Optimizer | **MuonEq-R** (Parallel Muon + row-norm) | **This work** (Muon: PR #399 @abaybektursun) |
+| Depth Recurrence | **Layers 4,5 repeated** | **This work** (concept: PR #1204 @msisovic) |
+| Quantization | Full Hessian GPTQ int6 (AR self-gen) | PR #1019 @abaybektursun |
+| Compression | LZMA preset=9 | PR #160 @ChaseWNorton |
+| Warmdown | 4000 iterations | PR #364 @shikhar1729 |
+| Late QAT | STE at LR scale < 0.15 | PR #286 @chris-buckley |
+| Selective pruning | +/-1 by reconstruction error | PR #609 @saml212 |
+| Flash Attention 3 | Hopper kernels | PR #122 @mtybadger |
+
+## Training
+
+- MuonEq-R: lr=0.025, momentum 0.92->0.99/1500 steps, WD=0.04, Newton-Schulz 5 steps
+- Adam for embeddings (lr=0.035) and scalars (lr=0.025)
+- Batch 786,432 tokens, seq_len 2048, warmdown 4000 iters
+- Depth recurrence activates at step 3000
+- Late QAT via STE (final 15% wallclock)
+- Gradient clipping 0.3
+
+## Quantization Pipeline
+
+| Stage | BPB (seed 314) |
+|-------|----------------|
+| Pre-quant (post-EMA) | 1.1288 |
+| Post-GPTQ int6 roundtrip | 1.1331 |
+| Post-GPTQ sliding window | **1.1094** |
+
+## Compliance
+
+- No TTT, no SLOT, no n-gram cache, no eval-time adaptation
+- AR self-generated GPTQ calibration (no external data during quantization)
+- All seeds within 600s training, <16MB artifact
+- Fully legal under all four conditions (Issue #1017)
+
+## Run Command
+
+```bash
+SEED=42 BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
+RECUR_LAYERS=4,5 RECUR_START_STEP=3000 TARGET_MB=15.9 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- **Base model + GPTQ + XSA-all + BigramHash**: PR #1019 by @abaybektursun
+- **Depth Recurrence concept**: PR #1204 by @msisovic
+- **MuonEq-R + Depth Recurrence implementation**: PR #1260 by @dexhunter
+- **Parallel Muon**: PR #399 by @abaybektursun
+- **LeakyReLU^2**: PR #493 by @parinzee, PR #518 by @sofiabod
+- **LN Scale + Partial RoPE**: PR #315 by @jfprincz
diff --git a/records/track_10min_16mb/2026-04-03_DepthRecurrence_MuonEqR_GPTQ/submission.json b/records/track_10min_16mb/2026-04-03_DepthRecurrence_MuonEqR_GPTQ/submission.json
@@ -0,0 +1,51 @@
+{
+  "author": "aryanbhosale",
+  "github_id": "aryanbhosale",
+  "name": "Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ",
+  "blurb": "11L with depth recurrence (layers 4,5 repeated) + MuonEq-R optimizer (row-normalized Muon) + Full Hessian GPTQ with AR self-generated calibration on the PR #1019 stack. BigramHash(3072,112), warmdown=4000, XSA all 11 layers, LZMA preset=9. 3-seed exact mean: 1.11036981 BPB / 1.87480795 nats, beating PR #1019 exact 3-seed mean 1.11473509 BPB / 1.88217853 nats by 0.00737058 nats (Welch t=-7.73, df=2.59).",
+  "date": "2026-04-03",
+  "track": "10min_16mb",
+  "val_loss": 1.87480795,
+  "val_bpb": 1.11036981,
+  "val_loss_std": 0.00154055,
+  "val_bpb_std": 0.00091240,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {
+      "val_loss": 1.87509706,
+      "val_bpb": 1.11054104,
+      "artifact_bytes": 15974737,
+      "steps": 6204,
+      "step_avg_ms": 96.73
+    },
+    "314": {
+      "val_loss": 1.87314333,
+      "val_bpb": 1.10938392,
+      "artifact_bytes": 15972993,
+      "steps": 6205,
+      "step_avg_ms": 96.72
+    },
+    "999": {
+      "val_loss": 1.87618346,
+      "val_bpb": 1.11118446,
+      "artifact_bytes": 15969221,
+      "steps": 6205,
+      "step_avg_ms": 96.71
+    }
+  },
+  "comparison_baseline_pr": 1019,
+  "delta_vs_pr1019_nats": -0.00737058,
+  "delta_vs_pr1019_bpb": -0.00436528,
+  "t_statistic": -7.7261,
+  "welch_df": 2.5884,
+  "artifact_bytes_mean": 15972317,
+  "artifact_bytes_max": 15974737,
+  "train_steps_mean": 6205,
+  "step_avg_ms_mean": 96.72,
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "cuda_version": "12.8",
+  "flash_attn_version": "FA3 Hopper kernels",
+  "calibration": "AR self-generated (64 seqs x 2048 tokens, temp=0.8, no external data)",
+  "technique_summary": "Depth Recurrence (layers 4,5) + MuonEq-R + AR Self-Gen GPTQ int6 + XSA-all + BigramHash 3072x112 + LZMA9"
+}