openai · dexhunter · Apr 4, 2026
diff --git a/...k_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/README.md b/...k_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/README.md
@@ -0,0 +1,78 @@
+## Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 (val_bpb: 1.0900)
+
+**val_bpb = 1.0900** (3-seed mean, std 0.0005) | **2.5078 nats** | **~15.96 MB** | 8xH100 SXM, 590s train + ~81s eval | No TTT
+
+Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev.
+
+Previous: [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> [PR #1285](https://github.com/openai/parameter-golf/pull/1285) (1.0912) -> this (1.0900)
+
+### Changes from PR #1285
+
+| | PR #1285 | This |
+|---|---|---|
+| val_bpb | 1.09124 | **1.08995** |
+| Recurrence | Layers 4,5 (2-layer) | **Layers 3,4,5 (3-layer)** |
+| Weight decay | 0.090 | **0.095** |
+| Matrix LR | 0.020 | **0.022** |
+| Everything else | Same | Same |
+
+### Key Innovations
+
+1. **3-Layer Depth Recurrence** — Layers 3, 4, and 5 are repeated (RECUR_LAYERS=3,4,5), creating 14 virtual layers from 11 physical. MLP weights fully shared. ~0.0005 BPP improvement over 2-layer recurrence.
+
+2. **WD=0.095 + MLR=0.022 Synergy** — Higher weight decay (0.095 vs 0.090) compresses weights better, while slightly higher matrix LR (0.022 vs 0.020) recovers quality. The net effect is better BPP at the same artifact budget. The 3-layer recurrence needs WD≥0.093 to fit all-int6 under 16MB.
+
+3. **MuonEq-R + All-Int6 GPTQ** — Row-normalized Muon optimizer with all 66 layers at int6 precision (carried from PR #1285).
+
+### Configuration
+
+```bash
+NCCL_NET=Socket DATA_DIR=./data SEED=42 \
+MIXED_QUANT=1 N_INT6_LAYERS=66 \
+RECUR_LAYERS=3,4,5 MUON_WD=0.095 EMBED_WD=0.095 \
+MATRIX_LR=0.022 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT)
+
+### Core Results
+
+| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact |
+|------|-------|---------|--------------|-------------|-----------------|----------|
+| 42 | 5,540 | 106.5 | 1.0991 | 1.0898 | 2.50733 | 15,961,029 |
+| 0 | 5,536 | 106.6 | 1.0993 | 1.0895 | 2.50672 | 15,955,962 |
+| 7 | 5,538 | 106.6 | 1.0999 | 1.0905 | 2.50901 | 15,964,018 |
+| **Mean** | **5,538** | **106.6** | **1.0994** | **1.0900** | **2.50769** | **15,960,336** |
+
+### Rule Compliance
+
+- No TTT, no SLOT, no eval-time adaptation
+- Artifact < 16,000,000 bytes for ALL seeds (max: 15,964,018)
+- Train < 600s, eval < 600s on 8xH100 SXM
+
+### Run Command (3-seed loop)
+
+```bash
+for SEED in 42 0 7; do
+  NCCL_NET=Socket DATA_DIR=./data SEED=$SEED \
+  MIXED_QUANT=1 N_INT6_LAYERS=66 \
+  RECUR_LAYERS=3,4,5 MUON_WD=0.095 EMBED_WD=0.095 \
+  MATRIX_LR=0.022 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py \
+  2>&1 | tee train_seed${SEED}.log
+done
+```
+
+### Credits
+
+- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture)
+- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline)
+- @msisovic for PR #1204 (depth recurrence concept)
+- @dexhunter for PR #1285 (WD-quantization synergy discovery)
+
+### Included Files
+
+- `train_gpt.py` — full training + quantization + evaluation (20,302 bytes, self-extracting)
+- `train_seed42.log`, `train_seed0.log`, `train_seed7.log`
+- `submission.json`
diff --git a/...track_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/submission.json b/...track_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/submission.json
@@ -0,0 +1,16 @@
+{
+  "name": "Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6",
+  "val_bpb": 1.0900,
+  "bytes_total": 15964018,
+  "blurb": "3-layer depth recurrence (3,4,5) with WD=0.095 and MLR=0.022 on all-int6 GPTQ. WD-LR synergy: higher WD compresses for headroom, higher LR recovers quality. 3-seed mean 1.0900 BPP. No TTT, no SLOT.",
+  "author": "dexhunter",
+  "github_id": "dexhunter",
+  "date": "2026-04-04",
+  "pre_quant_val_bpb": 1.0994,
+  "bytes_model_compressed": 15943318,
+  "bytes_code": 20302,
+  "base_pr": 1218,
+  "seeds": [42, 0, 7],
+  "seed_scores": [1.08980, 1.08953, 1.09053],
+  "eval_time_seconds": [81, 81, 81]
+}
diff --git a/...ds/track_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/train_gpt.py b/...ds/track_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/train_gpt.py
diff --git a/...track_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/train_seed0.log b/...track_10min_16mb/2026-04-04_MuonEqR_3LayerRecurrence_WD095_MLR022_AllInt6/train_seed0.log
@@ -0,0 +1,221 @@
+
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+*****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /home/dex/parameter-golf-with-cc/data
+  datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096
+  disable_layer0_attn: False
+  distributed: True
+  ema_decay: 0.997
+  embed_lr: 0.6
+  embed_wd: 0.095
+  embedding_dim: 512
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_enabled: True
+  gptq_reserve_seconds: 10.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/9ad94044-4030-491d-a000-f621565297c2.txt
+  logit_softcap: 30.0
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mixed_quant: True
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_wd: 0.095
+  muoneq_mode: r
+  n_int6_layers: 66
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  parallel_residual: False
+  parallel_start_layer: 7
+  parallel_start_layer_is_physical: True
+  qk_gain_init: 4.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  recur_layers_str: 3,4,5
+  recur_start_step: 3000
+  recur_warmup_steps: 20
+  repeat_untie_mlp: none
+  repeat_untie_mlp_layers: 
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 9ad94044-4030-491d-a000-f621565297c2
+  scalar_lr: 0.02
+  seed: 0
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model
+  train_batch_tokens: 786432
+  train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  val_batch_tokens: 524288
+  val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
+  val_loss_every: 4000
+  ve_dim: 128
+  ve_enabled: True
+  ve_layers: 9,10
+  vocab_size: 4096
+  warmdown_frac: 0.667
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 143
+val_tokens: 45514752
+model_params:34401371
+parallel_residual: active=0 start_layer=7 start_mode=physical params=0
+recurrence: layers=[3, 4, 5] start_step=3000 active=0
+repeat_untie_mlp: mode=none layers=[] params=0
+gptq:reserving 10s, effective=590000ms
+[rank1]:[W403 15:43:41.333694228 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank3]:[W403 15:43:42.759532074 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank4]:[W403 15:43:42.761293104 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank5]:[W403 15:43:42.761524637 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank0]:[W403 15:43:42.767210602 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank6]:[W403 15:43:42.787436053 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank7]:[W403 15:43:42.801718548 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+[rank2]:[W403 15:43:42.803494890 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+recurrence:prewarm active=1 virtual_layers:14
+recur_warmup_step: 1/20
+recur_warmup_step: 2/20
+recur_warmup_step: 3/20
+recur_warmup_step: 4/20
+recur_warmup_step: 5/20
+recur_warmup_step: 6/20
+recur_warmup_step: 10/20
+recur_warmup_step: 20/20
+0/20000 val_loss: 8.3145 val_bpb: 3.6139
+1/20000 train_loss: 8.3158 train_time: 0.0m tok/s: 8414190
+2/20000 train_loss: 12.3079 train_time: 0.0m tok/s: 8322806
+3/20000 train_loss: 10.7562 train_time: 0.0m tok/s: 8223119
+4/20000 train_loss: 9.0100 train_time: 0.0m tok/s: 8183649
+5/20000 train_loss: 7.8375 train_time: 0.0m tok/s: 8156259
+500/20000 train_loss: 2.9988 train_time: 0.8m tok/s: 7923791
+1000/20000 train_loss: 2.9482 train_time: 1.7m tok/s: 7921373
+1500/20000 train_loss: 2.9088 train_time: 2.5m tok/s: 7919597
+2000/20000 train_loss: 2.8387 train_time: 3.3m tok/s: 7917465
+2500/20000 train_loss: 2.7216 train_time: 4.1m tok/s: 7915663
+3000/20000 train_loss: 2.8276 train_time: 5.0m tok/s: 7914736
+recurrence:activated step:3000 layers:[3, 4, 5] virtual_layers:14
+3500/20000 train_loss: 2.6979 train_time: 6.0m tok/s: 7653592
+4000/20000 train_loss: 2.6219 train_time: 7.0m tok/s: 7468748
+4000/20000 val_loss: 2.6481 val_bpb: 1.1510
+4500/20000 train_loss: 2.5760 train_time: 8.1m tok/s: 7310121
+5000/20000 train_loss: 2.6197 train_time: 9.1m tok/s: 7207245
+5362/20000 val_loss: 2.5274 val_bpb: 1.0985
+stopping_early: wallclock_cap train_time: 590082ms step: 5362/20000
+peak memory allocated: 32484 MiB reserved: 32518 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.52510077 val_bpb:1.09752153 eval_time:2148ms
+Serialized model: 132405891 bytes
+Code size: 20700 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 66 Hessians in 10.4s
+mixed_quant: sensitivity ranking -- 66 int6 (top), 0 int5 (bottom)
+  rank  0: int6 blocks.0.mlp.proj.weight sens=4291895808.0 numel=1048576
+  rank  1: int6 blocks.1.mlp.proj.weight sens=1131829504.0 numel=1048576
+  rank  2: int6 blocks.3.mlp.proj.weight sens=639102976.0 numel=1048576
+  rank  3: int6 blocks.2.mlp.proj.weight sens=533221920.0 numel=1048576
+  rank  4: int6 blocks.4.mlp.proj.weight sens=346102656.0 numel=1048576
+  rank  5: int6 blocks.5.mlp.proj.weight sens=297923904.0 numel=1048576
+  rank  6: int6 blocks.7.mlp.proj.weight sens=96664504.0 numel=1048576
+  rank  7: int6 blocks.6.mlp.proj.weight sens=95330528.0 numel=1048576
+  rank  8: int6 blocks.8.mlp.proj.weight sens=52402352.0 numel=1048576
+  rank  9: int6 blocks.0.attn.c_q.weight sens=50334144.0 numel=262144
+  rank 10: int6 blocks.0.attn.c_k.weight sens=50334144.0 numel=131072
+  rank 11: int6 blocks.0.attn.c_v.weight sens=50334144.0 numel=131072
+  rank 12: int6 blocks.0.mlp.fc.weight sens=50330312.0 numel=1048576
+  rank 13: int6 blocks.9.mlp.proj.weight sens=36778616.0 numel=1048576
+  rank 14: int6 blocks.0.attn.proj.weight sens=30242380.0 numel=262144
+  rank 15: int6 blocks.1.attn.c_q.weight sens=25165376.0 numel=262144
+  rank 16: int6 blocks.1.attn.c_k.weight sens=25165376.0 numel=131072
+  rank 17: int6 blocks.1.attn.c_v.weight sens=25165376.0 numel=131072
+  rank 18: int6 blocks.1.mlp.fc.weight sens=25165286.0 numel=1048576
+  rank 19: int6 blocks.3.attn.c_q.weight sens=25165124.0 numel=262144
+  rank 20: int6 blocks.3.attn.c_k.weight sens=25165124.0 numel=131072
+  rank 21: int6 blocks.3.attn.c_v.weight sens=25165124.0 numel=131072
+  rank 22: int6 blocks.3.mlp.fc.weight sens=25165122.0 numel=1048576
+  rank 23: int6 blocks.3.attn.proj.weight sens=23067072.0 numel=262144
+  rank 24: int6 blocks.4.attn.proj.weight sens=20999424.0 numel=262144
+  rank 25: int6 blocks.4.attn.c_q.weight sens=20133216.0 numel=262144
+  rank 26: int6 blocks.4.attn.c_k.weight sens=20133216.0 numel=131072
+  rank 27: int6 blocks.4.attn.c_v.weight sens=20133216.0 numel=131072
+  rank 28: int6 blocks.4.mlp.fc.weight sens=20133208.0 numel=1048576
+  rank 29: int6 blocks.5.attn.c_q.weight sens=16778130.0 numel=262144
+  rank 30: int6 blocks.5.attn.c_k.weight sens=16778130.0 numel=131072
+  rank 31: int6 blocks.5.attn.c_v.weight sens=16778130.0 numel=131072
+  rank 32: int6 blocks.5.mlp.fc.weight sens=16778130.0 numel=1048576
+  rank 33: int6 blocks.2.attn.c_q.weight sens=16776920.0 numel=262144
+  rank 34: int6 blocks.2.attn.c_k.weight sens=16776920.0 numel=131072
+  rank 35: int6 blocks.2.attn.c_v.weight sens=16776920.0 numel=131072
+  rank 36: int6 blocks.2.mlp.fc.weight sens=16776912.0 numel=1048576
+  rank 37: int6 blocks.10.mlp.proj.weight sens=15877998.0 numel=1048576
+  rank 38: int6 blocks.1.attn.proj.weight sens=15654982.0 numel=262144
+  rank 39: int6 blocks.2.attn.proj.weight sens=13337374.0 numel=262144
+  rank 40: int6 blocks.5.attn.proj.weight sens=11676767.0 numel=262144
+  rank 41: int6 blocks.6.attn.c_q.weight sens=7191526.0 numel=262144
+  rank 42: int6 blocks.6.attn.c_k.weight sens=7191526.0 numel=131072
+  rank 43: int6 blocks.6.attn.c_v.weight sens=7191526.0 numel=131072
+  rank 44: int6 blocks.6.mlp.fc.weight sens=7191522.0 numel=1048576
+  rank 45: int6 blocks.7.mlp.fc.weight sens=6291325.0 numel=1048576
+  rank 46: int6 blocks.7.attn.c_q.weight sens=6291324.5 numel=262144
+  rank 47: int6 blocks.7.attn.c_k.weight sens=6291324.5 numel=131072
+  rank 48: int6 blocks.7.attn.c_v.weight sens=6291324.5 numel=131072
+  rank 49: int6 blocks.8.mlp.fc.weight sens=5592424.5 numel=1048576
+  rank 50: int6 blocks.8.attn.c_q.weight sens=5592423.5 numel=262144
+  rank 51: int6 blocks.8.attn.c_k.weight sens=5592423.5 numel=131072
+  rank 52: int6 blocks.8.attn.c_v.weight sens=5592423.5 numel=131072
+  rank 53: int6 blocks.6.attn.proj.weight sens=5141315.5 numel=262144
+  rank 54: int6 blocks.9.attn.c_q.weight sens=5032689.5 numel=262144
+  rank 55: int6 blocks.9.attn.c_k.weight sens=5032689.5 numel=131072
+  rank 56: int6 blocks.9.attn.c_v.weight sens=5032689.5 numel=131072
+  rank 57: int6 blocks.9.mlp.fc.weight sens=5032684.5 numel=1048576
+  rank 58: int6 blocks.10.attn.c_q.weight sens=4575790.5 numel=262144
+  rank 59: int6 blocks.10.attn.c_k.weight sens=4575790.5 numel=131072
+  rank 60: int6 blocks.10.attn.c_v.weight sens=4575790.5 numel=131072
+  rank 61: int6 blocks.10.mlp.fc.weight sens=4575548.5 numel=1048576
+  rank 62: int6 blocks.7.attn.proj.weight sens=3795631.5 numel=262144
+  rank 63: int6 blocks.9.attn.proj.weight sens=2910124.0 numel=262144
+  rank 64: int6 blocks.10.attn.proj.weight sens=2810890.5 numel=262144
+  rank 65: int6 blocks.8.attn.proj.weight sens=2492911.0 numel=262144
+mixed_quant: most sensitive=blocks.0.mlp.proj.weight (4291895808.0), least sensitive=blocks.8.attn.proj.weight (2492911.0)
+GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
+mixed_quant: 66 int6, 0 int5
+Serialized model mixed_int5_int6+brotli: 15935262 bytes
+Total submission size mixed_int5_int6+brotli: 15955962 bytes
+final_int6_roundtrip val_loss:2.54932435 val_bpb:1.10805018 eval_time:7204ms
+final_int6_sliding_window val_loss:2.50671417 val_bpb:1.08952989 eval_time:81110ms