Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
## Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 (val_bpb: 1.0900)

**val_bpb = 1.0900** (3-seed mean, std 0.0005) | **2.5078 nats** | **~15.96 MB** | 8xH100 SXM, 590s train + ~81s eval | No TTT

Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev.

Previous: [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> [PR #1285](https://github.com/openai/parameter-golf/pull/1285) (1.0912) -> this (1.0900)

### Changes from PR #1285

| | PR #1285 | This |
|---|---|---|
| val_bpb | 1.09124 | **1.08995** |
| Recurrence | Layers 4,5 (2-layer) | **Layers 3,4,5 (3-layer)** |
| Weight decay | 0.090 | **0.095** |
| Matrix LR | 0.020 | **0.022** |
| Everything else | Same | Same |

### Key Innovations

1. **3-Layer Depth Recurrence** — Layers 3, 4, and 5 are repeated (RECUR_LAYERS=3,4,5), creating 14 virtual layers from 11 physical. MLP weights fully shared. ~0.0005 BPP improvement over 2-layer recurrence.

2. **WD=0.095 + MLR=0.022 Synergy** — Higher weight decay (0.095 vs 0.090) compresses weights better, while slightly higher matrix LR (0.022 vs 0.020) recovers quality. The net effect is better BPP at the same artifact budget. The 3-layer recurrence needs WD≥0.093 to fit all-int6 under 16MB.

3. **MuonEq-R + All-Int6 GPTQ** — Row-normalized Muon optimizer with all 66 layers at int6 precision (carried from PR #1285).

### Configuration

```bash
NCCL_NET=Socket DATA_DIR=./data SEED=42 \
MIXED_QUANT=1 N_INT6_LAYERS=66 \
RECUR_LAYERS=3,4,5 MUON_WD=0.095 EMBED_WD=0.095 \
MATRIX_LR=0.022 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT)

### Core Results

| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact |
|------|-------|---------|--------------|-------------|-----------------|----------|
| 42 | 5,540 | 106.5 | 1.0991 | 1.0898 | 2.50733 | 15,961,029 |
| 0 | 5,536 | 106.6 | 1.0993 | 1.0895 | 2.50672 | 15,955,962 |
| 7 | 5,538 | 106.6 | 1.0999 | 1.0905 | 2.50901 | 15,964,018 |
| **Mean** | **5,538** | **106.6** | **1.0994** | **1.0900** | **2.50769** | **15,960,336** |

### Rule Compliance

- No TTT, no SLOT, no eval-time adaptation
- Artifact < 16,000,000 bytes for ALL seeds (max: 15,964,018)
- Train < 600s, eval < 600s on 8xH100 SXM

### Run Command (3-seed loop)

```bash
for SEED in 42 0 7; do
NCCL_NET=Socket DATA_DIR=./data SEED=$SEED \
MIXED_QUANT=1 N_INT6_LAYERS=66 \
RECUR_LAYERS=3,4,5 MUON_WD=0.095 EMBED_WD=0.095 \
MATRIX_LR=0.022 \
torchrun --standalone --nproc_per_node=8 train_gpt.py \
2>&1 | tee train_seed${SEED}.log
done
```

### Credits

- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture)
- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline)
- @msisovic for PR #1204 (depth recurrence concept)
- @dexhunter for PR #1285 (WD-quantization synergy discovery)

### Included Files

- `train_gpt.py` — full training + quantization + evaluation (20,302 bytes, self-extracting)
- `train_seed42.log`, `train_seed0.log`, `train_seed7.log`
- `submission.json`
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"name": "Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6",
"val_bpb": 1.0900,
"bytes_total": 15964018,
"blurb": "3-layer depth recurrence (3,4,5) with WD=0.095 and MLR=0.022 on all-int6 GPTQ. WD-LR synergy: higher WD compresses for headroom, higher LR recovers quality. 3-seed mean 1.0900 BPP. No TTT, no SLOT.",
"author": "dexhunter",
"github_id": "dexhunter",
"date": "2026-04-04",
"pre_quant_val_bpb": 1.0994,
"bytes_model_compressed": 15943318,
"bytes_code": 20302,
"base_pr": 1218,
"seeds": [42, 0, 7],
"seed_scores": [1.08980, 1.08953, 1.09053],
"eval_time_seconds": [81, 81, 81]
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: /home/dex/parameter-golf-with-cc/data
datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096
disable_layer0_attn: False
distributed: True
ema_decay: 0.997
embed_lr: 0.6
embed_wd: 0.095
embedding_dim: 512
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_enabled: True
gptq_reserve_seconds: 10.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/9ad94044-4030-491d-a000-f621565297c2.txt
logit_softcap: 30.0
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mixed_quant: True
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_wd: 0.095
muoneq_mode: r
n_int6_layers: 66
num_heads: 8
num_kv_heads: 4
num_layers: 11
parallel_residual: False
parallel_start_layer: 7
parallel_start_layer_is_physical: True
qk_gain_init: 4.0
quantized_model_path: final_model.int6.ptz
rank: 0
recur_layers_str: 3,4,5
recur_start_step: 3000
recur_warmup_steps: 20
repeat_untie_mlp: none
repeat_untie_mlp_layers:
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 9ad94044-4030-491d-a000-f621565297c2
scalar_lr: 0.02
seed: 0
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model
train_batch_tokens: 786432
train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
val_batch_tokens: 524288
val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
val_loss_every: 4000
ve_dim: 128
ve_enabled: True
ve_layers: 9,10
vocab_size: 4096
warmdown_frac: 0.667
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 143
val_tokens: 45514752
model_params:34401371
parallel_residual: active=0 start_layer=7 start_mode=physical params=0
recurrence: layers=[3, 4, 5] start_step=3000 active=0
repeat_untie_mlp: mode=none layers=[] params=0
gptq:reserving 10s, effective=590000ms
[rank1]:[W403 15:43:41.333694228 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank3]:[W403 15:43:42.759532074 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank4]:[W403 15:43:42.761293104 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank5]:[W403 15:43:42.761524637 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W403 15:43:42.767210602 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank6]:[W403 15:43:42.787436053 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank7]:[W403 15:43:42.801718548 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank2]:[W403 15:43:42.803494890 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
recurrence:prewarm active=1 virtual_layers:14
recur_warmup_step: 1/20
recur_warmup_step: 2/20
recur_warmup_step: 3/20
recur_warmup_step: 4/20
recur_warmup_step: 5/20
recur_warmup_step: 6/20
recur_warmup_step: 10/20
recur_warmup_step: 20/20
0/20000 val_loss: 8.3145 val_bpb: 3.6139
1/20000 train_loss: 8.3158 train_time: 0.0m tok/s: 8414190
2/20000 train_loss: 12.3079 train_time: 0.0m tok/s: 8322806
3/20000 train_loss: 10.7562 train_time: 0.0m tok/s: 8223119
4/20000 train_loss: 9.0100 train_time: 0.0m tok/s: 8183649
5/20000 train_loss: 7.8375 train_time: 0.0m tok/s: 8156259
500/20000 train_loss: 2.9988 train_time: 0.8m tok/s: 7923791
1000/20000 train_loss: 2.9482 train_time: 1.7m tok/s: 7921373
1500/20000 train_loss: 2.9088 train_time: 2.5m tok/s: 7919597
2000/20000 train_loss: 2.8387 train_time: 3.3m tok/s: 7917465
2500/20000 train_loss: 2.7216 train_time: 4.1m tok/s: 7915663
3000/20000 train_loss: 2.8276 train_time: 5.0m tok/s: 7914736
recurrence:activated step:3000 layers:[3, 4, 5] virtual_layers:14
3500/20000 train_loss: 2.6979 train_time: 6.0m tok/s: 7653592
4000/20000 train_loss: 2.6219 train_time: 7.0m tok/s: 7468748
4000/20000 val_loss: 2.6481 val_bpb: 1.1510
4500/20000 train_loss: 2.5760 train_time: 8.1m tok/s: 7310121
5000/20000 train_loss: 2.6197 train_time: 9.1m tok/s: 7207245
5362/20000 val_loss: 2.5274 val_bpb: 1.0985
stopping_early: wallclock_cap train_time: 590082ms step: 5362/20000
peak memory allocated: 32484 MiB reserved: 32518 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.52510077 val_bpb:1.09752153 eval_time:2148ms
Serialized model: 132405891 bytes
Code size: 20700 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 10.4s
mixed_quant: sensitivity ranking -- 66 int6 (top), 0 int5 (bottom)
rank 0: int6 blocks.0.mlp.proj.weight sens=4291895808.0 numel=1048576
rank 1: int6 blocks.1.mlp.proj.weight sens=1131829504.0 numel=1048576
rank 2: int6 blocks.3.mlp.proj.weight sens=639102976.0 numel=1048576
rank 3: int6 blocks.2.mlp.proj.weight sens=533221920.0 numel=1048576
rank 4: int6 blocks.4.mlp.proj.weight sens=346102656.0 numel=1048576
rank 5: int6 blocks.5.mlp.proj.weight sens=297923904.0 numel=1048576
rank 6: int6 blocks.7.mlp.proj.weight sens=96664504.0 numel=1048576
rank 7: int6 blocks.6.mlp.proj.weight sens=95330528.0 numel=1048576
rank 8: int6 blocks.8.mlp.proj.weight sens=52402352.0 numel=1048576
rank 9: int6 blocks.0.attn.c_q.weight sens=50334144.0 numel=262144
rank 10: int6 blocks.0.attn.c_k.weight sens=50334144.0 numel=131072
rank 11: int6 blocks.0.attn.c_v.weight sens=50334144.0 numel=131072
rank 12: int6 blocks.0.mlp.fc.weight sens=50330312.0 numel=1048576
rank 13: int6 blocks.9.mlp.proj.weight sens=36778616.0 numel=1048576
rank 14: int6 blocks.0.attn.proj.weight sens=30242380.0 numel=262144
rank 15: int6 blocks.1.attn.c_q.weight sens=25165376.0 numel=262144
rank 16: int6 blocks.1.attn.c_k.weight sens=25165376.0 numel=131072
rank 17: int6 blocks.1.attn.c_v.weight sens=25165376.0 numel=131072
rank 18: int6 blocks.1.mlp.fc.weight sens=25165286.0 numel=1048576
rank 19: int6 blocks.3.attn.c_q.weight sens=25165124.0 numel=262144
rank 20: int6 blocks.3.attn.c_k.weight sens=25165124.0 numel=131072
rank 21: int6 blocks.3.attn.c_v.weight sens=25165124.0 numel=131072
rank 22: int6 blocks.3.mlp.fc.weight sens=25165122.0 numel=1048576
rank 23: int6 blocks.3.attn.proj.weight sens=23067072.0 numel=262144
rank 24: int6 blocks.4.attn.proj.weight sens=20999424.0 numel=262144
rank 25: int6 blocks.4.attn.c_q.weight sens=20133216.0 numel=262144
rank 26: int6 blocks.4.attn.c_k.weight sens=20133216.0 numel=131072
rank 27: int6 blocks.4.attn.c_v.weight sens=20133216.0 numel=131072
rank 28: int6 blocks.4.mlp.fc.weight sens=20133208.0 numel=1048576
rank 29: int6 blocks.5.attn.c_q.weight sens=16778130.0 numel=262144
rank 30: int6 blocks.5.attn.c_k.weight sens=16778130.0 numel=131072
rank 31: int6 blocks.5.attn.c_v.weight sens=16778130.0 numel=131072
rank 32: int6 blocks.5.mlp.fc.weight sens=16778130.0 numel=1048576
rank 33: int6 blocks.2.attn.c_q.weight sens=16776920.0 numel=262144
rank 34: int6 blocks.2.attn.c_k.weight sens=16776920.0 numel=131072
rank 35: int6 blocks.2.attn.c_v.weight sens=16776920.0 numel=131072
rank 36: int6 blocks.2.mlp.fc.weight sens=16776912.0 numel=1048576
rank 37: int6 blocks.10.mlp.proj.weight sens=15877998.0 numel=1048576
rank 38: int6 blocks.1.attn.proj.weight sens=15654982.0 numel=262144
rank 39: int6 blocks.2.attn.proj.weight sens=13337374.0 numel=262144
rank 40: int6 blocks.5.attn.proj.weight sens=11676767.0 numel=262144
rank 41: int6 blocks.6.attn.c_q.weight sens=7191526.0 numel=262144
rank 42: int6 blocks.6.attn.c_k.weight sens=7191526.0 numel=131072
rank 43: int6 blocks.6.attn.c_v.weight sens=7191526.0 numel=131072
rank 44: int6 blocks.6.mlp.fc.weight sens=7191522.0 numel=1048576
rank 45: int6 blocks.7.mlp.fc.weight sens=6291325.0 numel=1048576
rank 46: int6 blocks.7.attn.c_q.weight sens=6291324.5 numel=262144
rank 47: int6 blocks.7.attn.c_k.weight sens=6291324.5 numel=131072
rank 48: int6 blocks.7.attn.c_v.weight sens=6291324.5 numel=131072
rank 49: int6 blocks.8.mlp.fc.weight sens=5592424.5 numel=1048576
rank 50: int6 blocks.8.attn.c_q.weight sens=5592423.5 numel=262144
rank 51: int6 blocks.8.attn.c_k.weight sens=5592423.5 numel=131072
rank 52: int6 blocks.8.attn.c_v.weight sens=5592423.5 numel=131072
rank 53: int6 blocks.6.attn.proj.weight sens=5141315.5 numel=262144
rank 54: int6 blocks.9.attn.c_q.weight sens=5032689.5 numel=262144
rank 55: int6 blocks.9.attn.c_k.weight sens=5032689.5 numel=131072
rank 56: int6 blocks.9.attn.c_v.weight sens=5032689.5 numel=131072
rank 57: int6 blocks.9.mlp.fc.weight sens=5032684.5 numel=1048576
rank 58: int6 blocks.10.attn.c_q.weight sens=4575790.5 numel=262144
rank 59: int6 blocks.10.attn.c_k.weight sens=4575790.5 numel=131072
rank 60: int6 blocks.10.attn.c_v.weight sens=4575790.5 numel=131072
rank 61: int6 blocks.10.mlp.fc.weight sens=4575548.5 numel=1048576
rank 62: int6 blocks.7.attn.proj.weight sens=3795631.5 numel=262144
rank 63: int6 blocks.9.attn.proj.weight sens=2910124.0 numel=262144
rank 64: int6 blocks.10.attn.proj.weight sens=2810890.5 numel=262144
rank 65: int6 blocks.8.attn.proj.weight sens=2492911.0 numel=262144
mixed_quant: most sensitive=blocks.0.mlp.proj.weight (4291895808.0), least sensitive=blocks.8.attn.proj.weight (2492911.0)
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
mixed_quant: 66 int6, 0 int5
Serialized model mixed_int5_int6+brotli: 15935262 bytes
Total submission size mixed_int5_int6+brotli: 15955962 bytes
final_int6_roundtrip val_loss:2.54932435 val_bpb:1.10805018 eval_time:7204ms
final_int6_sliding_window val_loss:2.50671417 val_bpb:1.08952989 eval_time:81110ms
Loading