openai · aryanbhosale · Apr 4, 2026
diff --git a/...10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/README.md b/...10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/README.md
@@ -0,0 +1,57 @@
+# Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Causal SLOT — val_bpb 1.0766 (3-seed mean)
+
+**val_bpb = 1.0766** (3-seed mean, std 0.0004) | **~16.00 MB** | 8xH100 SXM
+
+## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
+
+| Seed | Sliding BPB | **Causal SLOT BPB** | SLOT gain | Artifact |
+|------|-------------|---------------------|-----------|----------|
+| 42   | 1.0893      | **1.0762**          | -0.0131   | 15,999,461 |
+| 314  | 1.0897      | **1.0766**          | -0.0131   | 15,997,932 |
+| 999  | 1.0897      | **1.0770**          | -0.0127   | 15,994,941 |
+| **Mean** | | **1.0766** | **-0.0130** | |
+
+Merged SOTA (PR #1019): **1.1147 BPB**. Delta: **-0.0381 BPB**.
+
+## Key Techniques
+
+### Training (8 techniques)
+
+1. **4096-Vocab + MLP 4x + WD 0.090** — PR #1218 @clarkkev, PR #1285 @dexhunter
+2. **Depth Recurrence (layers 4,5)** — PR #1204 @msisovic, PR #1260 @dexhunter
+3. **Parallel Residuals (from layer 7)** — PR #1204 @msisovic, PR #1289 @MatoTeziTanka
+4. **MuonEq-R** — arXiv:2603.28254, PR #1260 @dexhunter
+5. **QK-Gain 5.0** — PR #1217 @bigbag
+6. **Full GPTQ int6 + Brotli + LZMA Compressed Wrapper**
+
+### Evaluation: Causal SLOT (context-only delta optimization)
+
+Per-batch additive delta vector (dim=512) optimized with AdamW (lr=0.008, 16 steps) on **context-only positions** during sliding-window eval. Only already-scored tokens contribute to the optimization loss. Delta is re-initialized to zeros for each batch. Model weights completely frozen.
+
+This is provably causal: the delta at position t depends only on tokens x_1,...,x_{t-stride} which have all been previously scored. New positions (last stride=64 tokens per window) are scored with the context-adapted delta but do not influence its optimization.
+
+Source: arXiv:2505.12392v2, PR #1306 @resouer (causal variant), PR #1176 @bigbag (SLOT concept).
+
+## Compliance
+
+- **Condition 1** (causal): delta optimized on context-only positions (already scored). New tokens excluded from optimization loss.
+- **Condition 2** (full distribution): standard softmax over full 4096-token vocabulary
+- **Condition 3** (score-before-update): new tokens scored AFTER delta optimization on context. Delta does not use new token information.
+- **Condition 4** (single pass): single left-to-right sliding window, no rescoring
+- Model weights frozen during eval — only delta vector optimized per-batch
+- GPTQ calibration within training budget
+- Total eval: ~520s (sliding ~76s + SLOT ~444s), within 600s budget
+
+## Reproduction
+
+```bash
+pip install brotli
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
+SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
+SLOT_ENABLED=1 SLOT_LR=0.008 SLOT_STEPS=16 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #1306 @resouer (causal SLOT), PR #1176 @bigbag (SLOT concept)
diff --git a/...ack_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/submission.json b/...ack_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/submission.json
@@ -0,0 +1,20 @@
+{
+  "author": "aryanbhosale",
+  "github_id": "aryanbhosale",
+  "name": "SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Causal SLOT-16",
+  "date": "2026-04-04",
+  "track": "10min_16mb",
+  "val_bpb": 1.07660790,
+  "val_bpb_std": 0.00039902,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {"val_bpb": 1.07620919, "artifact_bytes": 15999461},
+    "314": {"val_bpb": 1.07660728, "artifact_bytes": 15997932},
+    "999": {"val_bpb": 1.07700722, "artifact_bytes": 15994941}
+  },
+  "comparison_baseline_pr": 1019,
+  "delta_vs_pr1019_bpb": -0.03813,
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "SP4096 + MLP 4x + WD 0.090 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 + Causal SLOT-16 + Full GPTQ int6 + Brotli"
+}
diff --git a/.../track_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/train_gpt.py b/.../track_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/train_gpt.py
diff --git a/...k_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/train_seed314.log b/...k_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/train_seed314.log
@@ -0,0 +1,129 @@
+W0404 08:08:26.228000 78847 torch/distributed/run.py:803] 
+W0404 08:08:26.228000 78847 torch/distributed/run.py:803] *****************************************
+W0404 08:08:26.228000 78847 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0404 08:08:26.228000 78847 torch/distributed/run.py:803] *****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp4096
+  distributed: True
+  ema_decay: 0.997
+  embed_lr: 0.6
+  embed_wd: 0.09
+  embedding_dim: 512
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_enabled: True
+  gptq_reserve_seconds: 10.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/f7607a20-c299-450b-9170-973578a8b2ce.txt
+  logit_softcap: 30.0
+  matrix_lr: 0.02
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_wd: 0.09
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  parallel_start_layer: 7
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  recur_layers: 4,5
+  recur_start_step: 3000
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: f7607a20-c299-450b-9170-973578a8b2ce
+  scalar_lr: 0.02
+  seed: 314
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  slot_enabled: True
+  slot_lr: 0.008
+  slot_steps: 16
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_4096_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
+  val_loss_every: 4000
+  ve_dim: 128
+  ve_enabled: True
+  ve_layers: 9,10
+  vocab_size: 4096
+  warmdown_frac: 0.667
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 80
+val_tokens: 45508608
+model_params:34401372
+gptq:reserving 10s, effective=590000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+0/20000 val_loss: 8.3172 val_bpb: 3.6146
+1/20000 train_loss: 8.3192 train_time: 0.0m tok/s: 8507828
+2/20000 train_loss: 12.1995 train_time: 0.0m tok/s: 8377177
+3/20000 train_loss: 10.6851 train_time: 0.0m tok/s: 8288110
+4/20000 train_loss: 8.8318 train_time: 0.0m tok/s: 8233714
+5/20000 train_loss: 7.6631 train_time: 0.0m tok/s: 8203041
+500/20000 train_loss: 2.9028 train_time: 0.8m tok/s: 7976717
+1000/20000 train_loss: 2.8869 train_time: 1.7m tok/s: 7942538
+1500/20000 train_loss: 2.9120 train_time: 2.5m tok/s: 7935326
+2000/20000 train_loss: 2.6523 train_time: 3.3m tok/s: 7932106
+2500/20000 train_loss: 2.7109 train_time: 4.1m tok/s: 7930042
+3000/20000 train_loss: 2.7611 train_time: 5.0m tok/s: 7929894
+recurrence:activated at step 3000, virtual_layers=[0, 1, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10]
+3500/20000 train_loss: 2.6827 train_time: 6.1m tok/s: 7529226
+4000/20000 train_loss: 2.6169 train_time: 7.1m tok/s: 7435459
+4000/20000 val_loss: 2.6413 val_bpb: 1.1479
+4500/20000 train_loss: 2.5702 train_time: 8.0m tok/s: 7365310
+5000/20000 train_loss: 2.5111 train_time: 9.0m tok/s: 7309592
+5454/20000 val_loss: 2.5262 val_bpb: 1.0978
+stopping_early: wallclock_cap train_time: 590094ms step: 5454/20000
+peak memory allocated: 30120 MiB reserved: 30154 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.52368690 val_bpb:1.09676525 eval_time:2005ms
+Serialized model: 132406149 bytes
+Code size: 23803 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 66 Hessians in 9.8s
+GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
+selective_prune: unpruned=16.00MB target=16.0MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+brotli: 15974129 bytes
+Total submission size int6+brotli: 15997932 bytes
+final_int6_roundtrip val_loss:2.55027811 val_bpb:1.10832149 eval_time:7527ms
+final_int6_sliding_window val_loss:2.50739734 val_bpb:1.08968600 eval_time:76169ms
+final_causal_slot val_loss:2.47727670 val_bpb:1.07660728 eval_time:444871ms
diff --git a/...ck_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/train_seed42.log b/...ck_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_CausalSLOT/train_seed42.log
@@ -0,0 +1,129 @@
+W0404 07:38:35.392000 77439 torch/distributed/run.py:803] 
+W0404 07:38:35.392000 77439 torch/distributed/run.py:803] *****************************************
+W0404 07:38:35.392000 77439 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0404 07:38:35.392000 77439 torch/distributed/run.py:803] *****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp4096
+  distributed: True
+  ema_decay: 0.997
+  embed_lr: 0.6
+  embed_wd: 0.09
+  embedding_dim: 512
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_enabled: True
+  gptq_reserve_seconds: 10.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/923648bf-e0a6-40d4-b29e-0299c4f40422.txt
+  logit_softcap: 30.0
+  matrix_lr: 0.02
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_wd: 0.09
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  parallel_start_layer: 7
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  recur_layers: 4,5
+  recur_start_step: 3000
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 923648bf-e0a6-40d4-b29e-0299c4f40422
+  scalar_lr: 0.02
+  seed: 42
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  slot_enabled: True
+  slot_lr: 0.008
+  slot_steps: 16
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_4096_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
+  val_loss_every: 4000
+  ve_dim: 128
+  ve_enabled: True
+  ve_layers: 9,10
+  vocab_size: 4096
+  warmdown_frac: 0.667
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 80
+val_tokens: 45508608
+model_params:34401372
+gptq:reserving 10s, effective=590000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+0/20000 val_loss: 8.3187 val_bpb: 3.6152
+1/20000 train_loss: 8.3201 train_time: 0.0m tok/s: 8475031
+2/20000 train_loss: 12.1482 train_time: 0.0m tok/s: 8359023
+3/20000 train_loss: 10.6752 train_time: 0.0m tok/s: 8275865
+4/20000 train_loss: 8.8831 train_time: 0.0m tok/s: 8193201
+5/20000 train_loss: 7.6882 train_time: 0.0m tok/s: 8153963
+500/20000 train_loss: 2.8980 train_time: 0.8m tok/s: 7964606
+1000/20000 train_loss: 2.8826 train_time: 1.7m tok/s: 7943614
+1500/20000 train_loss: 2.9046 train_time: 2.5m tok/s: 7936900
+2000/20000 train_loss: 2.6485 train_time: 3.3m tok/s: 7933540
+2500/20000 train_loss: 2.7097 train_time: 4.1m tok/s: 7931972
+3000/20000 train_loss: 2.7596 train_time: 5.0m tok/s: 7931646
+recurrence:activated at step 3000, virtual_layers=[0, 1, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10]
+3500/20000 train_loss: 2.6817 train_time: 6.1m tok/s: 7528857
+4000/20000 train_loss: 2.6179 train_time: 7.1m tok/s: 7435705
+4000/20000 val_loss: 2.6409 val_bpb: 1.1477
+4500/20000 train_loss: 2.5735 train_time: 8.0m tok/s: 7365391
+5000/20000 train_loss: 2.5137 train_time: 9.0m tok/s: 7309483
+5454/20000 val_loss: 2.5257 val_bpb: 1.0976
+stopping_early: wallclock_cap train_time: 590101ms step: 5454/20000
+peak memory allocated: 30120 MiB reserved: 30154 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.52314384 val_bpb:1.09652925 eval_time:2008ms
+Serialized model: 132406149 bytes
+Code size: 23803 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 66 Hessians in 9.7s
+GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
+selective_prune: unpruned=16.00MB target=16.0MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+brotli: 15975658 bytes
+Total submission size int6+brotli: 15999461 bytes
+final_int6_roundtrip val_loss:2.54928373 val_bpb:1.10788934 eval_time:7568ms
+final_int6_sliding_window val_loss:2.50641155 val_bpb:1.08925759 eval_time:76200ms
+final_causal_slot val_loss:2.47636068 val_bpb:1.07620919 eval_time:444138ms