Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Causal SLOT — val_bpb 1.0766 (3-seed mean)

**val_bpb = 1.0766** (3-seed mean, std 0.0004) | **~16.00 MB** | 8xH100 SXM

## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

| Seed | Sliding BPB | **Causal SLOT BPB** | SLOT gain | Artifact |
|------|-------------|---------------------|-----------|----------|
| 42 | 1.0893 | **1.0762** | -0.0131 | 15,999,461 |
| 314 | 1.0897 | **1.0766** | -0.0131 | 15,997,932 |
| 999 | 1.0897 | **1.0770** | -0.0127 | 15,994,941 |
| **Mean** | | **1.0766** | **-0.0130** | |

Merged SOTA (PR #1019): **1.1147 BPB**. Delta: **-0.0381 BPB**.

## Key Techniques

### Training (8 techniques)

1. **4096-Vocab + MLP 4x + WD 0.090** — PR #1218 @clarkkev, PR #1285 @dexhunter
2. **Depth Recurrence (layers 4,5)** — PR #1204 @msisovic, PR #1260 @dexhunter
3. **Parallel Residuals (from layer 7)** — PR #1204 @msisovic, PR #1289 @MatoTeziTanka
4. **MuonEq-R** — arXiv:2603.28254, PR #1260 @dexhunter
5. **QK-Gain 5.0** — PR #1217 @bigbag
6. **Full GPTQ int6 + Brotli + LZMA Compressed Wrapper**

### Evaluation: Causal SLOT (context-only delta optimization)

Per-batch additive delta vector (dim=512) optimized with AdamW (lr=0.008, 16 steps) on **context-only positions** during sliding-window eval. Only already-scored tokens contribute to the optimization loss. Delta is re-initialized to zeros for each batch. Model weights completely frozen.

This is provably causal: the delta at position t depends only on tokens x_1,...,x_{t-stride} which have all been previously scored. New positions (last stride=64 tokens per window) are scored with the context-adapted delta but do not influence its optimization.

Source: arXiv:2505.12392v2, PR #1306 @resouer (causal variant), PR #1176 @bigbag (SLOT concept).

## Compliance

- **Condition 1** (causal): delta optimized on context-only positions (already scored). New tokens excluded from optimization loss.
- **Condition 2** (full distribution): standard softmax over full 4096-token vocabulary
- **Condition 3** (score-before-update): new tokens scored AFTER delta optimization on context. Delta does not use new token information.
- **Condition 4** (single pass): single left-to-right sliding window, no rescoring
- Model weights frozen during eval — only delta vector optimized per-batch
- GPTQ calibration within training budget
- Total eval: ~520s (sliding ~76s + SLOT ~444s), within 600s budget

## Reproduction

```bash
pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
SLOT_ENABLED=1 SLOT_LR=0.008 SLOT_STEPS=16 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #1306 @resouer (causal SLOT), PR #1176 @bigbag (SLOT concept)
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"author": "aryanbhosale",
"github_id": "aryanbhosale",
"name": "SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Causal SLOT-16",
"date": "2026-04-04",
"track": "10min_16mb",
"val_bpb": 1.07660790,
"val_bpb_std": 0.00039902,
"seeds": [42, 314, 999],
"seed_results": {
"42": {"val_bpb": 1.07620919, "artifact_bytes": 15999461},
"314": {"val_bpb": 1.07660728, "artifact_bytes": 15997932},
"999": {"val_bpb": 1.07700722, "artifact_bytes": 15994941}
},
"comparison_baseline_pr": 1019,
"delta_vs_pr1019_bpb": -0.03813,
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP4096 + MLP 4x + WD 0.090 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 + Causal SLOT-16 + Full GPTQ int6 + Brotli"
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
W0404 08:08:26.228000 78847 torch/distributed/run.py:803]
W0404 08:08:26.228000 78847 torch/distributed/run.py:803] *****************************************
W0404 08:08:26.228000 78847 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0404 08:08:26.228000 78847 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp4096
distributed: True
ema_decay: 0.997
embed_lr: 0.6
embed_wd: 0.09
embedding_dim: 512
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_enabled: True
gptq_reserve_seconds: 10.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/f7607a20-c299-450b-9170-973578a8b2ce.txt
logit_softcap: 30.0
matrix_lr: 0.02
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_wd: 0.09
num_heads: 8
num_kv_heads: 4
num_layers: 11
parallel_start_layer: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
recur_layers: 4,5
recur_start_step: 3000
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: f7607a20-c299-450b-9170-973578a8b2ce
scalar_lr: 0.02
seed: 314
skip_gates_enabled: True
sliding_window_enabled: True
slot_enabled: True
slot_lr: 0.008
slot_steps: 16
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_4096_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
val_loss_every: 4000
ve_dim: 128
ve_enabled: True
ve_layers: 9,10
vocab_size: 4096
warmdown_frac: 0.667
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 45508608
model_params:34401372
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 8.3172 val_bpb: 3.6146
1/20000 train_loss: 8.3192 train_time: 0.0m tok/s: 8507828
2/20000 train_loss: 12.1995 train_time: 0.0m tok/s: 8377177
3/20000 train_loss: 10.6851 train_time: 0.0m tok/s: 8288110
4/20000 train_loss: 8.8318 train_time: 0.0m tok/s: 8233714
5/20000 train_loss: 7.6631 train_time: 0.0m tok/s: 8203041
500/20000 train_loss: 2.9028 train_time: 0.8m tok/s: 7976717
1000/20000 train_loss: 2.8869 train_time: 1.7m tok/s: 7942538
1500/20000 train_loss: 2.9120 train_time: 2.5m tok/s: 7935326
2000/20000 train_loss: 2.6523 train_time: 3.3m tok/s: 7932106
2500/20000 train_loss: 2.7109 train_time: 4.1m tok/s: 7930042
3000/20000 train_loss: 2.7611 train_time: 5.0m tok/s: 7929894
recurrence:activated at step 3000, virtual_layers=[0, 1, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10]
3500/20000 train_loss: 2.6827 train_time: 6.1m tok/s: 7529226
4000/20000 train_loss: 2.6169 train_time: 7.1m tok/s: 7435459
4000/20000 val_loss: 2.6413 val_bpb: 1.1479
4500/20000 train_loss: 2.5702 train_time: 8.0m tok/s: 7365310
5000/20000 train_loss: 2.5111 train_time: 9.0m tok/s: 7309592
5454/20000 val_loss: 2.5262 val_bpb: 1.0978
stopping_early: wallclock_cap train_time: 590094ms step: 5454/20000
peak memory allocated: 30120 MiB reserved: 30154 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.52368690 val_bpb:1.09676525 eval_time:2005ms
Serialized model: 132406149 bytes
Code size: 23803 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 9.8s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
selective_prune: unpruned=16.00MB target=16.0MB
selective_prune: already fits, no pruning needed
Serialized model int6+brotli: 15974129 bytes
Total submission size int6+brotli: 15997932 bytes
final_int6_roundtrip val_loss:2.55027811 val_bpb:1.10832149 eval_time:7527ms
final_int6_sliding_window val_loss:2.50739734 val_bpb:1.08968600 eval_time:76169ms
final_causal_slot val_loss:2.47727670 val_bpb:1.07660728 eval_time:444871ms
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
W0404 07:38:35.392000 77439 torch/distributed/run.py:803]
W0404 07:38:35.392000 77439 torch/distributed/run.py:803] *****************************************
W0404 07:38:35.392000 77439 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0404 07:38:35.392000 77439 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp4096
distributed: True
ema_decay: 0.997
embed_lr: 0.6
embed_wd: 0.09
embedding_dim: 512
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_enabled: True
gptq_reserve_seconds: 10.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/923648bf-e0a6-40d4-b29e-0299c4f40422.txt
logit_softcap: 30.0
matrix_lr: 0.02
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_wd: 0.09
num_heads: 8
num_kv_heads: 4
num_layers: 11
parallel_start_layer: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
recur_layers: 4,5
recur_start_step: 3000
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 923648bf-e0a6-40d4-b29e-0299c4f40422
scalar_lr: 0.02
seed: 42
skip_gates_enabled: True
sliding_window_enabled: True
slot_enabled: True
slot_lr: 0.008
slot_steps: 16
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_4096_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
val_loss_every: 4000
ve_dim: 128
ve_enabled: True
ve_layers: 9,10
vocab_size: 4096
warmdown_frac: 0.667
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 45508608
model_params:34401372
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 8.3187 val_bpb: 3.6152
1/20000 train_loss: 8.3201 train_time: 0.0m tok/s: 8475031
2/20000 train_loss: 12.1482 train_time: 0.0m tok/s: 8359023
3/20000 train_loss: 10.6752 train_time: 0.0m tok/s: 8275865
4/20000 train_loss: 8.8831 train_time: 0.0m tok/s: 8193201
5/20000 train_loss: 7.6882 train_time: 0.0m tok/s: 8153963
500/20000 train_loss: 2.8980 train_time: 0.8m tok/s: 7964606
1000/20000 train_loss: 2.8826 train_time: 1.7m tok/s: 7943614
1500/20000 train_loss: 2.9046 train_time: 2.5m tok/s: 7936900
2000/20000 train_loss: 2.6485 train_time: 3.3m tok/s: 7933540
2500/20000 train_loss: 2.7097 train_time: 4.1m tok/s: 7931972
3000/20000 train_loss: 2.7596 train_time: 5.0m tok/s: 7931646
recurrence:activated at step 3000, virtual_layers=[0, 1, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10]
3500/20000 train_loss: 2.6817 train_time: 6.1m tok/s: 7528857
4000/20000 train_loss: 2.6179 train_time: 7.1m tok/s: 7435705
4000/20000 val_loss: 2.6409 val_bpb: 1.1477
4500/20000 train_loss: 2.5735 train_time: 8.0m tok/s: 7365391
5000/20000 train_loss: 2.5137 train_time: 9.0m tok/s: 7309483
5454/20000 val_loss: 2.5257 val_bpb: 1.0976
stopping_early: wallclock_cap train_time: 590101ms step: 5454/20000
peak memory allocated: 30120 MiB reserved: 30154 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.52314384 val_bpb:1.09652925 eval_time:2008ms
Serialized model: 132406149 bytes
Code size: 23803 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 9.7s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
selective_prune: unpruned=16.00MB target=16.0MB
selective_prune: already fits, no pruning needed
Serialized model int6+brotli: 15975658 bytes
Total submission size int6+brotli: 15999461 bytes
final_int6_roundtrip val_loss:2.54928373 val_bpb:1.10788934 eval_time:7568ms
final_int6_sliding_window val_loss:2.50641155 val_bpb:1.08925759 eval_time:76200ms
final_causal_slot val_loss:2.47636068 val_bpb:1.07620919 eval_time:444138ms
Loading