Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R — val_bpb 1.0897 (3-seed mean)

**val_bpb = 1.0897** (3-seed mean, std 0.0003) | **~15.99 MB** | 8xH100 SXM

## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

| Seed | **Sliding BPB** | Artifact |
|------|-----------------|----------|
| 42 | **1.0894** | 15,999,165 |
| 314 | **1.0898** | 15,997,318 |
| 999 | **1.0899** | 15,990,607 |
| **Mean** | **1.0897** | |

Merged SOTA (PR #1019): **1.1147 BPB**. Delta: **-0.0250 BPB**.

## Key Techniques

1. **4096-Vocab + MLP 4x + WD 0.090** — PR #1218 @clarkkev, PR #1285 @dexhunter
2. **Depth Recurrence (layers 4,5)** — PR #1204 @msisovic, PR #1260 @dexhunter
3. **Parallel Residuals (from layer 7)** — PR #1204 @msisovic, PR #1289 @MatoTeziTanka
4. **MuonEq-R** — arXiv:2603.28254, PR #1260 @dexhunter
5. **QK-Gain 5.0** — PR #1217 @bigbag
6. **Full GPTQ int6 + Brotli + Compressed Wrapper**

## Compliance

No TTT, no SLOT, no n-gram cache, no eval-time adaptation. All four conditions from Issue #1017 satisfied.

## Reproduction

```bash
pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"author": "aryanbhosale",
"github_id": "aryanbhosale",
"name": "SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0",
"date": "2026-04-03",
"track": "10min_16mb",
"val_bpb": 1.08971631,
"val_bpb_std": 0.00028794,
"seeds": [42, 314, 999],
"seed_results": {
"42": {"val_bpb": 1.08938974, "artifact_bytes": 15999165},
"314": {"val_bpb": 1.08982552, "artifact_bytes": 15997318},
"999": {"val_bpb": 1.08993367, "artifact_bytes": 15990607}
},
"comparison_baseline_pr": 1019,
"delta_vs_pr1019_bpb": -0.02501878,
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP4096 + MLP 4x + WD 0.090 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 + Full GPTQ int6 + Brotli"
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
W0403 15:01:26.741000 69615 torch/distributed/run.py:803]
W0403 15:01:26.741000 69615 torch/distributed/run.py:803] *****************************************
W0403 15:01:26.741000 69615 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0403 15:01:26.741000 69615 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp4096
distributed: True
ema_decay: 0.997
embed_lr: 0.6
embed_wd: 0.09
embedding_dim: 512
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_enabled: True
gptq_reserve_seconds: 10.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/ef1ace8d-5618-44d0-869c-bf47841c2cbb.txt
logit_softcap: 30.0
matrix_lr: 0.02
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_wd: 0.09
num_heads: 8
num_kv_heads: 4
num_layers: 11
parallel_start_layer: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
recur_layers: 4,5
recur_start_step: 3000
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: ef1ace8d-5618-44d0-869c-bf47841c2cbb
scalar_lr: 0.02
seed: 314
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_4096_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_batch_seqs: 32
ttt_chunk_tokens: 32768
ttt_enabled: False
ttt_epochs: 3
ttt_freeze_blocks: 0
ttt_grad_clip: 1.0
ttt_lr: 0.002
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
val_loss_every: 4000
ve_dim: 128
ve_enabled: True
ve_layers: 9,10
vocab_size: 4096
warmdown_frac: 0.667
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 45508608
model_params:34401372
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 8.3172 val_bpb: 3.6146
1/20000 train_loss: 8.3192 train_time: 0.0m tok/s: 8485990
2/20000 train_loss: 12.1995 train_time: 0.0m tok/s: 8384997
3/20000 train_loss: 10.6851 train_time: 0.0m tok/s: 8291082
4/20000 train_loss: 8.8318 train_time: 0.0m tok/s: 8231004
5/20000 train_loss: 7.6631 train_time: 0.0m tok/s: 8204599
500/20000 train_loss: 2.9015 train_time: 0.8m tok/s: 7973158
1000/20000 train_loss: 2.8899 train_time: 1.6m tok/s: 7948160
1500/20000 train_loss: 2.9126 train_time: 2.5m tok/s: 7940104
2000/20000 train_loss: 2.6546 train_time: 3.3m tok/s: 7935938
2500/20000 train_loss: 2.7109 train_time: 4.1m tok/s: 7932999
3000/20000 train_loss: 2.7603 train_time: 5.0m tok/s: 7932017
recurrence:activated at step 3000, virtual_layers=[0, 1, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10]
3500/20000 train_loss: 2.6852 train_time: 6.1m tok/s: 7528600
4000/20000 train_loss: 2.6170 train_time: 7.1m tok/s: 7433377
4000/20000 val_loss: 2.6414 val_bpb: 1.1479
4500/20000 train_loss: 2.5682 train_time: 8.0m tok/s: 7357189
5000/20000 train_loss: 2.5132 train_time: 9.0m tok/s: 7302210
5449/20000 val_loss: 2.5263 val_bpb: 1.0979
stopping_early: wallclock_cap train_time: 590088ms step: 5449/20000
peak memory allocated: 30120 MiB reserved: 30154 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.52391662 val_bpb:1.09686509 eval_time:2006ms
Serialized model: 132406149 bytes
Code size: 24584 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 9.7s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
selective_prune: unpruned=16.01MB target=16.0MB
selective_prune: pruning 40288/9380500 lowest-error ±1 values (excess=5036B)
Serialized model int6+brotli: 15968168 bytes
Total submission size int6+brotli: 15992752 bytes
final_int6_roundtrip val_loss:2.55060505 val_bpb:1.10846357 eval_time:7667ms
final_int6_sliding_window val_loss:2.50771838 val_bpb:1.08982552 eval_time:76573ms
Loading