Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)

**val_bpb = 1.0896** (3-seed mean, std 0.0008) | **~15.99 MB** | 8xH100 SXM

## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

| Seed | Sliding BPB | **TTT BPB** | TTT gain | Artifact |
|------|-------------|-------------|----------|----------|
| 42 | 1.0896 | **1.0889** | -0.0007 | 15,999,165 |
| 314 | 1.0915 | **1.0906** | -0.0010 | 15,974,112 |
| 999 | 1.0901 | **1.0894** | -0.0007 | 15,996,001 |
| **Mean** | | **1.0896** | **-0.0008** | |

Merged SOTA (PR #1019): **1.1147 BPB**. Delta: **-0.0251 BPB**.

## Key Techniques

1. **4096-Vocab + MLP 4x + WD 0.090** — PR #1218 @clarkkev, PR #1285 @dexhunter
2. **Depth Recurrence (layers 4,5)** — PR #1204 @msisovic, PR #1260 @dexhunter
3. **Parallel Residuals (from layer 7)** — PR #1204 @msisovic, PR #1289 @MatoTeziTanka
4. **MuonEq-R** — arXiv:2603.28254, PR #1260 @dexhunter
5. **QK-Gain 5.0** — PR #1217 @bigbag
6. **Legal Score-First TTT** — score each 32K-token chunk under torch.no_grad before SGD training. Compiled scoring for correctness. PR #461 @Christopher-Lee-McClendon
7. **Full GPTQ int6 + Brotli + Compressed Wrapper**

## TTT Compliance

Legal score-first per PR #461 framework:
- Every token scored BEFORE any weight update (enforced by torch.no_grad + compiled scoring)
- No training data access during evaluation
- No multi-epoch scoring — each chunk scored exactly once
- Total eval time: ~600s (sliding ~100s + TTT ~300s)

## Compliance

- Legal score-first TTT (backward-looking only)
- No SLOT, no n-gram cache
- GPTQ calibration within training budget
- All four conditions from Issue #1017 satisfied

## Reproduction

```bash
pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #461 @Christopher-Lee-McClendon
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"author": "aryanbhosale",
"github_id": "aryanbhosale",
"name": "SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 + Legal TTT",
"date": "2026-04-04",
"track": "10min_16mb",
"val_bpb": 1.08963553,
"val_bpb_std": 0.00083386,
"seeds": [42, 314, 999],
"seed_results": {
"42": {"val_bpb": 1.08894061, "artifact_bytes": 15999165},
"314": {"val_bpb": 1.09056017, "artifact_bytes": 15974112},
"999": {"val_bpb": 1.08940582, "artifact_bytes": 15996001}
},
"comparison_baseline_pr": 1019,
"delta_vs_pr1019_bpb": -0.02509956,
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP4096 + MLP 4x + WD 0.090 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 + Legal Score-First TTT + Full GPTQ int6 + Brotli"
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
W0404 05:42:56.497000 3603 torch/distributed/run.py:803]
W0404 05:42:56.497000 3603 torch/distributed/run.py:803] *****************************************
W0404 05:42:56.497000 3603 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0404 05:42:56.497000 3603 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp4096
distributed: True
ema_decay: 0.997
embed_lr: 0.6
embed_wd: 0.09
embedding_dim: 512
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_enabled: True
gptq_reserve_seconds: 10.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/654954dc-8b9e-4f2b-9eed-e6ad4f070f78.txt
logit_softcap: 30.0
matrix_lr: 0.02
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_wd: 0.09
num_heads: 8
num_kv_heads: 4
num_layers: 11
parallel_start_layer: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
recur_layers: 4,5
recur_start_step: 3000
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 654954dc-8b9e-4f2b-9eed-e6ad4f070f78
scalar_lr: 0.02
seed: 314
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_4096_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp4096/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_batch_seqs: 32
ttt_chunk_tokens: 32768
ttt_enabled: True
ttt_epochs: 3
ttt_freeze_blocks: 0
ttt_grad_clip: 1.0
ttt_lr: 0.002
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp4096/fineweb_val_*.bin
val_loss_every: 4000
ve_dim: 128
ve_enabled: True
ve_layers: 9,10
vocab_size: 4096
warmdown_frac: 0.667
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 45508608
model_params:34401372
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 8.3172 val_bpb: 3.6146
1/20000 train_loss: 8.3192 train_time: 0.0m tok/s: 8357924
2/20000 train_loss: 12.1995 train_time: 0.0m tok/s: 8306297
3/20000 train_loss: 10.6851 train_time: 0.0m tok/s: 8223796
4/20000 train_loss: 8.8318 train_time: 0.0m tok/s: 8190815
5/20000 train_loss: 7.6630 train_time: 0.0m tok/s: 8172777
500/20000 train_loss: 2.9038 train_time: 0.8m tok/s: 7970694
1000/20000 train_loss: 2.8864 train_time: 1.6m tok/s: 7946427
1500/20000 train_loss: 2.9106 train_time: 2.5m tok/s: 7937314
2000/20000 train_loss: 2.6565 train_time: 3.3m tok/s: 7932891
2500/20000 train_loss: 2.7096 train_time: 4.1m tok/s: 7929903
3000/20000 train_loss: 2.7587 train_time: 5.0m tok/s: 7928547
recurrence:activated at step 3000, virtual_layers=[0, 1, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10]
3500/20000 train_loss: 2.6758 train_time: 6.4m tok/s: 7140710
4000/20000 train_loss: 2.6058 train_time: 7.4m tok/s: 7101011
4000/20000 val_loss: 2.6319 val_bpb: 1.1438
4500/20000 train_loss: 2.5588 train_time: 8.3m tok/s: 7072297
5000/20000 train_loss: 2.5020 train_time: 9.3m tok/s: 7047648
5279/20000 val_loss: 2.5285 val_bpb: 1.0989
stopping_early: wallclock_cap train_time: 590047ms step: 5279/20000
peak memory allocated: 30164 MiB reserved: 30190 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.52622647 val_bpb:1.09786892 eval_time:2010ms
Serialized model: 132406149 bytes
Code size: 24671 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 9.8s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
selective_prune: unpruned=16.02MB target=16.0MB
selective_prune: pruning 165352/9349306 lowest-error ±1 values (excess=20669B)
Serialized model int6+brotli: 15949441 bytes
Total submission size int6+brotli: 15974112 bytes
final_int6_roundtrip val_loss:2.55440867 val_bpb:1.11011658 eval_time:23964ms
final_int6_sliding_window val_loss:2.51160228 val_bpb:1.09151342 eval_time:100413ms
ttt_sliding:start chunks=1389 chunk_tokens=32768 total_windows=711072 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0
ttt_sliding:params unfrozen=34401372 frozen=0
ttt_chunk [1/1389] bpb=1.127348 time=30.3s
ttt_chunk [11/1389] bpb=1.114810 time=35.4s
ttt_chunk [21/1389] bpb=1.079391 time=37.5s
ttt_chunk [31/1389] bpb=1.086754 time=39.5s
ttt_chunk [41/1389] bpb=1.095376 time=41.6s
ttt_chunk [51/1389] bpb=1.088475 time=43.7s
ttt_chunk [61/1389] bpb=1.084493 time=45.7s
ttt_chunk [71/1389] bpb=1.080428 time=47.8s
ttt_chunk [81/1389] bpb=1.089752 time=49.8s
ttt_chunk [91/1389] bpb=1.092473 time=51.9s
ttt_chunk [101/1389] bpb=1.090798 time=54.0s
ttt_chunk [111/1389] bpb=1.086302 time=56.0s
ttt_chunk [121/1389] bpb=1.090320 time=58.1s
ttt_chunk [131/1389] bpb=1.092314 time=60.1s
ttt_chunk [141/1389] bpb=1.095246 time=62.2s
ttt_chunk [151/1389] bpb=1.097732 time=64.3s
ttt_chunk [161/1389] bpb=1.097837 time=66.3s
ttt_chunk [171/1389] bpb=1.099782 time=68.4s
ttt_chunk [181/1389] bpb=1.099604 time=70.4s
ttt_chunk [191/1389] bpb=1.101394 time=72.5s
ttt_chunk [201/1389] bpb=1.099976 time=74.5s
ttt_chunk [211/1389] bpb=1.102296 time=76.6s
ttt_chunk [221/1389] bpb=1.104211 time=78.7s
ttt_chunk [231/1389] bpb=1.103940 time=80.7s
ttt_chunk [241/1389] bpb=1.104012 time=82.8s
ttt_chunk [251/1389] bpb=1.104307 time=84.9s
ttt_chunk [261/1389] bpb=1.105941 time=87.0s
ttt_chunk [271/1389] bpb=1.107934 time=89.0s
ttt_chunk [281/1389] bpb=1.108214 time=91.1s
ttt_chunk [291/1389] bpb=1.109896 time=93.2s
ttt_chunk [301/1389] bpb=1.108324 time=95.2s
ttt_chunk [311/1389] bpb=1.107919 time=97.3s
ttt_chunk [321/1389] bpb=1.108681 time=99.3s
ttt_chunk [331/1389] bpb=1.108468 time=101.4s
ttt_chunk [341/1389] bpb=1.108424 time=103.5s
ttt_chunk [351/1389] bpb=1.105785 time=105.5s
ttt_chunk [361/1389] bpb=1.106567 time=107.6s
ttt_chunk [371/1389] bpb=1.109563 time=109.6s
ttt_chunk [381/1389] bpb=1.106417 time=111.7s
ttt_chunk [391/1389] bpb=1.108109 time=113.7s
ttt_chunk [401/1389] bpb=1.108004 time=115.7s
ttt_chunk [411/1389] bpb=1.106038 time=117.8s
ttt_chunk [421/1389] bpb=1.103497 time=119.8s
ttt_chunk [431/1389] bpb=1.102606 time=121.9s
ttt_chunk [441/1389] bpb=1.102391 time=124.0s
ttt_chunk [451/1389] bpb=1.102026 time=126.0s
ttt_chunk [461/1389] bpb=1.100384 time=128.1s
ttt_chunk [471/1389] bpb=1.099923 time=130.1s
ttt_chunk [481/1389] bpb=1.099960 time=132.2s
ttt_chunk [491/1389] bpb=1.099663 time=134.2s
ttt_chunk [501/1389] bpb=1.099621 time=136.3s
ttt_chunk [511/1389] bpb=1.099851 time=138.3s
ttt_chunk [521/1389] bpb=1.099341 time=140.4s
ttt_chunk [531/1389] bpb=1.098565 time=142.5s
ttt_chunk [541/1389] bpb=1.098514 time=144.5s
ttt_chunk [551/1389] bpb=1.099201 time=146.6s
ttt_chunk [561/1389] bpb=1.099477 time=148.6s
ttt_chunk [571/1389] bpb=1.098962 time=150.7s
ttt_chunk [581/1389] bpb=1.099125 time=152.8s
ttt_chunk [591/1389] bpb=1.098695 time=154.8s
ttt_chunk [601/1389] bpb=1.098644 time=156.9s
ttt_chunk [611/1389] bpb=1.098761 time=159.0s
ttt_chunk [621/1389] bpb=1.098345 time=161.0s
ttt_chunk [631/1389] bpb=1.098143 time=163.1s
ttt_chunk [641/1389] bpb=1.098242 time=165.2s
ttt_chunk [651/1389] bpb=1.098413 time=167.2s
ttt_chunk [661/1389] bpb=1.098246 time=169.2s
ttt_chunk [671/1389] bpb=1.097271 time=171.3s
ttt_chunk [681/1389] bpb=1.096904 time=174.3s
ttt_chunk [691/1389] bpb=1.096885 time=176.4s
ttt_chunk [701/1389] bpb=1.097126 time=178.4s
ttt_chunk [711/1389] bpb=1.097580 time=180.5s
ttt_chunk [721/1389] bpb=1.097457 time=182.5s
ttt_chunk [731/1389] bpb=1.097344 time=184.6s
ttt_chunk [741/1389] bpb=1.097959 time=186.6s
ttt_chunk [751/1389] bpb=1.097916 time=188.7s
ttt_chunk [761/1389] bpb=1.098468 time=190.8s
ttt_chunk [771/1389] bpb=1.098491 time=192.8s
ttt_chunk [781/1389] bpb=1.098290 time=194.9s
ttt_chunk [791/1389] bpb=1.098107 time=196.9s
ttt_chunk [801/1389] bpb=1.097604 time=198.9s
ttt_chunk [811/1389] bpb=1.097748 time=201.0s
ttt_chunk [821/1389] bpb=1.098127 time=203.0s
ttt_chunk [831/1389] bpb=1.097955 time=205.1s
ttt_chunk [841/1389] bpb=1.096952 time=207.2s
ttt_chunk [851/1389] bpb=1.097484 time=209.2s
ttt_chunk [861/1389] bpb=1.097428 time=211.3s
ttt_chunk [871/1389] bpb=1.097301 time=213.3s
ttt_chunk [881/1389] bpb=1.097690 time=215.4s
ttt_chunk [891/1389] bpb=1.097021 time=217.4s
ttt_chunk [901/1389] bpb=1.096574 time=219.5s
ttt_chunk [911/1389] bpb=1.096044 time=221.5s
ttt_chunk [921/1389] bpb=1.095367 time=223.6s
ttt_chunk [931/1389] bpb=1.094654 time=225.6s
ttt_chunk [941/1389] bpb=1.094300 time=227.7s
ttt_chunk [951/1389] bpb=1.093788 time=229.7s
ttt_chunk [961/1389] bpb=1.093257 time=231.7s
ttt_chunk [971/1389] bpb=1.093169 time=233.8s
ttt_chunk [981/1389] bpb=1.092389 time=235.8s
ttt_chunk [991/1389] bpb=1.092472 time=237.8s
ttt_chunk [1001/1389] bpb=1.092522 time=239.9s
ttt_chunk [1011/1389] bpb=1.092527 time=241.9s
ttt_chunk [1021/1389] bpb=1.092118 time=243.9s
ttt_chunk [1031/1389] bpb=1.091741 time=246.0s
ttt_chunk [1041/1389] bpb=1.091469 time=248.0s
ttt_chunk [1051/1389] bpb=1.091907 time=250.1s
ttt_chunk [1061/1389] bpb=1.092532 time=252.2s
ttt_chunk [1071/1389] bpb=1.092534 time=254.2s
ttt_chunk [1081/1389] bpb=1.093267 time=256.3s
ttt_chunk [1091/1389] bpb=1.093350 time=258.3s
ttt_chunk [1101/1389] bpb=1.093063 time=260.4s
ttt_chunk [1111/1389] bpb=1.092545 time=262.4s
ttt_chunk [1121/1389] bpb=1.092939 time=264.5s
ttt_chunk [1131/1389] bpb=1.093828 time=266.5s
ttt_chunk [1141/1389] bpb=1.094179 time=268.6s
ttt_chunk [1151/1389] bpb=1.093981 time=270.6s
ttt_chunk [1161/1389] bpb=1.094321 time=272.6s
ttt_chunk [1171/1389] bpb=1.094494 time=274.7s
ttt_chunk [1181/1389] bpb=1.095050 time=276.7s
ttt_chunk [1191/1389] bpb=1.094972 time=278.8s
ttt_chunk [1201/1389] bpb=1.095399 time=280.8s
ttt_chunk [1211/1389] bpb=1.095523 time=282.8s
ttt_chunk [1221/1389] bpb=1.095434 time=284.9s
ttt_chunk [1231/1389] bpb=1.095637 time=286.9s
ttt_chunk [1241/1389] bpb=1.095774 time=288.9s
ttt_chunk [1251/1389] bpb=1.095904 time=291.0s
ttt_chunk [1261/1389] bpb=1.095304 time=293.0s
ttt_chunk [1271/1389] bpb=1.095093 time=295.1s
ttt_chunk [1281/1389] bpb=1.094801 time=297.1s
ttt_chunk [1291/1389] bpb=1.094560 time=299.1s
ttt_chunk [1301/1389] bpb=1.094508 time=301.2s
ttt_chunk [1311/1389] bpb=1.094365 time=303.2s
ttt_chunk [1321/1389] bpb=1.094307 time=305.3s
ttt_chunk [1331/1389] bpb=1.093594 time=307.3s
ttt_chunk [1341/1389] bpb=1.093242 time=309.3s
ttt_chunk [1351/1389] bpb=1.092585 time=311.4s
ttt_chunk [1361/1389] bpb=1.092305 time=313.4s
ttt_chunk [1371/1389] bpb=1.092083 time=315.5s
ttt_chunk [1381/1389] bpb=1.091971 time=317.5s
ttt_chunk [1389/1389] bpb=1.092046 time=333.9s
ttt_sliding:done val_loss=2.509382 val_bpb=1.090560 elapsed=335.0s
final_int6_ttt val_loss:2.50938234 val_bpb:1.09056017 eval_time:335375ms
Loading