openai · aryanbhosale · Apr 11, 2026
diff --git a/..._10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/README.md b/..._10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/README.md
@@ -0,0 +1,48 @@
+# Record: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean)
+
+**val_bpb = 1.0790** (5-seed mean, std 0.0003) | **~15.99 MB** | 8xH100 SXM
+
+## 5-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
+
+| Seed | **TTT BPB** | val_loss (nats) | Artifact |
+|------|-------------|-----------------|----------|
+| 42   | **1.0788**  | 2.7866          | 15,988,830 |
+| 314  | **1.0789**  | 2.7868          | 15,983,617 |
+| 1337 | **1.0788**  | 2.7867          | 15,985,310 |
+| 7    | **1.0793**  | 2.7880          | 15,986,416 |
+| 999  | **1.0795**  | 2.7884          | 15,986,416 |
+| **Mean** | **1.0790** | **2.7873** | |
+
+Merged SOTA (PR #1493): **1.0810 BPB / 2.7920 nats**. Delta: **-0.0047 nats** (5-seed), **-0.0020 BPB**.
+
+## Stack
+
+PR #1523 base (@abaybektursun) with hash embedding disabled and Triton fused MLP removed (standard MLP used instead). Key components:
+
+1. **SP8192** vocab with GPTQ embeddings and SDClip quantization
+2. **Parameter Banking** — batched Newton-Schulz optimizer step
+3. **Triple Depth Recurrence** (L3-5, 17 virtual layers from 11 physical)
+4. **Parallel Residuals** (L7+, GPT-J style)
+5. **Muon 0.97** momentum (from PR #1514 @dexhunter)
+6. **QK-Gain 5.25**
+7. **Score-First TTT** (3 epochs, SGD lr=0.005, PR #461 framework)
+8. **EMA 0.9965, WD 0.095, warmdown 0.72**
+
+## Compliance (Track B — Score-First TTT)
+
+- Score-first TTT: each chunk scored under `torch.no_grad()` BEFORE SGD weight update
+- No SLOT, no hash embedding, no pre-quant TTT, no n-gram cache, no ETLB
+- All four conditions from Issue #1017 satisfied
+- All artifacts < 16MB, train < 600s, eval < 600s
+
+## Reproduction
+
+```bash
+pip install brotli
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 --skip-manifest
+SEED=42 TTT_ENABLED=1 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+PR #1523 @abaybektursun (base: banking + triple recurrence + parallel residuals), PR #1394 @clarkkev (SP8192 + SDClip), PR #1514 @dexhunter (Muon 0.97), PR #1493 @bigbag (merged #1 hyperparameters), PR #1204 @msisovic (parallel residuals concept)
diff --git a/...rack_10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/submission.json b/...rack_10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/submission.json
@@ -0,0 +1 @@
+{"author":"aryanbhosale","github_id":"aryanbhosale","name":"SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + Score-First TTT","date":"2026-04-11","track":"10min_16mb","val_bpb":1.07904309,"val_bpb_std":0.00031891,"seeds":[42,314,999,1337,7],"seed_results":{"42":{"val_bpb":1.07876776,"val_loss":2.78656915,"artifact_bytes":15988830},"314":{"val_bpb":1.07887587,"val_loss":2.78684843,"artifact_bytes":15983617},"999":{"val_bpb":1.07948565,"val_loss":2.78842354,"artifact_bytes":15986416},"1337":{"val_bpb":1.07880325,"val_loss":2.78666083,"artifact_bytes":15985310},"7":{"val_bpb":1.07931921,"val_loss":2.78799360,"artifact_bytes":15986416}},"hardware":"8xH100 80GB SXM","pytorch_version":"2.9.1+cu128","technique_summary":"SP8192 + Parameter Banking + Triple Recurrence (L3-5) + Parallel Residuals (L7+) + Muon 0.97 + QK-Gain 5.25 + Score-First TTT + SDClip + Brotli"}
diff --git a/...s/track_10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/train_gpt.py b/...s/track_10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/train_gpt.py
diff --git a/...k_10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/train_seed1337.log b/...k_10min_16mb/2026-04-11_SP8192_Banking_ParResid_TripleRecur_Muon97_TTT/train_seed1337.log
@@ -0,0 +1,275 @@
+W0410 17:28:23.818000 115879 torch/distributed/run.py:803] 
+W0410 17:28:23.818000 115879 torch/distributed/run.py:803] *****************************************
+W0410 17:28:23.818000 115879 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0410 17:28:23.818000 115879 torch/distributed/run.py:803] *****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.997
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.095
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/8d3c73aa-de0b-4016-b727-bf25427820f6.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 8d3c73aa-de0b-4016-b727-bf25427820f6
+  scalar_lr: 0.02
+  seed: 1337
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_adamw_wd: 0.0
+  ttt_batch_seqs: 32
+  ttt_chunk_tokens: 32768
+  ttt_enabled: True
+  ttt_epochs: 3
+  ttt_freeze_blocks: 0
+  ttt_grad_clip: 1.0
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  ttt_optimizer: sgd
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.667
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 80
+val_tokens: 40540160
+model_params:35944537
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0095 val_bpb: 3.4878
+1/20000 train_loss: 9.0103 train_time: 0.0m tok/s: 17481801
+2/20000 train_loss: 12.2696 train_time: 0.0m tok/s: 12922068
+3/20000 train_loss: 10.9255 train_time: 0.0m tok/s: 10700367
+4/20000 train_loss: 9.3870 train_time: 0.0m tok/s: 9824180
+5/20000 train_loss: 8.2725 train_time: 0.0m tok/s: 9340287
+500/20000 train_loss: 3.3838 train_time: 0.8m tok/s: 7793593
+1000/20000 train_loss: 3.2862 train_time: 1.7m tok/s: 7785651
+1500/20000 train_loss: 3.1876 train_time: 2.5m tok/s: 7790707
+2000/20000 train_loss: 3.0806 train_time: 3.4m tok/s: 7794319
+layer_loop:enabled step:2040 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.1292 train_time: 4.6m tok/s: 7136033
+3000/20000 train_loss: 2.8997 train_time: 5.8m tok/s: 6746291
+3500/20000 train_loss: 2.9436 train_time: 7.1m tok/s: 6493105
+4000/20000 train_loss: 2.8239 train_time: 8.3m tok/s: 6315839
+4000/20000 val_loss: 2.8788 val_bpb: 1.1145
+4500/20000 train_loss: 2.8368 train_time: 9.6m tok/s: 6173232
+4600/20000 val_loss: 2.8075 val_bpb: 1.0869
+stopping_early: wallclock_cap train_time: 588175ms step: 4600/20000
+peak memory allocated: 39948 MiB reserved: 40026 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.80536149 val_bpb:1.08604285 eval_time:6102ms
+Serialized model: 135408623 bytes
+Code size: 19760 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 12.5s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, lane_merge, skip_gates, skip_weights
+Serialized model quantized+brotli: 15965550 bytes
+Total submission size quantized+brotli: 15985310 bytes
+quantized val_loss:2.83913142 val_bpb:1.09911625 eval_time:8742ms
+quantized_sliding_window val_loss:2.79371733 val_bpb:1.08153504 eval_time:91988ms
+ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 optimizer=sgd
+ttt_sliding:params unfrozen=35944537 frozen=0
+  ttt_chunk [1/1238] bpb=1.115663 time=4.5s
+  ttt_chunk [11/1238] bpb=1.068298 time=8.9s
+  ttt_chunk [21/1238] bpb=1.105284 time=11.5s
+  ttt_chunk [31/1238] bpb=1.098863 time=14.2s
+  ttt_chunk [41/1238] bpb=1.092158 time=16.8s
+  ttt_chunk [51/1238] bpb=1.085537 time=19.3s
+  ttt_chunk [61/1238] bpb=1.077554 time=21.9s
+  ttt_chunk [71/1238] bpb=1.084923 time=24.5s
+  ttt_chunk [81/1238] bpb=1.078172 time=27.1s
+  ttt_chunk [91/1238] bpb=1.074957 time=29.7s
+  ttt_chunk [101/1238] bpb=1.074802 time=32.2s
+  ttt_chunk [111/1238] bpb=1.073105 time=34.8s
+  ttt_chunk [121/1238] bpb=1.076285 time=37.4s
+  ttt_chunk [131/1238] bpb=1.080146 time=40.0s
+  ttt_chunk [141/1238] bpb=1.080745 time=42.6s
+  ttt_chunk [151/1238] bpb=1.080495 time=46.3s
+  ttt_chunk [161/1238] bpb=1.081092 time=48.9s
+  ttt_chunk [171/1238] bpb=1.081037 time=51.6s
+  ttt_chunk [181/1238] bpb=1.079594 time=54.3s
+  ttt_chunk [191/1238] bpb=1.079371 time=56.9s
+  ttt_chunk [201/1238] bpb=1.077008 time=59.6s
+  ttt_chunk [211/1238] bpb=1.081377 time=62.3s
+  ttt_chunk [221/1238] bpb=1.081821 time=65.0s
+  ttt_chunk [231/1238] bpb=1.083430 time=67.6s
+  ttt_chunk [241/1238] bpb=1.081679 time=70.3s
+  ttt_chunk [251/1238] bpb=1.081678 time=73.0s
+  ttt_chunk [261/1238] bpb=1.082736 time=75.6s
+  ttt_chunk [271/1238] bpb=1.083118 time=78.3s
+  ttt_chunk [281/1238] bpb=1.082392 time=80.9s
+  ttt_chunk [291/1238] bpb=1.083544 time=83.6s
+  ttt_chunk [301/1238] bpb=1.083748 time=86.2s
+  ttt_chunk [311/1238] bpb=1.082694 time=88.9s
+  ttt_chunk [321/1238] bpb=1.082521 time=91.5s
+  ttt_chunk [331/1238] bpb=1.082746 time=94.2s
+  ttt_chunk [341/1238] bpb=1.081815 time=96.8s
+  ttt_chunk [351/1238] bpb=1.082566 time=99.5s
+  ttt_chunk [361/1238] bpb=1.081499 time=102.2s
+  ttt_chunk [371/1238] bpb=1.079964 time=104.9s
+  ttt_chunk [381/1238] bpb=1.080336 time=107.6s
+  ttt_chunk [391/1238] bpb=1.080019 time=110.2s
+  ttt_chunk [401/1238] bpb=1.080082 time=112.9s
+  ttt_chunk [411/1238] bpb=1.080628 time=115.6s
+  ttt_chunk [421/1238] bpb=1.080115 time=118.2s
+  ttt_chunk [431/1238] bpb=1.080277 time=120.9s
+  ttt_chunk [441/1238] bpb=1.080322 time=123.6s
+  ttt_chunk [451/1238] bpb=1.081518 time=126.3s
+  ttt_chunk [461/1238] bpb=1.079774 time=129.0s
+  ttt_chunk [471/1238] bpb=1.079763 time=131.7s
+  ttt_chunk [481/1238] bpb=1.079955 time=134.3s
+  ttt_chunk [491/1238] bpb=1.080431 time=137.0s
+  ttt_chunk [501/1238] bpb=1.080066 time=139.7s
+  ttt_chunk [511/1238] bpb=1.079709 time=142.3s
+  ttt_chunk [521/1238] bpb=1.079228 time=145.0s
+  ttt_chunk [531/1238] bpb=1.079181 time=147.7s
+  ttt_chunk [541/1238] bpb=1.079268 time=150.4s
+  ttt_chunk [551/1238] bpb=1.078811 time=153.1s
+  ttt_chunk [561/1238] bpb=1.078139 time=155.7s
+  ttt_chunk [571/1238] bpb=1.077603 time=158.4s
+  ttt_chunk [581/1238] bpb=1.077936 time=161.0s
+  ttt_chunk [591/1238] bpb=1.078147 time=163.7s
+  ttt_chunk [601/1238] bpb=1.078100 time=166.4s
+  ttt_chunk [611/1238] bpb=1.078704 time=169.1s
+  ttt_chunk [621/1238] bpb=1.079519 time=171.8s
+  ttt_chunk [631/1238] bpb=1.079612 time=174.4s
+  ttt_chunk [641/1238] bpb=1.080091 time=177.1s
+  ttt_chunk [651/1238] bpb=1.080410 time=179.7s
+  ttt_chunk [661/1238] bpb=1.079761 time=182.4s
+  ttt_chunk [671/1238] bpb=1.079528 time=185.1s
+  ttt_chunk [681/1238] bpb=1.080825 time=187.7s
+  ttt_chunk [691/1238] bpb=1.081040 time=190.3s
+  ttt_chunk [701/1238] bpb=1.080868 time=193.0s
+  ttt_chunk [711/1238] bpb=1.081575 time=195.7s
+  ttt_chunk [721/1238] bpb=1.081898 time=198.3s
+  ttt_chunk [731/1238] bpb=1.081249 time=201.0s
+  ttt_chunk [741/1238] bpb=1.080933 time=203.6s
+  ttt_chunk [751/1238] bpb=1.080023 time=206.3s
+  ttt_chunk [761/1238] bpb=1.079440 time=208.9s
+  ttt_chunk [771/1238] bpb=1.078425 time=211.6s
+  ttt_chunk [781/1238] bpb=1.078413 time=214.3s
+  ttt_chunk [791/1238] bpb=1.078739 time=216.9s
+  ttt_chunk [801/1238] bpb=1.079031 time=219.6s
+  ttt_chunk [811/1238] bpb=1.078520 time=222.3s
+  ttt_chunk [821/1238] bpb=1.077334 time=224.9s
+  ttt_chunk [831/1238] bpb=1.077004 time=227.6s
+  ttt_chunk [841/1238] bpb=1.076534 time=230.2s
+  ttt_chunk [851/1238] bpb=1.076257 time=232.9s
+  ttt_chunk [861/1238] bpb=1.075927 time=235.6s
+  ttt_chunk [871/1238] bpb=1.075805 time=238.2s
+  ttt_chunk [881/1238] bpb=1.075334 time=240.9s
+  ttt_chunk [891/1238] bpb=1.074814 time=243.5s
+  ttt_chunk [901/1238] bpb=1.075202 time=246.2s
+  ttt_chunk [911/1238] bpb=1.074872 time=248.8s
+  ttt_chunk [921/1238] bpb=1.075146 time=251.5s
+  ttt_chunk [931/1238] bpb=1.075814 time=254.1s
+  ttt_chunk [941/1238] bpb=1.076196 time=256.8s
+  ttt_chunk [951/1238] bpb=1.076099 time=259.5s
+  ttt_chunk [961/1238] bpb=1.076936 time=262.1s
+  ttt_chunk [971/1238] bpb=1.077335 time=264.8s
+  ttt_chunk [981/1238] bpb=1.077703 time=267.4s
+  ttt_chunk [991/1238] bpb=1.077494 time=270.1s
+  ttt_chunk [1001/1238] bpb=1.077528 time=272.8s
+  ttt_chunk [1011/1238] bpb=1.077874 time=275.4s
+  ttt_chunk [1021/1238] bpb=1.078589 time=278.1s
+  ttt_chunk [1031/1238] bpb=1.079046 time=280.7s
+  ttt_chunk [1041/1238] bpb=1.079513 time=283.4s
+  ttt_chunk [1051/1238] bpb=1.079439 time=286.1s
+  ttt_chunk [1061/1238] bpb=1.079454 time=288.7s
+  ttt_chunk [1071/1238] bpb=1.079607 time=291.3s
+  ttt_chunk [1081/1238] bpb=1.079504 time=294.0s
+  ttt_chunk [1091/1238] bpb=1.079706 time=296.6s
+  ttt_chunk [1101/1238] bpb=1.080237 time=299.3s
+  ttt_chunk [1111/1238] bpb=1.080528 time=302.0s
+  ttt_chunk [1121/1238] bpb=1.080703 time=304.6s
+  ttt_chunk [1131/1238] bpb=1.080373 time=307.2s
+  ttt_chunk [1141/1238] bpb=1.080022 time=309.9s
+  ttt_chunk [1151/1238] bpb=1.080075 time=312.5s
+  ttt_chunk [1161/1238] bpb=1.080210 time=315.2s
+  ttt_chunk [1171/1238] bpb=1.079987 time=317.8s
+  ttt_chunk [1181/1238] bpb=1.079519 time=320.5s
+  ttt_chunk [1191/1238] bpb=1.079634 time=323.1s
+  ttt_chunk [1201/1238] bpb=1.079662 time=325.8s
+  ttt_chunk [1211/1238] bpb=1.079337 time=328.4s
+  ttt_chunk [1221/1238] bpb=1.078874 time=331.0s
+  ttt_chunk [1231/1238] bpb=1.078516 time=333.7s
+  ttt_chunk [1238/1238] bpb=1.078533 time=337.6s
+ttt_sliding:done val_loss=2.786661 val_bpb=1.07880325 elapsed=337.7s
+legal_ttt_exact val_loss:2.78666083 val_bpb:1.07880325 eval_time:337888ms
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"author":"aryanbhosale","github_id":"aryanbhosale","name":"SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + Score-First TTT","date":"2026-04-11","track":"10min_16mb","val_bpb":1.07904309,"val_bpb_std":0.00031891,"seeds":[42,314,999,1337,7],"seed_results":{"42":{"val_bpb":1.07876776,"val_loss":2.78656915,"artifact_bytes":15988830},"314":{"val_bpb":1.07887587,"val_loss":2.78684843,"artifact_bytes":15983617},"999":{"val_bpb":1.07948565,"val_loss":2.78842354,"artifact_bytes":15986416},"1337":{"val_bpb":1.07880325,"val_loss":2.78666083,"artifact_bytes":15985310},"7":{"val_bpb":1.07931921,"val_loss":2.78799360,"artifact_bytes":15986416}},"hardware":"8xH100 80GB SXM","pytorch_version":"2.9.1+cu128","technique_summary":"SP8192 + Parameter Banking + Triple Recurrence (L3-5) + Parallel Residuals (L7+) + Muon 0.97 + QK-Gain 5.25 + Score-First TTT + SDClip + Brotli"}