openai · aamodbhatt · Apr 8, 2026
diff --git a/...s/track_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/README.md b/...s/track_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/README.md
@@ -0,0 +1,39 @@
+# Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)
+
+**3-seed mean sliding val_bpb:** `1.07873723` (std `0.00049363`)
+**3-seed mean roundtrip val_bpb:** `1.09258717` (std `0.00054369`)
+
+Hardware: `8xH100 SXM` | Train cap target: `595s` | Eval: sliding window `stride=64`
+
+## What changed
+
+This package uses the same fixed-predictor lane as the prior SP8192 pre-quant TTT stack, with tuned settings from our April 8 sweep:
+
+- `QK_GAIN_INIT=5.25`
+- `TTT_ENABLED=1` with `TTT_EPOCHS=8`, `TTT_LR=0.00045`, `TTT_FREEZE_BLOCKS=1`
+- same SP8192 + recurrence + GPTQ pipeline
+
+No tokenizer/dataset modifications, no eval-time adaptation, no SLOT/ngram overlays.
+
+## Seed Results
+
+| Seed | sliding val_bpb | roundtrip val_bpb | train_s | eval_s | bytes_total |
+|------|----------------:|------------------:|--------:|-------:|------------:|
+| 42 | 1.07913183 | 1.09299539 | 595.162 | 74.678 | 15171524 |
+| 1337 | 1.07804121 | 1.09181877 | 595.086 | 74.663 | 15163267 |
+| 2025 | 1.07903865 | 1.09294735 | 595.162 | 74.560 | 15188203 |
+| **Mean** | **1.07873723** | **1.09258717** | - | - | - |
+| **Std** | **0.00049363** | **0.00054369** | - | - | - |
+
+## Sweep Provenance
+
+- best single-seed sweep run (`runB_seed1337`): `1.07765960`
+- confirmation seeds in this package: `42, 1337, 2025`
+- raw sweep table included in `runs.csv`
+
+## Compliance Notes
+
+- score computed from `final_int6_sliding_window_exact`
+- roundtrip reported from `final_int6_roundtrip_exact`
+- train capped by wallclock (`stopping_early` line in each log)
+- artifact size from `Total submission size int6+brotli+byteshuffle`
diff --git a/records/track_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/runs.csv b/records/track_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/runs.csv
@@ -0,0 +1,7 @@
+run_id,seed,qk_gain,ttt_enabled,ttt_lr,ttt_epochs,ttt_freeze,val_bpb,roundtrip_bpb,train_s,eval_s,bytes_total,log_path
+runA_seed1337_qk5_ttt6_f2,1337,5.0,1,0.0005,6,2,1.08250260,1.09687424,595.091,97.120,15148097,analysis/2026-04-08_record_push/runA_seed1337_qk5_ttt6_f2.log
+runB_seed1337_qk525_ttt8_f1,1337,5.25,1,0.00045,8,1,1.07765960,1.09149096,595.140,73.997,15161417,analysis/2026-04-08_record_push/runB_seed1337_qk525_ttt8_f1.log
+runC_seed1337_qk5_nottt,1337,5.0,0,0.0005,0,11,1.10884414,1.12541234,595.085,74.456,15151733,analysis/2026-04-08_record_push/runC_seed1337_qk5_nottt.log
+runD_seed42_qk525_ttt8_f1,42,5.25,1,0.00045,8,1,1.07913183,1.09299539,NA,NA,15171524,analysis/2026-04-08_record_push/runD_seed42_qk525_ttt8_f1.log
+runD_seed1337_qk525_ttt8_f1,1337,5.25,1,0.00045,8,1,1.07804121,1.09181877,NA,NA,15163267,analysis/2026-04-08_record_push/runD_seed1337_qk525_ttt8_f1.log
+runD_seed2025_qk525_ttt8_f1,2025,5.25,1,0.00045,8,1,1.07903865,1.09294735,NA,NA,15188203,analysis/2026-04-08_record_push/runD_seed2025_qk525_ttt8_f1.log
diff --git a/records/track_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/submission.json b/records/track_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/submission.json
@@ -0,0 +1,37 @@
+{
+  "author": "Aamod Bhatt",
+  "github_id": "aamodbhatt",
+  "name": "SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) + GPTQ int6",
+  "date": "2026-04-09",
+  "track": "10min_16mb",
+  "hardware": "8xH100 80GB SXM",
+  "val_bpb": 1.07873723,
+  "val_bpb_std": 0.00049363,
+  "roundtrip_val_bpb": 1.09258717,
+  "roundtrip_val_bpb_std": 0.00054369,
+  "seeds": [
+    42,
+    1337,
+    2025
+  ],
+  "seed_results": {
+    "42": {
+      "val_bpb": 1.07913183,
+      "roundtrip_val_bpb": 1.09299539,
+      "bytes_total": 15171524
+    },
+    "1337": {
+      "val_bpb": 1.07804121,
+      "roundtrip_val_bpb": 1.09181877,
+      "bytes_total": 15163267
+    },
+    "2025": {
+      "val_bpb": 1.07903865,
+      "roundtrip_val_bpb": 1.09294735,
+      "bytes_total": 15188203
+    }
+  },
+  "best_single_seed_sweep_val_bpb": 1.0776596,
+  "technique_summary": "SP8192 + Pre-Quant AdamW TTT + QK_GAIN_INIT=5.25 + TTT(8 epochs, freeze first block) + Full Hessian GPTQ int6 + sliding-window eval",
+  "notes": "No tokenizer/dataset edits; no eval-time adaptation; fixed predictor at evaluation."
+}
diff --git a/...ack_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/train_best_seed1337.log b/...ack_10min_16mb/2026-04-09_SP8192_PreQuantTTT_QK525_TTT8_F1_8xH100/train_best_seed1337.log
@@ -0,0 +1,93 @@
+W0408 16:17:26.158000 15891 torch/distributed/run.py:851] 
+W0408 16:17:26.158000 15891 torch/distributed/run.py:851] *****************************************
+W0408 16:17:26.158000 15891 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0408 16:17:26.158000 15891 torch/distributed/run.py:851] *****************************************
+logs/94fa9714-551c-45cf-acf8-0f6e7ab65754.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model
+train_loader:dataset:fineweb10B_sp8192 train_shards:80
+val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin tokens:40540160
+model_params:35941976
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:595.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:9.0006 val_bpb:3.4844 train_time:0ms step_avg:0.03ms
+step:1/20000 train_loss:9.0023 train_time:157ms step_avg:156.90ms
+step:2/20000 train_loss:9.7770 train_time:190ms step_avg:94.91ms
+step:3/20000 train_loss:10.0653 train_time:299ms step_avg:99.75ms
+step:4/20000 train_loss:9.5484 train_time:410ms step_avg:102.54ms
+step:5/20000 train_loss:8.9076 train_time:521ms step_avg:104.21ms
+step:6/20000 train_loss:8.6493 train_time:632ms step_avg:105.28ms
+step:7/20000 train_loss:8.3058 train_time:743ms step_avg:106.17ms
+step:8/20000 train_loss:7.8565 train_time:854ms step_avg:106.70ms
+step:9/20000 train_loss:7.8107 train_time:965ms step_avg:107.24ms
+step:10/20000 train_loss:7.6008 train_time:1077ms step_avg:107.68ms
+step:500/20000 train_loss:3.3487 train_time:56415ms step_avg:112.83ms
+step:1000/20000 train_loss:3.0751 train_time:113132ms step_avg:113.13ms
+step:1500/20000 train_loss:3.2878 train_time:169826ms step_avg:113.22ms
+step:2000/20000 train_loss:3.1082 train_time:226351ms step_avg:113.18ms
+step:2500/20000 train_loss:3.2439 train_time:282853ms step_avg:113.14ms
+step:3000/20000 train_loss:2.9583 train_time:339352ms step_avg:113.12ms
+step:3500/20000 train_loss:3.0975 train_time:395804ms step_avg:113.09ms
+step:4000/20000 train_loss:3.0032 train_time:452215ms step_avg:113.05ms
+step:4000/20000 val_loss:2.9840 val_bpb:1.1552 train_time:452295ms step_avg:113.07ms
+step:4500/20000 train_loss:3.0079 train_time:508620ms step_avg:113.03ms
+swa:start step:4600
+late_qat:enabled step:4735 scale:0.1499
+step:5000/20000 train_loss:2.8252 train_time:565718ms step_avg:113.14ms
+step:5256/20000 val_loss:2.8603 val_bpb:1.1073 train_time:595140ms step_avg:113.23ms
+stopping_early: wallclock_cap train_time:595140ms step:5256/20000
+peak memory allocated: 28774 MiB reserved: 28832 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:2.8573 val_bpb:1.1061 eval_time:1832ms
+ttt:start lr=0.00045 epochs=8 freeze_blocks=1 cosine_decay=True
+ttt_adamw:params trainable=35939920 frozen=2056
+ttt_adamw:epoch 1/8 loss:2.9018 time:15.2s
+ttt_adamw:epoch 2/8 loss:2.8439 time:30.2s
+ttt_adamw:epoch 3/8 loss:2.8229 time:45.2s
+ttt_adamw:epoch 4/8 loss:2.8049 time:60.2s
+ttt_adamw:epoch 5/8 loss:2.7892 time:75.2s
+ttt_adamw:epoch 6/8 loss:2.7755 time:90.2s
+ttt_adamw:epoch 7/8 loss:2.7646 time:105.2s
+ttt_adamw:epoch 8/8 loss:2.7574 time:120.2s
+ttt_adamw:done elapsed=120.2s
+ttt:elapsed=120.2s
+DIAGNOSTIC post_ttt val_loss:2.7524 val_bpb:1.0656 eval_time:5069ms
+Serialized model: 135395695 bytes
+Code size: 116724 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:collecting hessians from training data (256 batches)...
+gptq:collected hessians for 67 layers (training data)
+selective_prune: 9712210 ±1 candidates, unpruned=14.46MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+brotli+byteshuffle: 15044693 bytes
+Total submission size int6+brotli+byteshuffle: 15161417 bytes
+final_int6_roundtrip val_loss:2.8194 val_bpb:1.0915 eval_time:6825ms
+final_int6_roundtrip_exact val_loss:2.81943450 val_bpb:1.09149096
+final_int6_sliding_window val_loss:2.7837 val_bpb:1.0777 stride:64 eval_time:73997ms
+final_int6_sliding_window_exact val_loss:2.78369496 val_bpb:1.07765960
+final_int8_zlib_roundtrip_exact val_loss:2.78369496 val_bpb:1.07765960