Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)

**3-seed mean sliding val_bpb:** `1.07873723` (std `0.00049363`)
**3-seed mean roundtrip val_bpb:** `1.09258717` (std `0.00054369`)

Hardware: `8xH100 SXM` | Train cap target: `595s` | Eval: sliding window `stride=64`

## What changed

This package uses the same fixed-predictor lane as the prior SP8192 pre-quant TTT stack, with tuned settings from our April 8 sweep:

- `QK_GAIN_INIT=5.25`
- `TTT_ENABLED=1` with `TTT_EPOCHS=8`, `TTT_LR=0.00045`, `TTT_FREEZE_BLOCKS=1`
- same SP8192 + recurrence + GPTQ pipeline

No tokenizer/dataset modifications, no eval-time adaptation, no SLOT/ngram overlays.

## Seed Results

| Seed | sliding val_bpb | roundtrip val_bpb | train_s | eval_s | bytes_total |
|------|----------------:|------------------:|--------:|-------:|------------:|
| 42 | 1.07913183 | 1.09299539 | 595.162 | 74.678 | 15171524 |
| 1337 | 1.07804121 | 1.09181877 | 595.086 | 74.663 | 15163267 |
| 2025 | 1.07903865 | 1.09294735 | 595.162 | 74.560 | 15188203 |
| **Mean** | **1.07873723** | **1.09258717** | - | - | - |
| **Std** | **0.00049363** | **0.00054369** | - | - | - |

## Sweep Provenance

- best single-seed sweep run (`runB_seed1337`): `1.07765960`
- confirmation seeds in this package: `42, 1337, 2025`
- raw sweep table included in `runs.csv`

## Compliance Notes

- score computed from `final_int6_sliding_window_exact`
- roundtrip reported from `final_int6_roundtrip_exact`
- train capped by wallclock (`stopping_early` line in each log)
- artifact size from `Total submission size int6+brotli+byteshuffle`
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
run_id,seed,qk_gain,ttt_enabled,ttt_lr,ttt_epochs,ttt_freeze,val_bpb,roundtrip_bpb,train_s,eval_s,bytes_total,log_path
runA_seed1337_qk5_ttt6_f2,1337,5.0,1,0.0005,6,2,1.08250260,1.09687424,595.091,97.120,15148097,analysis/2026-04-08_record_push/runA_seed1337_qk5_ttt6_f2.log
runB_seed1337_qk525_ttt8_f1,1337,5.25,1,0.00045,8,1,1.07765960,1.09149096,595.140,73.997,15161417,analysis/2026-04-08_record_push/runB_seed1337_qk525_ttt8_f1.log
runC_seed1337_qk5_nottt,1337,5.0,0,0.0005,0,11,1.10884414,1.12541234,595.085,74.456,15151733,analysis/2026-04-08_record_push/runC_seed1337_qk5_nottt.log
runD_seed42_qk525_ttt8_f1,42,5.25,1,0.00045,8,1,1.07913183,1.09299539,NA,NA,15171524,analysis/2026-04-08_record_push/runD_seed42_qk525_ttt8_f1.log
runD_seed1337_qk525_ttt8_f1,1337,5.25,1,0.00045,8,1,1.07804121,1.09181877,NA,NA,15163267,analysis/2026-04-08_record_push/runD_seed1337_qk525_ttt8_f1.log
runD_seed2025_qk525_ttt8_f1,2025,5.25,1,0.00045,8,1,1.07903865,1.09294735,NA,NA,15188203,analysis/2026-04-08_record_push/runD_seed2025_qk525_ttt8_f1.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"author": "Aamod Bhatt",
"github_id": "aamodbhatt",
"name": "SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) + GPTQ int6",
"date": "2026-04-09",
"track": "10min_16mb",
"hardware": "8xH100 80GB SXM",
"val_bpb": 1.07873723,
"val_bpb_std": 0.00049363,
"roundtrip_val_bpb": 1.09258717,
"roundtrip_val_bpb_std": 0.00054369,
"seeds": [
42,
1337,
2025
],
"seed_results": {
"42": {
"val_bpb": 1.07913183,
"roundtrip_val_bpb": 1.09299539,
"bytes_total": 15171524
},
"1337": {
"val_bpb": 1.07804121,
"roundtrip_val_bpb": 1.09181877,
"bytes_total": 15163267
},
"2025": {
"val_bpb": 1.07903865,
"roundtrip_val_bpb": 1.09294735,
"bytes_total": 15188203
}
},
"best_single_seed_sweep_val_bpb": 1.0776596,
"technique_summary": "SP8192 + Pre-Quant AdamW TTT + QK_GAIN_INIT=5.25 + TTT(8 epochs, freeze first block) + Full Hessian GPTQ int6 + sliding-window eval",
"notes": "No tokenizer/dataset edits; no eval-time adaptation; fixed predictor at evaluation."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
W0408 16:17:26.158000 15891 torch/distributed/run.py:851]
W0408 16:17:26.158000 15891 torch/distributed/run.py:851] *****************************************
W0408 16:17:26.158000 15891 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0408 16:17:26.158000 15891 torch/distributed/run.py:851] *****************************************
logs/94fa9714-551c-45cf-acf8-0f6e7ab65754.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model
train_loader:dataset:fineweb10B_sp8192 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin tokens:40540160
model_params:35941976
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:595.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:9.0006 val_bpb:3.4844 train_time:0ms step_avg:0.03ms
step:1/20000 train_loss:9.0023 train_time:157ms step_avg:156.90ms
step:2/20000 train_loss:9.7770 train_time:190ms step_avg:94.91ms
step:3/20000 train_loss:10.0653 train_time:299ms step_avg:99.75ms
step:4/20000 train_loss:9.5484 train_time:410ms step_avg:102.54ms
step:5/20000 train_loss:8.9076 train_time:521ms step_avg:104.21ms
step:6/20000 train_loss:8.6493 train_time:632ms step_avg:105.28ms
step:7/20000 train_loss:8.3058 train_time:743ms step_avg:106.17ms
step:8/20000 train_loss:7.8565 train_time:854ms step_avg:106.70ms
step:9/20000 train_loss:7.8107 train_time:965ms step_avg:107.24ms
step:10/20000 train_loss:7.6008 train_time:1077ms step_avg:107.68ms
step:500/20000 train_loss:3.3487 train_time:56415ms step_avg:112.83ms
step:1000/20000 train_loss:3.0751 train_time:113132ms step_avg:113.13ms
step:1500/20000 train_loss:3.2878 train_time:169826ms step_avg:113.22ms
step:2000/20000 train_loss:3.1082 train_time:226351ms step_avg:113.18ms
step:2500/20000 train_loss:3.2439 train_time:282853ms step_avg:113.14ms
step:3000/20000 train_loss:2.9583 train_time:339352ms step_avg:113.12ms
step:3500/20000 train_loss:3.0975 train_time:395804ms step_avg:113.09ms
step:4000/20000 train_loss:3.0032 train_time:452215ms step_avg:113.05ms
step:4000/20000 val_loss:2.9840 val_bpb:1.1552 train_time:452295ms step_avg:113.07ms
step:4500/20000 train_loss:3.0079 train_time:508620ms step_avg:113.03ms
swa:start step:4600
late_qat:enabled step:4735 scale:0.1499
step:5000/20000 train_loss:2.8252 train_time:565718ms step_avg:113.14ms
step:5256/20000 val_loss:2.8603 val_bpb:1.1073 train_time:595140ms step_avg:113.23ms
stopping_early: wallclock_cap train_time:595140ms step:5256/20000
peak memory allocated: 28774 MiB reserved: 28832 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:2.8573 val_bpb:1.1061 eval_time:1832ms
ttt:start lr=0.00045 epochs=8 freeze_blocks=1 cosine_decay=True
ttt_adamw:params trainable=35939920 frozen=2056
ttt_adamw:epoch 1/8 loss:2.9018 time:15.2s
ttt_adamw:epoch 2/8 loss:2.8439 time:30.2s
ttt_adamw:epoch 3/8 loss:2.8229 time:45.2s
ttt_adamw:epoch 4/8 loss:2.8049 time:60.2s
ttt_adamw:epoch 5/8 loss:2.7892 time:75.2s
ttt_adamw:epoch 6/8 loss:2.7755 time:90.2s
ttt_adamw:epoch 7/8 loss:2.7646 time:105.2s
ttt_adamw:epoch 8/8 loss:2.7574 time:120.2s
ttt_adamw:done elapsed=120.2s
ttt:elapsed=120.2s
DIAGNOSTIC post_ttt val_loss:2.7524 val_bpb:1.0656 eval_time:5069ms
Serialized model: 135395695 bytes
Code size: 116724 bytes
gptq:building non-banked model for Hessian collection...
gptq:collecting hessians from training data (256 batches)...
gptq:collected hessians for 67 layers (training data)
selective_prune: 9712210 ±1 candidates, unpruned=14.46MB target=15.9MB
selective_prune: already fits, no pruning needed
Serialized model int6+brotli+byteshuffle: 15044693 bytes
Total submission size int6+brotli+byteshuffle: 15161417 bytes
final_int6_roundtrip val_loss:2.8194 val_bpb:1.0915 eval_time:6825ms
final_int6_roundtrip_exact val_loss:2.81943450 val_bpb:1.09149096
final_int6_sliding_window val_loss:2.7837 val_bpb:1.0777 stride:64 eval_time:73997ms
final_int6_sliding_window_exact val_loss:2.78369496 val_bpb:1.07765960
final_int8_zlib_roundtrip_exact val_loss:2.78369496 val_bpb:1.07765960
Loading