Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)#1482
Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)#1482aamodbhatt wants to merge 1 commit intoopenai:mainfrom
Conversation
Refresh PR cache, reclassify, publish frontier verdicts on data-touching vs data-free compression (PR openai#672/openai#1482/openai#1477/openai#1471). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi @aamodbhatt — I took a closer look at The TTT function Per If I'm misreading the code and the "PQT" step is actually adapting on held-out train shards rather than |
Two of the three comp-frontier wins are env-var bumps with no code change: - LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 / openai#1437 use this. Expected -0.005 to -0.01 BPB. - QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5; openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve. Expected -0.001 BPB. C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in background. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The biggest free comp-frontier delta (-0.014 BPB). Ported from PR openai#1485 / openai#1306 which both descend from PR openai#1306's original ttt_adapt_adamw. PR openai#1482 frontier (lr=0.00045, epochs=8, freeze_blocks=1) hits val_bpb 1.0787, vs PR openai#1477's 1.0822 with eval-time SGD TTT only. Changes to submission/train.py: - Hyperparameters: append 7 prequant_ttt_* env vars (defaults match PR openai#1482). - New function prequant_ttt_adapt_adamw(h, base_model, device, val_tokens, rank, world_size): AdamW(lr) + cosine schedule, freezes first N blocks during TTT, unfreezes ALL params at the end (CRITICAL — GPTQ collect_hessians needs gradients on every block, leaving any frozen would zero its Hessian and quantize the block to garbage). - train_and_eval: splice the call AFTER 'pre-quantization post-ema' eval and BEFORE serialize. Logs a 'post-prequant-ttt' val_bpb so we can see the improvement vs the EMA-only number. - Stacks with the existing eval_val_sliding_ttt (post-quant) — different namespaces, gains compound (1.0872 EMA -> 1.0569 prequant -> ~1.0679 final). run.sh: PREQUANT_TTT_ENABLED=1 by default with PR openai#1482 frontier hyperparams. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirror the Phase 1 append-only log format, pre-seeded with: - Phase 1 baseline context (what we're improving from) - Comp anchors (PR openai#1477/openai#1482/openai#1485) as the target ceiling - Shot-by-shot result slots for S1-S8 - A cumulative speedup tracker table - Phase 2 targets: ≥5x speedup, val_bpb 1.10-1.18 on 1xH100 SXM The Phase 1 dry run val_bpb placeholder will be filled in once Pod L's run completes (~03:30Z). That number becomes the Phase 2 floor we're optimizing against. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both techniques from PR openai#1477 (SP8192: 1.0822 BPB, -0.033 vs record): - QK_GAIN_INIT=5.0 (was 4.0): PR openai#1477/openai#1413 both use 5.0, ~0.002-0.005 BPB - TTT_ENABLED=1, TTT_OPTIMIZER=adamw, TTT_LR=0.0005, TTT_EPOCHS=3, TTT_FREEZE_BLOCKS=1 Score-First TTT = legal PR openai#461 protocol: score in inference_mode first, then adapt. run_legal_ttt() implementation is strictly causal. NOT included (confirmed illegal): - SLOT: 2-pass retroactive, optimizes delta on same tokens it scores - Pre-Quant TTT (PR openai#1482/openai#1489): adapts model with fineweb_val_*.bin before GPTQ — dexhunter flagged as val-data-in-training violation Baseline: sota_32 (WD=0.04, EMA=0.9965, RECUR=4,5, Mousse=1) Expected: ~1.08-1.09 BPB
Record Summary
3-seed mean sliding val_bpb:
1.07873723(std0.00049363)3-seed mean roundtrip val_bpb:
1.09258717(std0.00053392)Hardware:
8xH100 SXM| Train cap target:595s| Eval: sliding windowstride=64What changed
This package uses the SP8192 pre-quant TTT lane with tuned settings from the April 8 sweep:
QK_GAIN_INIT=5.25TTT_ENABLED=1withTTT_EPOCHS=8,TTT_LR=0.00045,TTT_FREEZE_BLOCKS=1No tokenizer/dataset modifications, no eval-time adaptation, no SLOT/ngram overlays.
Seed Results
Sweep Provenance
runB_seed1337):1.0776596042, 1337, 2025runs.csvSubmission Checklist
records/track_10min_16mb/README.mdsubmission.jsontrain_gpt.pyruns.csv, best single-seed log)