Skip to content

Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)#1482

Open
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:record-2026-04-09-sp8192-qk525-ttt8-f1
Open

Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)#1482
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:record-2026-04-09-sp8192-qk525-ttt8-f1

Conversation

@aamodbhatt
Copy link
Copy Markdown

Record Summary

3-seed mean sliding val_bpb: 1.07873723 (std 0.00049363)
3-seed mean roundtrip val_bpb: 1.09258717 (std 0.00053392)

Hardware: 8xH100 SXM | Train cap target: 595s | Eval: sliding window stride=64

What changed

This package uses the SP8192 pre-quant TTT lane with tuned settings from the April 8 sweep:

  • QK_GAIN_INIT=5.25
  • TTT_ENABLED=1 with TTT_EPOCHS=8, TTT_LR=0.00045, TTT_FREEZE_BLOCKS=1
  • same SP8192 + recurrence + GPTQ pipeline

No tokenizer/dataset modifications, no eval-time adaptation, no SLOT/ngram overlays.

Seed Results

Seed sliding val_bpb roundtrip val_bpb train_s eval_s bytes_total
42 1.07913183 1.09299539 595.162 74.678 15,171,524
1337 1.07804121 1.09181877 595.086 74.663 15,163,267
2025 1.07903865 1.09294735 595.162 74.560 15,188,203
Mean 1.07873723 1.09258717 - - -
Std 0.00049363 0.00053392 - - -

Sweep Provenance

  • best single-seed sweep run (runB_seed1337): 1.07765960
  • confirmation seeds in this package: 42, 1337, 2025
  • raw sweep table included in runs.csv

Submission Checklist

  • One folder added under records/track_10min_16mb/
  • Included README.md
  • Included submission.json
  • Included train_gpt.py
  • Included train logs for 3 confirmation seeds
  • Included sweep provenance (runs.csv, best single-seed log)
  • All artifacts under 16,000,000 bytes
  • Train wallclock under 600s

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 8, 2026
Refresh PR cache, reclassify, publish frontier verdicts on
data-touching vs data-free compression (PR openai#672/openai#1482/openai#1477/openai#1471).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dexhunter
Copy link
Copy Markdown

Hi @aamodbhatt — I took a closer look at train_gpt.py in this submission and want to flag a potential compliance concern so it can be resolved before review.

The TTT function ttt_adapt_adamw (line 1132) takes val_tokens: Tensor as an argument, and inside the loop at line 1164-1166 it slices val_tokens[raw_start:raw_end] and feeds those tokens through the model with gradient updates. This function is called from the main training path at line 2135 with val_tokens sourced from fineweb_val_*.bin via load_validation_tokens(args.val_files, ...) at line 1787 — so the TTT pass is training the backbone on the validation shards before the GPTQ quantization step.

Per track_10min_16mb/README.md, validation data cannot be used during training. This is the same pattern that @ClassicLarry flagged on PR #1423, which also produced a low BPB number that wouldn't hold after the val-data adaptation was removed (@aryanbhosale's PR #1477 is the clean resubmission at 1.0822 after stripping PQT).

If I'm misreading the code and the "PQT" step is actually adapting on held-out train shards rather than fineweb_val_*.bin, please confirm — that would be consistent with PR #1429's approach and would resolve the concern. Otherwise this should probably be relabeled non-record.

taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Two of the three comp-frontier wins are env-var bumps with no code change:
- LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer
  recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 /
  openai#1437 use this. Expected -0.005 to -0.01 BPB.
- QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5;
  openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve.
  Expected -0.001 BPB.

C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real
code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in
background.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
The biggest free comp-frontier delta (-0.014 BPB). Ported from PR openai#1485 / openai#1306
which both descend from PR openai#1306's original ttt_adapt_adamw. PR openai#1482 frontier
(lr=0.00045, epochs=8, freeze_blocks=1) hits val_bpb 1.0787, vs PR openai#1477's
1.0822 with eval-time SGD TTT only.

Changes to submission/train.py:
- Hyperparameters: append 7 prequant_ttt_* env vars (defaults match PR openai#1482).
- New function prequant_ttt_adapt_adamw(h, base_model, device, val_tokens, rank,
  world_size): AdamW(lr) + cosine schedule, freezes first N blocks during TTT,
  unfreezes ALL params at the end (CRITICAL — GPTQ collect_hessians needs
  gradients on every block, leaving any frozen would zero its Hessian and
  quantize the block to garbage).
- train_and_eval: splice the call AFTER 'pre-quantization post-ema' eval and
  BEFORE serialize. Logs a 'post-prequant-ttt' val_bpb so we can see the
  improvement vs the EMA-only number.
- Stacks with the existing eval_val_sliding_ttt (post-quant) — different
  namespaces, gains compound (1.0872 EMA -> 1.0569 prequant -> ~1.0679 final).

run.sh: PREQUANT_TTT_ENABLED=1 by default with PR openai#1482 frontier hyperparams.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Mirror the Phase 1 append-only log format, pre-seeded with:
- Phase 1 baseline context (what we're improving from)
- Comp anchors (PR openai#1477/openai#1482/openai#1485) as the target ceiling
- Shot-by-shot result slots for S1-S8
- A cumulative speedup tracker table
- Phase 2 targets: ≥5x speedup, val_bpb 1.10-1.18 on 1xH100 SXM

The Phase 1 dry run val_bpb placeholder will be filled in once Pod L's run
completes (~03:30Z). That number becomes the Phase 2 floor we're optimizing
against.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 9, 2026
Both techniques from PR openai#1477 (SP8192: 1.0822 BPB, -0.033 vs record):
- QK_GAIN_INIT=5.0 (was 4.0): PR openai#1477/openai#1413 both use 5.0, ~0.002-0.005 BPB
- TTT_ENABLED=1, TTT_OPTIMIZER=adamw, TTT_LR=0.0005, TTT_EPOCHS=3, TTT_FREEZE_BLOCKS=1
  Score-First TTT = legal PR openai#461 protocol: score in inference_mode first,
  then adapt. run_legal_ttt() implementation is strictly causal.

NOT included (confirmed illegal):
- SLOT: 2-pass retroactive, optimizes delta on same tokens it scores
- Pre-Quant TTT (PR openai#1482/openai#1489): adapts model with fineweb_val_*.bin
  before GPTQ — dexhunter flagged as val-data-in-training violation

Baseline: sota_32 (WD=0.04, EMA=0.9965, RECUR=4,5, Mousse=1)
Expected: ~1.08-1.09 BPB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants