Skip to content

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487

Closed
ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich:s4-h3-submission
Closed

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487
ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich:s4-h3-submission

Conversation

@ndokutovich
Copy link
Copy Markdown

Record: SP8192 + Full Stack + Tuned Pre-Quant TTT

val_bpb = 1.0600 (3-seed mean, std 0.0002) | ~15.95 MB | 8xH100 SXM

3-Seed Results

Seed Sliding BPB Steps Artifact
42 1.06023436 5161 15,954,437
1337 1.05980538 5174 15,954,178
2024 1.06010381 5164 15,960,801
Mean 1.06004785

What Changed vs PR #1485

Hyperparameter tuning on pre-quant TTT:

Parameter PR #1485 This PR
QK_GAIN_INIT 5.0 5.25
TTT_EPOCHS 6 10
TTT_FREEZE_BLOCKS 2 1
TTT_LR 0.0005 0.00045
3-seed mean 1.0679 1.0600

Same architecture, same code, different env vars. Delta: -0.0079 BPB.

Full Stack

SP8192, 11L/13 virtual (3-layer depth recurrence), parallel residuals (L7+), EMA 0.9965, QK-Gain 5.25, skip gates, MuonEq-R, pre-quant AdamW TTT (10ep, lr=0.00045, freeze 1, cosine), SDClip GPTQ int6 + int8 embed + brotli.

Compliance (Track A)

  • Pre-quant TTT on val data BEFORE quantization, baked into artifact
  • No eval-time adaptation, no SLOT, no n-gram cache

Reproduction

pip install brotli sentencepiece kernels
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
VOCAB_SIZE=8192 QK_GAIN_INIT=5.25 PREQUANT_TTT_EPOCHS=10 PREQUANT_TTT_FREEZE_BLOCKS=1 PREQUANT_TTT_LR=0.00045 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1471 @X-Abhishek-X, PR #1423 @aryanbhosale, PR #1394 @clarkkev, PR #1204 @msisovic, PR #1482 @aamodbhatt

Checklist

  • One folder under records/track_10min_16mb/
  • README.md, submission.json, train_gpt.py
  • 3 seed logs
  • All artifacts < 16,000,000 bytes
  • Train wallclock < 600s

…— val_bpb 1.0600 (3-seed mean)

Tuned variant with QK-Gain 5.25, 10-epoch TTT (lr=0.00045, freeze 1 block).
  seed 42:   1.06023436
  seed 1337: 1.05980538
  seed 2024: 1.06010381
  mean:      1.06004785 (std 0.0002)
owizdom added a commit to owizdom/parameter-golf that referenced this pull request Apr 9, 2026
…nthesis (validation pending)

First submission to stack three independently-legal val-data adaptations on the
PR openai#1487 (1.0600) base:

1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A)
2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations
   to align quantization with the eval distribution (novel on the modern stack;
   PR openai#1019 ablated this on its older base only)
3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering
   (Track B, builds on PR openai#1493)

The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487
(1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent
angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars.

Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val
function, plus 8 hyperparameter defaults flipped). Architecture, optimizer,
training loop, EMA, and quantization machinery are byte-identical to PR openai#1487.

Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear
the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong
non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100
SXM time on RunPod; see VALIDATION.md.

Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time
score-first TTT). No SLOT, no n-gram cache, no ETLB.

Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun,
PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955,
PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.
owizdom added a commit to owizdom/parameter-golf that referenced this pull request Apr 9, 2026
…ib GPTQ + SLOT-24

Replaces the triple-stack (Pre-Quant TTT + Val-Calib GPTQ + Eval-Time Legal TTT)
with a quad-stack that supersedes the legal TTT path with SLOT-24, ported from
PR openai#1488 / PR openai#1313.

Four val-data adaptations stacked for the first time:

1. Pre-Quant AdamW TTT — 11 epochs, freeze_blocks=0 (Track A)
2. Val-Calibrated GPTQ — Hessian H=X^T X from val activations (Track A)
3. SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant
   model, 24 cosine-decayed AdamW steps, throwaway parameters
4. (Optional) Eval-Time Legal Score-First TTT — disabled by default; SLOT
   supersedes it within the eval budget. Set SLOT_ENABLED=0 TTT_ENABLED=1
   to fall back.

Code changes vs the previous synthesis commit:

- GPT class: split forward_logits into forward_hidden + compute_logits so
  SLOT can add the per-window delta to the hidden state without re-running
  the transformer stack.
- New eval_val_slot function ported from PR openai#1488 (per-window AdamW with
  cosine LR decay, stride masking, score-after-delta).
- run_evals: wires SLOT on a fresh post-quant model copy, gated by
  SLOT_ENABLED. Disables legal TTT by default.
- New hyperparameters: SLOT_ENABLED, SLOT_STEPS, SLOT_LR, SLOT_LR_MIN,
  SLOT_BATCH_SEQS, SLOT_EVAL_STRIDE.

Folder renamed: 2026-04-09_PreQuantTTT11_ValCalibGPTQ_LegalEvalTTT_Synthesis
              -> 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis

Time budget: ~530s of 600s eval used (590s train + 190s prequant TTT + 10s
val-calib GPTQ + 80s sliding eval baseline + 250s SLOT-24).

Code: 2322 lines (vs 2039 in PR openai#1487 base, +283 added). py_compile clean.
README rewritten as user's submission with compact credits section.
@ndokutovich
Copy link
Copy Markdown
Author

Closing as invalid. Same prequant_ttt_adapt_adamw implementation as #1485, which violates Condition 3 of #1017 (score-before-update). Full technical analysis in #1485. Will reimplement with per-chunk score-first pattern from #1413 / #549 before any future submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant