Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) by aryanbhosale · Pull Request #1423 · openai/parameter-golf

aryanbhosale · 2026-04-06T17:51:39Z

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0

val_bpb = 1.0791 (3-seed mean, std 0.0012) | ~15.12 MB | 8×H100 SXM

3-Seed Results

Seed	Sliding BPB	Artifact
42	1.0802	15,123,918
314	1.0778	15,118,254
999	1.0794	15,127,567
Mean	1.0791

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0356 BPB.

Key Change

Takes @clarkkev's SP8192 base (PR #1394, 1.0856 BPB) + @stukenov's pre-quant TTT (PR #1364) and adds QK-Gain 5.0 (from 4.0, validated by PR #1217 @bigbag). One hyperparameter change that improves 3-seed mean by 0.0004 over PR #1416.

Full Stack

SP8192 vocab, MLP 4x, depth recurrence (loop 4,5), MuonEq-R, SDClip quantization, GPTQ embeddings, sigmoid-gated U-Net skips, pre-quant AdamW TTT (6 epochs, lr=0.0005, freeze first 2 blocks, cosine decay), brotli compression.

Compliance (Track A — Fixed Predictor)

No eval-time adaptation — model frozen after training + pre-quant TTT + GPTQ
No SLOT, no n-gram cache
Pre-quant TTT baked into artifact (weights adapted before quantization, then frozen)
Standard sliding-window eval (stride=64)
All four conditions from Issue A Field Guide to Valid Submissions #1017 trivially satisfied

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 --skip-manifest
SEED=42 QK_GAIN_INIT=5.0 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1394 @clarkkev, PR #1364 @stukenov, PR #1416 @erichroepke, PR #1217 @bigbag, PR #1204 @msisovic, PR #1260 @dexhunter, PR #1019 @abaybektursun

… mean) SP8192 + Pre-Quant AdamW TTT + QK-Gain 5.0 on PR openai#1394 base. 3-seed mean: 1.0791 BPB. Track A, no eval-time adaptation.

abaybektursun · 2026-04-06T20:14:24Z

Hey just heads up, you are fine-tuning the model directly on the validation data for 6 epochs before quantization:

The function (https://github.com/openai/parameter-golf/pull/1423/files#diff-train_gpt.py, ~line 1208):

  def ttt_adapt_adamw(args, base_model, device, val_tokens, ...):
      """AdamW TTT: fine-tune on val data BEFORE quantization"""
      for epoch in range(args.ttt_epochs):        # 6 epochs
          ...
          local = val_tokens[raw_start:raw_end]   # validation data
          loss = base_model(x, y)                 # forward on val
          loss.backward()                         # backward on val
          optimizer.step()                        # update weights

The call site (~line 2204) passes the actual validation tokens:

# AdamW TTT: fine-tune EMA model on val data BEFORE quantization
if args.ttt_enabled:
    ttt_adapt_adamw(args, base_model, device, val_tokens, ...)

The logs confirm it (seed 42):

  post_ema val_bpb:  1.1026    ← before touching val data
  ttt_adamw:epoch 1/6 loss:2.9122
  ttt_adamw:epoch 6/6 loss:2.7668   ← loss drops across epochs
  post_ttt val_bpb:  1.0687    ← after training on val: -0.034 BPB

This is not score-first TTT (PR #461 style) where each chunk is scored under inference_mode() before any weight update.
The same concern applies to PRs #1364, #1406, and #1408 which use the same pre-quant TTT mechanism.

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 — val_bpb 1.0791 (3-seed…

eebe7b2

… mean) SP8192 + Pre-Quant AdamW TTT + QK-Gain 5.0 on PR openai#1394 base. 3-seed mean: 1.0791 BPB. Track A, no eval-time adaptation.

aryanbhosale mentioned this pull request Apr 6, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Apr 7, 2026

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1426

Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean)#1423

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean)#1423
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp8192-prequant-ttt-qkgain5

aryanbhosale commented Apr 6, 2026

Uh oh!

abaybektursun commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aryanbhosale commented Apr 6, 2026