Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955 by renqianluo · Pull Request #1318 · openai/parameter-golf

renqianluo · 2026-04-03T23:52:29Z

Result

val_bpb: 1.00955 (3-seed mean) | ~15.71 MB | 8×H100 SXM | ~568s eval

Seed	Base int6 BPB	Final SLOT BPB	TTT+SLOT time
1337	1.11745	1.00988	276+293=569s ✓
42	1.11648	1.00877	276+291=567s ✓
314	1.11733	1.01001	277+292=570s ✓

Key Changes vs Leaderboard SOTA (1.11437)

TTT (~276s): 1 epoch AdamW (lr=0.001) on the test sequence, freezing the first 10/11 blocks. Adapts the last transformer block to the specific test distribution before quantization scoring.
SLOT in logit space (~292s): Optimizes a global delta d ∈ R^{1024} added to logits for each sliding window via L-BFGS (max_iter=25, history=20, strong-Wolfe, warm-start). Uses focal loss on the last 128 tokens per window. Delta clamped to ±5 for stability.
GPTQ damp=0.005: Half the standard Hessian damping — more aggressive weight error compensation, ~0.001 better base int6 BPB.

Technique Stack

FA3 (PyTorch 2.9.1+cu128, ~91ms/step, 8×H100)
GPTQ int6, block_size=128, damp=0.005, val-data calibration
TTT: 1ep AdamW lr=0.001, freeze blocks 0-9
SLOT: L-BFGS25, history=20, warm-start, focal_tokens=128, delta_clip=5, logit space
BigramHash 3072×112, MTP (2 heads, weight=0.1), QK_GAIN=4.0, EMA+SWA, SoftSTE QAT

Reproduction

export DATA_PATH=... TOKENIZER_PATH=...
torchrun --standalone --nproc_per_node=8 train_gpt.py        # seed 1337
SEED=42  torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=314 torchrun --standalone --nproc_per_node=8 train_gpt.py

…Clip5 + GPTQ DAMP=0.005 — val_bpb 1.00955 (3-seed mean)

…lip=5, warm-start Port L-BFGS SLOT from PR openai#1318 into our causal SLOT framework: - Delta in logit space [1,1,vocab_size=1024] instead of hidden space [1,1,512] - L-BFGS optimizer (strong_wolfe, max_iter=25, history=20) replaces AdamW - Focal loss: optimize on last 128 tokens intersected with causal context - Warm-start: carry delta from previous batch - Delta clamp ±5 for stability - All config HARDCODED (env vars not forwarded to GPU)

dexhunter · 2026-04-04T23:12:27Z

I think this PR would be much easier to evaluate if it added a short explicit compliance section against the current README / #1017 framing.

Right now, the two places that seem ambiguous are:

TTT (~276s): 1 epoch AdamW ... on the test sequence
SLOT ... Uses focal loss on the last 128 tokens per window

Under #1017, the relevant questions are basically the four conditions:

Condition 1 (causality): does the score for token t depend only on the artifact and prefix x_<t?
Condition 2 (normalized probabilities): are you still producing an ordinary full-vocabulary softmax distribution at each scored position?
Condition 3 (score-before-update): are the TTT and SLOT objectives restricted strictly to already-scored positions only, with currently scored tokens excluded?
Condition 4 (single pass): is evaluation still exactly one left-to-right pass, with no rescoring after adaptation?

If the answer to those is yes, I think it would really help reviewers if the PR body said so explicitly, for example in a small Compliance section.

Concretely, I think the most useful clarifications would be:

whether the TTT loss is computed only on previously scored tokens,
whether the SLOT focal-loss positions are also previously scored/context-only positions,
and whether any token is ever scored after adapting on that same token.

Not trying to nitpick the result here — I think the current writeup just leaves the legality story underspecified relative to the current README / #1017 guidance.

MatoTeziTanka · 2026-04-12T04:51:21Z

Community Review — Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955

BPB: 1.00955 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA a49a3130390f, file records/track_10min_16mb/2026-04-03_FA3_TTT_LBFGS25_LogitDelta_GPTQ_DAMP005_DeltaClip5_History20_WarmStart_Focal128_Freeze10/train_gpt.py):

The TTT path at line 1230 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 1.53s, dim=512, layers=11, vocab=1024, code=149224 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 1.53s, dim=512, layers=11, vocab=1024, code=149224 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: FA3 + TTT-AdamW-1ep-Freeze10 + SLOT L-BFGS25 LogitDelta Delta…

a49a313

…Clip5 + GPTQ DAMP=0.005 — val_bpb 1.00955 (3-seed mean)

AnubhavBharadwaaj mentioned this pull request Apr 6, 2026

Record: Pre-Quant TTT + ETLB: Eval-Time Logit Bias for Neural Language Model Compression 1.0898 BPB on PR #1285 base #1399

Open

MatoTeziTanka mentioned this pull request Apr 11, 2026

Record: — val_bpb 0.7271 (3-seed mean) SLOT-48 + VRL + QK-Gain 4.0 + XSA-11 #1324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955#1318

Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955#1318
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/gptq-damp005-1.00955

renqianluo commented Apr 3, 2026

Uh oh!

dexhunter commented Apr 4, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

renqianluo commented Apr 3, 2026

Result

Key Changes vs Leaderboard SOTA (1.11437)

Technique Stack

Reproduction

Uh oh!

dexhunter commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dexhunter commented Apr 4, 2026 •

edited

Loading