Skip to content

Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955#1318

Open
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/gptq-damp005-1.00955
Open

Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955#1318
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/gptq-damp005-1.00955

Conversation

@renqianluo
Copy link
Copy Markdown

Result

val_bpb: 1.00955 (3-seed mean) | ~15.71 MB | 8×H100 SXM | ~568s eval

Seed Base int6 BPB Final SLOT BPB TTT+SLOT time
1337 1.11745 1.00988 276+293=569s ✓
42 1.11648 1.00877 276+291=567s ✓
314 1.11733 1.01001 277+292=570s ✓

Key Changes vs Leaderboard SOTA (1.11437)

  1. TTT (~276s): 1 epoch AdamW (lr=0.001) on the test sequence, freezing the first 10/11 blocks. Adapts the last transformer block to the specific test distribution before quantization scoring.
  2. SLOT in logit space (~292s): Optimizes a global delta d ∈ R^{1024} added to logits for each sliding window via L-BFGS (max_iter=25, history=20, strong-Wolfe, warm-start). Uses focal loss on the last 128 tokens per window. Delta clamped to ±5 for stability.
  3. GPTQ damp=0.005: Half the standard Hessian damping — more aggressive weight error compensation, ~0.001 better base int6 BPB.

Technique Stack

  • FA3 (PyTorch 2.9.1+cu128, ~91ms/step, 8×H100)
  • GPTQ int6, block_size=128, damp=0.005, val-data calibration
  • TTT: 1ep AdamW lr=0.001, freeze blocks 0-9
  • SLOT: L-BFGS25, history=20, warm-start, focal_tokens=128, delta_clip=5, logit space
  • BigramHash 3072×112, MTP (2 heads, weight=0.1), QK_GAIN=4.0, EMA+SWA, SoftSTE QAT

Reproduction

export DATA_PATH=... TOKENIZER_PATH=...
torchrun --standalone --nproc_per_node=8 train_gpt.py        # seed 1337
SEED=42  torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=314 torchrun --standalone --nproc_per_node=8 train_gpt.py

…Clip5 + GPTQ DAMP=0.005 — val_bpb 1.00955 (3-seed mean)
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 4, 2026
…lip=5, warm-start

Port L-BFGS SLOT from PR openai#1318 into our causal SLOT framework:
- Delta in logit space [1,1,vocab_size=1024] instead of hidden space [1,1,512]
- L-BFGS optimizer (strong_wolfe, max_iter=25, history=20) replaces AdamW
- Focal loss: optimize on last 128 tokens intersected with causal context
- Warm-start: carry delta from previous batch
- Delta clamp ±5 for stability
- All config HARDCODED (env vars not forwarded to GPU)
@dexhunter
Copy link
Copy Markdown
Contributor

dexhunter commented Apr 4, 2026

I think this PR would be much easier to evaluate if it added a short explicit compliance section against the current README / #1017 framing.

Right now, the two places that seem ambiguous are:

  1. TTT (~276s): 1 epoch AdamW ... on the test sequence
  2. SLOT ... Uses focal loss on the last 128 tokens per window

Under #1017, the relevant questions are basically the four conditions:

  • Condition 1 (causality): does the score for token t depend only on the artifact and prefix x_<t?
  • Condition 2 (normalized probabilities): are you still producing an ordinary full-vocabulary softmax distribution at each scored position?
  • Condition 3 (score-before-update): are the TTT and SLOT objectives restricted strictly to already-scored positions only, with currently scored tokens excluded?
  • Condition 4 (single pass): is evaluation still exactly one left-to-right pass, with no rescoring after adaptation?

If the answer to those is yes, I think it would really help reviewers if the PR body said so explicitly, for example in a small Compliance section.

Concretely, I think the most useful clarifications would be:

  • whether the TTT loss is computed only on previously scored tokens,
  • whether the SLOT focal-loss positions are also previously scored/context-only positions,
  • and whether any token is ever scored after adapting on that same token.

Not trying to nitpick the result here — I think the current writeup just leaves the legality story underspecified relative to the current README / #1017 guidance.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955

BPB: 1.00955 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA a49a3130390f, file records/track_10min_16mb/2026-04-03_FA3_TTT_LBFGS25_LogitDelta_GPTQ_DAMP005_DeltaClip5_History20_WarmStart_Focal128_Freeze10/train_gpt.py):

The TTT path at line 1230 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 1.53s, dim=512, layers=11, vocab=1024, code=149224 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 1.53s, dim=512, layers=11, vocab=1024, code=149224 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants