Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364) by translatingthename · Pull Request #397 · openai/parameter-golf

translatingthename · 2026-03-22T04:14:13Z

Summary

Dynamic evaluation (Krause et al., ICML 2018) applied to the SOTA pipeline without modifying training. The model takes periodic SGD gradient steps during sliding window scoring, adapting to local text distribution. 2.0% consistent bpb improvement at zero artifact cost.

3-seed mean: 1.1371 (seeds 42, 7, 2024). Best seed: 1.1364. Merged SOTA: 1.1428.

Results (3-seed, 8xH100 SXM, SDPA backend)

Seed	Steps	Int6 Roundtrip	+ TTT + Dynamic Eval	Delta	Artifact
42	5,604	1.1607	1.1364	-0.0243	15.65 MB
7	5,590	1.1618	1.1369	-0.0249	15.80 MB
2024	5,620	1.1613	1.1380	-0.0233	15.35 MB
Mean		1.1613	1.1371	-0.0242

Novel Contribution: Dynamic Evaluation

After TTT adaptation, we score the validation stream using sliding windows (stride=64). Between batches of scored windows, we take an SGD gradient step (lr=0.001) on the model weights. The model adapts to the local distribution as it scores. TTT adapts weights before scoring; dynamic eval adapts during scoring. The two are complementary.

Windows scored in batches of 32, gradient step every 4 batches
SGD without momentum, rank-local adaptation
~344s eval time, total eval ~427s (well under 600s budget)
Zero additional artifact bytes

Attribution

Built on PR #315 (jfprincz): XSA, EMA, Partial RoPE, LN Scale, Late QAT.
PR #338 (alertcat): TTT integration.
SmearGate/BigramHash/OrthoInit originally by unnir.

Reference: Krause et al., "Dynamic Evaluation of Neural Sequence Models," ICML 2018.

See records/track_10min_16mb/2026-03-22_DynamicEval_TTT_11L/README.md for full details, ablation, what didn't work, and reproduction instructions.

3-seed mean: 1.1371 (seeds 42, 7, 2024) Dynamic evaluation (Krause et al., ICML 2018) applied during sliding window scoring. 2.0% consistent bpb improvement at zero artifact cost. Built on PR openai#315 (jfprincz) and PR openai#338 (alertcat).

MatoTeziTanka · 2026-04-11T20:07:43Z

Community Review — Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)

BPB: 1.1364 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 6b2ad269e11b, file records/track_10min_16mb/2026-03-22_DynamicEval_TTT_11L/train_gpt.py):

At line 1198 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=77946 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=77946 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

translatingthename closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)#397

Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)#397
translatingthename wants to merge 1 commit intoopenai:mainfrom
translatingthename:submission-dyneval-ttt

translatingthename commented Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

translatingthename commented Mar 22, 2026

Summary

Results (3-seed, 8xH100 SXM, SDPA backend)

Novel Contribution: Dynamic Evaluation

Attribution

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants