Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303 by timowhite88 · Pull Request #254 · openai/parameter-golf

timowhite88 · 2026-03-20T19:15:10Z

No description provided.

timowhite88 · 2026-03-20T19:46:08Z

@0hq Ready for review — 3 seeds complete with tight reproducibility:

Seed 1337: 1.1303
Seed 42: 1.1312
Seed 7: 1.1323
Mean: 1.1313

15.88 MB artifact, 600s train, 129s eval. Full logs for all 3 seeds included. This supersedes our earlier PRs #152 and #178.

timowhite88 · 2026-03-20T20:00:39Z

@notapplica 3 seeds submitted now, Mean is posted , all logs contained, Ready for @0hq review

…mean: 1.1313)

himanalot · 2026-03-20T20:24:30Z

this is aura

mohosy · 2026-03-20T23:59:31Z

interesting that freezing early blocks during ttt helps stability, have you experimented with freezing more or less blocks to see where the sweet spot is

Matching PR #254 (1.1313 BPB) TTT approach: - SGD optimizer instead of Adam (better for non-stationary TTT) - 3 epochs per document (more adaptation) - lr=0.002, momentum=0.9 - Freeze first 2 blocks' LoRA (stable features don't need adaptation) New env vars: TTT_EPOCHS, TTT_OPTIMIZER, TTT_MOMENTUM, TTT_FREEZE_FIRST_N Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation. Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function, TTT call before torch.compile in eval section.

sharpobject · 2026-03-21T17:55:33Z

"If it isn't abundantly obvious: You can't cheat on your test loss. You can't cheat by training on the validation set before you evaluate on the validation set. The validation language around test-time training has been confusing people: you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded!"

@timowhite88

11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanalot · 2026-03-21T19:52:11Z

newjordan just out here blatantly copying people lol

#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

hegdeadithyak · 2026-03-23T13:03:36Z

records/track_10min_16mb/2026-03-20_FarnsworthEngine_TTT_11L_Int6_MLP3x/train_gpt.py

+    restore_low_dim_params_to_fp32(eval_model)
+    eval_model.load_state_dict(deq_state, strict=True)
+
+    # TTT: adapt model on validation data before eval


I don't think this is how it should be done :)

@timowhite88

11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@timowhite88

11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:07:03Z

Community Review — Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303

BPB: 1.1303 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 479b8bcfa8ea, file records/track_10min_16mb/2026-03-20_FarnsworthEngine_TTT_11L_Int6_MLP3x/train_gpt.py):

At line 1038 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=68235 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=68235 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Mar 20, 2026

Add TTT (Test-Time Training) submission: 1.1767 BPB #152

Closed

Add Nuclear Stack submission: 1.16668 BPB (seed 2884431328) #178

Closed

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x (val_bpb: 1.1303, …

479b8bc

…mean: 1.1313)

timowhite88 force-pushed the farnsworth-engine-v1 branch from 18aa3cc to 479b8bc Compare March 20, 2026 20:12

timowhite88 mentioned this pull request Mar 20, 2026

Title: Record submission: FarnsworthEngine v1 — val_bpb=1.1303 (mean 1.1313, 3 seeds) #270

Closed

ibarrajo mentioned this pull request Mar 20, 2026

flash_attn_interface (FA3) missing from runpod/parameter-golf:latest image #280

Open

charmquark1984 mentioned this pull request Mar 20, 2026

Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth #281

Closed

ibarrajo mentioned this pull request Mar 21, 2026

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354) #290

Open

7 tasks

JackYoung27 mentioned this pull request Mar 21, 2026

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520) #302

Open

sseanliu mentioned this pull request Mar 21, 2026

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303

Open

newjordan mentioned this pull request Mar 22, 2026

Record: Sponge Bath — TTT 8ep eval-only improvement (val_bpb: 1.1295) #390

Closed

5 tasks

leloykun mentioned this pull request Mar 22, 2026

Invalid submissions due to information leakage during TTT #402

Open

hegdeadithyak reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303#254

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303#254
timowhite88 wants to merge 1 commit intoopenai:mainfrom
timowhite88:farnsworth-engine-v1

timowhite88 commented Mar 20, 2026

Uh oh!

timowhite88 commented Mar 20, 2026

Uh oh!

timowhite88 commented Mar 20, 2026

Uh oh!

himanalot commented Mar 20, 2026

Uh oh!

mohosy commented Mar 20, 2026

Uh oh!

sharpobject commented Mar 21, 2026 •

edited

Loading

Uh oh!

himanalot commented Mar 21, 2026

Uh oh!

hegdeadithyak Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

timowhite88 commented Mar 20, 2026

Uh oh!

timowhite88 commented Mar 20, 2026

Uh oh!

timowhite88 commented Mar 20, 2026

Uh oh!

himanalot commented Mar 20, 2026

Uh oh!

mohosy commented Mar 20, 2026

Uh oh!

sharpobject commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

himanalot commented Mar 21, 2026

Uh oh!

hegdeadithyak Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sharpobject commented Mar 21, 2026 •

edited

Loading