Skip to content

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303#254

Open
timowhite88 wants to merge 1 commit intoopenai:mainfrom
timowhite88:farnsworth-engine-v1
Open

Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303#254
timowhite88 wants to merge 1 commit intoopenai:mainfrom
timowhite88:farnsworth-engine-v1

Conversation

@timowhite88
Copy link
Copy Markdown

No description provided.

@timowhite88
Copy link
Copy Markdown
Author


@0hq Ready for review — 3 seeds complete with tight reproducibility:

  • Seed 1337: 1.1303
  • Seed 42: 1.1312
  • Seed 7: 1.1323
  • Mean: 1.1313

15.88 MB artifact, 600s train, 129s eval. Full logs for all 3 seeds included. This supersedes our earlier PRs #152 and #178.

@timowhite88
Copy link
Copy Markdown
Author

@notapplica 3 seeds submitted now, Mean is posted , all logs contained, Ready for @0hq review

@timowhite88 timowhite88 force-pushed the farnsworth-engine-v1 branch from 18aa3cc to 479b8bc Compare March 20, 2026 20:12
@himanalot
Copy link
Copy Markdown

this is aura

@mohosy
Copy link
Copy Markdown

mohosy commented Mar 20, 2026

interesting that freezing early blocks during ttt helps stability, have you experimented with freezing more or less blocks to see where the sweet spot is

newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
Matching PR #254 (1.1313 BPB) TTT approach:
- SGD optimizer instead of Adam (better for non-stationary TTT)
- 3 epochs per document (more adaptation)
- lr=0.002, momentum=0.9
- Freeze first 2 blocks' LoRA (stable features don't need adaptation)

New env vars: TTT_EPOCHS, TTT_OPTIMIZER, TTT_MOMENTUM, TTT_FREEZE_FIRST_N

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sseanliu added a commit to sseanliu/parameter-golf that referenced this pull request Mar 21, 2026
Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation.
Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function,
TTT call before torch.compile in eval section.
@sharpobject
Copy link
Copy Markdown

sharpobject commented Mar 21, 2026

"If it isn't abundantly obvious: You can't cheat on your test loss. You can't cheat by training on the validation set before you evaluate on the validation set. The validation language around test-time training has been confusing people: you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded!"

newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep
Exact reproduction of @timowhite88's FarnsworthEngine recipe.
No modifications — run as-is to validate baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@himanalot
Copy link
Copy Markdown

newjordan just out here blatantly copying people lol

newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
#1 untried combination from competition commentary:
TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan referenced this pull request in newjordan/parameter-golf Mar 21, 2026
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export)
exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count)
exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio

All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442)
flagged as potentially invalid for adapting on eval tokens BEFORE scoring them.
Added correct score-then-adapt protocol with implementation guide.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
restore_low_dim_params_to_fp32(eval_model)
eval_model.load_state_dict(deq_state, strict=True)

# TTT: adapt model on validation data before eval
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is how it should be done :)

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep
Exact reproduction of @timowhite88's FarnsworthEngine recipe.
No modifications — run as-is to validate baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
openai#1 untried combination from competition commentary:
TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export)
exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count)
exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio

All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep
Exact reproduction of @timowhite88's FarnsworthEngine recipe.
No modifications — run as-is to validate baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
openai#1 untried combination from competition commentary:
TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB
XSA_LAST_N=3 excludes self-attention in final 3 layers.
Zero extra params, frees attention capacity for cross-token focus.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export)
exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count)
exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio

All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303

BPB: 1.1303 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 479b8bcfa8ea, file records/track_10min_16mb/2026-03-20_FarnsworthEngine_TTT_11L_Int6_MLP3x/train_gpt.py):

At line 1038 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=68235 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=68235 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants