Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303#254
Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303#254timowhite88 wants to merge 1 commit intoopenai:mainfrom
Conversation
|
@notapplica 3 seeds submitted now, Mean is posted , all logs contained, Ready for @0hq review |
18aa3cc to
479b8bc
Compare
|
this is aura |
|
interesting that freezing early blocks during ttt helps stability, have you experimented with freezing more or less blocks to see where the sweet spot is |
Matching PR #254 (1.1313 BPB) TTT approach: - SGD optimizer instead of Adam (better for non-stationary TTT) - 3 epochs per document (more adaptation) - lr=0.002, momentum=0.9 - Freeze first 2 blocks' LoRA (stable features don't need adaptation) New env vars: TTT_EPOCHS, TTT_OPTIMIZER, TTT_MOMENTUM, TTT_FREEZE_FIRST_N Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation. Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function, TTT call before torch.compile in eval section.
|
"If it isn't abundantly obvious: You can't cheat on your test loss. You can't cheat by training on the validation set before you evaluate on the validation set. The validation language around test-time training has been confusing people: you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded!" |
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
newjordan just out here blatantly copying people lol |
#1 untried combination from competition commentary: TTT (from #254) + XSA (from #265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
| restore_low_dim_params_to_fp32(eval_model) | ||
| eval_model.load_state_dict(deq_state, strict=True) | ||
|
|
||
| # TTT: adapt model on validation data before eval |
There was a problem hiding this comment.
I don't think this is how it should be done :)
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: FarnsworthEngine v1 — TTT + 11L Int6 MLP3x, val_bpb=1.1303BPB: 1.1303 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 1038 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=68235 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=68235 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
No description provided.