[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E)#1222
[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E)#1222abaybektursun wants to merge 8 commits intoopenai:mainfrom
Conversation
Explores end-to-end meta-learned TTT (arxiv 2512.23675) for Parameter Golf. Adds prime MLPs to last 3 blocks, meta-learns init via FOMAML, adapts at eval with score-first SGD. Validates machinery on 1x L40S; TTT adaptation recovers 0.076 BPB but doesn't beat baseline on limited data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Naive TTT with zero-init prime MLPs beats baseline by -0.022 BPB (full eval). FOMAML meta-learning hurts — the architecture alone is the key insight. LR sweep shows monotonic improvement up to 0.1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
layer=all (11 prime MLPs, rank=256) gives 1.4288 BPB vs 1.5019 baseline. LR=1.0 still improving. Momentum=0.9 helps. Rank matters less than layer count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
40-shard training (10K steps, MLP 3.5x) baseline 1.3696. Naive TTT all-11-layers: 1.2906 (-0.079). Effect scales with model quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Update: Re-running E2E FOMAML study with the full 40-shard model and data (1500 FOMAML steps). The earlier negative E2E result was on a 1-shard model with insufficient meta-learning diversity. This run will settle whether FOMAML helps or hurts with proper data. Results in ~1 hour. |
Phase 2 joint training (base at 0.0003 LR + prime at 0.003) on 40-shard checkpoint massively improves the model even without TTT. TTT on top of FOMAML slightly hurts (meta-learned init is already optimal). The key finding: FOMAML makes training better, not just eval-time adaptation. Results on strong 40-shard model: - Baseline: 1.5185 BPB - Post-FOMAML (no TTT): 1.2588 (-0.260) - E2E TTT: 1.2656 (-0.253) - Naive TTT: 1.2776 (-0.241) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full E2E results on strong 40-shard modelTTT LR sweep on FOMAML model (5K chunks)
TTT adds only -0.001 on top of FOMAML — the meta-learning already captured the adaptation value during training. Complete comparison
FOMAML gives an extra -0.018 BPB over naive TTT but costs 44% of the training budget. Naive TTT is the better deal — zero training cost, 81% of the FOMAML benefit. |
Adds E2E FOMAML findings: -0.097 total but 44% training budget. Head-to-head comparison shows naive TTT is the practical winner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First TTT results on the real 8xH100 sp4608 model (1.098 BPB sliding window). Chunk-based baseline: 1.454. Best TTT: lr=0.03 at 1.248 (-0.206). Strong model prefers lower LR (0.03) vs weak model's preference for high LR (0.1-1.0). Full eval in progress. Sliding window TTT next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8xH100 sp4608 model results (in progress)Running naive TTT on the actual 8xH100 sp4608 model (val_bpb 1.098 sliding window). Using chunk-based eval (not sliding window yet — implementing next). LR sweep (5K chunks, all 11 layers, rank=256)
Key finding: Strong model (1.098 BPB) prefers much lower LR (0.03) than the weak models (which preferred 0.1-1.0). Higher LRs diverge catastrophically. This makes sense — the strong model's representations are more precise, so large perturbations are destructive. Full eval at lr=0.03 running now. Sliding window TTT implementation next. |
Fix RoPE (train_seq_len scaling was missing). With proper sliding window (stride=512), baseline is 1.0849 BPB. TTT at all LRs (0.003-0.1) is neutral-to-negative. The strong model with full context leaves no headroom for prime MLP adaptation. Earlier positive chunk-based results (-0.25 BPB) were an artifact of short-context eval (1024 tokens) — TTT was compensating for missing context, not adding genuine adaptation value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Definitive result: TTT is neutral on the 8xH100 model with proper evalSliding window eval (stride=512, seq_len=2048)
TTT provides zero benefit with proper sliding window evaluation. Every LR makes things slightly worse. Why the earlier chunk-based results were misleadingThe chunk-based eval (1024-token sequences, no overlap) showed massive TTT improvement (-0.25 BPB at lr=0.03). But that eval gives each token only ~512 average context. TTT was compensating for missing context, not providing genuine adaptation. With proper sliding window (each scored token has ~1536-2048 context), the model already has all the information it needs. Prime MLP adaptation only adds noise. ConclusionNaive TTT with prime MLPs is an artifact of short-context evaluation, not a real improvement. The technique does not help on production-quality models evaluated with proper sliding window. TTT-E2E (FOMAML) is similarly not useful for this setting. (Note: earlier results on weak L40S models showed real improvement because those models had much higher baseline BPB — more headroom. But this headroom was also partially due to insufficient training, not a fundamental property.) |
Prime MLP Test-Time Training: Naive vs E2E (FOMAML)
Two studies on test-time training with prime MLP adapters. Naive TTT gives
-0.079 BPB for free (eval-only). E2E FOMAML gives -0.097 total but costs
44% of the training budget.
Motivation
All 25 prior naive TTT attempts failed because they perturbed GPTQ'd int5/int6
weights. Prime MLPs are separate bf16 parameters — they don't touch GPTQ'd weights.
Architecture
Rank-256 prime MLPs on all 11 blocks, running before the main MLP:
Down projection zero-init (model starts unchanged). Score-first eval is legal.
Results
Full-data training (40 shards, MLP 3.5x, 10K steps, 1x L40S)
Sweep summary (5K chunks, full-data model)
Key findings (Study 1)
1-shard control (earlier experiment)
Artifact size for PR 1105
rank=64 all-layers fits and rank barely matters vs layer count.
Study 2: E2E TTT (FOMAML Meta-Learning)
Method
Phase 2 FOMAML joint training on the strong 40-shard checkpoint. Base model at
0.0003 LR, prime MLPs at 0.003 LR. Inner loop: K=1 SGD step on prime weights.
Outer loop: both base and prime get gradients. 3000 steps.
Results
Joint FOMAML massively improves the model even without TTT (-0.260 from FOMAML
baseline). But TTT on top of FOMAML slightly hurts — the meta-learned init
is already tuned and SGD overshoots.
TTT LR sweep on FOMAML model (5K chunks)
TTT adds only -0.001 on top of FOMAML. The meta-learning already captured the
adaptation value during training.
Key findings (Study 2)
Head-to-head: Naive TTT vs E2E FOMAML
Naive TTT is the practical winner — zero training cost, 81% of FOMAML's benefit.
FOMAML is worth it only if the 44% training budget can be absorbed.
Next steps
Files
train_ttt_e2e.py— Model with prime MLPs + FOMAML + TTT evaltrain_e2e_proper.py— Proper E2E training (Phase 1 + Phase 2 joint)sweep_naive_ttt.py— Naive TTT LR/chunk/reset sweepsweep_v2.py— LR/rank/layer/momentum sweep🤖 Generated with Claude Code