Skip to content

[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E)#1222

Open
abaybektursun wants to merge 8 commits intoopenai:mainfrom
abaybektursun:non-record/ttt-e2e-meta-learning
Open

[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E)#1222
abaybektursun wants to merge 8 commits intoopenai:mainfrom
abaybektursun:non-record/ttt-e2e-meta-learning

Conversation

@abaybektursun
Copy link
Copy Markdown
Contributor

@abaybektursun abaybektursun commented Apr 1, 2026

Prime MLP Test-Time Training: Naive vs E2E (FOMAML)

Two studies on test-time training with prime MLP adapters. Naive TTT gives
-0.079 BPB for free (eval-only). E2E FOMAML gives -0.097 total but costs
44% of the training budget.

Motivation

All 25 prior naive TTT attempts failed because they perturbed GPTQ'd int5/int6
weights. Prime MLPs are separate bf16 parameters — they don't touch GPTQ'd weights.

Architecture

Rank-256 prime MLPs on all 11 blocks, running before the main MLP:

h = h + attn(norm(h))
h = h + prime_MLP(prime_norm(h))   # bf16, adapted via SGD at eval time
h = h + MLP(mlp_norm(h))           # GPTQ'd int5/int6, frozen

Down projection zero-init (model starts unchanged). Score-first eval is legal.

Results

Full-data training (40 shards, MLP 3.5x, 10K steps, 1x L40S)

Config val_bpb Delta
Baseline (EMA, no TTT) 1.3696
TTT lr=0.1, all 11 layers 1.2906 -0.079

Sweep summary (5K chunks, full-data model)

Experiment val_bpb Delta
Baseline 1.3696
lr=0.03 1.3670 -0.003
lr=0.1 1.3636 -0.006
lr=0.3 1.3601 -0.010
lr=1.0 1.3550 -0.015
rank=64 (3 layers) 1.3661 -0.004
rank=512 (3 layers) 1.3638 -0.006
layer=[10] only 1.3669 -0.003
layer=[6..10] 1.3609 -0.009
layer=all (11) 1.3242 -0.045
momentum=0.9 1.3574 -0.012

Key findings (Study 1)

  1. Layer count >> rank. All 11 layers (-0.045) crushes rank=512 on 3 layers (-0.006)
  2. Higher LR is better up to 1.0 (still improving, ceiling not found)
  3. Full eval compounds — 60K chunks gives ~1.8x the 5K-chunk delta
  4. Effect scales with model quality — -0.079 on strong model vs -0.073 on weak
  5. Momentum=0.9 helps (+2x at same LR on 3 layers)

1-shard control (earlier experiment)

Config Baseline TTT Delta
1 shard, 7200 steps 1.5019 1.4288 -0.073
40 shards, 10K steps 1.3696 1.2906 -0.079

Artifact size for PR 1105

Config Prime MLP size Fits 16 MB?
rank=256, all 11 layers 5.75 MB No
rank=64, all 11 layers 1.41 MB Yes
rank=32, all 11 layers 0.70 MB Yes

rank=64 all-layers fits and rank barely matters vs layer count.


Study 2: E2E TTT (FOMAML Meta-Learning)

Method

Phase 2 FOMAML joint training on the strong 40-shard checkpoint. Base model at
0.0003 LR, prime MLPs at 0.003 LR. Inner loop: K=1 SGD step on prime weights.
Outer loop: both base and prime get gradients. 3000 steps.

Results

Stage val_bpb Delta vs orig baseline
Original baseline (no prime MLPs) 1.3696
FOMAML baseline (prime at zero) 1.5185
Post-FOMAML (no TTT) 1.2588 -0.111
E2E TTT (meta-learned init, lr=0.1) 1.2656 -0.104
Naive TTT (zero-init on FOMAML model) 1.2776 -0.092

Joint FOMAML massively improves the model even without TTT (-0.260 from FOMAML
baseline). But TTT on top of FOMAML slightly hurts — the meta-learned init
is already tuned and SGD overshoots.

TTT LR sweep on FOMAML model (5K chunks)

Config val_bpb Delta vs FOMAML no-TTT
FOMAML, no TTT 1.2732
+ TTT lr=0.001 1.2731 -0.000
+ TTT lr=0.01 1.2726 -0.001
+ TTT lr=0.1 1.2720 -0.001

TTT adds only -0.001 on top of FOMAML. The meta-learning already captured the
adaptation value during training.

Key findings (Study 2)

  1. Joint FOMAML makes training better — -0.260 BPB from the FOMAML baseline, even standalone
  2. TTT is nearly redundant after FOMAML — only -0.001 additional benefit
  3. The base model co-adapts — this isn't just adapter training, the whole model improves

Head-to-head: Naive TTT vs E2E FOMAML

Approach Baseline Best Total Δ Training cost
Naive TTT (eval-only) 1.3696 1.2906 -0.079 0
FOMAML + TTT 1.3696 1.2720 -0.097 3000 steps (~44% budget)

Naive TTT is the practical winner — zero training cost, 81% of FOMAML's benefit.
FOMAML is worth it only if the 44% training budget can be absorbed.


Next steps

  • 8xH100 validation on actual PR 1105 model (1.1125 BPB)
  • Combine lr=1.0 + all layers + momentum=0.9 (untested combination)
  • rank=64 all-layers full eval (fits 16 MB budget)

Files

  • train_ttt_e2e.py — Model with prime MLPs + FOMAML + TTT eval
  • train_e2e_proper.py — Proper E2E training (Phase 1 + Phase 2 joint)
  • sweep_naive_ttt.py — Naive TTT LR/chunk/reset sweep
  • sweep_v2.py — LR/rank/layer/momentum sweep

🤖 Generated with Claude Code

abaybektursun and others added 4 commits April 1, 2026 10:10
Explores end-to-end meta-learned TTT (arxiv 2512.23675) for Parameter Golf.
Adds prime MLPs to last 3 blocks, meta-learns init via FOMAML, adapts at
eval with score-first SGD. Validates machinery on 1x L40S; TTT adaptation
recovers 0.076 BPB but doesn't beat baseline on limited data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Naive TTT with zero-init prime MLPs beats baseline by -0.022 BPB (full eval).
FOMAML meta-learning hurts — the architecture alone is the key insight.
LR sweep shows monotonic improvement up to 0.1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
layer=all (11 prime MLPs, rank=256) gives 1.4288 BPB vs 1.5019 baseline.
LR=1.0 still improving. Momentum=0.9 helps. Rank matters less than layer count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
40-shard training (10K steps, MLP 3.5x) baseline 1.3696.
Naive TTT all-11-layers: 1.2906 (-0.079). Effect scales with model quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun changed the title [Non-record] TTT-E2E: Meta-learned test-time training via FOMAML [Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E) Apr 2, 2026
@abaybektursun
Copy link
Copy Markdown
Contributor Author

Update: Re-running E2E FOMAML study with the full 40-shard model and data (1500 FOMAML steps). The earlier negative E2E result was on a 1-shard model with insufficient meta-learning diversity. This run will settle whether FOMAML helps or hurts with proper data. Results in ~1 hour.

Phase 2 joint training (base at 0.0003 LR + prime at 0.003) on 40-shard
checkpoint massively improves the model even without TTT. TTT on top of
FOMAML slightly hurts (meta-learned init is already optimal). The key
finding: FOMAML makes training better, not just eval-time adaptation.

Results on strong 40-shard model:
- Baseline: 1.5185 BPB
- Post-FOMAML (no TTT): 1.2588 (-0.260)
- E2E TTT: 1.2656 (-0.253)
- Naive TTT: 1.2776 (-0.241)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun
Copy link
Copy Markdown
Contributor Author

Full E2E results on strong 40-shard model

TTT LR sweep on FOMAML model (5K chunks)

Config val_bpb Delta
FOMAML model, no TTT 1.2732
+ TTT lr=0.01 1.2726 -0.001
+ TTT lr=0.1 1.2720 -0.001

TTT adds only -0.001 on top of FOMAML — the meta-learning already captured the adaptation value during training.

Complete comparison

Approach Baseline Best Total Δ Training cost
Naive TTT (eval-only) 1.3696 1.2906 -0.079 0
FOMAML + TTT 1.3696 1.2720 -0.097 3000 steps (~44% of budget)

FOMAML gives an extra -0.018 BPB over naive TTT but costs 44% of the training budget. Naive TTT is the better deal — zero training cost, 81% of the FOMAML benefit.

abaybektursun and others added 2 commits April 2, 2026 08:25
Adds E2E FOMAML findings: -0.097 total but 44% training budget.
Head-to-head comparison shows naive TTT is the practical winner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First TTT results on the real 8xH100 sp4608 model (1.098 BPB sliding window).
Chunk-based baseline: 1.454. Best TTT: lr=0.03 at 1.248 (-0.206).
Strong model prefers lower LR (0.03) vs weak model's preference for high LR (0.1-1.0).
Full eval in progress. Sliding window TTT next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun
Copy link
Copy Markdown
Contributor Author

8xH100 sp4608 model results (in progress)

Running naive TTT on the actual 8xH100 sp4608 model (val_bpb 1.098 sliding window). Using chunk-based eval (not sliding window yet — implementing next).

LR sweep (5K chunks, all 11 layers, rank=256)

LR val_bpb Delta
0.0 (baseline) 1.454
0.03 1.248 -0.206
0.1 1.390 -0.064
0.3 2.897 +1.443 (diverged)
1.0 3.060 +1.606 (diverged)

Key finding: Strong model (1.098 BPB) prefers much lower LR (0.03) than the weak models (which preferred 0.1-1.0). Higher LRs diverge catastrophically. This makes sense — the strong model's representations are more precise, so large perturbations are destructive.

Full eval at lr=0.03 running now. Sliding window TTT implementation next.

Fix RoPE (train_seq_len scaling was missing). With proper sliding window
(stride=512), baseline is 1.0849 BPB. TTT at all LRs (0.003-0.1) is
neutral-to-negative. The strong model with full context leaves no headroom
for prime MLP adaptation.

Earlier positive chunk-based results (-0.25 BPB) were an artifact of
short-context eval (1024 tokens) — TTT was compensating for missing context,
not adding genuine adaptation value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun
Copy link
Copy Markdown
Contributor Author

Definitive result: TTT is neutral on the 8xH100 model with proper eval

Sliding window eval (stride=512, seq_len=2048)

LR val_bpb Delta
0.0 (baseline) 1.0849
0.003 1.0850 +0.0001
0.01 1.0858 +0.0008
0.03 1.0865 +0.0015
0.1 1.0876 +0.0027

TTT provides zero benefit with proper sliding window evaluation. Every LR makes things slightly worse.

Why the earlier chunk-based results were misleading

The chunk-based eval (1024-token sequences, no overlap) showed massive TTT improvement (-0.25 BPB at lr=0.03). But that eval gives each token only ~512 average context. TTT was compensating for missing context, not providing genuine adaptation.

With proper sliding window (each scored token has ~1536-2048 context), the model already has all the information it needs. Prime MLP adaptation only adds noise.

Conclusion

Naive TTT with prime MLPs is an artifact of short-context evaluation, not a real improvement. The technique does not help on production-quality models evaluated with proper sliding window. TTT-E2E (FOMAML) is similarly not useful for this setting.

(Note: earlier results on weak L40S models showed real improvement because those models had much higher baseline BPB — more headroom. But this headroom was also partially due to insufficient training, not a fundamental property.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant