[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E) by abaybektursun · Pull Request #1222 · openai/parameter-golf

abaybektursun · 2026-04-01T15:13:14Z

Prime MLP Test-Time Training: Naive vs E2E (FOMAML)

Two studies on test-time training with prime MLP adapters. Naive TTT gives
-0.079 BPB for free (eval-only). E2E FOMAML gives -0.097 total but costs
44% of the training budget.

Motivation

All 25 prior naive TTT attempts failed because they perturbed GPTQ'd int5/int6
weights. Prime MLPs are separate bf16 parameters — they don't touch GPTQ'd weights.

Architecture

Rank-256 prime MLPs on all 11 blocks, running before the main MLP:

h = h + attn(norm(h))
h = h + prime_MLP(prime_norm(h))   # bf16, adapted via SGD at eval time
h = h + MLP(mlp_norm(h))           # GPTQ'd int5/int6, frozen

Down projection zero-init (model starts unchanged). Score-first eval is legal.

Results

Full-data training (40 shards, MLP 3.5x, 10K steps, 1x L40S)

Config	val_bpb	Delta
Baseline (EMA, no TTT)	1.3696	—
TTT lr=0.1, all 11 layers	1.2906	-0.079

Sweep summary (5K chunks, full-data model)

Experiment	val_bpb	Delta
Baseline	1.3696	—
lr=0.03	1.3670	-0.003
lr=0.1	1.3636	-0.006
lr=0.3	1.3601	-0.010
lr=1.0	1.3550	-0.015
rank=64 (3 layers)	1.3661	-0.004
rank=512 (3 layers)	1.3638	-0.006
layer=[10] only	1.3669	-0.003
layer=[6..10]	1.3609	-0.009
layer=all (11)	1.3242	-0.045
momentum=0.9	1.3574	-0.012

Key findings (Study 1)

Layer count >> rank. All 11 layers (-0.045) crushes rank=512 on 3 layers (-0.006)
Higher LR is better up to 1.0 (still improving, ceiling not found)
Full eval compounds — 60K chunks gives ~1.8x the 5K-chunk delta
Effect scales with model quality — -0.079 on strong model vs -0.073 on weak
Momentum=0.9 helps (+2x at same LR on 3 layers)

1-shard control (earlier experiment)

Config	Baseline	TTT	Delta
1 shard, 7200 steps	1.5019	1.4288	-0.073
40 shards, 10K steps	1.3696	1.2906	-0.079

Artifact size for PR 1105

Config	Prime MLP size	Fits 16 MB?
rank=256, all 11 layers	5.75 MB	No
rank=64, all 11 layers	1.41 MB	Yes
rank=32, all 11 layers	0.70 MB	Yes

rank=64 all-layers fits and rank barely matters vs layer count.

Study 2: E2E TTT (FOMAML Meta-Learning)

Method

Phase 2 FOMAML joint training on the strong 40-shard checkpoint. Base model at
0.0003 LR, prime MLPs at 0.003 LR. Inner loop: K=1 SGD step on prime weights.
Outer loop: both base and prime get gradients. 3000 steps.

Results

Stage	val_bpb	Delta vs orig baseline
Original baseline (no prime MLPs)	1.3696	—
FOMAML baseline (prime at zero)	1.5185	—
Post-FOMAML (no TTT)	1.2588	-0.111
E2E TTT (meta-learned init, lr=0.1)	1.2656	-0.104
Naive TTT (zero-init on FOMAML model)	1.2776	-0.092

Joint FOMAML massively improves the model even without TTT (-0.260 from FOMAML
baseline). But TTT on top of FOMAML slightly hurts — the meta-learned init
is already tuned and SGD overshoots.

TTT LR sweep on FOMAML model (5K chunks)

Config	val_bpb	Delta vs FOMAML no-TTT
FOMAML, no TTT	1.2732	—
+ TTT lr=0.001	1.2731	-0.000
+ TTT lr=0.01	1.2726	-0.001
+ TTT lr=0.1	1.2720	-0.001

TTT adds only -0.001 on top of FOMAML. The meta-learning already captured the
adaptation value during training.

Key findings (Study 2)

Joint FOMAML makes training better — -0.260 BPB from the FOMAML baseline, even standalone
TTT is nearly redundant after FOMAML — only -0.001 additional benefit
The base model co-adapts — this isn't just adapter training, the whole model improves

Head-to-head: Naive TTT vs E2E FOMAML

Approach	Baseline	Best	Total Δ	Training cost
Naive TTT (eval-only)	1.3696	1.2906	-0.079	0
FOMAML + TTT	1.3696	1.2720	-0.097	3000 steps (~44% budget)

Naive TTT is the practical winner — zero training cost, 81% of FOMAML's benefit.
FOMAML is worth it only if the 44% training budget can be absorbed.

Next steps

8xH100 validation on actual PR 1105 model (1.1125 BPB)
Combine lr=1.0 + all layers + momentum=0.9 (untested combination)
rank=64 all-layers full eval (fits 16 MB budget)

Files

train_ttt_e2e.py — Model with prime MLPs + FOMAML + TTT eval
train_e2e_proper.py — Proper E2E training (Phase 1 + Phase 2 joint)
sweep_naive_ttt.py — Naive TTT LR/chunk/reset sweep
sweep_v2.py — LR/rank/layer/momentum sweep

🤖 Generated with Claude Code

Explores end-to-end meta-learned TTT (arxiv 2512.23675) for Parameter Golf. Adds prime MLPs to last 3 blocks, meta-learns init via FOMAML, adapts at eval with score-first SGD. Validates machinery on 1x L40S; TTT adaptation recovers 0.076 BPB but doesn't beat baseline on limited data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Naive TTT with zero-init prime MLPs beats baseline by -0.022 BPB (full eval). FOMAML meta-learning hurts — the architecture alone is the key insight. LR sweep shows monotonic improvement up to 0.1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

layer=all (11 prime MLPs, rank=256) gives 1.4288 BPB vs 1.5019 baseline. LR=1.0 still improving. Momentum=0.9 helps. Rank matters less than layer count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

40-shard training (10K steps, MLP 3.5x) baseline 1.3696. Naive TTT all-11-layers: 1.2906 (-0.079). Effect scales with model quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun · 2026-04-02T02:51:54Z

Update: Re-running E2E FOMAML study with the full 40-shard model and data (1500 FOMAML steps). The earlier negative E2E result was on a 1-shard model with insufficient meta-learning diversity. This run will settle whether FOMAML helps or hurts with proper data. Results in ~1 hour.

Phase 2 joint training (base at 0.0003 LR + prime at 0.003) on 40-shard checkpoint massively improves the model even without TTT. TTT on top of FOMAML slightly hurts (meta-learned init is already optimal). The key finding: FOMAML makes training better, not just eval-time adaptation. Results on strong 40-shard model: - Baseline: 1.5185 BPB - Post-FOMAML (no TTT): 1.2588 (-0.260) - E2E TTT: 1.2656 (-0.253) - Naive TTT: 1.2776 (-0.241) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun · 2026-04-02T09:32:39Z

Full E2E results on strong 40-shard model

TTT LR sweep on FOMAML model (5K chunks)

Config	val_bpb	Delta
FOMAML model, no TTT	1.2732	—
+ TTT lr=0.01	1.2726	-0.001
+ TTT lr=0.1	1.2720	-0.001

TTT adds only -0.001 on top of FOMAML — the meta-learning already captured the adaptation value during training.

Complete comparison

Approach	Baseline	Best	Total Δ	Training cost
Naive TTT (eval-only)	1.3696	1.2906	-0.079	0
FOMAML + TTT	1.3696	1.2720	-0.097	3000 steps (~44% of budget)

FOMAML gives an extra -0.018 BPB over naive TTT but costs 44% of the training budget. Naive TTT is the better deal — zero training cost, 81% of the FOMAML benefit.

Adds E2E FOMAML findings: -0.097 total but 44% training budget. Head-to-head comparison shows naive TTT is the practical winner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

First TTT results on the real 8xH100 sp4608 model (1.098 BPB sliding window). Chunk-based baseline: 1.454. Best TTT: lr=0.03 at 1.248 (-0.206). Strong model prefers lower LR (0.03) vs weak model's preference for high LR (0.1-1.0). Full eval in progress. Sliding window TTT next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun · 2026-04-02T17:56:21Z

8xH100 sp4608 model results (in progress)

Running naive TTT on the actual 8xH100 sp4608 model (val_bpb 1.098 sliding window). Using chunk-based eval (not sliding window yet — implementing next).

LR sweep (5K chunks, all 11 layers, rank=256)

LR	val_bpb	Delta
0.0 (baseline)	1.454	—
0.03	1.248	-0.206
0.1	1.390	-0.064
0.3	2.897	+1.443 (diverged)
1.0	3.060	+1.606 (diverged)

Key finding: Strong model (1.098 BPB) prefers much lower LR (0.03) than the weak models (which preferred 0.1-1.0). Higher LRs diverge catastrophically. This makes sense — the strong model's representations are more precise, so large perturbations are destructive.

Full eval at lr=0.03 running now. Sliding window TTT implementation next.

Fix RoPE (train_seq_len scaling was missing). With proper sliding window (stride=512), baseline is 1.0849 BPB. TTT at all LRs (0.003-0.1) is neutral-to-negative. The strong model with full context leaves no headroom for prime MLP adaptation. Earlier positive chunk-based results (-0.25 BPB) were an artifact of short-context eval (1024 tokens) — TTT was compensating for missing context, not adding genuine adaptation value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun · 2026-04-03T00:55:54Z

Definitive result: TTT is neutral on the 8xH100 model with proper eval

Sliding window eval (stride=512, seq_len=2048)

LR	val_bpb	Delta
0.0 (baseline)	1.0849	—
0.003	1.0850	+0.0001
0.01	1.0858	+0.0008
0.03	1.0865	+0.0015
0.1	1.0876	+0.0027

TTT provides zero benefit with proper sliding window evaluation. Every LR makes things slightly worse.

Why the earlier chunk-based results were misleading

The chunk-based eval (1024-token sequences, no overlap) showed massive TTT improvement (-0.25 BPB at lr=0.03). But that eval gives each token only ~512 average context. TTT was compensating for missing context, not providing genuine adaptation.

With proper sliding window (each scored token has ~1536-2048 context), the model already has all the information it needs. Prime MLP adaptation only adds noise.

Conclusion

Naive TTT with prime MLPs is an artifact of short-context evaluation, not a real improvement. The technique does not help on production-quality models evaluated with proper sliding window. TTT-E2E (FOMAML) is similarly not useful for this setting.

(Note: earlier results on weak L40S models showed real improvement because those models had much higher baseline BPB — more headroom. But this headroom was also partially due to insufficient training, not a fundamental property.)

abaybektursun and others added 4 commits April 1, 2026 10:10

Sweep v2: all-layer prime MLPs give -0.073 BPB

7df8824

layer=all (11 prime MLPs, rank=256) gives 1.4288 BPB vs 1.5019 baseline. LR=1.0 still improving. Momentum=0.9 helps. Rank matters less than layer count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update README with full-data results: -0.079 BPB

ff3ec20

40-shard training (10K steps, MLP 3.5x) baseline 1.3696. Naive TTT all-11-layers: 1.2906 (-0.079). Effect scales with model quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~[Non-record] TTT-E2E: Meta-learned test-time training via FOMAML~~ [Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E) Apr 2, 2026

abaybektursun and others added 2 commits April 2, 2026 08:25

Update README with complete Study 1 + Study 2 results

4ec87c2

Adds E2E FOMAML findings: -0.097 total but 44% training budget. Head-to-head comparison shows naive TTT is the practical winner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewbaggio1 mentioned this pull request Apr 2, 2026

Non-record: Comprehensive Negative Results — What Doesn't Work on Strong Models #1272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E)#1222

[Non-record] Prime MLP TTT: Naive vs Meta-Learned (E2E)#1222
abaybektursun wants to merge 8 commits intoopenai:mainfrom
abaybektursun:non-record/ttt-e2e-meta-learning

abaybektursun commented Apr 1, 2026 •

edited

Loading

Uh oh!

abaybektursun commented Apr 2, 2026

Uh oh!

abaybektursun commented Apr 2, 2026

Uh oh!

abaybektursun commented Apr 2, 2026

Uh oh!

abaybektursun commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prime MLP Test-Time Training: Naive vs E2E (FOMAML)

Motivation

Architecture

Results

Full-data training (40 shards, MLP 3.5x, 10K steps, 1x L40S)

Sweep summary (5K chunks, full-data model)

Key findings (Study 1)

1-shard control (earlier experiment)

Artifact size for PR 1105

Study 2: E2E TTT (FOMAML Meta-Learning)

Method

Results

TTT LR sweep on FOMAML model (5K chunks)

Key findings (Study 2)

Head-to-head: Naive TTT vs E2E FOMAML

Next steps

Files

Uh oh!

abaybektursun commented Apr 2, 2026

Uh oh!

abaybektursun commented Apr 2, 2026

Full E2E results on strong 40-shard model

TTT LR sweep on FOMAML model (5K chunks)

Complete comparison

Uh oh!

abaybektursun commented Apr 2, 2026

8xH100 sp4608 model results (in progress)

LR sweep (5K chunks, all 11 layers, rank=256)

Uh oh!

abaybektursun commented Apr 3, 2026

Definitive result: TTT is neutral on the 8xH100 model with proper eval

Sliding window eval (stride=512, seq_len=2048)

Why the earlier chunk-based results were misleading

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Apr 1, 2026 •

edited

Loading