Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) by felipe-parodi · Pull Request #398 · openai/parameter-golf

felipe-parodi · 2026-03-22T04:15:33Z

Record: 11L EMA + TTT(20ep) — val_bpb: 1.1213

val_bpb = 1.1213 (sliding window stride=64, best seed 1337) | 15.53 MB artifact | 8xH100 SXM, 600s

Key Finding: EMA + Aggressive TTT with All Blocks Unfrozen

EMA(0.997) weight averaging combined with aggressive test-time training (20 epochs SGD, lr=0.008, all blocks unfrozen) outperforms Tight SWA + VE128 approaches (PR #388, 1.1231).

Results (3-seed, 8xH100 SXM)

Seed	Steps	Sliding BPB (s64)	Artifact
1337	7386	1.1213	15.53 MB
42	7411	1.1221	15.51 MB
2025	7386	1.1228	15.53 MB

Mean: 1.1221 | Std: 0.0008

Critical Discoveries (15-run ablation)

TTT_FREEZE_BLOCKS=0 is essential — freezing early blocks during aggressive TTT creates internal inconsistency (quant gap 5x worse)
Late QAT is counterproductive with aggressive TTT
XSA removed — saves ~1.4ms/step, yielding ~130 more training steps
PPM-C eval blending hurts strong models — classical compression adds noise when neural model is already strong
Memory tokens, aggressive warmdown, gradient-guided quant — all documented negative results

Run Command

pip install zstandard flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

SEED=1337 NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=0 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=0 \
TTT_ENABLED=1 TTT_LR=0.008 TTT_EPOCHS=20 TTT_MOMENTUM=0.9 TTT_FREEZE_BLOCKS=0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

See the full README in the submission folder for detailed architecture, training config, TTT analysis, and the complete 15-run ablation table.

…1221)

…ai#398 findings; disable XSA for throughput

… DDP 14.8s/epoch)

PR openai#398: 11L EMA + TTT(20ep, freeze=0), no XSA, no Late QAT - Best seed 1.1213 BPB, 3-seed mean 1.1221 - 7386 steps at ~81ms/step - Has: FA3, NTK RoPE, MTP, TTT, (B,S,H,D) layout - Missing: memory tokens, magnitude pruning, late-K passthrough

…#398 base Built on PR openai#398 (1.1213 BPB). Three targeted improvements: 1. Cautious Muon: mask Muon updates that disagree with gradient direction (~1.47x convergence speedup, 2 lines, zero risk) 2. Magnitude pruning (5% default): zero smallest weights before quantization, improves zstd compression ratio by 5-15% 3. allow_in_graph + cache_size_limit=32: safer torch.compile with FA3 custom ops and 11-block guard specialization Respects PR openai#398 negative results: no memory tokens, no Late QAT.

…agnitude pruning

Replace SGD with AdamW for test-time training. 3-line diff from PR openai#398. Mean val_bpb 1.1027 (3-seed), best 1.0992. Beats prior SOTA 1.1213 by -0.019.

Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

@sjp611

Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.

@sjp611

Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.

…ttern) Root cause: per-sequence indexing from permuted indices was ~100x slower than contiguous val_tokens slicing. Each GPU now takes a contiguous shard and iterates sequentially, matching openai#398's working implementation.

Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462): - SwiGLU FFN with Star-ReLU (hidden=1792) - U-Net skip connections with learned gating - EMA (decay=0.9985) replacing SWA - AdamW TTT (legal score-first protocol) - Partial RoPE (16 dims) - LN Scale (1/sqrt(layer_idx+1)) - BigramHash(8192) + SmearGate - GPTQ-lite quantization - DDP compile fix for multi-GPU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

felipe-parodi · 2026-03-23T01:00:46Z

Converting to non-record per TTT ruling (Issue #402). The README documents a 15-run ablation (memory tokens, causal TTT, PPM-C blending, grad-guided quant, aggressive warmdown — all negative results at the frontier) and the freeze_blocks=0 finding for aggressive TTT. Working on a non-TTT submission.

3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).

…nai#400 openai#369 openai#398) KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free) GPTQ should be per-ROW not per-matrix (-0.0006 BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:06:18Z

Community Review — Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)

BPB: 1.1213 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA db2c5796c539, file records/track_10min_16mb/2026-03-21_11L_EMA_TTT20ep_1.1213/train_gpt.py):

At line 1007 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=71770 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=71770 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.…

db2c579

…1221)

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

leloykun mentioned this pull request Mar 22, 2026

Invalid submissions due to information leakage during TTT #402

Open

samuelczhao added a commit to samuelczhao/parameter-golf that referenced this pull request Mar 22, 2026

Aggressive TTT defaults (20 epochs, all blocks, lr=0.008) per PR open…

1985065

…ai#398 findings; disable XSA for throughput

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026

Add aggressive TTT script (matches PR openai#398 strategy)

6d047ea

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 22, 2026

Cap TTT at 7 epochs (our TTT is single-GPU, 73s/epoch vs openai#398's…

08f3dcb

… DDP 14.8s/epoch)

abaybektursun mentioned this pull request Mar 22, 2026

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399

Open

robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 22, 2026

V6: minimal delta from PR openai#398 -- Cautious Muon + GPTQ-lite + m…

f95541f

…agnitude pruning

robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 22, 2026

V6: add EMA + TTT + sliding window eval + PR openai#398 defaults

53c1113

sjp611 mentioned this pull request Mar 22, 2026

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) #442

Closed

JoeProAI mentioned this pull request Mar 22, 2026

Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672) #462

Closed

mohosy mentioned this pull request Mar 23, 2026

Non-record: 11L SwiGLU + XSA4 + EMA + U-Net + AdamW TTT (pending compute) #291

Open

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

felipe-parodi changed the title ~~Record: 11L EMA + TTT(20ep,freeze=0) — val_bpb=1.1213 (3-seed mean 1.1221)~~ Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) Mar 23, 2026

ndokutovich mentioned this pull request Mar 23, 2026

Record: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879) #486

Closed

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

abaybektursun mentioned this pull request Mar 23, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Closed

andrewbaggio1 mentioned this pull request Mar 23, 2026

Non-record: Cosine TTT 30ep on SwiGLU + U-Net (1xH100, val_bpb=1.1175) #509

Closed

4 tasks

lukacf mentioned this pull request Mar 23, 2026

Record*: val_bpb=0.978 BPB — Goldfish ML Autonomous Research (100ep Cosine *leaky* TTT) #517

Closed

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)#398

Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)#398
felipe-parodi wants to merge 1 commit intoopenai:mainfrom
felipe-parodi:submission/11L-EMA-TTT20ep-1.1213

felipe-parodi commented Mar 22, 2026

Uh oh!

felipe-parodi commented Mar 23, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

felipe-parodi commented Mar 22, 2026

Record: 11L EMA + TTT(20ep) — val_bpb: 1.1213

Key Finding: EMA + Aggressive TTT with All Blocks Unfrozen

Results (3-seed, 8xH100 SXM)

Critical Discoveries (15-run ablation)

Run Command

Uh oh!

felipe-parodi commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

felipe-parodi commented Mar 23, 2026 •

edited

Loading