Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)#398
Conversation
…ai#398 findings; disable XSA for throughput
… DDP 14.8s/epoch)
PR openai#398: 11L EMA + TTT(20ep, freeze=0), no XSA, no Late QAT - Best seed 1.1213 BPB, 3-seed mean 1.1221 - 7386 steps at ~81ms/step - Has: FA3, NTK RoPE, MTP, TTT, (B,S,H,D) layout - Missing: memory tokens, magnitude pruning, late-K passthrough
…#398 base Built on PR openai#398 (1.1213 BPB). Three targeted improvements: 1. Cautious Muon: mask Muon updates that disagree with gradient direction (~1.47x convergence speedup, 2 lines, zero risk) 2. Magnitude pruning (5% default): zero smallest weights before quantization, improves zstd compression ratio by 5-15% 3. allow_in_graph + cache_size_limit=32: safer torch.compile with FA3 custom ops and 11-block guard specialization Respects PR openai#398 negative results: no memory tokens, no Late QAT.
…agnitude pruning
Replace SGD with AdamW for test-time training. 3-line diff from PR openai#398. Mean val_bpb 1.1027 (3-seed), best 1.0992. Beats prior SOTA 1.1213 by -0.019.
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.
Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.
…ttern) Root cause: per-sequence indexing from permuted indices was ~100x slower than contiguous val_tokens slicing. Each GPU now takes a contiguous shard and iterates sequentially, matching openai#398's working implementation.
Major rewrite based on latest meta (PRs openai#398, openai#442, openai#462): - SwiGLU FFN with Star-ReLU (hidden=1792) - U-Net skip connections with learned gating - EMA (decay=0.9985) replacing SWA - AdamW TTT (legal score-first protocol) - Partial RoPE (16 dims) - LN Scale (1/sqrt(layer_idx+1)) - BigramHash(8192) + SmearGate - GPTQ-lite quantization - DDP compile fix for multi-GPU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
|
Converting to non-record per TTT ruling (Issue #402). The README documents a 15-run ablation (memory tokens, causal TTT, PPM-C blending, grad-guided quant, aggressive warmdown — all negative results at the frontier) and the freeze_blocks=0 finding for aggressive TTT. Working on a non-TTT submission. |
3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).
…nai#400 openai#369 openai#398) KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free) GPTQ should be per-ROW not per-matrix (-0.0006 BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)BPB: 1.1213 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 1007 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=71770 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=71770 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Record: 11L EMA + TTT(20ep) — val_bpb: 1.1213
val_bpb = 1.1213 (sliding window stride=64, best seed 1337) | 15.53 MB artifact | 8xH100 SXM, 600s
Key Finding: EMA + Aggressive TTT with All Blocks Unfrozen
EMA(0.997) weight averaging combined with aggressive test-time training (20 epochs SGD, lr=0.008, all blocks unfrozen) outperforms Tight SWA + VE128 approaches (PR #388, 1.1231).
Results (3-seed, 8xH100 SXM)
Mean: 1.1221 | Std: 0.0008
Critical Discoveries (15-run ablation)
Run Command
See the full README in the submission folder for detailed architecture, training config, TTT analysis, and the complete 15-run ablation table.