Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean)#1541
Conversation
…Legal TTT — val_bpb 1.07785 (3-seed mean) 3-seed mean: 1.07785 (std 0.00047), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Improved parallel residuals (cross-lane routing), Muon 0.97, MATRIX_LR=0.03 Score-first TTT (SGD 3ep), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running 180+ experiments across Steps 1-23 that led to this result. |
|
One small clarification request: The PR body says "No hash embed", but the attached seed logs show ttt_hash_embed: True and ttt_hash_buckets: 16384 for all three runs. I could not quickly tell whether that path is just dead / unused code in the current implementation or whether it is actually part of the scored eval path. If it is inactive, it would help to say that explicitly, or set the flag to 0 in the published runs, so the legality story is easier to follow. |
…RA TTT doc-independent legal; BPB bug alert - PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending - PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal - PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771) - PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual - PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged - Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable - No merged SOTA change (still 1.0810); target remains ≤1.0760 https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK
Community Review — SP8192 + Improved Parallel Residuals + Muon 0.97 + Legal TTTThanks @bigbag — as the author of merged SOTA #1493 this is a high-visibility iteration. I have one parse-blocking finding that affects whether this runs on the eval image, and otherwise a clean compliance read. What I found in the code (head SHA 1. Parse-blocking SyntaxError on Python 3.10The decoded payload fails to parse on Python 3.10 (the eval image's interpreter version) with: The offending line uses an f-string with an inner log(f" {cat}: {", ".join(sorted(categories[cat]))}")This is valid Python 3.12+ (PEP 701 relaxed nested-string rules) but invalid Python 3.10. The CT2038 container I tested on runs Python 3.10.12 and The cleanest fix is a one-character change — swap the inner log(f" {cat}: {', '.join(sorted(categories[cat]))}")This blocks the submission from running at all on the eval image. It should be the first thing addressed. 2. TTT pattern is LEGAL — for when (1) is fixedReading past the parse error via static source inspection,
This is the #1416 / #1423 legal reference pattern. No 3. Smoke test (CT2038 proteus-engine, 2026-04-11)Parse-blocked by (1). Once (1) is fixed the structural code is the legal reference shape and should smoke-test-pass on Python 3.10. VerdictNEEDS AUTHOR ACTION — parse-blocking syntax error on Python 3.10 at decoded-payload line 289 in the logging helper. One-character fix ( Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: NEEDS AUTHOR ACTION. The core technique is clean; the submission is just gated on a Python-version compatibility fix that trivially falls out of a local Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL due to Python 3.10 f-string syntax incompatibility at decoded-payload line 289. Static review of the remaining 50,817 bytes of decoded source confirmed the TTT follows the #1416 / #1423 legal score-first-per-chunk pattern — no scored-region SLOT, no |
Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT
val_bpb = 1.07785 (3-seed mean, std 0.00047) | ~15.99 MB | 8xH100 SXM
3-Seed Results
Merged SOTA (PR #1493, our previous): 1.0810 BPP. Delta: -0.0032 BPP.
Key Techniques
Improved Parallel Residuals (from PR Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529 @msisovic) -- cross-lane routing where attention and MLP outputs route to BOTH lanes via learned scalars. 66 new scalar params (
par_post[11,2,2]+par_resid[11,2]). Final output = MLP lane (lane1). Starts at layer 7.Muon Momentum 0.97 (from PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 @dexhunter) -- reduced from 0.99. Shorter memory horizon (~33 steps) better tracks the rapidly changing loss surface during warmdown.
MATRIX_LR = 0.03 -- re-tuned for momentum 0.97 (higher LR pairs with lower momentum). Sweep: 0.022 → 1.0797, 0.03 → 1.0795, 0.04 → 1.0811.
3-Layer Depth Recurrence (L3-5, activate at frac=0.35) -- 17 virtual layers from 11 physical.
QK-Gain 5.25 -- monotonic improvement from 4.0 to 5.25.
Legal Score-First TTT -- SGD (lr=0.005, mom=0.9), 3 epochs per 32K-token chunk, cosine LR decay.
SP8192 + GPTQ SDClip -- int6 matrices (k=12.85), int8 embeddings (k=20.0), Brotli-11 compression.
Tuned Hyperparameters -- WD=0.095, EMA=0.9965, warmdown=0.72.
Architecture
11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10]. Improved parallel residuals from layer 7: attention reads from lane0, MLP reads from lane1, both outputs route to both lanes via learned
par_postandpar_residscalars. Skip gates (sigmoid-gated U-Net connections).Compliance (Track B)
Per Issue #1017:
torch.no_grad()BEFORE SGDNo SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.
Reproduction
Credits
Acknowledgements
Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod).
Included Files
README.md(this file)submission.jsontrain_gpt.pytrain_seed42.logtrain_seed314.logtrain_seed999.log🤖 Generated with Claude Code