Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT

bigbag · 2026-04-11T12:49:04Z

val_bpb = 1.07785 (3-seed mean, std 0.00047) | ~15.99 MB | 8xH100 SXM

3-Seed Results

Seed	Sliding BPP	TTT BPP	Artifact
42	1.07880	1.07718	15,990,780
314	1.07959	1.07810	15,987,449
999	1.07963	1.07826	15,987,550
Mean	1.07934	1.07785	15,988,593
Std	0.00039	0.00047

Merged SOTA (PR #1493, our previous): 1.0810 BPP. Delta: -0.0032 BPP.

Key Techniques

Improved Parallel Residuals (from PR Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529 @msisovic) -- cross-lane routing where attention and MLP outputs route to BOTH lanes via learned scalars. 66 new scalar params (par_post[11,2,2] + par_resid[11,2]). Final output = MLP lane (lane1). Starts at layer 7.
Muon Momentum 0.97 (from PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 @dexhunter) -- reduced from 0.99. Shorter memory horizon (~33 steps) better tracks the rapidly changing loss surface during warmdown.
MATRIX_LR = 0.03 -- re-tuned for momentum 0.97 (higher LR pairs with lower momentum). Sweep: 0.022 → 1.0797, 0.03 → 1.0795, 0.04 → 1.0811.
3-Layer Depth Recurrence (L3-5, activate at frac=0.35) -- 17 virtual layers from 11 physical.
QK-Gain 5.25 -- monotonic improvement from 4.0 to 5.25.
Legal Score-First TTT -- SGD (lr=0.005, mom=0.9), 3 epochs per 32K-token chunk, cosine LR decay.
SP8192 + GPTQ SDClip -- int6 matrices (k=12.85), int8 embeddings (k=20.0), Brotli-11 compression.
Tuned Hyperparameters -- WD=0.095, EMA=0.9965, warmdown=0.72.

Architecture

11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10]. Improved parallel residuals from layer 7: attention reads from lane0, MLP reads from lane1, both outputs route to both lanes via learned par_post and par_resid scalars. Skip gates (sigmoid-gated U-Net connections).

Compliance (Track B)

Per Issue #1017:

Condition 1 (Causality): Sliding-window eval, prefix only
Condition 2 (Normalized): Standard softmax, no n-gram/logit bias
Condition 3 (Score before update): Each chunk scored under torch.no_grad() BEFORE SGD
Condition 4 (Single pass): Each token scored once, no rescoring

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.

Reproduction

SEED=42 QK_GAIN_INIT=5.25 MUON_MOMENTUM=0.97 MATRIX_LR=0.03 \
  TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

@msisovic -- Improved parallel residuals (PR Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529, Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204)
@clarkkev -- SP8192 + GPTQ + SDClip + MuonEq-R (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394)
@dexhunter -- Muon 0.97 (PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514), depth recurrence (PR Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331, Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437), TTT on SP8192 (PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413)
@abaybektursun -- Score-first TTT framework (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549)
@X-Abhishek-X -- Hyperparameter tuning (PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445, [Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 #1471)
@Robby955 -- Parallel residuals on SP8192 (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412)

Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod).

Included Files

README.md (this file)
submission.json
train_gpt.py
train_seed42.log
train_seed314.log
train_seed999.log

🤖 Generated with Claude Code

…Legal TTT — val_bpb 1.07785 (3-seed mean) 3-seed mean: 1.07785 (std 0.00047), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Improved parallel residuals (cross-lane routing), Muon 0.97, MATRIX_LR=0.03 Score-first TTT (SGD 3ep), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag · 2026-04-11T12:50:13Z

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running 180+ experiments across Steps 1-23 that led to this result.

dexhunter · 2026-04-11T15:01:27Z

One small clarification request:

The PR body says "No hash embed", but the attached seed logs show ttt_hash_embed: True and ttt_hash_buckets: 16384 for all three runs.

I could not quickly tell whether that path is just dead / unused code in the current implementation or whether it is actually part of the scored eval path. If it is inactive, it would help to say that explicitly, or set the flag to 0 in the published runs, so the legality story is easier to follow.

…RA TTT doc-independent legal; BPB bug alert - PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending - PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal - PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771) - PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual - PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged - Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable - No merged SOTA change (still 1.0810); target remains ≤1.0760 https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK

MatoTeziTanka · 2026-04-11T18:16:19Z

Community Review — SP8192 + Improved Parallel Residuals + Muon 0.97 + Legal TTT

Thanks @bigbag — as the author of merged SOTA #1493 this is a high-visibility iteration. I have one parse-blocking finding that affects whether this runs on the eval image, and otherwise a clean compliance read.

What I found in the code (head SHA e037ca92030c8c861eb4ec2aa2b8f230722edb73, records/track_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/train_gpt.py, decoded from the import lzma as L,base64 as B self-extracting shim — 50,817 bytes of actual source):

1. Parse-blocking SyntaxError on Python 3.10

The decoded payload fails to parse on Python 3.10 (the eval image's interpreter version) with:

SyntaxError("f-string: expecting '}'", ('<string>', 289, 49,
  '\tfor cat in sorted(categories):log(f"  {cat}: {", ".join(sorted(categories[cat]))}")\n', 289, 50))

The offending line uses an f-string with an inner ".join(...)" that re-enters the same double-quote context:

log(f"  {cat}: {", ".join(sorted(categories[cat]))}")

This is valid Python 3.12+ (PEP 701 relaxed nested-string rules) but invalid Python 3.10. The CT2038 container I tested on runs Python 3.10.12 and importlib.util.spec_from_file_location fails at parse time before any runtime logic runs. On the canonical eval image (also Python 3.10 per Issue #17 / the README), this means train_gpt.py cannot be imported — not "runs but fails", but "cannot parse".

The cleanest fix is a one-character change — swap the inner ", " for ', ' or escape differently:

log(f"  {cat}: {', '.join(sorted(categories[cat]))}")

This blocks the submission from running at all on the eval image. It should be the first thing addressed.

2. TTT pattern is LEGAL — for when (1) is fixed

Reading past the parse error via static source inspection, eval_val_ttt at L(decoded) ~1X implements the score-first-per-chunk pattern:

base_model.eval() + with torch.no_grad(): scores windows assigned to chunk ci, accumulating into loss_sum / token_count / byte_count
is_last_chunk = ci == num_chunks - 1; if not the last chunk and ttt_epochs > 0, THEN base_model.train() and SGD runs on that chunk's tokens
Cosine LR decay h.ttt_lr * 0.5 * (1 + cos(π * ci / max(num_chunks - 1, 1))) across chunks
chunk_seqs = (chunk_end - chunk_start) // seq_len, rank-sharded, ttt_epochs inner passes with clip_grad_norm_(ttt_params, 1.0) + dist.all_reduce(p.grad, op=ReduceOp.AVG) for DDP
base_model.eval() at function exit

This is the #1416 / #1423 legal reference pattern. No prequant_ttt_adapt_adamw(val_tokens, ...) call before GPTQ. No scored-region SLOT. No full_key = (ctx_hash ^ target * primes[k]) n-gram cache. Chunk ci is scored under weights adapted only on chunks 0..ci-1, last chunk gets no adaptation.

3. Smoke test (CT2038 proteus-engine, 2026-04-11)

IMPORT_FAIL error=SyntaxError("f-string: expecting '}'", ...line 289...)

Parse-blocked by (1). Once (1) is fixed the structural code is the legal reference shape and should smoke-test-pass on Python 3.10.

Verdict

NEEDS AUTHOR ACTION — parse-blocking syntax error on Python 3.10 at decoded-payload line 289 in the logging helper. One-character fix (", " → ', '). Once that lands, the TTT path is the legal score-first-per-chunk pattern and the submission should be reviewable as a clean SOTA iteration.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: NEEDS AUTHOR ACTION. The core technique is clean; the submission is just gated on a Python-version compatibility fix that trivially falls out of a local python3 -c "import py_compile; py_compile.compile('train_gpt.py')" check before bundling. I'd suggest adding that as a pre-commit step for the self-extracting shim workflow, since PRs #1523 and #1541 both land with the same class of bug at the same logging-helper line pattern — looks like shared source across your SOTA iteration branch.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL due to Python 3.10 f-string syntax incompatibility at decoded-payload line 289. Static review of the remaining 50,817 bytes of decoded source confirmed the TTT follows the #1416 / #1423 legal score-first-per-chunk pattern — no scored-region SLOT, no full_key-with-target n-gram, no multi-epoch val fine-tune. AI tooling: review drafted with Claude Code (Opus) using an internal review template; batch-9 subagent quota exhausted mid-batch so this review was authored in the main session. SHA e037ca92030c8c861eb4ec2aa2b8f230722edb73.

MatoTeziTanka mentioned this pull request Apr 11, 2026

Record: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean) #1533

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean)#1541