Skip to content

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean)#1541

Open
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/sp8192-improved-parresid
Open

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean)#1541
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/sp8192-improved-parresid

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT

val_bpb = 1.07785 (3-seed mean, std 0.00047) | ~15.99 MB | 8xH100 SXM

3-Seed Results

Seed Sliding BPP TTT BPP Artifact
42 1.07880 1.07718 15,990,780
314 1.07959 1.07810 15,987,449
999 1.07963 1.07826 15,987,550
Mean 1.07934 1.07785 15,988,593
Std 0.00039 0.00047

Merged SOTA (PR #1493, our previous): 1.0810 BPP. Delta: -0.0032 BPP.

Key Techniques

  1. Improved Parallel Residuals (from PR Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529 @msisovic) -- cross-lane routing where attention and MLP outputs route to BOTH lanes via learned scalars. 66 new scalar params (par_post[11,2,2] + par_resid[11,2]). Final output = MLP lane (lane1). Starts at layer 7.

  2. Muon Momentum 0.97 (from PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 @dexhunter) -- reduced from 0.99. Shorter memory horizon (~33 steps) better tracks the rapidly changing loss surface during warmdown.

  3. MATRIX_LR = 0.03 -- re-tuned for momentum 0.97 (higher LR pairs with lower momentum). Sweep: 0.022 → 1.0797, 0.03 → 1.0795, 0.04 → 1.0811.

  4. 3-Layer Depth Recurrence (L3-5, activate at frac=0.35) -- 17 virtual layers from 11 physical.

  5. QK-Gain 5.25 -- monotonic improvement from 4.0 to 5.25.

  6. Legal Score-First TTT -- SGD (lr=0.005, mom=0.9), 3 epochs per 32K-token chunk, cosine LR decay.

  7. SP8192 + GPTQ SDClip -- int6 matrices (k=12.85), int8 embeddings (k=20.0), Brotli-11 compression.

  8. Tuned Hyperparameters -- WD=0.095, EMA=0.9965, warmdown=0.72.

Architecture

11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10]. Improved parallel residuals from layer 7: attention reads from lane0, MLP reads from lane1, both outputs route to both lanes via learned par_post and par_resid scalars. Skip gates (sigmoid-gated U-Net connections).

Compliance (Track B)

Per Issue #1017:

  • Condition 1 (Causality): Sliding-window eval, prefix only
  • Condition 2 (Normalized): Standard softmax, no n-gram/logit bias
  • Condition 3 (Score before update): Each chunk scored under torch.no_grad() BEFORE SGD
  • Condition 4 (Single pass): Each token scored once, no rescoring

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.

Reproduction

SEED=42 QK_GAIN_INIT=5.25 MUON_MOMENTUM=0.97 MATRIX_LR=0.03 \
  TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod).

Included Files

  • README.md (this file)
  • submission.json
  • train_gpt.py
  • train_seed42.log
  • train_seed314.log
  • train_seed999.log

🤖 Generated with Claude Code

…Legal TTT — val_bpb 1.07785 (3-seed mean)

3-seed mean: 1.07785 (std 0.00047), seeds 42/314/999
All artifacts under 16MB, training under 600s, eval under 600s
Improved parallel residuals (cross-lane routing), Muon 0.97, MATRIX_LR=0.03
Score-first TTT (SGD 3ep), no SLOT, no pre-quant TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag
Copy link
Copy Markdown
Author

bigbag commented Apr 11, 2026

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running 180+ experiments across Steps 1-23 that led to this result.

@dexhunter
Copy link
Copy Markdown
Contributor

dexhunter commented Apr 11, 2026

One small clarification request:

The PR body says "No hash embed", but the attached seed logs show ttt_hash_embed: True and ttt_hash_buckets: 16384 for all three runs.

I could not quickly tell whether that path is just dead / unused code in the current implementation or whether it is actually part of the scored eval path. If it is inactive, it would help to say that explicitly, or set the flag to 0 in the published runs, so the legality story is easier to follow.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 11, 2026
…RA TTT doc-independent legal; BPB bug alert

- PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending
- PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal
- PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771)
- PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual
- PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged
- Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable
- No merged SOTA change (still 1.0810); target remains ≤1.0760

https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — SP8192 + Improved Parallel Residuals + Muon 0.97 + Legal TTT

Thanks @bigbag — as the author of merged SOTA #1493 this is a high-visibility iteration. I have one parse-blocking finding that affects whether this runs on the eval image, and otherwise a clean compliance read.

What I found in the code (head SHA e037ca92030c8c861eb4ec2aa2b8f230722edb73, records/track_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/train_gpt.py, decoded from the import lzma as L,base64 as B self-extracting shim — 50,817 bytes of actual source):

1. Parse-blocking SyntaxError on Python 3.10

The decoded payload fails to parse on Python 3.10 (the eval image's interpreter version) with:

SyntaxError("f-string: expecting '}'", ('<string>', 289, 49,
  '\tfor cat in sorted(categories):log(f"  {cat}: {", ".join(sorted(categories[cat]))}")\n', 289, 50))

The offending line uses an f-string with an inner ".join(...)" that re-enters the same double-quote context:

log(f"  {cat}: {", ".join(sorted(categories[cat]))}")

This is valid Python 3.12+ (PEP 701 relaxed nested-string rules) but invalid Python 3.10. The CT2038 container I tested on runs Python 3.10.12 and importlib.util.spec_from_file_location fails at parse time before any runtime logic runs. On the canonical eval image (also Python 3.10 per Issue #17 / the README), this means train_gpt.py cannot be imported — not "runs but fails", but "cannot parse".

The cleanest fix is a one-character change — swap the inner ", " for ', ' or escape differently:

log(f"  {cat}: {', '.join(sorted(categories[cat]))}")

This blocks the submission from running at all on the eval image. It should be the first thing addressed.

2. TTT pattern is LEGAL — for when (1) is fixed

Reading past the parse error via static source inspection, eval_val_ttt at L(decoded) ~1X implements the score-first-per-chunk pattern:

  • base_model.eval() + with torch.no_grad(): scores windows assigned to chunk ci, accumulating into loss_sum / token_count / byte_count
  • is_last_chunk = ci == num_chunks - 1; if not the last chunk and ttt_epochs > 0, THEN base_model.train() and SGD runs on that chunk's tokens
  • Cosine LR decay h.ttt_lr * 0.5 * (1 + cos(π * ci / max(num_chunks - 1, 1))) across chunks
  • chunk_seqs = (chunk_end - chunk_start) // seq_len, rank-sharded, ttt_epochs inner passes with clip_grad_norm_(ttt_params, 1.0) + dist.all_reduce(p.grad, op=ReduceOp.AVG) for DDP
  • base_model.eval() at function exit

This is the #1416 / #1423 legal reference pattern. No prequant_ttt_adapt_adamw(val_tokens, ...) call before GPTQ. No scored-region SLOT. No full_key = (ctx_hash ^ target * primes[k]) n-gram cache. Chunk ci is scored under weights adapted only on chunks 0..ci-1, last chunk gets no adaptation.

3. Smoke test (CT2038 proteus-engine, 2026-04-11)

IMPORT_FAIL error=SyntaxError("f-string: expecting '}'", ...line 289...)

Parse-blocked by (1). Once (1) is fixed the structural code is the legal reference shape and should smoke-test-pass on Python 3.10.

Verdict

NEEDS AUTHOR ACTION — parse-blocking syntax error on Python 3.10 at decoded-payload line 289 in the logging helper. One-character fix (", "', '). Once that lands, the TTT path is the legal score-first-per-chunk pattern and the submission should be reviewable as a clean SOTA iteration.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: NEEDS AUTHOR ACTION. The core technique is clean; the submission is just gated on a Python-version compatibility fix that trivially falls out of a local python3 -c "import py_compile; py_compile.compile('train_gpt.py')" check before bundling. I'd suggest adding that as a pre-commit step for the self-extracting shim workflow, since PRs #1523 and #1541 both land with the same class of bug at the same logging-helper line pattern — looks like shared source across your SOTA iteration branch.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL due to Python 3.10 f-string syntax incompatibility at decoded-payload line 289. Static review of the remaining 50,817 bytes of decoded source confirmed the TTT follows the #1416 / #1423 legal score-first-per-chunk pattern — no scored-region SLOT, no full_key-with-target n-gram, no multi-epoch val fine-tune. AI tooling: review drafted with Claude Code (Opus) using an internal review template; batch-9 subagent quota exhausted mid-batch so this review was authored in the main session. SHA e037ca92030c8c861eb4ec2aa2b8f230722edb73.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants