Skip to content

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1492

Closed
bigbag wants to merge 3 commits intoopenai:mainfrom
bigbag:submission/sp8192-3recur-parresid-qk525-ttt
Closed

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1492
bigbag wants to merge 3 commits intoopenai:mainfrom
bigbag:submission/sp8192-3recur-parresid-qk525-ttt

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Apr 9, 2026

Summary

  • val_bpb = 1.0810 (3-seed mean, std 0.0002) | ~15.99 MB | 8×H100 SXM
  • SP8192 + 3-layer depth recurrence (L3-5) + parallel residuals (L7+) + QK-Gain 5.25 + legal score-first TTT
  • No SLOT, no pre-quant TTT, no n-gram cache, no ETLB — fully compliant

3-Seed Results

Seed Sliding BPP TTT BPP Artifact
42 1.0829 1.0808 15,991,930
314 1.0827 1.0810 15,992,919
999 1.0826 1.0812 15,992,919
Mean 1.0827 1.0810 15,992,589
Std 0.0002 0.0002

Merged SOTA (PR #1019): 1.1147 BPP. Delta: −0.0337 BPP.

Key Techniques

  1. SP8192 + GPTQ SDClip — int6 matrices (k=12.85), int8 embeddings (k=20.0), zero pruning (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 @clarkkev)
  2. 3-Layer Depth Recurrence (L3-5, activate at 0.35) — 17 virtual layers from 11 physical
  3. Parallel Residuals (L7+) — GPT-J style (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412 @Robby955, PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic)
  4. QK-Gain 5.25 — monotonic improvement from 4.0 → 5.0 → 5.25
  5. Legal Score-First TTT — SGD (lr=0.005, mom=0.9), 3 epochs, cosine decay (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 @abaybektursun, PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 @dexhunter)
  6. Tuned Hyperparameters — WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 (PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 @X-Abhishek-X)
  7. LZMA code wrapper — 16.6KB code footprint

Compliance (Track B)

Per Issue #1017:

  • Condition 1 (Causality): Sliding-window eval, prefix only
  • Condition 2 (Normalized): Standard softmax, no n-gram/logit bias
  • Condition 3 (Score before update): Each chunk scored under torch.no_grad() BEFORE SGD
  • Condition 4 (Single pass): Each token scored once, no rescoring

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.

Credits

PR #1394 @clarkkev, PR #1413 @dexhunter, PR #549 @abaybektursun, PR #1412 @Robby955, PR #1204 @msisovic, PR #1445 @X-Abhishek-X, PR #1331 @dexhunter

Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was instrumental in running 160+ experiments that led to this result.

Reproduction

SEED=42 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • 3-seed validation (42, 314, 999)
  • All artifacts under 16,000,000 bytes
  • Training under 600s (588s actual)
  • Eval (sliding + TTT) under 600s (~500s actual)
  • Score-first TTT: compliant with Issue A Field Guide to Valid Submissions #1017 conditions 1-4
  • No SLOT, no pre-quant TTT, no ETLB, no n-gram cache

🤖 Generated with Claude Code

Pavel Liashkov and others added 3 commits March 22, 2026 23:41
…25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999
All artifacts under 16MB, training under 600s, eval under 600s
Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag
Copy link
Copy Markdown
Author

bigbag commented Apr 9, 2026

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running the experiments that led to this result. The grant covered ~320 compute hours across 160+ experiments over Steps 1-22 of our optimization journey.

@bigbag
Copy link
Copy Markdown
Author

bigbag commented Apr 9, 2026

Closing — PR includes unrelated files from working branch. Will resubmit clean.

@bigbag bigbag closed this Apr 9, 2026
@bigbag bigbag deleted the submission/sp8192-3recur-parresid-qk525-ttt branch April 9, 2026 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant