Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan by PersusUS · Pull Request #1365 · openai/parameter-golf

PersusUS · 2026-04-04T23:53:51Z

Summary

First SSM-based submission to Parameter Golf, targeting the unchecked "State-space models" checkbox
HybridMamba-11: 5 Mamba SSM + 6 Transformer layers with U-Net skip connections (31.8M params, 13.6MB artifact)
Parallel associative scan (Hillis-Steele, O(log L) depth) replaces Mamba's sequential loop, making the entire model torch.compile(fullgraph=True) compatible — the barrier that killed all prior SSM attempts (Issue Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140, PR Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831)
Validated locally: correctness < 1e-6 error, gradient match, FP16 stable, 48x speedup over sequential, dynamo fullgraph traces successfully

Current Status

Preliminary only — 2.12 bpb on 1xH100 / 312 steps (per-block compile, no fullgraph). Not competitive yet because without fullgraph compile only ~300 steps complete in 600s vs ~20k for SOTA. The parallel scan is implemented and tested but needs H100 with Triton to validate the inductor backend (not available on Windows).

Requesting compute credits to validate fullgraph compile on H100 and run 8xH100 competition-scale experiments.

Key Technical Details

Component	Detail
Architecture	11 layers: Mamba (0,2,4,6,8) + Transformer (1,3,5,7,9,10)
Parallel scan	Hillis-Steele, pure PyTorch, O(L log L) work / O(log L) depth
Compile	fullgraph=True on whole model (parallel scan is dynamo-traceable)
Banks	Only for 6 Transformer layers (12.5M dead params eliminated)
Quantization	int6 mixed + LZMA, artifact 13.6MB

Test plan

Correctness: parallel scan matches sequential across all lengths 1-2048
Gradients: match between parallel and sequential implementations
FP16 stability: no NaN/Inf
torch.compile(fullgraph=True): dynamo traces successfully (eager backend)
Integration: full MambaSSM forward + backward
Triton/inductor backend on H100 (needs RunPod credits)
8xH100 DDP full competition run
BPB competitive with SOTA

🤖 Generated with Claude Code

…iative scan Hybrid Mamba SSM + Transformer architecture (5 Mamba + 6 Transformer layers) with Hillis-Steele parallel associative scan enabling torch.compile(fullgraph=True). Key contributions: - Parallel scan replaces sequential loop: O(log L) depth, 48x speedup - torch.compile(fullgraph=True) compatible (dynamo traces successfully) - Parameter banks restructured for hybrid layers (-12.5M dead params) - 31.8M params, 13.6MB artifact (under 16MB cap) Preliminary: 2.12 bpb on 1xH100/312 steps (not competitive yet — needs fullgraph compile on 8xH100 to reach 15-20k steps for competitive BPB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a non-record submission entry documenting HybridMamba-11, a hybrid Mamba SSM + Transformer model that introduces a torch.compile(fullgraph=True)-compatible parallel associative scan to unblock SSM training throughput in this repo.

Changes:

Adds a new non-record submission JSON with metrics and artifact size.
Adds a detailed README describing the architecture, the parallel scan approach, and reproducibility steps.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File	Description
records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/submission.json	Registers the non-record submission metadata (date, metrics, artifact bytes).
records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/README.md	Documents the motivation, approach, validation results, and reproduction commands for the submission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T23:54:31Z

records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/README.md

+
+## The Solution: Parallel Associative Scan
+
+The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:


This sentence mixes two different scan algorithms/citations (Blelloch vs Hillis–Steele) in a way that’s internally inconsistent. Suggestion (mandatory for clarity): either (a) describe it consistently as Hillis–Steele (and cite Hillis–Steele), or (b) describe it as Blelloch scan and ensure the implementation/text matches that algorithm. This helps readers accurately assess work/depth claims and compare with other scan implementations.

Suggested change

The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:

The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via a Hillis-Steele-style prefix scan. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:

The implementation is Hillis-Steele (1986), not Blelloch (1990). Fixed all references for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:14:55Z

Community Review — Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 5452b445b580, file records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/train_gpt.py):

The TTT path at line 1312 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=106121 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=106121 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Copilot AI review requested due to automatic review settings April 4, 2026 23:53

Copilot AI reviewed Apr 4, 2026

View reviewed changes

Fix: consistent Hillis-Steele citation (was mixing Blelloch reference)

5452b44

The implementation is Hillis-Steele (1986), not Blelloch (1990). Fixed all references for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan#1365

Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan#1365
PersusUS wants to merge 2 commits intoopenai:mainfrom
PersusUS:hybridmamba-11

PersusUS commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		## The Solution: Parallel Associative Scan

		The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a linear recurrence — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:

Conversation

PersusUS commented Apr 4, 2026

Summary

Current Status

Key Technical Details

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants