Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan#1365
Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan#1365PersusUS wants to merge 2 commits intoopenai:mainfrom
Conversation
…iative scan Hybrid Mamba SSM + Transformer architecture (5 Mamba + 6 Transformer layers) with Hillis-Steele parallel associative scan enabling torch.compile(fullgraph=True). Key contributions: - Parallel scan replaces sequential loop: O(log L) depth, 48x speedup - torch.compile(fullgraph=True) compatible (dynamo traces successfully) - Parameter banks restructured for hybrid layers (-12.5M dead params) - 31.8M params, 13.6MB artifact (under 16MB cap) Preliminary: 2.12 bpb on 1xH100/312 steps (not competitive yet — needs fullgraph compile on 8xH100 to reach 15-20k steps for competitive BPB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a non-record submission entry documenting HybridMamba-11, a hybrid Mamba SSM + Transformer model that introduces a torch.compile(fullgraph=True)-compatible parallel associative scan to unblock SSM training throughput in this repo.
Changes:
- Adds a new non-record submission JSON with metrics and artifact size.
- Adds a detailed README describing the architecture, the parallel scan approach, and reproducibility steps.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/submission.json | Registers the non-record submission metadata (date, metrics, artifact bytes). |
| records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/README.md | Documents the motivation, approach, validation results, and reproduction commands for the submission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ## The Solution: Parallel Associative Scan | ||
|
|
||
| The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch: |
There was a problem hiding this comment.
This sentence mixes two different scan algorithms/citations (Blelloch vs Hillis–Steele) in a way that’s internally inconsistent. Suggestion (mandatory for clarity): either (a) describe it consistently as Hillis–Steele (and cite Hillis–Steele), or (b) describe it as Blelloch scan and ensure the implementation/text matches that algorithm. This helps readers accurately assess work/depth claims and compare with other scan implementations.
| The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch: | |
| The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via a Hillis-Steele-style prefix scan. I implemented a Hillis-Steele parallel associative scan in pure PyTorch: |
The implementation is Hillis-Steele (1986), not Blelloch (1990). Fixed all references for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Community Review — Non-record: HybridMamba-11 — First SSM with Parallel Associative ScanBPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1312 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=106121 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=106121 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
torch.compile(fullgraph=True)compatible — the barrier that killed all prior SSM attempts (Issue Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140, PR Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831)Current Status
Preliminary only — 2.12 bpb on 1xH100 / 312 steps (per-block compile, no fullgraph). Not competitive yet because without fullgraph compile only ~300 steps complete in 600s vs ~20k for SOTA. The parallel scan is implemented and tested but needs H100 with Triton to validate the inductor backend (not available on Windows).
Requesting compute credits to validate fullgraph compile on H100 and run 8xH100 competition-scale experiments.
Key Technical Details
Test plan
🤖 Generated with Claude Code