Skip to content

Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan#1365

Open
PersusUS wants to merge 2 commits intoopenai:mainfrom
PersusUS:hybridmamba-11
Open

Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan#1365
PersusUS wants to merge 2 commits intoopenai:mainfrom
PersusUS:hybridmamba-11

Conversation

@PersusUS
Copy link
Copy Markdown

@PersusUS PersusUS commented Apr 4, 2026

Summary

Current Status

Preliminary only — 2.12 bpb on 1xH100 / 312 steps (per-block compile, no fullgraph). Not competitive yet because without fullgraph compile only ~300 steps complete in 600s vs ~20k for SOTA. The parallel scan is implemented and tested but needs H100 with Triton to validate the inductor backend (not available on Windows).

Requesting compute credits to validate fullgraph compile on H100 and run 8xH100 competition-scale experiments.

Key Technical Details

Component Detail
Architecture 11 layers: Mamba (0,2,4,6,8) + Transformer (1,3,5,7,9,10)
Parallel scan Hillis-Steele, pure PyTorch, O(L log L) work / O(log L) depth
Compile fullgraph=True on whole model (parallel scan is dynamo-traceable)
Banks Only for 6 Transformer layers (12.5M dead params eliminated)
Quantization int6 mixed + LZMA, artifact 13.6MB

Test plan

  • Correctness: parallel scan matches sequential across all lengths 1-2048
  • Gradients: match between parallel and sequential implementations
  • FP16 stability: no NaN/Inf
  • torch.compile(fullgraph=True): dynamo traces successfully (eager backend)
  • Integration: full MambaSSM forward + backward
  • Triton/inductor backend on H100 (needs RunPod credits)
  • 8xH100 DDP full competition run
  • BPB competitive with SOTA

🤖 Generated with Claude Code

…iative scan

Hybrid Mamba SSM + Transformer architecture (5 Mamba + 6 Transformer layers)
with Hillis-Steele parallel associative scan enabling torch.compile(fullgraph=True).

Key contributions:
- Parallel scan replaces sequential loop: O(log L) depth, 48x speedup
- torch.compile(fullgraph=True) compatible (dynamo traces successfully)
- Parameter banks restructured for hybrid layers (-12.5M dead params)
- 31.8M params, 13.6MB artifact (under 16MB cap)

Preliminary: 2.12 bpb on 1xH100/312 steps (not competitive yet — needs
fullgraph compile on 8xH100 to reach 15-20k steps for competitive BPB).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 4, 2026 23:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a non-record submission entry documenting HybridMamba-11, a hybrid Mamba SSM + Transformer model that introduces a torch.compile(fullgraph=True)-compatible parallel associative scan to unblock SSM training throughput in this repo.

Changes:

  • Adds a new non-record submission JSON with metrics and artifact size.
  • Adds a detailed README describing the architecture, the parallel scan approach, and reproducibility steps.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File Description
records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/submission.json Registers the non-record submission metadata (date, metrics, artifact bytes).
records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/README.md Documents the motivation, approach, validation results, and reproduction commands for the submission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


## The Solution: Parallel Associative Scan

The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence mixes two different scan algorithms/citations (Blelloch vs Hillis–Steele) in a way that’s internally inconsistent. Suggestion (mandatory for clarity): either (a) describe it consistently as Hillis–Steele (and cite Hillis–Steele), or (b) describe it as Blelloch scan and ensure the implementation/text matches that algorithm. This helps readers accurately assess work/depth claims and compare with other scan implementations.

Suggested change
The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via Blelloch-style prefix scan since 1990. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:
The SSM recurrence `h[t] = A[t] * h[t-1] + B[t] * u[t]` is a **linear recurrence** — parallelizable via a Hillis-Steele-style prefix scan. I implemented a Hillis-Steele parallel associative scan in pure PyTorch:

Copilot uses AI. Check for mistakes.
The implementation is Hillis-Steele (1986), not Blelloch (1990).
Fixed all references for consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: HybridMamba-11 — First SSM with Parallel Associative Scan

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 5452b445b580, file records/track_non_record_16mb/2026-04-05_HybridMamba11_SSM_ParallelAssociativeScan/train_gpt.py):

The TTT path at line 1312 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=106121 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=106121 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants