experiment: M4 Pro 24GB — 3.50 tok/s at 4-bit, architecture confirmed on 24GB by JackCid89 · Pull Request #21 · danveloper/flash-moe

JackCid89 · 2026-04-03T05:16:21Z

What

Ran flash-moe on a MacBook Pro M4 Pro with 24GB unified memory and documented the results.

Results

Machine	RAM	GPU cores	Bandwidth	tok/s	TTFT
M3 Max (original)	48 GB	40	~400 GB/s	4.36	—
M4 Pro (this PR)	24 GB	20	~273 GB/s	3.50	4613ms

Only ~20% slower despite half the RAM and half the GPU cores. The OS page cache with ~14GB available is still effective. No code changes required — the architecture scales down to 24GB unified memory without modification.

Also documents: `vocab.bin` GPT-2 byte decoding bug

export_tokenizer.py must reverse the GPT-2 byte-to-unicode encoding when building vocab.bin. Without this, raw BPE symbols leak into output (Ġ instead of space, Ċ instead of newline). The fix is to apply the bytes_to_unicode() reverse mapping when writing each token string.

Changes

CLAUDE.md / README.md: add M4 Pro 24GB to Hardware section; document export_tokenizer.py in project structure; add vocab.bin fix to What We Tried table
results.tsv + metal_infer/results.tsv: add M4 Pro experiment row

🤖 Generated with Claude Code

… on 24GB Verified flash-moe on MacBook Pro M4 Pro with 24GB unified memory (half the RAM and GPU cores of the original M3 Max 48GB machine). Results: - 4-bit experts: 3.50 tok/s steady-state, TTFT 4613ms - Only ~20% slower despite halved memory bandwidth (~273 vs ~400 GB/s) - OS page cache ~14GB (vs ~35GB on 48GB machine) still effective - No code changes required — architecture scales down without modification Also documents the GPT-2 byte-to-unicode decoding fix for vocab.bin: export_tokenizer.py must reverse the BPE encoding (Ġ→space, Ċ→newline) when building vocab.bin, otherwise raw BPE unicode leaks into output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: M4 Pro 24GB — 3.50 tok/s at 4-bit, architecture confirmed on 24GB#21

experiment: M4 Pro 24GB — 3.50 tok/s at 4-bit, architecture confirmed on 24GB#21
JackCid89 wants to merge 1 commit intodanveloper:mainfrom
JackCid89:m4-pro-24gb-experiment

JackCid89 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JackCid89 commented Apr 3, 2026

What

Results

Also documents: vocab.bin GPT-2 byte decoding bug

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Also documents: `vocab.bin` GPT-2 byte decoding bug