experiment: M4 Pro 24GB — 3.50 tok/s at 4-bit, architecture confirmed on 24GB#21
Open
JackCid89 wants to merge 1 commit intodanveloper:mainfrom
Open
experiment: M4 Pro 24GB — 3.50 tok/s at 4-bit, architecture confirmed on 24GB#21JackCid89 wants to merge 1 commit intodanveloper:mainfrom
JackCid89 wants to merge 1 commit intodanveloper:mainfrom
Conversation
… on 24GB Verified flash-moe on MacBook Pro M4 Pro with 24GB unified memory (half the RAM and GPU cores of the original M3 Max 48GB machine). Results: - 4-bit experts: 3.50 tok/s steady-state, TTFT 4613ms - Only ~20% slower despite halved memory bandwidth (~273 vs ~400 GB/s) - OS page cache ~14GB (vs ~35GB on 48GB machine) still effective - No code changes required — architecture scales down without modification Also documents the GPT-2 byte-to-unicode decoding fix for vocab.bin: export_tokenizer.py must reverse the BPE encoding (Ġ→space, Ċ→newline) when building vocab.bin, otherwise raw BPE unicode leaks into output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Ran flash-moe on a MacBook Pro M4 Pro with 24GB unified memory and documented the results.
Results
Only ~20% slower despite half the RAM and half the GPU cores. The OS page cache with ~14GB available is still effective. No code changes required — the architecture scales down to 24GB unified memory without modification.
Also documents:
vocab.binGPT-2 byte decoding bugexport_tokenizer.pymust reverse the GPT-2 byte-to-unicode encoding when buildingvocab.bin. Without this, raw BPE symbols leak into output (Ġinstead of space,Ċinstead of newline). The fix is to apply thebytes_to_unicode()reverse mapping when writing each token string.Changes
CLAUDE.md/README.md: add M4 Pro 24GB to Hardware section; documentexport_tokenizer.pyin project structure; add vocab.bin fix to What We Tried tableresults.tsv+metal_infer/results.tsv: add M4 Pro experiment row🤖 Generated with Claude Code