Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation

Environment

• Machine: Mac Mini M4 Pro, 64GB unified memory
• macOS: Tahoe 26.3
• GPU: Apple M4 Pro (20-core GPU)

What works

• Compilation: clean build ✅
• Model loading: model_weights.bin (5.52 GB) mmap'd ✅
• Vocab: 248,077 tokens loaded ✅
• Metal shaders: compile in 1ms ✅
• Speed: 14.3-14.5 tok/s sustained — significantly faster than M3 Max (5.7 tok/s) ✅

What's broken

Generated tokens are nonsensical regardless of prompt or sampling strategy.

CLI mode (greedy):
./infer --prompt "What is prostate surgery?" --tokens 20 --k 4
Output: The prostate surgery is a surgery that is a surgery that is... (infinite loop)

Server mode:
Generates 1 token (#) then immediate EOS.

Chat template in CLI:
Echoes prompt then produces garbage tokens.

With temperature sampling (0.7):
Same garbage, different random tokens.

Hypothesis

Metal shaders optimized for M3 Max (40 GPU cores) may have compatibility issues with M4 Pro (20 GPU cores) — different SIMD group sizes or threadgroup configs causing numerical errors in dequant/matvec kernels.

Speed being FASTER while output is garbage suggests computation runs but produces incorrect results.

Model

mlx-community/Qwen3.5-397B-A17B-4bit (snapshot: 39159bd8)

Happy to run diagnostic builds or Metal profiling. Great project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions