-
Notifications
You must be signed in to change notification settings - Fork 378
Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation #10
Description
Environment
• Machine: Mac Mini M4 Pro, 64GB unified memory
• macOS: Tahoe 26.3
• GPU: Apple M4 Pro (20-core GPU)
What works
• Compilation: clean build ✅
• Model loading: model_weights.bin (5.52 GB) mmap'd ✅
• Vocab: 248,077 tokens loaded ✅
• Metal shaders: compile in 1ms ✅
• Speed: 14.3-14.5 tok/s sustained — significantly faster than M3 Max (5.7 tok/s) ✅
What's broken
Generated tokens are nonsensical regardless of prompt or sampling strategy.
CLI mode (greedy):
./infer --prompt "What is prostate surgery?" --tokens 20 --k 4
Output: The prostate surgery is a surgery that is a surgery that is... (infinite loop)
Server mode:
Generates 1 token (#) then immediate EOS.
Chat template in CLI:
Echoes prompt then produces garbage tokens.
With temperature sampling (0.7):
Same garbage, different random tokens.
Hypothesis
Metal shaders optimized for M3 Max (40 GPU cores) may have compatibility issues with M4 Pro (20 GPU cores) — different SIMD group sizes or threadgroup configs causing numerical errors in dequant/matvec kernels.
Speed being FASTER while output is garbage suggests computation runs but produces incorrect results.
Model
mlx-community/Qwen3.5-397B-A17B-4bit (snapshot: 39159bd8)
Happy to run diagnostic builds or Metal profiling. Great project!