Skip to content

Nonsensical output on Apple M4 Pro (Mac Mini 64GB) — 14.5 tok/s but garbage generation #10

@fgomsan

Description

@fgomsan

Environment

• Machine: Mac Mini M4 Pro, 64GB unified memory
• macOS: Tahoe 26.3
• GPU: Apple M4 Pro (20-core GPU)

What works

• Compilation: clean build ✅
• Model loading: model_weights.bin (5.52 GB) mmap'd ✅
• Vocab: 248,077 tokens loaded ✅
• Metal shaders: compile in 1ms ✅
• Speed: 14.3-14.5 tok/s sustained — significantly faster than M3 Max (5.7 tok/s) ✅

What's broken

Generated tokens are nonsensical regardless of prompt or sampling strategy.

CLI mode (greedy):
./infer --prompt "What is prostate surgery?" --tokens 20 --k 4
Output: The prostate surgery is a surgery that is a surgery that is... (infinite loop)

Server mode:
Generates 1 token (#) then immediate EOS.

Chat template in CLI:
Echoes prompt then produces garbage tokens.

With temperature sampling (0.7):
Same garbage, different random tokens.

Hypothesis

Metal shaders optimized for M3 Max (40 GPU cores) may have compatibility issues with M4 Pro (20 GPU cores) — different SIMD group sizes or threadgroup configs causing numerical errors in dequant/matvec kernels.

Speed being FASTER while output is garbage suggests computation runs but produces incorrect results.

Model

mlx-community/Qwen3.5-397B-A17B-4bit (snapshot: 39159bd8)

Happy to run diagnostic builds or Metal profiling. Great project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions