This plugin enables vLLM to run on Apple Silicon Macs using Metal Performance Shaders (MPS) for GPU acceleration.
Install the latest release with a single command:
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bashgit clone --recursive https://github.com/vllm-project/vllm-metal.git
cd vllm-metal
scripts/ci.sh # This will setup the dev environmentNote: This project uses a specific version of PyTorch from a git submodule (extern/pytorch).
- Native Apple Silicon Support: Run LLMs on Apple Silicon Macs
- MPS Acceleration: Leverages PyTorch's MPS backend for GPU operations
- Paged Attention: Full support for vLLM's paged attention mechanism
- Memory Efficient: Optimized for unified memory architecture
- Drop-in Replacement: Works with existing vLLM APIs
- Apple Silicon Mac
- Python 3.11 or later
- vLLM 0.12.0 or later
| Variable | Default | Description |
|---|---|---|
VLLM_METAL_DEVICE_ID |
0 |
MPS device ID |
VLLM_METAL_MEMORY_FRACTION |
0.9 |
Fraction of memory to use |
VLLM_METAL_ATTENTION_BACKEND |
mps |
Attention backend (mps or eager) |
VLLM_METAL_EAGER_MODE |
1 |
Use eager mode (disable graph compilation) |
VLLM_METAL_MAX_BATCH_SIZE |
256 |
Maximum batch size |
VLLM_METAL_KV_CACHE_DTYPE |
None |
KV cache dtype (default: model dtype) |
VLLM_METAL_ENABLE_PROFILING |
0 |
Enable profiling |
# Use 80% of available memory
export VLLM_METAL_MEMORY_FRACTION=0.8
# Enable profiling
export VLLM_METAL_ENABLE_PROFILING=1
# Run vLLM
python -m vllm.entrypoints.openai.api_server \
--model HuggingFaceTB/SmolLM2-135M-Instruct \
--dtype float16- Single GPU Only: MPS does not support multi-GPU configurations
- No Distributed Inference: Tensor and pipeline parallelism not supported
- Limited Quantization: Some quantization methods (FP8) not available
- Memory Sharing: GPU memory is shared with system memory
- Use Float16: Metal works best with
dtype=float16 - Adjust Memory Fraction: If you encounter OOM errors, reduce
VLLM_METAL_MEMORY_FRACTION - Batch Size: Larger batch sizes can improve throughput
- Model Size: Unified memory allows larger models than discrete GPU memory
- Ensure you're using
dtype=float16 - Check that MPS is being used (not CPU fallback)
- Consider enabling eager mode if graph compilation is slow
scripts/ci.sh