vllm-metal

This plugin enables vLLM to run on Apple Silicon Macs using Metal Performance Shaders (MPS) for GPU acceleration.

Installation

Quick Install

Install the latest release with a single command:

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash

From Source

git clone --recursive https://github.com/vllm-project/vllm-metal.git
cd vllm-metal
scripts/ci.sh  # This will setup the dev environment

Note: This project uses a specific version of PyTorch from a git submodule (extern/pytorch).

Features

Native Apple Silicon Support: Run LLMs on Apple Silicon Macs
MPS Acceleration: Leverages PyTorch's MPS backend for GPU operations
Paged Attention: Full support for vLLM's paged attention mechanism
Memory Efficient: Optimized for unified memory architecture
Drop-in Replacement: Works with existing vLLM APIs

Requirements

Apple Silicon Mac
Python 3.11 or later
vLLM 0.12.0 or later

Configuration

Environment Variables

Variable	Default	Description
`VLLM_METAL_DEVICE_ID`	`0`	MPS device ID
`VLLM_METAL_MEMORY_FRACTION`	`0.9`	Fraction of memory to use
`VLLM_METAL_ATTENTION_BACKEND`	`mps`	Attention backend (`mps` or `eager`)
`VLLM_METAL_EAGER_MODE`	`1`	Use eager mode (disable graph compilation)
`VLLM_METAL_MAX_BATCH_SIZE`	`256`	Maximum batch size
`VLLM_METAL_KV_CACHE_DTYPE`	`None`	KV cache dtype (default: model dtype)
`VLLM_METAL_ENABLE_PROFILING`	`0`	Enable profiling

Example Configuration

# Use 80% of available memory
export VLLM_METAL_MEMORY_FRACTION=0.8

# Enable profiling
export VLLM_METAL_ENABLE_PROFILING=1

# Run vLLM
python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceTB/SmolLM2-135M-Instruct \
    --dtype float16

Limitations

Single GPU Only: MPS does not support multi-GPU configurations
No Distributed Inference: Tensor and pipeline parallelism not supported
Limited Quantization: Some quantization methods (FP8) not available
Memory Sharing: GPU memory is shared with system memory

Performance Tips

Use Float16: Metal works best with dtype=float16
Adjust Memory Fraction: If you encounter OOM errors, reduce VLLM_METAL_MEMORY_FRACTION
Batch Size: Larger batch sizes can improve throughput
Model Size: Unified memory allows larger models than discrete GPU memory

Slow Performance

Ensure you're using dtype=float16
Check that MPS is being used (not CPU fallback)
Consider enabling eager mode if graph compilation is slow

Development

Running CI

scripts/ci.sh

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
extern		extern
rust_ext		rust_ext
scripts		scripts
tests		tests
vllm_metal		vllm_metal
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vllm-metal

Installation

Quick Install

From Source

Features

Requirements

Configuration

Environment Variables

Example Configuration

Limitations

Performance Tips

Slow Performance

Development

Running CI

About

Uh oh!

Releases 15

Packages

Contributors 2

Languages

License

vllm-project/vllm-metal

Folders and files

Latest commit

History

Repository files navigation

vllm-metal

Installation

Quick Install

From Source

Features

Requirements

Configuration

Environment Variables

Example Configuration

Limitations

Performance Tips

Slow Performance

Development

Running CI

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Contributors 2

Languages

Packages