- OS: Linux
- Python: 3.12
- GPU: NVIDIA compute capability 8.0+ (e.g., L20, L40, H20)
- CUDA: CUDA Version 12.8
- vLLM: v0.9.2
- OS: Linux
- Python: >= 3.9, < 3.12
- NPU: Atlas 800 A2/A3 series
- CANN: CANN Version 8.1.RC1
- vLLM: v0.9.2
- vLLM Ascend: v0.9.2rc1
Before you start with UCM, please make sure that you have installed UCM correctly by following the GPU Installation guide or NPU Installation guide.
UCM supports two key features: Prefix Cache and Sparse attention.
Each feature supports both Offline Inference and Online API modes.
For quick start, just follow the usage guide below to launch your own inference experience;
For further research on Prefix Cache, more details are available via the link below:
Various Sparse Attention features are now available, try GSA Sparsity via the link below:
Offline Inference
You can use our official offline example script to run offline inference as following commands:
cd examples/
# Change the model path to your own model path
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
python offline_inference.pyOpenAI-Compatible Online API
For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.
First, specify the python hash seed by:
export PYTHONHASHSEED=123456Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
# Change the model path to your own model path
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
vllm serve ${MODEL_PATH} \
--served-model-name vllm_cpu_offload \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--trust-remote-code \
--port 7800 \
--kv-transfer-config \
'{
"kv_connector": "UnifiedCacheConnectorV1",
"kv_connector_module_path": "ucm.integration.vllm.uc_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"ucm_connector_name": "UcmDramStore",
"ucm_connector_config": {
"max_cache_size": 5368709120,
"kv_block_size": 262144
}
}
}'If you see log as below:
INFO: Started server process [32890]
INFO: Waiting for application startup.
INFO: Application startup complete.Congratulations, you have successfully started the vLLM server with UCM!
After successfully started the vLLM server,You can interact with the API as following:
curl http://localhost:7800/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vllm_cpu_offload",
"prompt": "Shanghai is a",
"max_tokens": 7,
"temperature": 0
}'Note: If you want to disable vLLM prefix cache to test the cache ability of UCM, you can add --no-enable-prefix-caching to the command line.