Quickstart

Prerequisites

GPU

OS: Linux
Python: 3.12
GPU: NVIDIA compute capability 8.0+ (e.g., L20, L40, H20)
CUDA: CUDA Version 12.8
vLLM: v0.9.2

NPU

OS: Linux
Python: >= 3.9, < 3.12
NPU: Atlas 800 A2/A3 series
CANN: CANN Version 8.1.RC1
vLLM: v0.9.2
vLLM Ascend: v0.9.2rc1

Installation

Before you start with UCM, please make sure that you have installed UCM correctly by following the GPU Installation guide or NPU Installation guide.

Features Overview

UCM supports two key features: Prefix Cache and Sparse attention.

Each feature supports both Offline Inference and Online API modes.

For quick start, just follow the usage guide below to launch your own inference experience;

For further research on Prefix Cache, more details are available via the link below:

Prefix Cache

Various Sparse Attention features are now available, try GSA Sparsity via the link below:

GSA Sparsity

Usage

Offline Inference

You can use our official offline example script to run offline inference as following commands:

cd examples/
# Change the model path to your own model path
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
python offline_inference.py

OpenAI-Compatible Online API

For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

First, specify the python hash seed by:

export PYTHONHASHSEED=123456

Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:

# Change the model path to your own model path
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
vllm serve ${MODEL_PATH} \
--served-model-name vllm_cpu_offload \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--trust-remote-code \
--port 7800 \
--kv-transfer-config \
'{
    "kv_connector": "UnifiedCacheConnectorV1",
    "kv_connector_module_path": "ucm.integration.vllm.uc_connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "ucm_connector_name": "UcmDramStore",
        "ucm_connector_config": {
            "max_cache_size": 5368709120,
            "kv_block_size": 262144
        }
    }
}'

If you see log as below:

INFO:     Started server process [32890]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Congratulations, you have successfully started the vLLM server with UCM!

After successfully started the vLLM server，You can interact with the API as following:

curl http://localhost:7800/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "vllm_cpu_offload",
        "prompt": "Shanghai is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Note: If you want to disable vLLM prefix cache to test the cache ability of UCM, you can add --no-enable-prefix-caching to the command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quickstart

Prerequisites

GPU

NPU

Installation

Features Overview

Usage

FilesExpand file tree

quick_start.md

Latest commit

History

quick_start.md

File metadata and controls

Quickstart

Prerequisites

GPU

NPU

Installation

Features Overview

Usage