Skip to content

Commit a2b9d95

Browse files
gavinlichnchen-hu-97pre-commit-ci[bot]eero-t
authored
Add vLLM ARC support with OpenVINO backend (opea-project#641)
* Add vllm Arc Dockerfile support Support vllm inference on Intel ARC GPU Signed-off-by: Li Gang <[email protected]> Co-authored-by: Chen, Hu1 <[email protected]> * Add vLLM ARC support With vLLM official repo: https://github.com/vllm-project/vllm/ based on openvino backend Dockerfile is based on Dockerfile.openvino https://github.com/vllm-project/vllm/blob/main/Dockerfile.openvino And add ARC support packages Default mode: meta-llama/Llama-3.2-3B-Instruct to fit ARC A770 VRAM Signed-off-by: Li Gang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add README and .github workflow for vLLM ARC support Signed-off-by: Li Gang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update comps/llms/text-generation/vllm/langchain/README.md Co-authored-by: Eero Tamminen <[email protected]> * Rename Dockerfile to meet Contribution Guidelines Signed-off-by: Li Gang <[email protected]> * Align image names as opea/vllm-arc:latest Signed-off-by: Li Gang <[email protected]> --------- Signed-off-by: Li Gang <[email protected]> Co-authored-by: Chen, Hu1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eero Tamminen <[email protected]>
1 parent 617e119 commit a2b9d95

File tree

6 files changed

+133
-16
lines changed

6 files changed

+133
-16
lines changed

.github/workflows/docker/compose/llms-compose-cd.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ services:
1515
context: vllm-openvino
1616
dockerfile: Dockerfile.openvino
1717
image: ${REGISTRY:-opea}/vllm-openvino:${TAG:-latest}
18+
vllm-arc:
19+
build:
20+
dockerfile: comps/llms/text-generation/vllm/langchain/dependency/Dockerfile.intel_gpu
21+
image: ${REGISTRY:-opea}/vllm-arc:${TAG:-latest}
1822
llm-eval:
1923
build:
2024
dockerfile: comps/llms/utils/lm-eval/Dockerfile

comps/llms/text-generation/vllm/langchain/README.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -98,23 +98,31 @@ For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use fo
9898
bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
9999
```
100100

101-
### 2.3 vLLM with OpenVINO
101+
### 2.3 vLLM with OpenVINO (on Intel GPU and CPU)
102102

103-
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:
103+
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on Intel GPU and all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs (starting from Intel® UHD Graphics generation). OpenVINO vLLM backend supports the following advanced vLLM features:
104104

105105
- Prefix caching (`--enable-prefix-caching`)
106106
- Chunked prefill (`--enable-chunked-prefill`)
107107

108108
#### Build Docker Image
109109

110-
To build the docker image, run the command
110+
To build the docker image for Intel CPU, run the command
111111

112112
```bash
113113
bash ./build_docker_vllm_openvino.sh
114114
```
115115

116116
Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.
117117

118+
To build the docker image for Intel GPU, run the command
119+
120+
```bash
121+
bash ./build_docker_vllm_openvino.sh gpu
122+
```
123+
124+
Once it successfully builds, you will have the `opea/vllm-arc:latest` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.
125+
118126
#### Launch vLLM service
119127

120128
For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
@@ -125,14 +133,30 @@ Please follow this link [huggingface token](https://huggingface.co/docs/hub/secu
125133
export HUGGINGFACEHUB_API_TOKEN=<token>
126134
```
127135

128-
To start the model server:
136+
To start the model server for Intel CPU:
129137

130138
```bash
131139
bash launch_vllm_service_openvino.sh
132140
```
133141

142+
To start the model server for Intel GPU:
143+
144+
```bash
145+
bash launch_vllm_service_openvino.sh -d gpu
146+
```
147+
134148
#### Performance tips
135149

150+
---
151+
152+
vLLM OpenVINO backend environment variables
153+
154+
- `VLLM_OPENVINO_DEVICE` to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, `VLLM_OPENVINO_DEVICE=GPU.1`). If the value is not specified, CPU device is used by default.
155+
156+
- `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON` enables U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`
157+
158+
##### CPU performance tips
159+
136160
vLLM OpenVINO backend uses the following environment variables to control behavior:
137161

138162
- `VLLM_OPENVINO_KVCACHE_SPACE` to specify the KV Cache size (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
@@ -148,6 +172,17 @@ OpenVINO best known configuration is:
148172
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
149173
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256
150174

175+
##### GPU performance tips
176+
177+
GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account `gpu_memory_utilization` option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using `VLLM_OPENVINO_KVCACHE_SPACE` environment variable (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=8` means 8 GB space for KV cache).
178+
179+
Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and `preemption-mode=swap`.
180+
181+
OpenVINO best known configuration for GPU is:
182+
183+
$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
184+
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
185+
151186
### 2.4 Query the service
152187

153188
And then you can make requests like below to check the service status:
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
2+
# to run the OpenAI compatible server.
3+
# Based on https://github.com/vllm-project/vllm/blob/main/Dockerfile.openvino
4+
# add Intel ARC support package
5+
6+
FROM ubuntu:22.04 AS dev
7+
8+
RUN apt-get update -y && \
9+
apt-get install -y \
10+
git python3-pip \
11+
ffmpeg libsm6 libxext6 libgl1 \
12+
gpg-agent wget
13+
14+
RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg && \
15+
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | \
16+
tee /etc/apt/sources.list.d/intel-gpu-jammy.list &&\
17+
apt update -y &&\
18+
apt install -y \
19+
intel-opencl-icd intel-level-zero-gpu level-zero \
20+
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
21+
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
22+
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
23+
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
24+
25+
WORKDIR /workspace
26+
27+
RUN git clone -b v0.6.3.post1 https://github.com/vllm-project/vllm.git
28+
29+
#ARG GIT_REPO_CHECK=0
30+
#RUN --mount=type=bind,source=.git,target=.git \
31+
# if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
32+
33+
# install build requirements
34+
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
35+
# build vLLM with OpenVINO backend
36+
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/
37+
38+
#COPY examples/ /workspace/vllm/examples
39+
#COPY benchmarks/ /workspace/vllm/benchmarks
40+
41+
42+
CMD ["/bin/bash"]
43+

comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm_openvino.sh

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,27 @@
33
# Copyright (C) 2024 Intel Corporation
44
# SPDX-License-Identifier: Apache-2.0
55

6-
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
7-
git clone https://github.com/vllm-project/vllm.git vllm
8-
cd ./vllm/ && git checkout v0.6.1
9-
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
10-
cd $BASEDIR && rm -rf vllm
6+
# Set default values
7+
default_hw_mode="cpu"
8+
9+
# Assign arguments to variable
10+
hw_mode=${1:-$default_hw_mode}
11+
12+
# Check if all required arguments are provided
13+
if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
14+
echo "Usage: $0 [hw_mode]"
15+
echo "Please customize the arguments you want to use.
16+
- hw_mode: The hardware mode for the vLLM endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'gpu'."
17+
exit 1
18+
fi
19+
20+
# Build the docker image for vLLM based on the hardware mode
21+
if [ "$hw_mode" = "gpu" ]; then
22+
docker build -f Dockerfile.intel_gpu -t opea/vllm-arc:latest . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
23+
else
24+
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
25+
git clone https://github.com/vllm-project/vllm.git vllm
26+
cd ./vllm/ && git checkout v0.6.1
27+
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
28+
cd $BASEDIR && rm -rf vllm
29+
fi

comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service_openvino.sh

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,20 @@
99

1010
default_port=8008
1111
default_model="meta-llama/Llama-2-7b-hf"
12+
default_device="cpu"
1213
swap_space=50
14+
image="vllm:openvino"
1315

14-
while getopts ":hm:p:" opt; do
16+
while getopts ":hm:p:d:" opt; do
1517
case $opt in
1618
h)
17-
echo "Usage: $0 [-h] [-m model] [-p port]"
19+
echo "Usage: $0 [-h] [-m model] [-p port] [-d device]"
1820
echo "Options:"
1921
echo " -h Display this help message"
20-
echo " -m model Model (default: meta-llama/Llama-2-7b-hf)"
22+
echo " -m model Model (default: meta-llama/Llama-2-7b-hf for cpu"
23+
echo " meta-llama/Llama-3.2-3B-Instruct for gpu)"
2124
echo " -p port Port (default: 8000)"
25+
echo " -d device Target Device (Default: cpu, optional selection can be 'cpu' and 'gpu')"
2226
exit 0
2327
;;
2428
m)
@@ -27,6 +31,9 @@ while getopts ":hm:p:" opt; do
2731
p)
2832
port=$OPTARG
2933
;;
34+
d)
35+
device=$OPTARG
36+
;;
3037
\?)
3138
echo "Invalid option: -$OPTARG" >&2
3239
exit 1
@@ -37,25 +44,33 @@ done
3744
# Assign arguments to variables
3845
model_name=${model:-$default_model}
3946
port_number=${port:-$default_port}
47+
device=${device:-$default_device}
4048

4149

4250
# Set the Huggingface cache directory variable
4351
HF_CACHE_DIR=$HOME/.cache/huggingface
44-
52+
if [ "$device" = "gpu" ]; then
53+
docker_args="-e VLLM_OPENVINO_DEVICE=GPU --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path"
54+
vllm_args="--max_model_len=1024"
55+
model_name="meta-llama/Llama-3.2-3B-Instruct"
56+
image="opea/vllm-arc:latest"
57+
fi
4558
# Start the model server using Openvino as the backend inference engine.
4659
# Provide the container name that is unique and meaningful, typically one that includes the model name.
4760

4861
docker run -d --rm --name="vllm-openvino-server" \
4962
-p $port_number:80 \
5063
--ipc=host \
64+
$docker_args \
5165
-e HTTPS_PROXY=$https_proxy \
5266
-e HTTP_PROXY=$https_proxy \
5367
-e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} \
54-
-v $HOME/.cache/huggingface:/home/user/.cache/huggingface \
55-
vllm:openvino /bin/bash -c "\
68+
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
69+
$image /bin/bash -c "\
5670
cd / && \
5771
export VLLM_CPU_KVCACHE_SPACE=50 && \
5872
python3 -m vllm.entrypoints.openai.api_server \
5973
--model \"$model_name\" \
74+
$vllm_args \
6075
--host 0.0.0.0 \
6176
--port 80"

comps/llms/text-generation/vllm/langchain/query.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
your_ip="0.0.0.0"
5+
model=$(curl http://localhost:8008/v1/models -s|jq -r '.data[].id')
56

67
curl http://${your_ip}:8008/v1/completions \
78
-H "Content-Type: application/json" \
89
-d '{
9-
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
10+
"model": "'$model'",
1011
"prompt": "What is Deep Learning?",
1112
"max_tokens": 32,
1213
"temperature": 0

0 commit comments

Comments
 (0)