Skip to content

Commit 8fe1929

Browse files
[AudioQnA] Enable vLLM and set it as default LLM serving (opea-project#1657)
Signed-off-by: Wang, Kai Lawrence <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 35c5cf5 commit 8fe1929

File tree

16 files changed

+750
-102
lines changed

16 files changed

+750
-102
lines changed

AudioQnA/audioqna.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
SPEECHT5_SERVER_PORT = int(os.getenv("SPEECHT5_SERVER_PORT", 7055))
1717
LLM_SERVER_HOST_IP = os.getenv("LLM_SERVER_HOST_IP", "0.0.0.0")
1818
LLM_SERVER_PORT = int(os.getenv("LLM_SERVER_PORT", 3006))
19-
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "Intel/neural-chat-7b-v3-3")
19+
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")
2020

2121

2222
def align_inputs(self, inputs, cur_node, runtime_graph, llm_parameters_dict, **kwargs):

AudioQnA/audioqna_multilang.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
GPT_SOVITS_SERVER_PORT = int(os.getenv("GPT_SOVITS_SERVER_PORT", 9088))
1818
LLM_SERVER_HOST_IP = os.getenv("LLM_SERVER_HOST_IP", "0.0.0.0")
1919
LLM_SERVER_PORT = int(os.getenv("LLM_SERVER_PORT", 8888))
20-
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "Intel/neural-chat-7b-v3-3")
20+
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")
2121

2222

2323
def align_inputs(self, inputs, cur_node, runtime_graph, llm_parameters_dict, **kwargs):

AudioQnA/docker_compose/intel/cpu/xeon/README.md

Lines changed: 86 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
This document outlines the deployment process for a AudioQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server.
44

5+
The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [Start the MegaService](#-start-the-megaservice) section in this page.
6+
7+
Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
8+
59
## 🚀 Build Docker images
610

711
### 1. Source Code install GenAIComps
@@ -17,9 +21,15 @@ cd GenAIComps
1721
docker build -t opea/whisper:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/src/integrations/dependency/whisper/Dockerfile .
1822
```
1923

20-
### 3. Build LLM Image
24+
### 3. Build vLLM Image
2125

22-
Intel Xeon optimized image hosted in huggingface repo will be used for TGI service: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu (https://github.com/huggingface/text-generation-inference)
26+
```bash
27+
git clone https://github.com/vllm-project/vllm.git
28+
cd ./vllm/
29+
VLLM_VER="$(git describe --tags "$(git rev-list --tags --max-count=1)" )"
30+
git checkout ${VLLM_VER}
31+
docker build --no-cache --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile.cpu -t opea/vllm:latest --shm-size=128g .
32+
```
2333

2434
### 4. Build TTS Image
2535

@@ -43,9 +53,10 @@ docker build --no-cache -t opea/audioqna:latest --build-arg https_proxy=$https_p
4353
Then run the command `docker images`, you will have following images ready:
4454

4555
1. `opea/whisper:latest`
46-
2. `opea/speecht5:latest`
47-
3. `opea/audioqna:latest`
48-
4. `opea/gpt-sovits:latest` (optional)
56+
2. `opea/vllm:latest`
57+
3. `opea/speecht5:latest`
58+
4. `opea/audioqna:latest`
59+
5. `opea/gpt-sovits:latest` (optional)
4960

5061
## 🚀 Set the environment variables
5162

@@ -55,7 +66,7 @@ Before starting the services with `docker compose`, you have to recheck the foll
5566
export host_ip=<your External Public IP> # export host_ip=$(hostname -I | awk '{print $1}')
5667
export HUGGINGFACEHUB_API_TOKEN=<your HF token>
5768

58-
export LLM_MODEL_ID=Intel/neural-chat-7b-v3-3
69+
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
5970

6071
export MEGA_SERVICE_HOST_IP=${host_ip}
6172
export WHISPER_SERVER_HOST_IP=${host_ip}
@@ -73,40 +84,90 @@ export BACKEND_SERVICE_ENDPOINT=http://${host_ip}:3008/v1/audioqna
7384

7485
or use set_env.sh file to setup environment variables.
7586

76-
Note: Please replace with host_ip with your external IP address, do not use localhost.
87+
Note:
88+
89+
- Please replace with host_ip with your external IP address, do not use localhost.
90+
- If you are in a proxy environment, also set the proxy-related environment variables:
91+
92+
```
93+
export http_proxy="Your_HTTP_Proxy"
94+
export https_proxy="Your_HTTPs_Proxy"
95+
# Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
96+
export no_proxy="Your_No_Proxy",${host_ip},whisper-service,speecht5-service,gpt-sovits-service,tgi-service,vllm-service,audioqna-xeon-backend-server,audioqna-xeon-ui-server
97+
```
7798

7899
## 🚀 Start the MegaService
79100

80101
```bash
81102
cd GenAIExamples/AudioQnA/docker_compose/intel/cpu/xeon/
103+
```
104+
105+
If use vLLM as the LLM serving backend:
106+
107+
```
82108
docker compose up -d
83109
84110
# multilang tts (optional)
85111
docker compose -f compose_multilang.yaml up -d
86112
```
87113

114+
If use TGI as the LLM serving backend:
115+
116+
```
117+
docker compose -f compose_tgi.yaml up -d
118+
```
119+
88120
## 🚀 Test MicroServices
89121

90-
```bash
91-
# whisper service
92-
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav
93-
curl http://${host_ip}:7066/v1/audio/transcriptions \
94-
-H "Content-Type: multipart/form-data" \
95-
-F file="@./sample.wav" \
96-
-F model="openai/whisper-small"
97-
98-
# tgi service
99-
curl http://${host_ip}:3006/generate \
100-
-X POST \
101-
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
102-
-H 'Content-Type: application/json'
122+
1. Whisper Service
103123

104-
# speecht5 service
105-
curl http://${host_ip}:7055/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
124+
```bash
125+
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav
126+
curl http://${host_ip}:${WHISPER_SERVER_PORT}/v1/audio/transcriptions \
127+
-H "Content-Type: multipart/form-data" \
128+
-F file="@./sample.wav" \
129+
-F model="openai/whisper-small"
130+
```
106131

107-
# gpt-sovits service (optional)
108-
curl http://${host_ip}:9880/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
109-
```
132+
2. LLM backend Service
133+
134+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready and the container (`vllm-service` or `tgi-service`) status shown via `docker ps` will be `healthy`. Before that, the status will be `health: starting`.
135+
136+
Or try the command below to check whether the LLM serving is ready.
137+
138+
```bash
139+
# vLLM service
140+
docker logs vllm-service 2>&1 | grep complete
141+
# If the service is ready, you will get the response like below.
142+
INFO: Application startup complete.
143+
```
144+
145+
```bash
146+
# TGI service
147+
docker logs tgi-service | grep Connected
148+
# If the service is ready, you will get the response like below.
149+
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
150+
```
151+
152+
Then try the `cURL` command below to validate services.
153+
154+
```bash
155+
# either vLLM or TGI service
156+
curl http://${host_ip}:${LLM_SERVER_PORT}/v1/chat/completions \
157+
-X POST \
158+
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
159+
-H 'Content-Type: application/json'
160+
```
161+
162+
3. TTS Service
163+
164+
```
165+
# speecht5 service
166+
curl http://${host_ip}:${SPEECHT5_SERVER_PORT}/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
167+
168+
# gpt-sovits service (optional)
169+
curl http://${host_ip}:${GPT_SOVITS_SERVER_PORT}/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
170+
```
110171

111172
## 🚀 Test MegaService
112173

AudioQnA/docker_compose/intel/cpu/xeon/compose.yaml

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ services:
66
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
77
container_name: whisper-service
88
ports:
9-
- "7066:7066"
9+
- ${WHISPER_SERVER_PORT:-7066}:7066
1010
ipc: host
1111
environment:
1212
no_proxy: ${no_proxy}
@@ -17,38 +17,41 @@ services:
1717
image: ${REGISTRY:-opea}/speecht5:${TAG:-latest}
1818
container_name: speecht5-service
1919
ports:
20-
- "7055:7055"
20+
- ${SPEECHT5_SERVER_PORT:-7055}:7055
2121
ipc: host
2222
environment:
2323
no_proxy: ${no_proxy}
2424
http_proxy: ${http_proxy}
2525
https_proxy: ${https_proxy}
2626
restart: unless-stopped
27-
tgi-service:
28-
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
29-
container_name: tgi-service
27+
vllm-service:
28+
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
29+
container_name: vllm-service
3030
ports:
31-
- "3006:80"
31+
- ${LLM_SERVER_PORT:-3006}:80
3232
volumes:
33-
- "${MODEL_CACHE:-./data}:/data"
34-
shm_size: 1g
33+
- "${MODEL_CACHE:-./data}:/root/.cache/huggingface/hub"
34+
shm_size: 128g
3535
environment:
3636
no_proxy: ${no_proxy}
3737
http_proxy: ${http_proxy}
3838
https_proxy: ${https_proxy}
3939
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
40+
LLM_MODEL_ID: ${LLM_MODEL_ID}
41+
VLLM_TORCH_PROFILER_DIR: "/mnt"
42+
LLM_SERVER_PORT: ${LLM_SERVER_PORT}
4043
healthcheck:
41-
test: ["CMD-SHELL", "curl -f http://$host_ip:3006/health || exit 1"]
44+
test: ["CMD-SHELL", "curl -f http://$host_ip:${LLM_SERVER_PORT}/health || exit 1"]
4245
interval: 10s
4346
timeout: 10s
4447
retries: 100
45-
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
48+
command: --model ${LLM_MODEL_ID} --host 0.0.0.0 --port 80
4649
audioqna-xeon-backend-server:
4750
image: ${REGISTRY:-opea}/audioqna:${TAG:-latest}
4851
container_name: audioqna-xeon-backend-server
4952
depends_on:
5053
- whisper-service
51-
- tgi-service
54+
- vllm-service
5255
- speecht5-service
5356
ports:
5457
- "3008:8888"

AudioQnA/docker_compose/intel/cpu/xeon/compose_multilang.yaml

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ services:
66
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
77
container_name: whisper-service
88
ports:
9-
- "7066:7066"
9+
- ${WHISPER_SERVER_PORT:-7066}:7066
1010
ipc: host
1111
environment:
1212
no_proxy: ${no_proxy}
@@ -18,27 +18,35 @@ services:
1818
image: ${REGISTRY:-opea}/gpt-sovits:${TAG:-latest}
1919
container_name: gpt-sovits-service
2020
ports:
21-
- "9880:9880"
21+
- ${GPT_SOVITS_SERVER_PORT:-9880}:9880
2222
ipc: host
2323
environment:
2424
no_proxy: ${no_proxy}
2525
http_proxy: ${http_proxy}
2626
https_proxy: ${https_proxy}
2727
restart: unless-stopped
28-
tgi-service:
29-
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
30-
container_name: tgi-service
28+
vllm-service:
29+
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
30+
container_name: vllm-service
3131
ports:
32-
- "3006:80"
32+
- ${LLM_SERVER_PORT:-3006}:80
3333
volumes:
34-
- "${MODEL_CACHE:-./data}:/data"
35-
shm_size: 1g
34+
- "${MODEL_CACHE:-./data}:/root/.cache/huggingface/hub"
35+
shm_size: 128g
3636
environment:
3737
no_proxy: ${no_proxy}
3838
http_proxy: ${http_proxy}
3939
https_proxy: ${https_proxy}
4040
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
41-
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
41+
LLM_MODEL_ID: ${LLM_MODEL_ID}
42+
VLLM_TORCH_PROFILER_DIR: "/mnt"
43+
LLM_SERVER_PORT: ${LLM_SERVER_PORT}
44+
healthcheck:
45+
test: ["CMD-SHELL", "curl -f http://$host_ip:${LLM_SERVER_PORT}/health || exit 1"]
46+
interval: 10s
47+
timeout: 10s
48+
retries: 100
49+
command: --model ${LLM_MODEL_ID} --host 0.0.0.0 --port 80
4250
audioqna-xeon-backend-server:
4351
image: ${REGISTRY:-opea}/audioqna-multilang:${TAG:-latest}
4452
container_name: audioqna-xeon-backend-server
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
services:
5+
whisper-service:
6+
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
7+
container_name: whisper-service
8+
ports:
9+
- ${WHISPER_SERVER_PORT:-7066}:7066
10+
ipc: host
11+
environment:
12+
no_proxy: ${no_proxy}
13+
http_proxy: ${http_proxy}
14+
https_proxy: ${https_proxy}
15+
restart: unless-stopped
16+
speecht5-service:
17+
image: ${REGISTRY:-opea}/speecht5:${TAG:-latest}
18+
container_name: speecht5-service
19+
ports:
20+
- ${SPEECHT5_SERVER_PORT:-7055}:7055
21+
ipc: host
22+
environment:
23+
no_proxy: ${no_proxy}
24+
http_proxy: ${http_proxy}
25+
https_proxy: ${https_proxy}
26+
restart: unless-stopped
27+
tgi-service:
28+
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
29+
container_name: tgi-service
30+
ports:
31+
- ${LLM_SERVER_PORT:-3006}:80
32+
volumes:
33+
- "${MODEL_CACHE:-./data}:/data"
34+
shm_size: 1g
35+
environment:
36+
no_proxy: ${no_proxy}
37+
http_proxy: ${http_proxy}
38+
https_proxy: ${https_proxy}
39+
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
40+
LLM_SERVER_PORT: ${LLM_SERVER_PORT}
41+
healthcheck:
42+
test: ["CMD-SHELL", "curl -f http://$host_ip:${LLM_SERVER_PORT}/health || exit 1"]
43+
interval: 10s
44+
timeout: 10s
45+
retries: 100
46+
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
47+
audioqna-xeon-backend-server:
48+
image: ${REGISTRY:-opea}/audioqna:${TAG:-latest}
49+
container_name: audioqna-xeon-backend-server
50+
depends_on:
51+
- whisper-service
52+
- tgi-service
53+
- speecht5-service
54+
ports:
55+
- "3008:8888"
56+
environment:
57+
- no_proxy=${no_proxy}
58+
- https_proxy=${https_proxy}
59+
- http_proxy=${http_proxy}
60+
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
61+
- WHISPER_SERVER_HOST_IP=${WHISPER_SERVER_HOST_IP}
62+
- WHISPER_SERVER_PORT=${WHISPER_SERVER_PORT}
63+
- LLM_SERVER_HOST_IP=${LLM_SERVER_HOST_IP}
64+
- LLM_SERVER_PORT=${LLM_SERVER_PORT}
65+
- LLM_MODEL_ID=${LLM_MODEL_ID}
66+
- SPEECHT5_SERVER_HOST_IP=${SPEECHT5_SERVER_HOST_IP}
67+
- SPEECHT5_SERVER_PORT=${SPEECHT5_SERVER_PORT}
68+
ipc: host
69+
restart: always
70+
audioqna-xeon-ui-server:
71+
image: ${REGISTRY:-opea}/audioqna-ui:${TAG:-latest}
72+
container_name: audioqna-xeon-ui-server
73+
depends_on:
74+
- audioqna-xeon-backend-server
75+
ports:
76+
- "5173:5173"
77+
environment:
78+
- no_proxy=${no_proxy}
79+
- https_proxy=${https_proxy}
80+
- http_proxy=${http_proxy}
81+
- CHAT_URL=${BACKEND_SERVICE_ENDPOINT}
82+
ipc: host
83+
restart: always
84+
85+
networks:
86+
default:
87+
driver: bridge

AudioQnA/docker_compose/intel/cpu/xeon/set_env.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ export host_ip=$(hostname -I | awk '{print $1}')
88
export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
99
# <token>
1010

11-
export LLM_MODEL_ID=Intel/neural-chat-7b-v3-3
11+
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
1212

1313
export MEGA_SERVICE_HOST_IP=${host_ip}
1414
export WHISPER_SERVER_HOST_IP=${host_ip}

0 commit comments

Comments
 (0)