vLLM vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.
- An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.
In this example, the Qwen1.5-4B-Chat model is used to describe how to download a Qwen model, upload the model to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in an ACK cluster.
- Download the model file.
yum install git git-lfs
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
cd Qwen1.5-4B-Chat
git lfs pull
- Upload the Qwen1.5-4B-Chat model file to OSS.
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
- Configure PVs and PVCs in the destination cluster.
You need to replace the variables in the file with real values.
kubectl apply -f ./yamls/dataset.yaml
- Run the following command to deploy the Qwen1.5-4B-Chat model as an inference service by using rtp-llm:
kubectl apply -f ./yamls/deploy.yaml
kubectl apply -f ./yamls/service.yaml
- Run the following command to view the details of the inference service:
$ kubectl get po|grep vllm
---
vllm-7cfb9cd9f4-w2hk8 1/1 Running 0 7m11s
The output indicates that the inference service is running as expected and is ready to provide services.
- Run the following command to create a port forwarding rule between the inference service and the local environment:
kubectl port-forward svc/vllm-service 8000:8000
Expected output:
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
- Run the following command to send an inference request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen", "messages": [{"role": "user", "content": "测试一下"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
Expected output:
{"id":"cmpl-e3efa23d803349d2ad6b44411811109f","object":"chat.completion","created":1720161589,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"好的,请问您需要测试什么?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":30,"completion_tokens":9}}
If you no longer need the resources, delete the resources at the earliest opportunity.
- Run the following command to delete the inference service:
kubectl delete -f ./yamls
tag | Date | release |
---|---|---|
0.4.1 | 2024-07 | init |