-
Notifications
You must be signed in to change notification settings - Fork 218
vLLM support for FAQGen #884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
32b18a3
Add model parameter for FaqGenGateway in gateway.py file
sgurunat 20fbed5
Add langchain vllm support for FaqGen along with authentication suppo…
sgurunat 89d30d2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3b93cce
Updated docker_compose_llm.yaml and README file with vLLM information
sgurunat 0dbae69
Merge branch 'vllm-faq' of https://github.com/sgurunat/GenAIComps int…
sgurunat fa46865
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5ef1c4a
Merge branch 'main' into vllm-faq
sgurunat 4777fa9
Merge branch 'main' into vllm-faq
sgurunat 802d99d
Updated faq-vllm Dockerfile into llm-compose-cd.yaml under github wor…
sgurunat 09b1979
Merge branch 'main' into vllm-faq
sgurunat 9056766
resolved merge conflicts
sgurunat 47cc36c
Merge branch 'vllm-faq' of https://github.com/sgurunat/GenAIComps int…
sgurunat 33f5011
Updated llm-compose.yaml file to include vllm faqgen build
sgurunat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| FROM python:3.11-slim | ||
|
|
||
| RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ | ||
| libgl1-mesa-glx \ | ||
| libjemalloc-dev | ||
|
|
||
| RUN useradd -m -s /bin/bash user && \ | ||
| mkdir -p /home/user && \ | ||
| chown -R user /home/user/ | ||
|
|
||
| USER user | ||
|
|
||
| COPY comps /home/user/comps | ||
|
|
||
| RUN pip install --no-cache-dir --upgrade pip setuptools && \ | ||
| pip install --no-cache-dir -r /home/user/comps/llms/faq-generation/vllm/langchain/requirements.txt | ||
|
|
||
| ENV PYTHONPATH=$PYTHONPATH:/home/user | ||
|
|
||
| WORKDIR /home/user/comps/llms/faq-generation/vllm/langchain | ||
|
|
||
| ENTRYPOINT ["bash", "entrypoint.sh"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # vLLM FAQGen LLM Microservice | ||
|
|
||
| This microservice interacts with the vLLM server to generate FAQs from Input Text.[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). | ||
|
|
||
| ## 🚀1. Start Microservice with Docker | ||
|
|
||
| If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a VLLM service with docker. | ||
|
|
||
| To setup or build the vLLM image follow the instructions provided in [vLLM Gaudi](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm/langchain#22-vllm-on-gaudi) | ||
|
|
||
| ### 1.1 Setup Environment Variables | ||
|
|
||
| In order to start vLLM and LLM services, you need to setup the following environment variables first. | ||
|
|
||
| ```bash | ||
| export HF_TOKEN=${your_hf_api_token} | ||
| export vLLM_ENDPOINT="http://${your_ip}:8008" | ||
| export LLM_MODEL_ID=${your_hf_llm_model} | ||
| ``` | ||
|
|
||
| ### 1.3 Build Docker Image | ||
|
|
||
| ```bash | ||
| cd ../../../../../ | ||
| docker build -t opea/llm-faqgen-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/faq-generation/vllm/langchain/Dockerfile . | ||
| ``` | ||
|
|
||
| To start a docker container, you have two options: | ||
|
|
||
| - A. Run Docker with CLI | ||
| - B. Run Docker with Docker Compose | ||
|
|
||
| You can choose one as needed. | ||
|
|
||
| ### 1.3 Run Docker with CLI (Option A) | ||
|
|
||
| ```bash | ||
| docker run -d -p 8008:80 -v ./data:/data --name vllm-service --shm-size 1g opea/vllm:hpu --model-id ${LLM_MODEL_ID} | ||
| ``` | ||
|
|
||
| ```bash | ||
| docker run -d --name="llm-faqgen-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_ENDPOINT=$vLLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN opea/llm-faqgen-vllm:latest | ||
| ``` | ||
|
|
||
| ### 1.4 Run Docker with Docker Compose (Option B) | ||
|
|
||
| ```bash | ||
| docker compose -f docker_compose_llm.yaml up -d | ||
| ``` | ||
|
|
||
| ## 🚀3. Consume LLM Service | ||
|
|
||
| ### 3.1 Check Service Status | ||
|
|
||
| ```bash | ||
| curl http://${your_ip}:9000/v1/health_check\ | ||
| -X GET \ | ||
| -H 'Content-Type: application/json' | ||
| ``` | ||
|
|
||
| ### 3.2 Consume FAQGen LLM Service | ||
|
|
||
| ```bash | ||
| # Streaming Response | ||
| # Set streaming to True. Default will be True. | ||
| curl http://${your_ip}:9000/v1/faqgen \ | ||
| -X POST \ | ||
| -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5."}' \ | ||
| -H 'Content-Type: application/json' | ||
|
|
||
| # Non-Streaming Response | ||
| # Set streaming to False. | ||
| curl http://${your_ip}:9000/v1/faqgen \ | ||
| -X POST \ | ||
| -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "streaming":false}' \ | ||
| -H 'Content-Type: application/json' | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 |
46 changes: 46 additions & 0 deletions
46
comps/llms/faq-generation/vllm/langchain/docker_compose_llm.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| version: "3.8" | ||
|
|
||
| services: | ||
| vllm-service: | ||
| image: opea/vllm:hpu | ||
| container_name: vllm-gaudi-server | ||
| ports: | ||
| - "8008:80" | ||
| volumes: | ||
| - "./data:/data" | ||
| environment: | ||
| no_proxy: ${no_proxy} | ||
| http_proxy: ${http_proxy} | ||
| https_proxy: ${https_proxy} | ||
| HF_TOKEN: ${HF_TOKEN} | ||
| HABANA_VISIBLE_DEVICES: all | ||
| OMPI_MCA_btl_vader_single_copy_mechanism: none | ||
| LLM_MODEL_ID: ${LLM_MODEL_ID} | ||
| runtime: habana | ||
| cap_add: | ||
| - SYS_NICE | ||
| ipc: host | ||
| command: --enforce-eager --model $LLM_MODEL_ID --tensor-parallel-size 1 --host 0.0.0.0 --port 80 | ||
| llm: | ||
| image: opea/llm-faqgen-vllm:latest | ||
| container_name: llm-faqgen-server | ||
| depends_on: | ||
| - vllm-service | ||
| ports: | ||
| - "9000:9000" | ||
| ipc: host | ||
| environment: | ||
| no_proxy: ${no_proxy} | ||
| http_proxy: ${http_proxy} | ||
| https_proxy: ${https_proxy} | ||
| vLLM_ENDPOINT: ${vLLM_ENDPOINT} | ||
| HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN} | ||
| LLM_MODEL_ID: ${LLM_MODEL_ID} | ||
| restart: unless-stopped | ||
|
|
||
| networks: | ||
| default: | ||
| driver: bridge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| pip --no-cache-dir install -r requirements-runtime.txt | ||
|
|
||
| python llm.py |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| import os | ||
|
|
||
| from fastapi.responses import StreamingResponse | ||
| from langchain.chains.summarize import load_summarize_chain | ||
| from langchain.docstore.document import Document | ||
| from langchain.prompts import PromptTemplate | ||
| from langchain.text_splitter import CharacterTextSplitter | ||
| from langchain_community.llms import VLLMOpenAI | ||
|
|
||
| from comps import CustomLogger, GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice | ||
| from comps.cores.mega.utils import get_access_token | ||
|
|
||
| logger = CustomLogger("llm_faqgen") | ||
| logflag = os.getenv("LOGFLAG", False) | ||
|
|
||
| # Environment variables | ||
| TOKEN_URL = os.getenv("TOKEN_URL") | ||
| CLIENTID = os.getenv("CLIENTID") | ||
| CLIENT_SECRET = os.getenv("CLIENT_SECRET") | ||
|
|
||
|
|
||
| def post_process_text(text: str): | ||
| if text == " ": | ||
| return "data: @#$\n\n" | ||
| if text == "\n": | ||
| return "data: <br/>\n\n" | ||
| if text.isspace(): | ||
| return None | ||
| new_text = text.replace(" ", "@#$") | ||
| return f"data: {new_text}\n\n" | ||
|
|
||
|
|
||
| @register_microservice( | ||
| name="opea_service@llm_faqgen", | ||
| service_type=ServiceType.LLM, | ||
| endpoint="/v1/faqgen", | ||
| host="0.0.0.0", | ||
| port=9000, | ||
| ) | ||
| async def llm_generate(input: LLMParamsDoc): | ||
| if logflag: | ||
| logger.info(input) | ||
| access_token = ( | ||
| get_access_token(TOKEN_URL, CLIENTID, CLIENT_SECRET) if TOKEN_URL and CLIENTID and CLIENT_SECRET else None | ||
| ) | ||
| headers = {} | ||
| if access_token: | ||
| headers = {"Authorization": f"Bearer {access_token}"} | ||
|
|
||
| model = input.model if input.model else os.getenv("LLM_MODEL_ID") | ||
| llm = VLLMOpenAI( | ||
| openai_api_key="EMPTY", | ||
| openai_api_base=llm_endpoint + "/v1", | ||
| model_name=model, | ||
| default_headers=headers, | ||
| max_tokens=input.max_tokens, | ||
| top_p=input.top_p, | ||
| streaming=input.streaming, | ||
| temperature=input.temperature, | ||
| ) | ||
|
|
||
| templ = """Create a concise FAQs (frequently asked questions and answers) for following text: | ||
| TEXT: {text} | ||
| Do not use any prefix or suffix to the FAQ. | ||
| """ | ||
| PROMPT = PromptTemplate.from_template(templ) | ||
| llm_chain = load_summarize_chain(llm=llm, prompt=PROMPT) | ||
| texts = text_splitter.split_text(input.query) | ||
|
|
||
| # Create multiple documents | ||
| docs = [Document(page_content=t) for t in texts] | ||
|
|
||
| if input.streaming: | ||
|
|
||
| async def stream_generator(): | ||
| from langserve.serialization import WellKnownLCSerializer | ||
|
|
||
| _serializer = WellKnownLCSerializer() | ||
| async for chunk in llm_chain.astream_log(docs): | ||
| data = _serializer.dumps({"ops": chunk.ops}).decode("utf-8") | ||
| if logflag: | ||
| logger.info(data) | ||
| yield f"data: {data}\n\n" | ||
| yield "data: [DONE]\n\n" | ||
|
|
||
| return StreamingResponse(stream_generator(), media_type="text/event-stream") | ||
| else: | ||
| response = await llm_chain.ainvoke(docs) | ||
| response = response["output_text"] | ||
| if logflag: | ||
| logger.info(response) | ||
| return GeneratedDoc(text=response, prompt=input.query) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8080") | ||
| # Split text | ||
| text_splitter = CharacterTextSplitter() | ||
| opea_microservices["opea_service@llm_faqgen"].start() |
1 change: 1 addition & 0 deletions
1
comps/llms/faq-generation/vllm/langchain/requirements-runtime.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| langserve |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| docarray[full] | ||
| fastapi | ||
| huggingface_hub | ||
| langchain | ||
| langchain-huggingface | ||
| langchain-openai | ||
| langchain_community | ||
| langchainhub | ||
| opentelemetry-api | ||
| opentelemetry-exporter-otlp | ||
| opentelemetry-sdk | ||
| prometheus-fastapi-instrumentator | ||
| shortuuid | ||
| transformers | ||
| uvicorn |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.