-
Notifications
You must be signed in to change notification settings - Fork 218
Adding a llama.cpp LLM Component #1052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 11 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
397f7b8
First commit of llamacpp Opea component
edlee123 cb4f5e5
Removed unneeded requirements file
edlee123 df3d943
Merge branch 'main' into llamacpp
edlee123 8893f38
Merge branch 'main' into llamacpp
edlee123 2a48bae
Pin the llama.cpp server version, and fix small typo
edlee123 644ecce
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 4e82152
Update README.md to describe hardware support, and provide reference.
edlee123 baf381d
Updated docker_compose_llm.yaml so that the llamacpp-server so the pu…
edlee123 7bab970
Merge branch 'main' into llamacpp
edlee123 e4f4b70
Merge branch 'main' into llamacpp
edlee123 9d7539d
Small adjustments to README.md
edlee123 2cf25e5
Merge branch 'main' into llamacpp
edlee123 fd15ee7
This removes unneeded dependencies in the Dockerfile, unneeded entryp…
edlee123 666196c
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 104527a
Merge branch 'main' into llamacpp
edlee123 c931902
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 6b98403
Merge branch 'main' into llamacpp
edlee123 240d3d1
Merge branch 'main' into llamacpp
edlee123 91e0fd4
Merge branch 'main' into llamacpp
edlee123 a75d28d
Refactored llama cpp and text-generation README_llamacpp.md
edlee123 830da58
Delete unrefactored files
edlee123 8d058bb
Adding llama.cpp backend include in the compose_text-genearation.yaml
edlee123 a0294a5
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 a6740b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d0e27bf
Fix service name
edlee123 91324af
Revise llamacpp, using smaller Qwen model and remove unnecessary curl…
edlee123 f295e29
Update llamacpp thirdparty readme to use smaller model
edlee123 480cb69
Fix healthcheck in llamacpp deployment compose.yaml
edlee123 2c9f877
Wrote a test and tested for llamacpp text gen service
edlee123 f3147f1
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 7310d6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 80ed9b0
Merge branch 'main' into llamacpp
edlee123 efde309
Increase the llamacpp-server wait time
edlee123 1a7db52
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 c474a64
Fixed typos on http environment variables, and volumes
edlee123 712f575
Splitting the llama.cpp test to use compose up on the llama.cpp third…
edlee123 68cc00f
add alternate command to stop and remove docker containers from previ…
edlee123 2dd2064
Modifying tear down of stop_docker in llamacpp tests to try to remove…
edlee123 dbff6fc
Adding some logs output to debug llamacpp test
edlee123 f184897
Found model path bug and fixed it to run llama.cpp test
edlee123 ea4ea38
Adjusted LLM_ENDPOINT env variable
edlee123 01fca03
Cleaned up test file
edlee123 dfd5057
Adjust host_ip env variable in scope of start_service
edlee123 a741320
Merge branch 'main' into llamacpp
edlee123 4a965da
Docker ps to debug orphaned containers.
edlee123 25240da
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 32b06e9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3363504
Adding output to debug orphaned docker containers
edlee123 421b1ab
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 d5d3c1e
Merge branch 'main' into llamacpp
edlee123 d85c60e
Merge branch 'main' into llamacpp
xiguiw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| FROM python:3.11-slim | ||
|
|
||
| RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ | ||
| curl \ | ||
| libgl1-mesa-glx \ | ||
| libjemalloc-dev | ||
|
|
||
| RUN useradd -m -s /bin/bash user && \ | ||
| mkdir -p /home/user && \ | ||
| chown -R user /home/user/ | ||
|
|
||
| USER user | ||
|
|
||
| # Assumes we're building from the GenAIComps directory. | ||
| COPY ../../../comps /home/user/comps | ||
|
|
||
| RUN pip install --no-cache-dir --upgrade pip setuptools && \ | ||
| pip install --no-cache-dir -r /home/user/comps/llms/text-generation/llamacpp/requirements.txt | ||
|
|
||
| ENV PYTHONPATH=$PYTHONPATH:/home/user | ||
|
|
||
| WORKDIR /home/user/comps/llms/text-generation/llamacpp/ | ||
|
|
||
| ENTRYPOINT ["bash", "entrypoint.sh"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Introduction | ||
|
|
||
| [llama.cpp](https://github.com/ggerganov/llama.cpp) provides inference in pure C/C++, and enables "LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud". | ||
edlee123 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| This OPEA component wraps llama.cpp server so that it can interface with other OPEA components, or for creating OPEA Megaservices. | ||
|
|
||
| llama.cpp supports this [hardware](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#supported-backends), and has only been tested on CPU. | ||
|
|
||
| To use a CUDA server please refer to [this llama.cpp reference](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#docker) and modify docker_compose_llm.yaml accordingly. | ||
|
|
||
| ## TLDR | ||
|
|
||
| ```bash | ||
| cd GenAIComps/ | ||
| docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml up | ||
| ``` | ||
|
|
||
| Please note it's instructive to run and validate each the llama.cpp server and OPEA component below. | ||
|
|
||
| ## 1. Run the llama.cpp server | ||
|
|
||
| ```bash | ||
| cd GenAIComps | ||
| docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml up llamacpp-server --force-recreate | ||
| ``` | ||
|
|
||
| Notes: | ||
|
|
||
| i) If you prefer to run above in the background without screen output use `up -d` . The `--force-recreate` clears cache. | ||
|
|
||
| ii) To tear down the llama.cpp server and remove the container: | ||
|
|
||
| `docker compose -f comps/llms/text-generation/llamacpp/langchain/docker_compose_llm.yaml llamacpp-server down` | ||
|
|
||
| iii) For [llama.cpp settings](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md) please specify them in the docker_compose_llm.yaml file. | ||
|
|
||
| #### Verify the llama.cpp Service: | ||
|
|
||
| ```bash | ||
| curl --request POST \ | ||
| --url http://localhost:8080/completion \ | ||
| --header "Content-Type: application/json" \ | ||
| --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' | ||
| ``` | ||
|
|
||
| ## 2. Run the llama.cpp OPEA Service | ||
|
|
||
| This is essentially a wrapper component of Llama.cpp server. OPEA nicely standardizes and verifies LLM inputs with LLMParamsDoc class (see llm.py). | ||
|
|
||
| ### 2.1 Build the llama.cpp OPEA image: | ||
|
|
||
| ```bash | ||
| cd GenAIComps/ | ||
| docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml up llama-opea-llm | ||
| ``` | ||
|
|
||
| Equivalently, the above can be achieved with `build` and `run` from the Dockerfile. Build: | ||
|
|
||
| ```bash | ||
| cd GenAIComps/ | ||
| docker build --no-cache -t opea/llm-llamacpp:latest \ | ||
| --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy \ | ||
| -f comps/llms/text-generation/llamacpp/Dockerfile . | ||
| ``` | ||
|
|
||
| And run: | ||
|
|
||
| ```bash | ||
| docker run --network host -e http_proxy=$http_proxy -e https_proxy=$https_proxy \ | ||
| opea/llm-llamacpp:latest | ||
| ``` | ||
|
|
||
| ### 2.3 Consume the llama.cpp Microservice: | ||
|
|
||
| ```bash | ||
| curl http://127.0.0.1:9000/v1/chat/completions -X POST \ | ||
| -d '{"query":"What is Deep Learning?","max_tokens":32,"top_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \ | ||
| -H 'Content-Type: application/json' | ||
| ``` | ||
|
|
||
| ### Notes | ||
|
|
||
| Tearing down services and removing containers: | ||
|
|
||
| ```bash | ||
| cd GenAIComps/comps/llms/text-generation/llamacpp/ | ||
| docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml down | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 |
39 changes: 39 additions & 0 deletions
39
comps/llms/text-generation/llamacpp/docker_compose_llm.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| services: | ||
| llamacpp-server: | ||
| image: ghcr.io/ggerganov/llama.cpp:server-b4419 | ||
| ports: | ||
| - 8080:8080 | ||
| environment: | ||
| # Refer to settings here: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md | ||
| # Llama.cpp is based on .gguf format, and Hugging Face offers many .gguf format models. | ||
| LLAMA_ARG_MODEL_URL: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf | ||
| LLAMA_ARG_CTX_SIZE: 4096 | ||
| LLAMA_ARG_N_PARALLEL: 2 | ||
| LLAMA_ARG_ENDPOINT_METRICS: 1 | ||
| LLAMA_ARG_PORT: 8080 | ||
|
|
||
| llamacpp-opea-llm: | ||
| image: opea/llm-llamacpp:latest | ||
| build: | ||
| # Set this to allow COPY comps in the Dockerfile. | ||
| # When using docker compose with -f, the comps context is 4 levels down from docker_compose_llm.yaml. | ||
| context: ../../../../ | ||
| dockerfile: ./comps/llms/text-generation/llamacpp/Dockerfile | ||
| depends_on: | ||
| - llamacpp-server | ||
| ports: | ||
| - "9000:9000" | ||
| network_mode: "host" # equivalent to: docker run --network host ... | ||
| environment: | ||
| no_proxy: ${no_proxy} | ||
| http_proxy: ${http_proxy} | ||
| https_proxy: ${https_proxy} | ||
| # LLAMACPP_ENDPOINT: ${LLAMACPP_ENDPOINT} | ||
| restart: unless-stopped | ||
|
|
||
| networks: | ||
| default: | ||
| driver: bridge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # pip --no-cache-dir install -r requirements-runtime.txt | ||
|
|
||
| python llm.py | ||
edlee123 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Copyright (C) 2024 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| import os | ||
|
|
||
| import openai | ||
| from fastapi.responses import StreamingResponse | ||
|
|
||
| from comps import CustomLogger, LLMParamsDoc, ServiceType, opea_microservices, register_microservice | ||
|
|
||
| logger = CustomLogger("llm_llamacpp") | ||
| logflag = os.getenv("LOGFLAG", False) | ||
| llamacpp_endpoint = os.getenv("LLAMACPP_ENDPOINT", "http://localhost:8080/") | ||
|
|
||
|
|
||
| # OPEA microservice wrapper of llama.cpp | ||
| # llama.cpp server uses openai API format: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md | ||
| @register_microservice( | ||
| name="opea_service@llm_llamacpp", | ||
| service_type=ServiceType.LLM, | ||
| endpoint="/v1/chat/completions", | ||
| host="0.0.0.0", | ||
| port=9000, | ||
| ) | ||
| async def llm_generate(input: LLMParamsDoc): | ||
| if logflag: | ||
| logger.info(input) | ||
| logger.info(llamacpp_endpoint) | ||
|
|
||
| client = openai.OpenAI( | ||
| base_url=llamacpp_endpoint, api_key="sk-no-key-required" # "http://<Your api-server IP>:port" | ||
| ) | ||
|
|
||
| # Llama.cpp works with openai API format | ||
| # The openai api doesn't have top_k parameter | ||
| # https://community.openai.com/t/which-openai-gpt-models-if-any-allow-specifying-top-k/777982/2 | ||
| chat_completion = client.chat.completions.create( | ||
| model=input.model, | ||
| messages=[{"role": "user", "content": input.query}], | ||
| max_tokens=input.max_tokens, | ||
| temperature=input.temperature, | ||
| top_p=input.top_p, | ||
| frequency_penalty=input.frequency_penalty, | ||
| presence_penalty=input.presence_penalty, | ||
| stream=input.streaming, | ||
| ) | ||
|
|
||
| if input.streaming: | ||
|
|
||
| def stream_generator(): | ||
| for c in chat_completion: | ||
| if logflag: | ||
| logger.info(c) | ||
| yield f"data: {c.model_dump_json()}\n\n" | ||
| yield "data: [DONE]\n\n" | ||
|
|
||
| return StreamingResponse(stream_generator(), media_type="text/event-stream") | ||
| else: | ||
| if logflag: | ||
| logger.info(chat_completion) | ||
| return chat_completion | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| opea_microservices["opea_service@llm_llamacpp"].start() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| aiohttp | ||
| docarray[full] | ||
| fastapi | ||
| huggingface_hub | ||
| openai | ||
| opentelemetry-api | ||
| opentelemetry-exporter-otlp | ||
| opentelemetry-sdk | ||
| prometheus-fastapi-instrumentator | ||
| shortuuid | ||
| transformers | ||
| uvicorn |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.