Skip to content

Conversation

@edlee123
Copy link
Contributor

@edlee123 edlee123 commented Dec 20, 2024

Description

I added a llama.cpp LLM OPEA component. Llama.cpp is a popular LLM inference library/server "with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud" written in pure C/C++.

The component code is written in llm.py, and is most similar to the existing code in llms/text-generation/ray_serve. I also referred to ollama, and tgi to try keep with conventions.

Please see the README.md provides instructions how to use it.

Issues

List the issue or RFC link this PR is working on. If there is no such link, please mark it as n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

The dependencies are similar to other llm components.

Tests

This was tested on CPU (laptop) with Phi3.5 mini 4k instruct. The Llama Cpp can use GPU as needed but didn't test it.

@xiguiw
Copy link
Collaborator

xiguiw commented Jan 3, 2025

@edlee123
GenAIComps is under refactor - There are too much duplicated code.

Please wait for several days. And integrate it into new interface. Sorry for this.
But code refactor is helpful the new interface is simple. You just focus on LLM, no micro-service code needed.

@xiguiw xiguiw requested a review from letonghan January 6, 2025 09:17
@edlee123
Copy link
Contributor Author

edlee123 commented Jan 6, 2025

Hi @xiguiw

GenAIComps is under refactor - There are too much duplicated code.

No problem, which refactor branches should I wait for? I can wait for the refactoring to merge, and then I see how I can use the same approach.

@xiguiw
Copy link
Collaborator

xiguiw commented Jan 13, 2025

Hi @xiguiw

GenAIComps is under refactor - There are too much duplicated code.

No problem, which refactor branches should I wait for? I can wait for the refactoring to merge, and then I see how I can use the same approach.

@edlee123

The LLM refactor code is merged.
Here is the code structure for your reference:

https://github.com/opea-project/GenAIComps/tree/main/comps/llms

.
├── deployment
│   ├── docker_compose
│   └── kubernetes
├── src
│   └── text-generation
           └── integrations

Foy your reference:

  1. The docker-compose yaml file is put in deployment folder.
  2. There is only one micro-service code for llm, opea_llm_microservice.py. For integration of each LLM services engine, it is located in integrations. Please refer to opea_llm_microservice.py and opea.py.
  3. For multiple services/engines integrations, please can refer to https://github.com/opea-project/GenAIComps/tree/main/comps/embeddings/src/integrations

@edlee123
Copy link
Contributor Author

edlee123 commented Jan 13, 2025

Thank you @xiguiw - would good next steps for this PR be:

  1. Move the comps/llms/text-generation/llamacpp/docker_compose_llm.yaml file to comps/llms/deployment/docker_compose/text-generation_llamacpp.yaml (renaming it).
  2. Test the new refactored opea llm microservice works with the above docker compose file.
  3. Update the README.md in comps/llms/text-generation/llamacpp to use the above workflow?
  4. Delete files from comps/llms/text-generation/llamacpp that are no longer needed.
  5. Fix the last two github checks Check Online Document Building and Compose file and dockerfile path checking

Thank you for your guidance

@xiguiw
Copy link
Collaborator

xiguiw commented Feb 25, 2025

Thank you @xiguiw. One question for above 1. create Dockefile to build docker image. vllm has src/Dockerfile.intel_gpu to build for Intel GPU. For llama.cpp CPU, it wouldn't need a special build, and could use the image: ghcr.io/ggerganov/llama.cpp:server-b4419. Am I understanding we should include the Dockerfile for this ghcr.io/ggerganov/llama.cpp:server-b4419 in the repo?

@edlee123
I agree we use the image and include the Docker file for the image ghcr.io/ggerganov/llama.cpp:server-b4419.
We need a stable version. Is this image a stable release version?

Perhaps I can write the test it'll be easier to see, and we can look at the test's function build_docker_images() ?
Yes, it's great that you write the test.

@edlee123
Copy link
Contributor Author

edlee123 commented Feb 25, 2025

Hi @xiguiw

It looks like there's some debate in llama.cpp about semver or "stable" tagging to balance their rapid development and stability: ggml-org/llama.cpp#9276

But I don't think their process looks resolved yet: "The llama-server example is on path to become another production ready artifact".

Would you have some suggestion how I should proceed for now - maybe provide a disclaimer to check changelog carefully?

For the test I can do something like vLLM, e.g., pull the Docker file that goes with a particular release, and build with a build_docker_images function.

@edlee123
Copy link
Contributor Author

Changed to WIP pending bug #1323 Orphan container.

Copy link

@jrevillard jrevillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

A general comment is that the llama.cpp repository moved so you should adapt the docker images URL (and all the links): ggml-org/llama.cpp#11801

Best,
Jérôme

@yinghu5 yinghu5 added the WIP label Mar 19, 2025
@xiguiw
Copy link
Collaborator

xiguiw commented Mar 19, 2025

Hi @xiguiw

It looks like there's some debate in llama.cpp about semver or "stable" tagging to balance their rapid development and stability: ggml-org/llama.cpp#9276

But I don't think their process looks resolved yet: "The llama-server example is on path to become another production ready artifact".

Would you have some suggestion how I should proceed for now - maybe provide a disclaimer to check changelog carefully?

For the test I can do something like vLLM, e.g., pull the Docker file that goes with a particular release, and build with a build_docker_images function.

@edlee123
Thanks for sharing the information.
Yes, there is braking changes in llama.cpp.

If we cannot get a stable version, we have the risk of broken llm->llama sometimes (This happened in vLLM and Gaudi. So we get the latest stable version in OPEA now).

If we fixed a verified commit ID, that's acceptable. But we need to update the commit id from time to time to get latest change from llama.cpp. There are some maintaining work.

Who will take this work, will you take this work?
If there is maintainer, that's not a problem. Otherwise, it's a potential issue.

@edlee123
Copy link
Contributor Author

edlee123 commented Mar 19, 2025

Hi @xiguiw, with the maintenance I'm thinking we revisit at a future time when llama.cpp is more stable, and we can close for now. Seems there would be many changes in llama.cpp to stay on top.

Also with #1395 can use openai style remote endpoints, and there are many good free options in OpenRouter.ai. I think this would give many options for devs on low compute machines.

@xiguiw
Copy link
Collaborator

xiguiw commented Mar 26, 2025

@edlee123

remind:

Thanks for exploration extending the OPEA llm backend.
How about this PR, shall we close this an open a new one in the future?

@edlee123
Copy link
Contributor Author

@xiguiw sounds good to me, closing it now. Thank you for all for the review

@edlee123 edlee123 closed this Mar 26, 2025
@yinghu5 yinghu5 added Backlog features in backlog A3 Maintain labels Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A3 Maintain Backlog features in backlog WIP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants