-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama-cpp-python server #452
base: main
Are you sure you want to change the base?
Conversation
Reviewer's Guide by SourceryThis PR changes the default runtime from 'llama.cpp' to 'llama-cpp-python' and adds support for the 'llama-cpp-python' server implementation. The changes involve modifying the server execution logic and updating the CLI configuration to accommodate the new runtime option. Sequence diagram for server execution logicsequenceDiagram
participant User
participant CLI
participant Model
User->>CLI: Run command with --runtime flag
CLI->>Model: Pass runtime argument
alt Runtime is vllm
Model->>CLI: Execute vllm server
else Runtime is llama.cpp
Model->>CLI: Execute llama-server
else Runtime is llama-cpp-python
Model->>CLI: Execute llama_cpp.server
end
Updated class diagram for runtime configurationclassDiagram
class CLI {
-runtime: String
+configure_arguments(parser)
}
class Model {
+serve(args)
}
CLI --> Model: uses
note for CLI "Updated default runtime to 'llama-cpp-python' and added it as a choice"
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ericcurtin - I've reviewed your changes - here's some feedback:
Overall Comments:
- Please provide justification for changing the default runtime from llama.cpp to llama-cpp-python. What are the benefits that led to this decision?
- Consider updating any additional documentation to reflect the new runtime option and explain the differences between the available runtimes
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have skipped reviewing this pull request. It looks like we've already reviewed the commit 47fdf9a in this pull request.
c4f91bc
to
fe9c0ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the code.
But, I think I'm missing something. Is llama-cpp-python
already a dependency?
@ygalblum we probably need to push some container images before merging this, but when we do that, we should be all good. |
1d79a3c
to
4c35af3
Compare
ramalama/cli.py
Outdated
@@ -103,7 +103,7 @@ def load_and_merge_config(): | |||
) | |||
|
|||
ramalama_config['carimage'] = ramalama_config.get('carimage', "registry.access.redhat.com/ubi9-micro:latest") | |||
ramalama_config['runtime'] = ramalama_config.get('runtime', 'llama.cpp') | |||
ramalama_config['runtime'] = ramalama_config.get('runtime', 'llama-cpp-python') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have a different runtime for ramalama serve versus ramalama run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ramalama run
, the only commands that are available to do that in a single process are the llama.cpp ones, like llama-cli (and llama-run which I'll be adding in this PR now that the last few bits were merged a couple of hours ago). There are other possibilities for chatbots, but the other require us to set up a server process and a client process. I think llama-run is cleaner than setting up a client and server.
As regards server there are 3 clear options, vLLM, llama.cpp and llama-cpp-python (which is just a python wrapper around llama.cpp). Although vLLM is the best enterprise option today for a server in my opinion, it lacks the diverse hardware support for local development, which leaves us the other two. llama.cpp server is a very minimal implementation of an OpenAI-compatible server as in it is barely OpenAI-compatible, last I tried it, it didn't work with OpenAI-compatible clients like: https://github.com/open-webui/open-webui
llama.cpp server might catch up as it is actively being contributed to, but it's not there today. One key feature it is missing is multi-model support, so the client normally specifies the model it wants to use.
llama-cpp-python is further ahead in terms of it's OpenAI-compatible server implementation, that's why I'm proposing we switch in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion here is choice is bad. We should only offer llama.cpp and if the server tool llama-cpp-python is better, then default to using that, if a better one shows up in the future, then we could move to it, as long as it implements the Same API as llama-cpp-python. We are doing the same thing on the run side moving from llama-client to llama-run.
Bottom line is keep the selection to llama.cpp and vLLM (And whisper.cpp) and we can modify the tools used to work with the libraries under the covers. Eventually we could allow specifying the commands to run within the containers for run and serve, but that is a ways a way.
acb465f
to
678ca99
Compare
Adding temperature option to llama-run here: |
ramalama/cli.py
Outdated
@@ -205,8 +205,8 @@ def configure_arguments(parser): | |||
parser.add_argument( | |||
"--runtime", | |||
default=config.get("runtime"), | |||
choices=["llama.cpp", "vllm"], | |||
help="specify the runtime to use; valid options are 'llama.cpp' and 'vllm'", | |||
choices=["llama-cpp-python", "llama.cpp", "vllm"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated above, do not give this choice, just stick with llama.cpp and use llama-cpp-python server for serve
ca00357
to
07202c5
Compare
Changed default runtime from 'llama.cpp' to 'llama-cpp-python'. Added 'llama-cpp-python' as a runtime option for better flexibility with the `--runtime` flag. Signed-off-by: Eric Curtin <[email protected]>
07202c5
to
ebf4d34
Compare
Changed default runtime from 'llama.cpp' to 'llama-cpp-python'. Added 'llama-cpp-python' as a runtime option for better flexibility with the
--runtime
flag.Summary by Sourcery
Add 'llama-cpp-python' as a new runtime option and set it as the default runtime, enhancing flexibility in model serving.
New Features:
Enhancements: