-
Notifications
You must be signed in to change notification settings - Fork 43.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(forge/llm): Add LlamafileProvider
#7091
base: master
Are you sure you want to change the base?
feat(forge/llm): Add LlamafileProvider
#7091
Conversation
β¦der for llamafiles. Currently it just extends OpenAIProvider and only overrides methods that are necessary to get the system to work at a basic level. Update ModelProviderName schema and config/configurator so that app startup using this provider is handled correctly. Add 'mistral-7b-instruct-v0' to OpenAIModelName/OPEN_AI_CHAT_MODELS registries.
β¦-Instruct chat template, which supports the 'user' & 'assistant' roles but does not support the 'system' role.
β¦kens`, and `get_tokenizer` from classmethods so I can override them in LlamafileProvide (and so I can access instance instance attributes from inside them). Implement class `LlamafileTokenizer` that calls the llamafile server's `/tokenize` API endpoint.
β¦tes on the integration; add helper scripts for downloading/running a llamafile + example env file.
β¦gs for reproducibility
β¦ange serve.sh to use model's full context size (this does not seem to cause OOM errors, surpisingly).
β Deploy Preview for auto-gpt-docs canceled.
|
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request. |
@CodiumAI-Agent /review |
PR Review
Code feedback:
β¨ Review tool usage guide:Overview: The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.
See the review usage page for a comprehensive guide on using this tool. |
@k8si any chance you could enable maintainer write access on this PR? |
@Pwuts it doesn't look like I have the ability to do that. I added you as a maintainer to the forked project, is that sufficient or do others need write access? Alternatively, you could branch off my branch and I can just accept the changes via PR? |
Conflicts have been resolved! π A maintainer will review the pull request shortly. |
β¦vider`, `GroqProvider` and `LlamafileProvider` and rebase the latter three on `BaseOpenAIProvider`
Conflicts have been resolved! π A maintainer will review the pull request shortly. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7091 +/- ##
==========================================
- Coverage 25.50% 24.73% -0.77%
==========================================
Files 80 82 +2
Lines 4661 4806 +145
Branches 631 659 +28
==========================================
Hits 1189 1189
- Misses 3402 3547 +145
Partials 70 70
Flags with carried forward coverage won't be shown. Click here to find out more. β View full report in Codecov by Sentry. |
I am currently experiencing these issues: Mozilla-Ocho/llamafile#356, Mozilla-Ocho/llamafile#100. May need to amend Update: this isn't scriptable and not our problem. I'll amend the docs with a note that llamafiles can't be run from WSL, but can still be used by running them on Windows and then connecting to them in WSL. |
fb6a4ab
to
9ee1e8f
Compare
## LLAMAFILE_API_BASE - Llamafile API base URL | ||
# LLAMAFILE_API_BASE=http://localhost:8080/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Env var not added to options.md
"--port", type=int, help="Specify the port for the llamafile server to listen on" | ||
) | ||
@click.option( | ||
"--use-gpu", is_flag=True, help="Use an AMD or Nvidia GPU to speed up inference" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it works on Apple Silicon (ARM64) as well
"--ctx-size", | ||
"0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think context size should be parametrizable; it has impact on performance so it's important to have a way of limiting it.
The first time this is run, it will download a file containing the model + runtime, | ||
which may take a while and a few gigabytes of disk space. | ||
|
||
To force GPU acceleration, add `--use-gpu` to the command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it'll attempt to use GPU but use CPU if not possible
# 5 tokens for [INST], [/INST], which actually get | ||
# tokenized into "[, INST, ]" and "[, /, INST, ]" | ||
# by the mistral tokenizer | ||
prompt_added += 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's 7? π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this "works" but the model isn't great. Can we constrain the output schema to our models like is an option in the llamafile UI?
if not click.prompt( | ||
click.style( | ||
"You seem to have specified a different URL for the default model " | ||
f"({llamafile.name}). Are you sure this is correct? " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"({llamafile.name}). Are you sure this is correct? " | |
f"({llamafile}). Are you sure this is correct? " |
llamafile.name doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing only --llamafile Mixtral-8x22B-Instruct-v0.1-llamafile
cause's a weird prompt input that can't be escaped and needs a reply like yes to continue before crashing on attempting to check llamafile.is_file()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't support models larger than 50GB https://huggingface.co/Mozilla/Mixtral-8x22B-Instruct-v0.1-llamafile#about-quantization-formats-general-advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing only
--llamafile Mixtral-8x22B-Instruct-v0.1-llamafile
cause's a weird prompt input that can't be escaped and needs a reply like yes to continue before crashing on attempting to check llamafile.is_file()
That's how I intended it, why don't you pass something with a .llamafile
extension instead of -llamafile
?
|
||
on_windows = platform.system() == "Windows" | ||
|
||
if not llamafile.is_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running with --use-gpu --llamafile rocket-3b.Q5_K_M.llamafile --llamafile_url https://huggingface.co/Mozilla/rocket-3B-llamafile/resolve/main/rocket-3b.Q5_K_M.llamafile
will crash here
if model_name == LlamafileModelName.MISTRAL_7B_INSTRUCT: | ||
# For mistral-instruct, num added tokens depends on if the message | ||
# is a prompt/instruction or an assistant-generated message. | ||
# - prompt gets [INST], [/INST] added and the first instruction | ||
# begins with '<s>' ('beginning-of-sentence' token). | ||
# - assistant-generated messages get '</s>' added | ||
# see: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 | ||
# | ||
prompt_added = 1 # one for '<s>' token | ||
assistant_num_added = 0 | ||
ntokens = 0 | ||
for message in messages: | ||
if ( | ||
message.role == ChatMessage.Role.USER | ||
# note that 'system' messages will get converted | ||
# to 'user' messages before being sent to the model | ||
or message.role == ChatMessage.Role.SYSTEM | ||
): | ||
# 5 tokens for [INST], [/INST], which actually get | ||
# tokenized into "[, INST, ]" and "[, /, INST, ]" | ||
# by the mistral tokenizer | ||
prompt_added += 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sucks that this isn't consistent across models or there's not a global tokenizer
I think we could implement something like that by allowing to pass a model as the |
Background
This draft PR is a step toward enabling the use of local models in AutoGPT by adding llamafile as an LLM provider.
Implementation notes are included in
forge/forge/llm/providers/llamafile/README.md
Related issues:
Depends on:
BaseOpenAIProvider
-> deduplicateGroqProvider
&OpenAIProvider
Β #7178Changes ποΈ
Add minimal implementation of
LlamafileProvider
, a newChatModelProvider
for llamafiles. It extendsBaseOpenAIProvider
and only overrides methods that are necessary to get the system to work at a basic level.Add support for
mistral-7b-instruct-v0.2
. This is the only model currently supported byLlamafileProvider
because this is the only model I tested anything with.Misc changes to app configuration to enable switching between openai/llamafile providers. In particular, added config fieldLLM_PROVIDER
that, when set to 'llamafile', will useLllamafileProvider
in agents rather thanOpenAIProvider
.Add instructions to use AutoGPT with llamafile in the docs at
autogpt/setup/index.md
Limitations:
PR Quality Scorecard β¨
+2 pts
+5 pts
+5 pts
+5 pts
-4 pts
+4 pts
+5 pts
-5 pts
agbenchmark
to verify that these changes do not regress performance? β+10 pts