Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(forge/llm): Add LlamafileProvider #7091

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
03d8e1e
Add minimal implementation of LlamafileProvider, a new ChatModelProvi…
k8si Apr 18, 2024
ed1dfd0
Adapt model prompt message roles to be compatible with the Mistral-7b…
k8si Apr 18, 2024
c56c290
In `OpenAIProvider`, change methods `count_message_tokens`, `count_to…
k8si Apr 18, 2024
234d059
misc cleanup
k8si Apr 18, 2024
05d2b81
add README for llamafile integration including setup instruction + no…
k8si Apr 18, 2024
1cd3e8b
simplify mistral message handling; set seed=0 in chat completion kwar…
k8si Apr 19, 2024
dc36c69
set mistral max_tokens to actual value configured in the model and ch…
k8si Apr 19, 2024
e426766
Merge branch 'master' into draft-llamafile-support
k8si Apr 19, 2024
d63aa23
Merge branch 'master' into draft-llamafile-support
Pwuts May 24, 2024
7e7037d
remove llamafile stuff from openai.py
Pwuts May 25, 2024
3c1f283
Merge branch 'master' into draft-llamafile-support
Pwuts May 30, 2024
5d0f8b0
fix linting errors
Pwuts May 31, 2024
960155a
Create `BaseOpenAIProvider` with common functionality from `OpenAIPro…
Pwuts May 31, 2024
7aed930
Merge branch 'master' into draft-llamafile-support
Pwuts Jun 2, 2024
02d0691
Merge branch 'master' into draft-llamafile-support
Pwuts Jun 3, 2024
f53c2de
move llamafile stuff into folders
Pwuts Jun 3, 2024
f78ad94
clean up llamafile readme
Pwuts Jun 3, 2024
1a00ecf
Improve llamafile model name cleaning logic
Pwuts Jun 3, 2024
3c8bf3c
expand setup instructions and info for llamafile
Pwuts Jun 3, 2024
65433ba
combine llamafile setup.sh and serve.sh into single cross-platform se…
Pwuts Jun 3, 2024
bc372cb
Merge branch 'master' into draft-llamafile-support
Pwuts Jun 14, 2024
e1bcb03
fix llamafile/serve.py for Windows
Pwuts Jun 14, 2024
df3278f
address review comment on clean_model_name in llamafile.py
Pwuts Jun 14, 2024
6858b22
add --llamafile and --llamafile_url options to llamafile/serve.py
Pwuts Jun 14, 2024
d73a98c
tweaks to llamafile/serve.py
Pwuts Jun 14, 2024
4d64b45
address comment by Nick
Pwuts Jun 14, 2024
e5c5163
fix llamafile/serve.py execution path error
Pwuts Jun 14, 2024
3cd7b0e
Merge branch 'master' into draft-llamafile-support
Pwuts Jun 20, 2024
0e081f4
improve debug logging messages in `LlamafileProvider.get_available_mo…
Pwuts Jun 20, 2024
529314e
small refactor for readability/simplicity in llamafile/serve.py
Pwuts Jun 21, 2024
072e674
amend docs regarding WSL
Pwuts Jun 21, 2024
01372d1
add --use-gpu option to llamafile/serve.py
Pwuts Jun 21, 2024
63fe5b5
set llamafile host to 0.0.0.0
Pwuts Jun 21, 2024
271e59b
debug llamafile init
Pwuts Jun 21, 2024
aecc363
add --host and --port options to llamafile/serve.py
Pwuts Jun 21, 2024
9ee1e8f
add instructions to run llamafiles with WSL
Pwuts Jun 21, 2024
242753e
add note about `--use-gpu` to the docs
Pwuts Jun 21, 2024
e8905d1
Convert messages with content blocks to plain text messages
Pwuts Jun 24, 2024
3621be0
add LLAMAFILE_API_BASE to .env.template
Pwuts Jun 24, 2024
f33c2d2
resolve TODO regarding `seed` parameter
Pwuts Jun 24, 2024
75e0301
minor refactor
Pwuts Jun 24, 2024
deb7d11
add reference to llamafile documentation
Pwuts Jun 24, 2024
74923f1
fix type errors
Pwuts Jun 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions autogpt/.env.template
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@
## GROQ_API_KEY - Groq API Key (Example: gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
# GROQ_API_KEY=

## LLAMAFILE_API_BASE - Llamafile API base URL
# LLAMAFILE_API_BASE=http://localhost:8080/v1
Comment on lines +14 to +15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env var not added to options.md


## TELEMETRY_OPT_IN - Share telemetry on errors and other issues with the AutoGPT team, e.g. through Sentry.
## This helps us to spot and solve problems earlier & faster. (Default: DISABLED)
# TELEMETRY_OPT_IN=true
Expand Down
3 changes: 3 additions & 0 deletions autogpt/scripts/llamafile/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.llamafile
*.llamafile.exe
llamafile.exe
160 changes: 160 additions & 0 deletions autogpt/scripts/llamafile/serve.py
Pwuts marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#!/usr/bin/env python3
"""
Use llamafile to serve a (quantized) mistral-7b-instruct-v0.2 model
Usage:
cd <repo-root>/autogpt
./scripts/llamafile/serve.py
"""

import os
import platform
import subprocess
from pathlib import Path
from typing import Optional

import click

LLAMAFILE = Path("mistral-7b-instruct-v0.2.Q5_K_M.llamafile")
LLAMAFILE_URL = f"https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/{LLAMAFILE.name}" # noqa
LLAMAFILE_EXE = Path("llamafile.exe")
LLAMAFILE_EXE_URL = "https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.6/llamafile-0.8.6" # noqa


@click.command()
@click.option(
"--llamafile",
type=click.Path(dir_okay=False),
help=f"Name of the llamafile to serve. Default: {LLAMAFILE.name}",
)
@click.option("--llamafile_url", help="Download URL for the llamafile you want to use")
@click.option(
"--host", help="Specify the address for the llamafile server to listen on"
)
@click.option(
"--port", type=int, help="Specify the port for the llamafile server to listen on"
)
@click.option(
"--use-gpu", is_flag=True, help="Use an AMD or Nvidia GPU to speed up inference"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
def main(
llamafile: Optional[Path] = None,
llamafile_url: Optional[str] = None,
host: Optional[str] = None,
port: Optional[int] = None,
use_gpu: bool = False,
):
if not llamafile:
if not llamafile_url:
llamafile = LLAMAFILE
else:
llamafile = Path(llamafile_url.rsplit("/", 1)[1])
if llamafile.suffix != ".llamafile":
click.echo(
click.style(
"The given URL does not end with '.llamafile' -> "
"can't get filename from URL. "
"Specify the filename using --llamafile.",
fg="red",
),
err=True,
)
return

if llamafile == LLAMAFILE and not llamafile_url:
llamafile_url = LLAMAFILE_URL
elif llamafile_url != LLAMAFILE_URL:
if not click.prompt(
click.style(
"You seem to have specified a different URL for the default model "
f"({llamafile.name}). Are you sure this is correct? "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"({llamafile.name}). Are you sure this is correct? "
f"({llamafile}). Are you sure this is correct? "

llamafile.name doesn't exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing only --llamafile Mixtral-8x22B-Instruct-v0.1-llamafile cause's a weird prompt input that can't be escaped and needs a reply like yes to continue before crashing on attempting to check llamafile.is_file()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing only --llamafile Mixtral-8x22B-Instruct-v0.1-llamafile cause's a weird prompt input that can't be escaped and needs a reply like yes to continue before crashing on attempting to check llamafile.is_file()

That's how I intended it, why don't you pass something with a .llamafile extension instead of -llamafile?

"If you want to use a different model, also specify --llamafile.",
fg="yellow",
),
type=bool,
):
return

# Go to autogpt/scripts/llamafile/
os.chdir(Path(__file__).resolve().parent)

on_windows = platform.system() == "Windows"

if not llamafile.is_file():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running with --use-gpu --llamafile rocket-3b.Q5_K_M.llamafile --llamafile_url https://huggingface.co/Mozilla/rocket-3B-llamafile/resolve/main/rocket-3b.Q5_K_M.llamafile will crash here

if not llamafile_url:
click.echo(
click.style(
"Please use --lamafile_url to specify a download URL for "
f"'{llamafile.name}'. "
"This will only be necessary once, so we can download the model.",
fg="red",
),
err=True,
)
return

download_file(llamafile_url, llamafile)

if not on_windows:
llamafile.chmod(0o755)
subprocess.run([llamafile, "--version"], check=True)

if not on_windows:
base_command = [f"./{llamafile}"]
else:
# Windows does not allow executables over 4GB, so we have to download a
# model-less llamafile.exe and run that instead.
if not LLAMAFILE_EXE.is_file():
download_file(LLAMAFILE_EXE_URL, LLAMAFILE_EXE)
LLAMAFILE_EXE.chmod(0o755)
subprocess.run([f".\\{LLAMAFILE_EXE}", "--version"], check=True)

base_command = [f".\\{LLAMAFILE_EXE}", "-m", llamafile]

if host:
base_command.extend(["--host", host])
if port:
base_command.extend(["--port", str(port)])
if use_gpu:
base_command.extend(["-ngl", "9999"])

subprocess.run(
[
*base_command,
"--server",
"--nobrowser",
"--ctx-size",
"0",
Comment on lines +126 to +127
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think context size should be parametrizable; it has impact on performance so it's important to have a way of limiting it.

"--n-predict",
"1024",
],
check=True,
)

# note: --ctx-size 0 means the prompt context size will be set directly from the
# underlying model configuration. This may cause slow response times or consume
# a lot of memory.


def download_file(url: str, to_file: Path) -> None:
print(f"Downloading {to_file.name}...")
import urllib.request

urllib.request.urlretrieve(url, to_file, reporthook=report_download_progress)
print()


def report_download_progress(chunk_number: int, chunk_size: int, total_size: int):
if total_size != -1:
downloaded_size = chunk_number * chunk_size
percent = min(1, downloaded_size / total_size)
bar = "#" * int(40 * percent)
print(
f"\rDownloading: [{bar:<40}] {percent:.0%}"
f" - {downloaded_size/1e6:.1f}/{total_size/1e6:.1f} MB",
end="",
)


if __name__ == "__main__":
main()
60 changes: 60 additions & 0 deletions docs/content/AutoGPT/setup/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,3 +190,63 @@ If you don't know which to choose, you can safely go with OpenAI*.

[groq/api-keys]: https://console.groq.com/keys
[groq/models]: https://console.groq.com/docs/models


Pwuts marked this conversation as resolved.
Show resolved Hide resolved
### Llamafile

With llamafile you can run models locally, which means no need to set up billing,
and guaranteed data privacy.

!!! warning
At the moment, llamafile only serves one model at a time. This means you can not
set `SMART_LLM` and `FAST_LLM` to two different llamafile models.

!!! warning
Due to the issues linked below, llamafiles don't work on WSL. To use a llamafile
with AutoGPT in WSL, you will have to run the llamafile in Windows (outside WSL).

<details>
<summary>Instructions</summary>

1. Get the `llamafile/serve.py` script through one of these two ways:
1. Clone the AutoGPT repo somewhere in your Windows environment,
with the script located at `autogpt/scripts/llamafile/serve.py`
2. Download just the [serve.py] script somewhere in your Windows environment
2. Make sure you have `click` installed: `pip install click`
3. Run `ip route | grep default | awk '{print $3}'` *inside WSL* to get the address
of the WSL host machine
4. Run `python3 serve.py --host {WSL_HOST_ADDR}`, where `{WSL_HOST_ADDR}`
is the address you found at step 3.
If port 8080 is taken, also specify a different port using `--port {PORT}`.
5. In WSL, set `LLAMAFILE_API_BASE=http://{WSL_HOST_ADDR}:8080/v1` in your `.env`.
6. Follow the rest of the regular instructions below.

[serve.py]: https://github.com/Significant-Gravitas/AutoGPT/blob/master/autogpt/scripts/llamafile/serve.py
</details>

* [Mozilla-Ocho/llamafile#356](https://github.com/Mozilla-Ocho/llamafile/issues/356)
* [Mozilla-Ocho/llamafile#100](https://github.com/Mozilla-Ocho/llamafile/issues/100)

!!! note
These instructions will download and use `mistral-7b-instruct-v0.2.Q5_K_M.llamafile`.
`mistral-7b-instruct-v0.2` is currently the only tested and supported model.
If you want to try other models, you'll have to add them to `LlamafileModelName` in
[`llamafile.py`][forge/llamafile.py].
For optimal results, you may also have to add some logic to adapt the message format,
like `LlamafileProvider._adapt_chat_messages_for_mistral_instruct(..)` does.

1. Run the llamafile serve script:
```shell
python3 ./scripts/llamafile/serve.py
```
The first time this is run, it will download a file containing the model + runtime,
which may take a while and a few gigabytes of disk space.

To force GPU acceleration, add `--use-gpu` to the command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it'll attempt to use GPU but use CPU if not possible


3. In `.env`, set `SMART_LLM`/`FAST_LLM` or both to `mistral-7b-instruct-v0.2`

4. If the server is running on different address than `http://localhost:8080/v1`,
set `LLAMAFILE_API_BASE` in `.env` to the right base URL

[forge/llamafile.py]: https://github.com/Significant-Gravitas/AutoGPT/blob/master/forge/forge/llm/providers/llamafile/llamafile.py
36 changes: 36 additions & 0 deletions forge/forge/llm/providers/llamafile/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Llamafile Integration Notes

Tested with:
* Python 3.11
* Apple M2 Pro (32 GB), macOS 14.2.1
* quantized mistral-7b-instruct-v0.2

## Setup

Download a `mistral-7b-instruct-v0.2` llamafile:
```shell
wget -nc https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.llamafile
chmod +x mistral-7b-instruct-v0.2.Q5_K_M.llamafile
./mistral-7b-instruct-v0.2.Q5_K_M.llamafile --version
```

Run the llamafile server:
```shell
LLAMAFILE="./mistral-7b-instruct-v0.2.Q5_K_M.llamafile"

"${LLAMAFILE}" \
--server \
--nobrowser \
--ctx-size 0 \
--n-predict 1024

# note: ctx-size=0 means the prompt context size will be set directly from the
# underlying model configuration. This may cause slow response times or consume
# a lot of memory.
```

## TODOs

* `SMART_LLM`/`FAST_LLM` configuration: Currently, the llamafile server only serves one model at a time. However, there's no reason you can't start multiple llamafile servers on different ports. To support using different models for `smart_llm` and `fast_llm`, you could implement config vars like `LLAMAFILE_SMART_LLM_URL` and `LLAMAFILE_FAST_LLM_URL` that point to different llamafile servers (one serving a 'big model' and one serving a 'fast model').
* Authorization: the `serve.sh` script does not set up any authorization for the llamafile server; this can be turned on by adding arg `--api-key <some-key>` to the server startup command. However I haven't attempted to test whether the integration with autogpt works when this feature is turned on.
* Test with other models
17 changes: 17 additions & 0 deletions forge/forge/llm/providers/llamafile/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from .llamafile import (

Check warning on line 1 in forge/forge/llm/providers/llamafile/__init__.py

View check run for this annotation

Codecov / codecov/patch

forge/forge/llm/providers/llamafile/__init__.py#L1

Added line #L1 was not covered by tests
LLAMAFILE_CHAT_MODELS,
LLAMAFILE_EMBEDDING_MODELS,
LlamafileCredentials,
LlamafileModelName,
LlamafileProvider,
LlamafileSettings,
)

__all__ = [

Check warning on line 10 in forge/forge/llm/providers/llamafile/__init__.py

View check run for this annotation

Codecov / codecov/patch

forge/forge/llm/providers/llamafile/__init__.py#L10

Added line #L10 was not covered by tests
"LLAMAFILE_CHAT_MODELS",
"LLAMAFILE_EMBEDDING_MODELS",
"LlamafileCredentials",
"LlamafileModelName",
"LlamafileProvider",
"LlamafileSettings",
]
Loading
Loading