-
Notifications
You must be signed in to change notification settings - Fork 702
/
Copy pathREADME.md.tpl
211 lines (141 loc) · 8.84 KB
/
README.md.tpl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
<div align="center">
<h1>🦾 OpenLLM: Self-Hosting LLMs Made Easy</h1>
[](https://github.com/bentoml/OpenLLM/blob/main/LICENSE)
[](https://pypi.org/project/openllm)
[](https://results.pre-commit.ci/latest/github/bentoml/OpenLLM/main)
[](https://twitter.com/bentomlai)
[](https://l.bentoml.com/join-slack)
</div>
OpenLLM allows developers to run **any open-source LLMs** (Llama 3.3, Qwen2.5, Phi3 and [more](#supported-models)) or **custom models** as **OpenAI-compatible APIs** with a single command. It features a [built-in chat UI](#chat-ui), state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Docker, Kubernetes, and [BentoCloud](#deploy-to-bentocloud).
Understand the [design philosophy of OpenLLM](https://www.bentoml.com/blog/from-ollama-to-openllm-running-llms-in-the-cloud).
## Get Started
Run the following commands to install OpenLLM and explore it interactively.
```bash
pip install openllm # or pip3 install openllm
openllm hello
```

## Supported models
OpenLLM supports a wide range of state-of-the-art open-source LLMs. You can also add a [model repository to run custom models](#set-up-a-custom-repository) with OpenLLM.
<table>
<tr>
<th>Model</th>
<th>Parameters</th>
<th>Required GPU</th>
<th>Start a Server</th>
</tr>
{%- for key, value in model_dict|items %}
<tr>
<td>{{key}}</td>
<td>{{value['version']}}</td>
<td>{{value['pretty_gpu']}}</td>
<td><code>{{value['command']}}</code></td>
</tr>
{%- endfor %}
</table>
For the full model list, see the [OpenLLM models repository](https://github.com/bentoml/openllm-models).
## Start an LLM server
To start an LLM server locally, use the `openllm serve` command and specify the model version.
> [!NOTE]
> OpenLLM does not store model weights. A Hugging Face token (HF_TOKEN) is required for gated models.
>
> 1. Create your Hugging Face token [here](https://huggingface.co/settings/tokens).
> 2. Request access to the gated model, such as [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).
> 3. Set your token as an environment variable by running:
> ```bash
> export HF_TOKEN=<your token>
> ```
```bash
{{model_dict.get("llama3.2")["command"]}}
```
The server will be accessible at [http://localhost:3000](http://localhost:3000/), providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following:
- **The API host address**: By default, the LLM is hosted at [http://localhost:3000](http://localhost:3000/).
- **The model name:** The name can be different depending on the tool you use.
- **The API key**: The API key used for client authentication. This is optional.
Here are some examples:
<details>
<summary>OpenAI Python client</summary>
```python
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
# Use the following func to get the available models
# model_list = client.models.list()
# print(model_list)
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[
{
"role": "user",
"content": "Explain superconductors like I'm five years old"
}
],
stream=True,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content or "", end="")
```
</details>
<details>
<summary>LlamaIndex</summary>
```python
from llama_index.llms.openai import OpenAI
llm = OpenAI(api_bese="http://localhost:3000/v1", model="meta-llama/Llama-3.2-1B-Instruct", api_key="dummy")
...
```
</details>
## Chat UI
OpenLLM provides a chat UI at the `/chat` endpoint for the launched LLM server at http://localhost:3000/chat.
<img width="800" alt="openllm_ui" src="https://github.com/bentoml/OpenLLM/assets/5886138/8b426b2b-67da-4545-8b09-2dc96ff8a707">
## Chat with a model in the CLI
To start a chat conversation in the CLI, use the `openllm run` command and specify the model version.
```bash
openllm run llama3:8b
```
## Model repository
A model repository in OpenLLM represents a catalog of available LLMs that you can run. OpenLLM provides a default model repository that includes the latest open-source LLMs like Llama 3, Mistral, and Qwen2, hosted at [this GitHub repository](https://github.com/bentoml/openllm-models). To see all available models from the default and any added repository, use:
```bash
openllm model list
```
To ensure your local list of models is synchronized with the latest updates from all connected repositories, run:
```bash
openllm repo update
```
To review a model’s information, run:
```bash
openllm model get {{model_dict.get("llama3.2")["tag"]}}
```
### Add a model to the default model repository
You can contribute to the default model repository by adding new models that others can use. This involves creating and submitting a Bento of the LLM. For more information, check out this [example pull request](https://github.com/bentoml/openllm-models/pull/1).
### Set up a custom repository
You can add your own repository to OpenLLM with custom models. To do so, follow the format in the default OpenLLM model repository with a `bentos` directory to store custom LLMs. You need to [build your Bentos with BentoML](https://docs.bentoml.com/en/latest/guides/build-options.html) and submit them to your model repository.
First, prepare your custom models in a `bentos` directory following the guidelines provided by [BentoML to build Bentos](https://docs.bentoml.com/en/latest/guides/build-options.html). Check out the [default model repository](https://github.com/bentoml/openllm-repo) for an example and read the [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) for details.
Then, register your custom model repository with OpenLLM:
```bash
openllm repo add <repo-name> <repo-url>
```
**Note**: Currently, OpenLLM only supports adding public repositories.
## Deploy to BentoCloud
OpenLLM supports LLM cloud deployment via BentoML, the unified model serving framework, and BentoCloud, an AI inference platform for enterprise AI teams. BentoCloud provides fully-managed infrastructure optimized for LLM inference with autoscaling, model orchestration, observability, and many more, allowing you to run any AI model in the cloud.
[Sign up for BentoCloud](https://www.bentoml.com/) for free and [log in](https://docs.bentoml.com/en/latest/bentocloud/how-tos/manage-access-token.html). Then, run `openllm deploy` to deploy a model to BentoCloud:
```bash
openllm deploy {{model_dict.get("llama3.2")["tag"]}}
```
> [!NOTE]
> If you are deploying a gated model, make sure to set HF_TOKEN in enviroment variables.
Once the deployment is complete, you can run model inference on the BentoCloud console:
<img width="800" alt="bentocloud_ui" src="https://github.com/bentoml/OpenLLM/assets/65327072/4f7819d9-73ea-488a-a66c-f724e5d063e6">
## Community
OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 [Join our Slack community!](https://l.bentoml.com/join-slack)
## Contributing
As an open-source project, we welcome contributions of all kinds, such as new features, bug fixes, and documentation. Here are some of the ways to contribute:
- Repost a bug by [creating a GitHub issue](https://github.com/bentoml/OpenLLM/issues/new/choose).
- [Submit a pull request](https://github.com/bentoml/OpenLLM/compare) or help review other developers’ [pull requests](https://github.com/bentoml/OpenLLM/pulls).
- Add an LLM to the OpenLLM default model repository so that other users can run your model. See the [pull request template](https://github.com/bentoml/openllm-models/pull/1).
- Check out the [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) to learn more.
## Acknowledgements
This project uses the following open-source projects:
- [bentoml/bentoml](https://github.com/bentoml/bentoml) for production level model serving
- [vllm-project/vllm](https://github.com/vllm-project/vllm) for production level LLM backend
- [blrchen/chatgpt-lite](https://github.com/blrchen/chatgpt-lite) for a fancy Web Chat UI
- [astral-sh/uv](https://github.com/astral-sh/uv) for blazing fast model requirements installing
We are grateful to the developers and contributors of these projects for their hard work and dedication.