Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon)). Easiest way to launch OpenAI API Compatible Server on Windows, Linux and MacOS
Support matrix | Supported now | Under Development | On the roadmap |
---|---|---|---|
Model architectures | Gemma Llama * Mistral + Phi |
||
Platform | Linux Windows |
||
Architecture | x86 x64 |
Arm64 | |
Hardware Acceleration | CUDA DirectML |
QNN ROCm |
OpenVINO |
* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
+ The Mistral model architecture supports similar model families such as Zephyr.
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
Models | Parameters | Context Length | Link |
---|---|---|---|
Gemma-2b-Instruct v1 | 2B | 8192 | EmbeddedLLM/gemma-2b-it-onnx |
Llama-2-7b-chat | 7B | 4096 | EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml |
Llama-2-13b-chat | 13B | 4096 | EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml |
Llama-3-8b-chat | 8B | 8192 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
Mistral-7b-v0.3-instruct | 7B | 32768 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
Phi-3-mini-4k-instruct-062024 | 3.8B | 4096 | EmbeddedLLM/Phi-3-mini-4k-instruct-062024-onnx |
Phi3-mini-4k-instruct | 3.8B | 4096 | microsoft/Phi-3-mini-4k-instruct-onnx |
Phi3-mini-128k-instruct | 3.8B | 128k | microsoft/Phi-3-mini-128k-instruct-onnx |
Phi3-medium-4k-instruct | 17B | 4096 | microsoft/Phi-3-medium-4k-instruct-onnx-directml |
Phi3-medium-128k-instruct | 17B | 128k | microsoft/Phi-3-medium-128k-instruct-onnx-directml |
Openchat-3.6-8b | 8B | 8192 | EmbeddedLLM/openchat-3.6-8b-20240522-onnx |
Yi-1.5-6b-chat | 6B | 32k | EmbeddedLLM/01-ai_Yi-1.5-6B-Chat-onnx |
Phi-3-vision-128k-instruct | 128k | EmbeddedLLM/Phi-3-vision-128k-instruct-onnx |
Windows
- Install embeddedllm package.
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .
. Note: currently supportcpu
,directml
andcuda
.- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml]
- CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu]
- CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda]
- With Web UI:
- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml, webui]
- CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu, webui]
- CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda, webui]
- DirectML:
- DirectML:
Linux
- Install embeddedllm package.
ELLM_TARGET_DEVICE='directml' pip install -e .
. Note: currently supportcpu
,directml
andcuda
.- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml]
- CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu]
- CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda]
- With Web UI:
- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml, webui]
- CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu, webui]
- CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda, webui]
- DirectML:
- DirectML:
Note
- If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime
.
usage: ellm_server.exe [-h] [--port int] [--host str] [--response_role str] [--uvicorn_log_level str]
[--served_model_name str] [--model_path str] [--vision bool]
options:
-h, --help show this help message and exit
--port int Server port. (default: 6979)
--host str Server host. (default: 0.0.0.0)
--response_role str Server response role. (default: assistant)
--uvicorn_log_level str
Uvicorn logging level. `debug`, `info`, `trace`, `warning`, `critical` (default: info)
--served_model_name str
Model name. (default: phi3-mini-int4)
--model_path str Path to model weights. (required)
--vision bool Enable vision capability, only if model supports vision input. (default: False)
ellm_server --model_path <path/to/model/weight>
.- Example code to connect to the api server can be found in
scripts/python
.
ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost
.
It is an interface that allows you to download and deploy OpenAI API compatible server. You can find out the disk space required to download the model in the UI.
ellm_modelui --port 6678
- Install
embeddedllm
. - Install PyInstaller:
pip install pyinstaller
. - Compile Windows Executable:
pyinstaller .\ellm_api_server.spec
. - You can find the executable in the
dist\ellm_api_server
.
- Excellent open-source projects: vLLM, onnxruntime-genai and many others.