Note
Need help? Join the Discord Server and get the Tabby
role. Please be nice when asking questions.
Welcome to YALS, also known as Yet Another Llamacpp Server.
YALS is a friendly OAI compatible API server built with Deno, Hono, and Zod, designed to facilitate LLM text generation via the llama.cpp backend
This project is in an alpha state. There may be bugs, possibly even ones that could cause thermonuclear war. Please note that commits happen frequently, and builds are distributed via CI.
YALS is a hobby project made for a small amount of users. It is not meant to run on production servers. For that, please look at other solutions that support those workloads.
The AI space is full of backend projects that wrap llama.cpp, but I felt that something was missing. This led me to create my own backend, one which is extensible, speedy, and as elegant as TabbyAPI, but specifically for llama.cpp and GGUF.
Here are the reasons why I decided to create a separate project instead of integrating llamacpp support into TabbyAPI:
- Separation of concerns: I want TabbyAPI to stay focused on ExLlama, not become a monolithic backend.
- Distribution patterns: Unlike TabbyAPI, llama.cpp backends are often distributed as binaries. Deno’s compile command is vastly superior to PyInstaller, making binary distribution easier.
- Dependency hell: Python’s dependency system is a mess. Adding another layer of abstractions would confuse users further.
- New technologies: Since C++ (via C bindings) is universally compatible via an FFI interface, I wanted to try something new instead of struggling with Python. The main reason for using Deno is because it augments an easy to learn language (TypeScript) with inbuilt tooling and a robust FFI system.
To get started, download the latest zip from releases that corresponds to your setup.
The currently supported builds via CI are:
- macOS: Metal
- Windows/Linux: CPU
- Windows/Linux: CUDA (built for Turing architectures and newer)
Note
If your specific setup is not available via CI, you can build locally via the building guide, or request a certain architecture in issues.
Then follow these steps:
- Extract the zip file
- Copy
config_sample.yml
to a file calledconfig.yml
- Edit
config.yml
to configure model loading, networking, and other parameters.- All options are commented: if you're unsure about an option, it's best to leave it unchanged.
- You can also use CLI arguments, similar to TabbyAPI (ex.
--flash-attention true
).
- Download a
.gguf
model into themodels
directory (or whatever you set your directory to)- If the model is split into multiple parts (
00001-of-0000x.gguf
), setmodel_name
inconfig.yml
to the first part (ending in00001
). Other parts will load automatically.
- If the model is split into multiple parts (
- Start YALS:
- Windows: Double click
YALS.exe
or run.\YALS.exe
from the terminal (recommended) - macOS/Linux: Open a terminal and run
./YALS
- Windows: Double click
- Navigate to
http://<your URL>/docs
(ex.http://localhost:5000/docs
) to view the YALS Scalar API documentation.
- OpenAI compatible API
- Loading/unloading models
- Flexible Jinja2 template engine for chat completions that conforms to HuggingFace
- String banning
- Concurrent inference with Hono + async TypeScript
- Robust validation with Zod
More features will be added as the project matures. If something is missing here, PR it in!
Since YALS uses llama.cpp for inference, the only supported model format is GGUF.
If you want to use other model formats such as Exl2, try tabbyAPI
Use the template when creating issues or pull requests, otherwise the developers may not look at your post.
If you have issues with the project:
- Describe the issue in detail
- If you have a feature request, please indicate it as such.
If you have a Pull Request:
- Describe the pull request in detail, what, and why you are changing something
Creators/Developers:
- kingbri - TypeScript, Deno, and some C++
- CoffeeVampire - Main C++ developer
YALS would not exist without the work of other contributors and FOSS projects: