Smart launcher for ik_llama.cpp and llama.cpp. Auto-detects your hardware, figures out the optimal configuration, and launches the server — no manual flag tuning. Then lets the model tune its own flags for up to +54% tokens/sec.
llm-server model.gguf # launch a model
llm-server-gui # interactive TUI pickerRecommended self-contained setup:
curl -fsSL https://raw.githubusercontent.com/raketenkater/llm-server/main/setup.sh | bashOr from a clone:
git clone https://github.com/raketenkater/llm-server.git
cd llm-server
./setup.sh # auto-detects Linux / WSL2 / macOSThis creates ~/llm-server with:
~/llm-server/bin # llm-server, llm-server-gui, bundled backend when available
~/llm-server/models # GGUF models and downloaded mmproj files
~/llm-server/cache # AI Tune caches
~/llm-server/logs # setup/server logs
~/llm-server/config # local config loaded automatically
~/llm-server/src # backend source/build fallback
Use it with:
source ~/llm-server/env.sh
llm-server-gui
llm-server <repo/name> --downloadLegacy one-line install to ~/.local/bin:
curl -fsSL https://raw.githubusercontent.com/raketenkater/llm-server/main/install.sh | bashDetects your GPU, installs scripts to ~/.local/bin, tries a matching prebuilt release bundle, then falls back to building the right backend from source (ik_llama.cpp for CUDA, llama.cpp for Vulkan/Metal/CPU). CUDA currently uses source builds unless a CUDA bundle was manually attached to the release. See Requirements for what gets installed.
Prefer the manual path? git clone … && cd llm-server && ./install.sh works identically.
Installer controls:
LLM_INSTALL_MODE=release ./install.sh # require a prebuilt bundle
LLM_INSTALL_MODE=build ./install.sh # force source build
LLM_INSTALL_BACKEND=skip ./install.sh # install scripts only
LLM_INSTALL_PY_DEPS=skip ./install.sh # skip downloader Python deps
LLM_INSTALL_PREFIX=/usr/local/bin ./install.sh- Quick start
- Why llm-server?
- Features
- AI self-tuning
- Usage
- How it works
- Requirements
- Roadmap
- Changelog
# Basic — auto-detects everything
llm-server model.gguf
# Let the model tune its own flags (cached forever after)
llm-server model.gguf --ai-tune
# Grab a quant from HuggingFace, recommends a size for your VRAM+RAM
llm-server unsloth/Qwen3.5-27B-GGUF --download
# Interactive TUI picker
llm-server-gui| raw llama.cpp | Ollama | LM Studio | llm-server | |
|---|---|---|---|---|
| Multi-GPU auto-placement | ❌ manual flags | partial | ❌ | ✅ |
MoE expert offload (-ot/--n-cpu-moe) |
❌ manual | ❌ | ❌ | ✅ auto |
| AI self-tuning | ❌ | ❌ | ❌ | ✅ |
| Heterogeneous GPUs (different VRAM/PCIe) | ❌ manual | ❌ | ❌ | ✅ weighted |
| CLI-first / scriptable | ✅ | partial | ❌ | ✅ |
| Uses upstream llama.cpp / ik_llama.cpp | ✅ | fork | fork | ✅ |
Running llama.cpp on multi-GPU means juggling dozens of flags. llm-server figures it all out:
| Without llm-server | With llm-server |
|---|---|
llama-server \
-m model.gguf \
-ngl 81 \
--ctx-size 32768 \
--tensor-split 24,12,12 \
--split-mode graph -mg 0 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa on \
--threads 8 --threads-batch 16 \
-b 4096 -ub 1024 \
--jinja --run-time-repack \
-khad -defrag-thold 0.1 \
--port 8081 |
llm-server model.gguf |
- AI self-tuning — the model being served proposes its own flags, benchmarks, and iterates 8 rounds. Crashes get free retries. Winner cached. Up to +54% tok/s in real tests.
- Smart multi-GPU placement — any mix of NVIDIA cards (0-8+), heterogeneous VRAM sizes, PCIe-bandwidth weighted.
--gpus 0,1restricts to specific GPUs for multi-instance workloads. - MoE expert auto-placement — starts conservative, measures actual VRAM, optimizes, caches. Automatic
--n-cpu-moefallback if-otplacement fails. - Auto-update + rollback —
--updatepulls and rebuilds both backends; if the new binary breaks, rolls back to the last working commit. - Auto-fallback — if ik_llama can't load a model, strips ik-specific flags and retries on mainline llama.cpp mid-launch.
- Built-in downloader —
--downloadwith any HuggingFace repo; recommends the best quant for your VRAM+RAM budget. - Vision (multimodal) —
--visionauto-detects and downloads the matchingmmprojfrom HuggingFace. - Terminal GUI —
llm-server-guifor interactive model picking with option toggles.
Most llm-server users leave 20-50% performance on the table because "good enough" defaults aren't optimal for their specific hardware + model combination. --ai-tune fixes this:
llm-server model.gguf --ai-tuneThe model being served proposes its own optimal flags, benchmarks each one, learns from the results, and iterates — 8 rounds of self-optimization. Crashes get free retries. The winner is cached for instant use on every future launch.
Round 0/8: Benchmarking baseline (heuristic config)...
Baseline: gen=23.17 tok/s pp=119.87 tok/s
Round 1/8: Config: optimized_split_hadamard_v2
Result: gen=22.05 tok/s (-4.8%)
Round 3/8: Config: optimized_tensor_split_q4
★ NEW BEST: gen=24.56 tok/s (+6.0%)
Round 7/8: Config: ultra_stable_3090ti_bias
★ NEW BEST: gen=24.77 tok/s (+6.9%)
AI Tune complete: ultra_stable_3090ti_bias wins!
Baseline: 23.17 tok/s → Best: 24.77 tok/s (+6.9%)
No external API, no internet required. The model tunes itself using its own intelligence. Results are cached — run --ai-tune once, benefit forever.
| Model | raw llama-server |
llm-server heuristic | llm-server --ai-tune |
|---|---|---|---|
| Qwen3.5-122B | 4.1 tok/s | 11.2 tok/s | 17.47 tok/s (+326%) |
| Qwen3.5-27B Q4_K_M | 18.5 tok/s | 25.94 tok/s | 40.05 tok/s (+116%) |
| gemma-4-31B UD-Q4_K_XL | 14.2 tok/s | 23.17 tok/s | 24.77 tok/s (+74%) |
RTX 3090 Ti + RTX 4070 + RTX 3060 (49GB total VRAM) + 128GB RAM. More hardware results in this r/LocalLLaMA post.
# AI tune + re-tune
llm-server model.gguf --ai-tune
llm-server model.gguf --ai-tune --retune
# Vision
llm-server model.gguf --vision
# Show all cached configs / detail for one model
llm-server --show-configs
llm-server --show-configs model.gguf
# Launch with a specific tuned config
llm-server model.gguf --tune-cache ~/.cache/llm-server/tune_model_hw_llama.json
# Force a specific backend
llm-server --backend llama model.gguf
llm-server --backend ik_llama model.gguf
# Unattended / always-on
llm-server model.gguf --keep-alive
# Multi-instance: big model on GPUs 0+1, small on GPU 2
llm-server big-model.gguf --gpus 0,1 --port 8081 --ram-budget 90G
llm-server small-model.gguf --gpus 2 --port 8082 --ram-budget 30G
# Update backends (rolls back on failure)
llm-server --update
# Quick tok/s benchmark then exit
llm-server --benchmark model.ggufAny flag llm-server doesn't recognize is forwarded to llama-server, so every upstream option works without wrapping.
- Launches with heuristic config → benchmarks as baseline
- Sends the running model its hardware profile, GGUF metadata, full
--help, and baseline - Model proposes flag overrides as JSON → script applies, launches, benchmarks
- Feeds results back → model proposes next config → repeat for 8 rounds
- Crashes get free retries (up to 4). Large models (>70% RAM) get a confirmation prompt.
- Server processes are marked as OOM-kill targets — a bad config kills the server, never your system.
- Winner cached at
~/.cache/llm-server/— used automatically on future launches. - All results persisted to
tune_history.jsonl— the model learns across sessions.
--download calculates Total = System VRAM + System RAM, inspects the HuggingFace repo, and recommends the quantization level with the best speed/quality balance for your exact hardware.
Downloaded and locally discovered GGUFs are indexed in <model-dir>/.llm-server/models.json. The index records model path, size, architecture, quant, matching local vision projectors, HuggingFace source metadata, and available AI Tune configs so the GUI can show runnable models and configs without guessing from filenames only.
With --vision, llm-server checks for an existing mmproj, verifies it matches the loaded model, and downloads the correct mmproj-F16.gguf from HuggingFace if missing. The correct repo is inferred from GGUF metadata (general.basename + general.quantized_by).
llm-server --update pulls llm-server, ik_llama.cpp, and llama.cpp; backs up each working binary; rebuilds with the existing cmake config; smoke-tests; and rolls back to the previous commit if anything fails. Update fearlessly.
If ik_llama.cpp can't load a model (unsupported architecture), llm-server detects the load failure, switches to mainline llama.cpp, strips ik-specific flags, and retries — no manual intervention.
Linux:
- Prebuilt release bundle when available, otherwise ik_llama.cpp for CUDA or llama.cpp for Vulkan/CPU
- CUDA source builds need NVIDIA drivers, CUDA toolkit/nvcc,
cmake,make, and a compiler toolchain nvidia-smi(for GPU detection)python3,huggingface_hub,tqdm,curl
macOS (Apple Silicon):
- Prebuilt Metal release bundle when available, otherwise llama.cpp built with Metal (or
brew install llama.cpp) python3,huggingface_hub,tqdm,curlllm-server-macsupports Metal auto-offload, downloader, vision/mmproj auto-detect/download, benchmarks, GUI launching, startup fallbacks, and macOS AI Tune for Metal-safe flags (batch, ubatch, KV type, threads, parallel, flash-attn). Placement tuning is Linux/WSL2-only because Apple Silicon exposes a single unified Metal GPU.
Windows:
- Install WSL2 (
wsl --installin PowerShell) - Inside WSL2, follow the Linux instructions above
- NVIDIA GPU passthrough works automatically in WSL2 with up-to-date drivers
- Native Windows is not supported by llm-server today.
The install.sh one-liner handles release bundle detection, dependency checks, Python downloader dependencies, and backend source builds automatically.
MIT
