Kolosal Server

A high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both Windows and Linux systems!

Platform Support

🪟 Windows: Full support with Visual Studio and MSVC
🐧 Linux: Native support with GCC/Clang
🎮 GPU Acceleration: NVIDIA CUDA and Vulkan support
📦 Easy Installation: Direct binary installation or build from source

Features

🚀 Fast Inference: Built with llama.cpp for optimized model inference
🔗 OpenAI Compatible: Drop-in replacement for OpenAI API endpoints
📡 Streaming Support: Real-time streaming responses for chat completions
🎛️ Multi-Model Management: Load and manage multiple models simultaneously
📊 Real-time Metrics: Monitor completion performance with TPS, TTFT, and success rates
⚙️ Lazy Loading: Defer model loading until first request with load_immediately=false
🔧 Configurable: Flexible model loading parameters and inference settings
🔒 Authentication: API key and rate limiting support
🌐 Cross-Platform: Windows and Linux native builds
📚 RAG Retrieval: Native FAISS vector store (default) with optional Qdrant backend

Installation (Linux, Recommended)

Prerequisites

System Requirements:

Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)
GCC 9+ or Clang 10+
CMake 3.14+
Git with submodule support
At least 4GB RAM (8GB+ recommended for larger models)
CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)
Vulkan SDK (optional, for alternative GPU acceleration)

Install Dependencies:

Ubuntu/Debian:

# Update package list
sudo apt update

# Install essential build tools
sudo apt install -y build-essential cmake git pkg-config

# Install required libraries
sudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev

# Optional: Install PoDoFo dependencies for PDF support
sudo apt install -y libfreetype6-dev libjpeg-dev libpng-dev libtiff-dev libxml2-dev libfontconfig1-dev

# Optional: Install FAISS dependencies
sudo apt install libopenblas-dev liblapack-dev

# Optional: Install CUDA for GPU support
# Follow NVIDIA's official installation guide for your distribution

CentOS/RHEL/Fedora:

# For CentOS/RHEL 8+
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake git curl-devel yaml-cpp-devel

# For Fedora
sudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel

# Optional: Install PoDoFo dependencies for PDF support (Fedora)
sudo dnf install freetype-devel libjpeg-devel libpng-devel libtiff-devel libxml2-devel fontconfig-devel

# Optional: Install FAISS dependencies
sudo dnf install openblas-devel lapack-devel

Arch Linux:

sudo pacman -S base-devel cmake git curl yaml-cpp

# Optional: Install PoDoFo dependencies for PDF support
sudo pacman -S freetype2 libjpeg-turbo libpng libtiff libxml2 fontconfig

# Optional: Install FAISS dependencies
sudo pacman -S openblas lapack

Install with Package Manager (Future)

# Note: Package manager installation will be available in future releases
# For now, use the build from source method below

Building from Source

1. Clone the Repository with the Submodules: FAISS is bundled as a submodule in external/faiss. If you don't add --recursive, FAISS be disabled (the build creates a stub). Re-run the submodule command then reconfigure.

git clone https://github.com/kolosalai/kolosal-server.git --recurisve
cd kolosal-server

2. Create Build Directory:

mkdir build && cd build

3. Configure Build:

Standard Build (CPU-only):

cmake -DCMAKE_BUILD_TYPE=Release ..

With CUDA Support:

cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..

With Vulkan Support:

cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..

With PoDoFo PDF Support (requires dependencies installed):

cmake -DCMAKE_BUILD_TYPE=Release -DUSE_PODOFO=ON ..

With FAISS Support (requires dependencies installed):

cmake -DCMAKE_BUILD_TYPE=Release -DUSE_FAISS=ON ..

Combined Options:

# CUDA + PoDoFo
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON -DUSE_PODOFO=ON ..

# Vulkan + PoDoFo
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON -DUSE_PODOFO=ON ..

Debug Build:

cmake -DCMAKE_BUILD_TYPE=Debug ..

4. Build the Project:

# Use all available CPU cores
make -j

# Or specify number of cores manually
make -j4

5. Verify Build:

# Check if the executable was created
cd Release && ls -la kolosal-server

# Test basic functionality
./kolosal-server --help

6. Install to System Path (Optional):

# Install binary to /usr/local/bin
sudo cp build/Release/kolosal-server /usr/local/bin/

# Make it executable
sudo chmod +x /usr/local/bin/kolosal-server

# Now you can run from anywhere
kolosal-server --help

Running the Server

Start the Server:

# From build/Release directory
./kolosal-server

# Check where the config file is
./kolosal-server --config 

# Or specify a config file
./kolosal-server --config ../config.yaml

Background Service:

# Run in background
nohup ./kolosal-server > server.log 2>&1 &

# Check if running
ps aux | grep kolosal-server

Check Server Status:

# Test if server is responding
curl http://localhost:8080/v1/health

Installation as System Service

Create Service File:

sudo tee /etc/systemd/system/kolosal-server.service > /dev/null << EOF
[Unit]
Description=Kolosal Server - LLM Inference Server
After=network.target

[Service]
Type=simple
User=kolosal
Group=kolosal
WorkingDirectory=/opt/kolosal-server
ExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

Enable and Start Service:

# Create user for service
sudo useradd -r -s /bin/false kolosal

# Install binary and config
sudo mkdir -p /opt/kolosal-server /etc/kolosal-server
sudo cp build/kolosal-server /opt/kolosal-server/
sudo cp config.example.yaml /etc/kolosal-server/config.yaml
sudo chown -R kolosal:kolosal /opt/kolosal-server

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable kolosal-server
sudo systemctl start kolosal-server

# Check status
sudo systemctl status kolosal-server

Troubleshooting

Common Build Issues:

Missing dependencies:

# Check for missing packages
ldd build/kolosal-server

# Install missing development packages
sudo apt install -y libssl-dev libcurl4-openssl-dev

CMake version too old:

# Install newer CMake from Kitware APT repository
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null
sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main'
sudo apt update && sudo apt install cmake

CUDA compilation errors:

# Verify CUDA installation
nvcc --version
nvidia-smi

# Set CUDA environment variables if needed
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Permission issues:

# Fix ownership
sudo chown -R $USER:$USER ./build

# Make executable
chmod +x build/kolosal-server

Performance Optimization:

CPU Optimization:

# Build with native optimizations
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" ..

Memory Settings:

# For systems with limited RAM, reduce parallel jobs
make -j2

# Set memory limits in config
echo "server.max_memory_mb: 4096" >> config.yaml

GPU Memory:

# Monitor GPU usage
watch nvidia-smi

# Adjust GPU layers in model config
# Reduce n_gpu_layers if running out of VRAM

Installation (MacOS)

Prerequisites:

macOS 10.15 (Catalina) or later
Xcode Command Line Tools or Xcode
CMake 3.14+
Homebrew (recommended for dependency management)

Install Dependencies:

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install build tools and required libraries
brew install cmake git curl yaml-cpp

# Optional: Install PoDoFo dependencies for PDF support
brew install freetype jpeg libpng libtiff libxml2

Building:

git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
git submodule update --init --recursive
mkdir build && cd build

# Standard build
cmake -DCMAKE_BUILD_TYPE=Release ..

# With PoDoFo PDF support (if dependencies are installed)
# cmake -DCMAKE_BUILD_TYPE=Release -DUSE_PODOFO=ON ..

# With Metal acceleration (automatically enabled on Apple Silicon)
# Metal support is automatically enabled on macOS

make -j$(sysctl -n hw.ncpu)

Running the Server:

./kolosal-server

Installation (Windows)

Prerequisites:

Windows 10/11
Visual Studio 2019 or later
CMake 3.20+
VCPKG
CUDA Toolkit (optional, for GPU acceleration)

Install Dependencies

Run git clone https://github.com/kolosalai/kolosal-server.git --recurisve
Run cd kolosal-server

Make a vcpkg.json file at the root of the project

{
  "name": "kolosal-server",
  "version-string": "1.0.0",
  "dependencies": [
    "curl",
    "fontconfig",
    "freetype",
    "libjpeg-turbo",
    "libpng",
    "openssl",
    "libxml2",
    "tiff",
    "zlib",
    "openblas",
    "lapack-reference"
  ]
}```

Run mkdir build && cd build

Run

cmake -S . -B build -G "Visual Studio 17 2022" -A x64 ^
-DCMAKE_TOOLCHAIN_FILE="$env:VCPKG_ROOT\scripts\buildsystems\vcpkg.cmake" ^
-DVCPKG_TARGET_TRIPLET=x64-windows ^
-DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreadedDLL ^
-DCMAKE_POLICY_DEFAULT_CMP0091=NEW

Run cd .. && cmake --build build --config Release --target kolosal_server_exe

Running the Server

./Release/kolosal-server.exe

The server will start on http://localhost:8080 by default.

Configuration

Kolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.

Vector Database (Retrieval) Backends

The retrieval endpoints (/add_documents, /retrieve, /remove_documents, /list_documents, /info_documents) use a pluggable vector store:

FAISS (default, in-process, zero external dependencies)
Qdrant (optional external service)

If database.vector_database is omitted, FAISS is selected automatically.

database:
  vector_database: faiss  # or qdrant
  faiss:
    index_type: Flat
    index_path: ./data/faiss_index
    dimensions: 1536
    normalize_vectors: true
    metric_type: IP  # IP + normalization approximates cosine
  qdrant:
    enabled: true
    host: localhost
    port: 6333
    collection_name: documents
    default_embedding_model: text-embedding-3-small

FAISS build notes:

Controlled by CMake option USE_FAISS (ON by default)
GPU acceleration toggles automatically if CUDA is found and USE_CUDA is enabled
Disable with -DUSE_FAISS=OFF

Example build enabling CUDA + FAISS:

cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON -DUSE_FAISS=ON ..

Quick Configuration Examples

Minimal Configuration (`config.yaml`)

server:
  port: "8080"

models:
  - id: "my-model"
    path: "./models/model.gguf"
    load_immediately: true

Production Configuration

server:
  port: "8080"
  max_connections: 500
  worker_threads: 8

auth:
  enabled: true
  require_api_key: true
  api_keys:
    - "sk-your-api-key-here"

models:
  - id: "gpt-3.5-turbo"
    path: "./models/gpt-3.5-turbo.gguf"
    load_immediately: true
    main_gpu_id: 0
    load_params:
      n_ctx: 4096
      n_gpu_layers: 50

features:
  metrics: true  # Enable /metrics and /completion-metrics

For complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the Configuration Guide.

API Usage (Unix)

1. Add a Model Engine

Before using chat completions, you need to add a model engine:

curl -X POST http://localhost:8080/engines \
  -H "Content-Type: application/json" \
  -d '{
    "engine_id": "my-model",
    "model_path": "path/to/your/model.gguf",
    "load_immediately": true,
    "n_ctx": 2048,
    "n_gpu_layers": 0,
    "main_gpu_id": 0
  }'

Lazy Loading

For faster startup times, you can defer model loading until first use:

curl -X POST http://localhost:8080/engines \
  -H "Content-Type: application/json" \
  -d '{
    "engine_id": "my-model",
    "model_path": "https://huggingface.co/model-repo/model.gguf",
    "load_immediately": false,
    "n_ctx": 4096,
    "n_gpu_layers": 30,
    "main_gpu_id": 0
  }'

2. Chat Completions

Non-Streaming Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you today?"
      }
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'

Response:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "role": "assistant"
      }
    }
  ],
  "created": 1749981228,
  "id": "chatcmpl-80HTkM01z7aaaThFbuALkbTu",
  "model": "my-model",
  "object": "chat.completion",
  "system_fingerprint": "fp_4d29efe704",
  "usage": {
    "completion_tokens": 15,
    "prompt_tokens": 9,
    "total_tokens": 24
  }
}

Streaming Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a short story about a robot."
      }
    ],
    "stream": true,
    "temperature": 0.8,
    "max_tokens": 150
  }'

Response (Server-Sent Events):

data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: {"choices":[{"delta":{"content":"Once"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: {"choices":[{"delta":{"content":" upon"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: [DONE]

Multi-Message Conversation

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful programming assistant."
      },
      {
        "role": "user",
        "content": "How do I create a simple HTTP server in Python?"
      },
      {
        "role": "assistant",
        "content": "You can create a simple HTTP server in Python using the built-in http.server module..."
      },
      {
        "role": "user",
        "content": "Can you show me the code?"
      }
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 200
  }'

Advanced Parameters

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "stream": false,
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 50,
    "seed": 42,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0
  }'

3. Completions

Non-Streaming Completion

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "The future of artificial intelligence is",
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'

Response:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "text": " bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields..."
    }
  ],
  "created": 1749981288,
  "id": "cmpl-80HTkM01z7aaaThFbuALkbTu",
  "model": "my-model",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 8,
    "total_tokens": 33
  }
}

Streaming Completion

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "my-model",
    "prompt": "Write a haiku about programming:",
    "stream": true,
    "temperature": 0.8,
    "max_tokens": 50
  }'

Response (Server-Sent Events):

data: {"choices":[{"finish_reason":"","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: {"choices":[{"finish_reason":"","index":0,"text":"Code"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: {"choices":[{"finish_reason":"","index":0,"text":" flows"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: {"choices":[{"finish_reason":"stop","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: [DONE]

Multiple Prompts

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": [
      "The weather today is",
      "In other news,"
    ],
    "stream": false,
    "temperature": 0.5,
    "max_tokens": 30
  }'

Advanced Completion Parameters

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Explain quantum computing:",
    "stream": false,
    "temperature": 0.2,
    "top_p": 0.9,
    "max_tokens": 100,
    "seed": 123,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.1
  }'

4. Engine Management

List Available Engines

curl -X GET http://localhost:8080/v1/engines

Get Engine Status

curl -X GET http://localhost:8080/engines/my-model/status

Remove an Engine

curl -X DELETE http://localhost:8080/engines/my-model

5. Completion Metrics and Monitoring

The server provides real-time completion metrics for monitoring performance and usage:

Get Completion Metrics

curl -X GET http://localhost:8080/completion-metrics

Response:

{
  "completion_metrics": {
    "summary": {
      "total_requests": 15,
      "completed_requests": 14,
      "failed_requests": 1,
      "success_rate_percent": 93.33,
      "total_input_tokens": 120,
      "total_output_tokens": 350,
      "avg_turnaround_time_ms": 1250.5,
      "avg_tps": 12.8,
      "avg_output_tps": 8.4,
      "avg_ttft_ms": 245.2,
      "avg_rps": 0.85
    },
    "per_engine": [
      {
        "model_name": "my-model",
        "engine_id": "default",
        "total_requests": 15,
        "completed_requests": 14,
        "failed_requests": 1,
        "total_input_tokens": 120,
        "total_output_tokens": 350,
        "tps": 12.8,
        "output_tps": 8.4,
        "avg_ttft": 245.2,
        "rps": 0.85,
        "last_updated": "2025-06-16T17:04:12.123Z"
      }
    ],
    "timestamp": "2025-06-16T17:04:12.123Z"
  }
}

Alternative endpoints:

# OpenAI-style endpoint
curl -X GET http://localhost:8080/v1/completion-metrics

# Alternative path
curl -X GET http://localhost:8080/completion/metrics

Metrics Explained

Metric	Description
`total_requests`	Total number of completion requests received
`completed_requests`	Number of successfully completed requests
`failed_requests`	Number of requests that failed
`success_rate_percent`	Success rate as a percentage
`total_input_tokens`	Total input tokens processed
`total_output_tokens`	Total output tokens generated
`avg_turnaround_time_ms`	Average time from request to completion (ms)
`avg_tps`	Average tokens per second (input + output)
`avg_output_tps`	Average output tokens per second
`avg_ttft_ms`	Average time to first token (ms)
`avg_rps`	Average requests per second

PowerShell Example

# Get completion metrics
$metrics = Invoke-RestMethod -Uri "http://localhost:8080/completion-metrics" -Method GET
Write-Output "Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%"
Write-Output "Average TPS: $($metrics.completion_metrics.summary.avg_tps)"

6. Health Check

curl -X GET http://localhost:8080/v1/health

API Usage (Powershell)

For Windows users, here are PowerShell equivalents:

Add Engine

$body = @{
    engine_id = "my-model"
    model_path = "C:\path\to\model.gguf"
    load_immediately = $true
    n_ctx = 2048
    n_gpu_layers = 0
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://localhost:8080/engines" -Method POST -Body $body -ContentType "application/json"

Chat Completion

$body = @{
    model = "my-model"
    messages = @(
        @{
            role = "user"
            content = "Hello, how are you?"
        }
    )
    stream = $false
    temperature = 0.7
    max_tokens = 100
} | ConvertTo-Json -Depth 3

Invoke-RestMethod -Uri "http://localhost:8080/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"

Completion

$body = @{
    model = "my-model"
    prompt = "The future of AI is"
    stream = $false
    temperature = 0.7
    max_tokens = 50
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://localhost:8080/v1/completions" -Method POST -Body $body -ContentType "application/json"

Parameters Reference

Chat Completion Parameters

Parameter	Type	Default	Description
`model`	string	required	The ID of the model to use
`messages`	array	required	List of message objects
`stream`	boolean	false	Whether to stream responses
`temperature`	number	1.0	Sampling temperature (0.0-2.0)
`top_p`	number	1.0	Nucleus sampling parameter
`max_tokens`	integer	128	Maximum tokens to generate
`seed`	integer	random	Random seed for reproducible outputs
`presence_penalty`	number	0.0	Presence penalty (-2.0 to 2.0)
`frequency_penalty`	number	0.0	Frequency penalty (-2.0 to 2.0)

Completion Parameters

Parameter	Type	Default	Description
`model`	string	required	The ID of the model to use
`prompt`	string/array	required	Text prompt or array of prompts
`stream`	boolean	false	Whether to stream responses
`temperature`	number	1.0	Sampling temperature (0.0-2.0)
`top_p`	number	1.0	Nucleus sampling parameter
`max_tokens`	integer	16	Maximum tokens to generate
`seed`	integer	random	Random seed for reproducible outputs
`presence_penalty`	number	0.0	Presence penalty (-2.0 to 2.0)
`frequency_penalty`	number	0.0	Frequency penalty (-2.0 to 2.0)

Message Object

Field	Type	Description
`role`	string	Role: "system", "user", or "assistant"
`content`	string	The content of the message

Engine Loading Parameters

Parameter	Type	Default	Description
`engine_id`	string	required	Unique identifier for the engine
`model_path`	string	required	Path to the GGUF model file or URL
`load_immediately`	boolean	true	Whether to load the model immediately or defer until first use
`n_ctx`	integer	4096	Context window size
`n_gpu_layers`	integer	100	Number of layers to offload to GPU
`main_gpu_id`	integer	0	Primary GPU device ID

Error Handling

The server returns standard HTTP status codes and JSON error responses:

{
  "error": {
    "message": "Model 'non-existent-model' not found or could not be loaded",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

Common error codes:

400 - Bad Request (invalid JSON, missing parameters)
404 - Not Found (model/engine not found)
500 - Internal Server Error (inference failures)

📚 Developer Documentation

For developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the docs/ directory:

🚀 Getting Started

Developer Guide - Complete setup, architecture, and development workflows
Configuration Guide - Complete server configuration in JSON and YAML formats
Architecture Overview - Detailed system design and component relationships

🔧 Implementation Guides

Adding New Routes - Step-by-step guide for implementing API endpoints
Adding New Models - Guide for creating data models and JSON handling
API Specification - Complete API reference with examples

📖 Quick Links

Documentation Index - Complete documentation overview
Project Structure - Understanding the codebase
Contributing Guidelines - How to contribute

Acknowledgments

Kolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:

llama.cpp

This project is powered by llama.cpp, developed by Georgi Gerganov and the ggml-org community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.

Project: https://github.com/ggml-org/llama.cpp
License: MIT License
Description: Inference of Meta's LLaMA model (and others) in pure C/C++

We extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.

Other Dependencies

yaml-cpp: YAML parsing and emitting library
nlohmann/json: JSON library for Modern C++
libcurl: Client-side URL transfer library
prometheus-cpp: Prometheus metrics library for C++

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

We welcome contributions! Please see our Developer Documentation for detailed guides on:

Getting Started: Developer Guide
Understanding the System: Architecture Overview
Adding Features: Route and Model guides
API Changes: API Specification

Quick Contributing Steps

Fork the repository
Follow the Developer Guide for setup
Create a feature branch
Implement your changes following our guides
Add tests and update documentation
Submit a Pull Request

Support

Issues: Report bugs and feature requests on GitHub Issues
Documentation: Check the docs/ directory for comprehensive guides
Discussions: Join Kolosal AI Discord for questions and community support

Name		Name	Last commit message	Last commit date
Latest commit History 343 Commits
cmake		cmake
common		common
configs		configs
docs		docs
external		external
include		include
inference		inference
src		src
static		static
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
changes.log		changes.log
server.log		server.log

License

KolosalAI/kolosal-server

Folders and files

Latest commit

History

Repository files navigation

Kolosal Server

Platform Support

Features

Installation (Linux, Recommended)

Prerequisites

Install Dependencies:

Install with Package Manager (Future)

Building from Source

Running the Server

Installation as System Service

Troubleshooting

Installation (MacOS)

Prerequisites:

Install Dependencies:

Installation (Windows)

Prerequisites:

Install Dependencies

Running the Server

Configuration

Vector Database (Retrieval) Backends

Quick Configuration Examples

Minimal Configuration (config.yaml)

Production Configuration

API Usage (Unix)

1. Add a Model Engine

Lazy Loading

2. Chat Completions

Non-Streaming Chat Completion

Streaming Chat Completion

Multi-Message Conversation

Advanced Parameters

3. Completions

Non-Streaming Completion

Streaming Completion

Multiple Prompts

Advanced Completion Parameters

4. Engine Management

List Available Engines

Get Engine Status

Remove an Engine

5. Completion Metrics and Monitoring

Get Completion Metrics

Metrics Explained

PowerShell Example

6. Health Check

API Usage (Powershell)

Add Engine

Chat Completion

Completion

Parameters Reference

Chat Completion Parameters

Completion Parameters

Message Object

Engine Loading Parameters

Error Handling

📚 Developer Documentation

🚀 Getting Started

🔧 Implementation Guides

📖 Quick Links

Acknowledgments

llama.cpp

Other Dependencies

License

Contributing

Quick Contributing Steps

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Minimal Configuration (`config.yaml`)

Packages