A high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both Windows and Linux systems!
- 🪟 Windows: Full support with Visual Studio and MSVC
- 🐧 Linux: Native support with GCC/Clang
- 🎮 GPU Acceleration: NVIDIA CUDA and Vulkan support
- 📦 Easy Installation: Direct binary installation or build from source
- 🚀 Fast Inference: Built with llama.cpp for optimized model inference
- 🔗 OpenAI Compatible: Drop-in replacement for OpenAI API endpoints
- 📡 Streaming Support: Real-time streaming responses for chat completions
- 🎛️ Multi-Model Management: Load and manage multiple models simultaneously
- 📊 Real-time Metrics: Monitor completion performance with TPS, TTFT, and success rates
- ⚙️ Lazy Loading: Defer model loading until first request with load_immediately=false
- 🔧 Configurable: Flexible model loading parameters and inference settings
- 🔒 Authentication: API key and rate limiting support
- 🌐 Cross-Platform: Windows and Linux native builds
- 📚 RAG Retrieval: Native FAISS vector store (default) with optional Qdrant backend
System Requirements:
- Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)
- GCC 9+ or Clang 10+
- CMake 3.14+
- Git with submodule support
- At least 4GB RAM (8GB+ recommended for larger models)
- CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)
- Vulkan SDK (optional, for alternative GPU acceleration)
Ubuntu/Debian:
# Update package list
sudo apt update
# Install essential build tools
sudo apt install -y build-essential cmake git pkg-config
# Install required libraries
sudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev
# Optional: Install PoDoFo dependencies for PDF support
sudo apt install -y libfreetype6-dev libjpeg-dev libpng-dev libtiff-dev libxml2-dev libfontconfig1-dev
# Optional: Install FAISS dependencies
sudo apt install libopenblas-dev liblapack-dev
# Optional: Install CUDA for GPU support
# Follow NVIDIA's official installation guide for your distributionCentOS/RHEL/Fedora:
# For CentOS/RHEL 8+
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake git curl-devel yaml-cpp-devel
# For Fedora
sudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel
# Optional: Install PoDoFo dependencies for PDF support (Fedora)
sudo dnf install freetype-devel libjpeg-devel libpng-devel libtiff-devel libxml2-devel fontconfig-devel
# Optional: Install FAISS dependencies
sudo dnf install openblas-devel lapack-develArch Linux:
sudo pacman -S base-devel cmake git curl yaml-cpp
# Optional: Install PoDoFo dependencies for PDF support
sudo pacman -S freetype2 libjpeg-turbo libpng libtiff libxml2 fontconfig
# Optional: Install FAISS dependencies
sudo pacman -S openblas lapack# Note: Package manager installation will be available in future releases
# For now, use the build from source method below1. Clone the Repository with the Submodules:
FAISS is bundled as a submodule in external/faiss. If you don't add --recursive, FAISS be disabled (the build creates a stub). Re-run the submodule command then reconfigure.
git clone https://github.com/kolosalai/kolosal-server.git --recurisve
cd kolosal-server2. Create Build Directory:
mkdir build && cd build3. Configure Build:
Standard Build (CPU-only):
cmake -DCMAKE_BUILD_TYPE=Release ..With CUDA Support:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..With Vulkan Support:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..With PoDoFo PDF Support (requires dependencies installed):
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_PODOFO=ON ..With FAISS Support (requires dependencies installed):
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_FAISS=ON ..Combined Options:
# CUDA + PoDoFo
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON -DUSE_PODOFO=ON ..
# Vulkan + PoDoFo
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON -DUSE_PODOFO=ON ..Debug Build:
cmake -DCMAKE_BUILD_TYPE=Debug ..4. Build the Project:
# Use all available CPU cores
make -j
# Or specify number of cores manually
make -j45. Verify Build:
# Check if the executable was created
cd Release && ls -la kolosal-server
# Test basic functionality
./kolosal-server --help6. Install to System Path (Optional):
# Install binary to /usr/local/bin
sudo cp build/Release/kolosal-server /usr/local/bin/
# Make it executable
sudo chmod +x /usr/local/bin/kolosal-server
# Now you can run from anywhere
kolosal-server --helpStart the Server:
# From build/Release directory
./kolosal-server
# Check where the config file is
./kolosal-server --config 
# Or specify a config file
./kolosal-server --config ../config.yamlBackground Service:
# Run in background
nohup ./kolosal-server > server.log 2>&1 &
# Check if running
ps aux | grep kolosal-serverCheck Server Status:
# Test if server is responding
curl http://localhost:8080/v1/healthCreate Service File:
sudo tee /etc/systemd/system/kolosal-server.service > /dev/null << EOF
[Unit]
Description=Kolosal Server - LLM Inference Server
After=network.target
[Service]
Type=simple
User=kolosal
Group=kolosal
WorkingDirectory=/opt/kolosal-server
ExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOFEnable and Start Service:
# Create user for service
sudo useradd -r -s /bin/false kolosal
# Install binary and config
sudo mkdir -p /opt/kolosal-server /etc/kolosal-server
sudo cp build/kolosal-server /opt/kolosal-server/
sudo cp config.example.yaml /etc/kolosal-server/config.yaml
sudo chown -R kolosal:kolosal /opt/kolosal-server
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable kolosal-server
sudo systemctl start kolosal-server
# Check status
sudo systemctl status kolosal-serverCommon Build Issues:
- 
Missing dependencies: # Check for missing packages ldd build/kolosal-server # Install missing development packages sudo apt install -y libssl-dev libcurl4-openssl-dev 
- 
CMake version too old: # Install newer CMake from Kitware APT repository wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' sudo apt update && sudo apt install cmake 
- 
CUDA compilation errors: # Verify CUDA installation nvcc --version nvidia-smi # Set CUDA environment variables if needed export CUDA_HOME=/usr/local/cuda export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH 
- 
Permission issues: # Fix ownership sudo chown -R $USER:$USER ./build # Make executable chmod +x build/kolosal-server 
Performance Optimization:
- 
CPU Optimization: # Build with native optimizations cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" .. 
- 
Memory Settings: # For systems with limited RAM, reduce parallel jobs make -j2 # Set memory limits in config echo "server.max_memory_mb: 4096" >> config.yaml 
- 
GPU Memory: # Monitor GPU usage watch nvidia-smi # Adjust GPU layers in model config # Reduce n_gpu_layers if running out of VRAM 
- macOS 10.15 (Catalina) or later
- Xcode Command Line Tools or Xcode
- CMake 3.14+
- Homebrew (recommended for dependency management)
# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install build tools and required libraries
brew install cmake git curl yaml-cpp
# Optional: Install PoDoFo dependencies for PDF support
brew install freetype jpeg libpng libtiff libxml2Building:
git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
git submodule update --init --recursive
mkdir build && cd build
# Standard build
cmake -DCMAKE_BUILD_TYPE=Release ..
# With PoDoFo PDF support (if dependencies are installed)
# cmake -DCMAKE_BUILD_TYPE=Release -DUSE_PODOFO=ON ..
# With Metal acceleration (automatically enabled on Apple Silicon)
# Metal support is automatically enabled on macOS
make -j$(sysctl -n hw.ncpu)Running the Server:
./kolosal-server- Windows 10/11
- Visual Studio 2019 or later
- CMake 3.20+
- VCPKG
- CUDA Toolkit (optional, for GPU acceleration)
- Run git clone https://github.com/kolosalai/kolosal-server.git --recurisve
- Run cd kolosal-server
- Make a vcpkg.jsonfile at the root of the project{ "name": "kolosal-server", "version-string": "1.0.0", "dependencies": [ "curl", "fontconfig", "freetype", "libjpeg-turbo", "libpng", "openssl", "libxml2", "tiff", "zlib", "openblas", "lapack-reference" ] }```
- Run mkdir build && cd build
- Run
cmake -S . -B build -G "Visual Studio 17 2022" -A x64 ^ -DCMAKE_TOOLCHAIN_FILE="$env:VCPKG_ROOT\scripts\buildsystems\vcpkg.cmake" ^ -DVCPKG_TARGET_TRIPLET=x64-windows ^ -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreadedDLL ^ -DCMAKE_POLICY_DEFAULT_CMP0091=NEW
- Run cd .. && cmake --build build --config Release --target kolosal_server_exe
./Release/kolosal-server.exeThe server will start on http://localhost:8080 by default.
Kolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.
The retrieval endpoints (/add_documents, /retrieve, /remove_documents, /list_documents, /info_documents) use a pluggable vector store:
- FAISS (default, in-process, zero external dependencies)
- Qdrant (optional external service)
If database.vector_database is omitted, FAISS is selected automatically.
database:
  vector_database: faiss  # or qdrant
  faiss:
    index_type: Flat
    index_path: ./data/faiss_index
    dimensions: 1536
    normalize_vectors: true
    metric_type: IP  # IP + normalization approximates cosine
  qdrant:
    enabled: true
    host: localhost
    port: 6333
    collection_name: documents
    default_embedding_model: text-embedding-3-smallFAISS build notes:
- Controlled by CMake option USE_FAISS(ON by default)
- GPU acceleration toggles automatically if CUDA is found and USE_CUDAis enabled
- Disable with -DUSE_FAISS=OFF
Example build enabling CUDA + FAISS:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON -DUSE_FAISS=ON ..server:
  port: "8080"
models:
  - id: "my-model"
    path: "./models/model.gguf"
    load_immediately: trueserver:
  port: "8080"
  max_connections: 500
  worker_threads: 8
auth:
  enabled: true
  require_api_key: true
  api_keys:
    - "sk-your-api-key-here"
models:
  - id: "gpt-3.5-turbo"
    path: "./models/gpt-3.5-turbo.gguf"
    load_immediately: true
    main_gpu_id: 0
    load_params:
      n_ctx: 4096
      n_gpu_layers: 50
features:
  metrics: true  # Enable /metrics and /completion-metricsFor complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the Configuration Guide.
Before using chat completions, you need to add a model engine:
curl -X POST http://localhost:8080/engines \
  -H "Content-Type: application/json" \
  -d '{
    "engine_id": "my-model",
    "model_path": "path/to/your/model.gguf",
    "load_immediately": true,
    "n_ctx": 2048,
    "n_gpu_layers": 0,
    "main_gpu_id": 0
  }'For faster startup times, you can defer model loading until first use:
curl -X POST http://localhost:8080/engines \
  -H "Content-Type: application/json" \
  -d '{
    "engine_id": "my-model",
    "model_path": "https://huggingface.co/model-repo/model.gguf",
    "load_immediately": false,
    "n_ctx": 4096,
    "n_gpu_layers": 30,
    "main_gpu_id": 0
  }'curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you today?"
      }
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'Response:
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "role": "assistant"
      }
    }
  ],
  "created": 1749981228,
  "id": "chatcmpl-80HTkM01z7aaaThFbuALkbTu",
  "model": "my-model",
  "object": "chat.completion",
  "system_fingerprint": "fp_4d29efe704",
  "usage": {
    "completion_tokens": 15,
    "prompt_tokens": 9,
    "total_tokens": 24
  }
}curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a short story about a robot."
      }
    ],
    "stream": true,
    "temperature": 0.8,
    "max_tokens": 150
  }'Response (Server-Sent Events):
data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":"Once"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":" upon"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: [DONE]
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful programming assistant."
      },
      {
        "role": "user",
        "content": "How do I create a simple HTTP server in Python?"
      },
      {
        "role": "assistant",
        "content": "You can create a simple HTTP server in Python using the built-in http.server module..."
      },
      {
        "role": "user",
        "content": "Can you show me the code?"
      }
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 200
  }'curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "stream": false,
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 50,
    "seed": 42,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0
  }'curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "The future of artificial intelligence is",
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'Response:
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "text": " bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields..."
    }
  ],
  "created": 1749981288,
  "id": "cmpl-80HTkM01z7aaaThFbuALkbTu",
  "model": "my-model",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 8,
    "total_tokens": 33
  }
}curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "my-model",
    "prompt": "Write a haiku about programming:",
    "stream": true,
    "temperature": 0.8,
    "max_tokens": 50
  }'Response (Server-Sent Events):
data: {"choices":[{"finish_reason":"","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"","index":0,"text":"Code"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"","index":0,"text":" flows"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"stop","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: [DONE]
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": [
      "The weather today is",
      "In other news,"
    ],
    "stream": false,
    "temperature": 0.5,
    "max_tokens": 30
  }'curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Explain quantum computing:",
    "stream": false,
    "temperature": 0.2,
    "top_p": 0.9,
    "max_tokens": 100,
    "seed": 123,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.1
  }'curl -X GET http://localhost:8080/v1/enginescurl -X GET http://localhost:8080/engines/my-model/statuscurl -X DELETE http://localhost:8080/engines/my-modelThe server provides real-time completion metrics for monitoring performance and usage:
curl -X GET http://localhost:8080/completion-metricsResponse:
{
  "completion_metrics": {
    "summary": {
      "total_requests": 15,
      "completed_requests": 14,
      "failed_requests": 1,
      "success_rate_percent": 93.33,
      "total_input_tokens": 120,
      "total_output_tokens": 350,
      "avg_turnaround_time_ms": 1250.5,
      "avg_tps": 12.8,
      "avg_output_tps": 8.4,
      "avg_ttft_ms": 245.2,
      "avg_rps": 0.85
    },
    "per_engine": [
      {
        "model_name": "my-model",
        "engine_id": "default",
        "total_requests": 15,
        "completed_requests": 14,
        "failed_requests": 1,
        "total_input_tokens": 120,
        "total_output_tokens": 350,
        "tps": 12.8,
        "output_tps": 8.4,
        "avg_ttft": 245.2,
        "rps": 0.85,
        "last_updated": "2025-06-16T17:04:12.123Z"
      }
    ],
    "timestamp": "2025-06-16T17:04:12.123Z"
  }
}Alternative endpoints:
# OpenAI-style endpoint
curl -X GET http://localhost:8080/v1/completion-metrics
# Alternative path
curl -X GET http://localhost:8080/completion/metrics| Metric | Description | 
|---|---|
| total_requests | Total number of completion requests received | 
| completed_requests | Number of successfully completed requests | 
| failed_requests | Number of requests that failed | 
| success_rate_percent | Success rate as a percentage | 
| total_input_tokens | Total input tokens processed | 
| total_output_tokens | Total output tokens generated | 
| avg_turnaround_time_ms | Average time from request to completion (ms) | 
| avg_tps | Average tokens per second (input + output) | 
| avg_output_tps | Average output tokens per second | 
| avg_ttft_ms | Average time to first token (ms) | 
| avg_rps | Average requests per second | 
# Get completion metrics
$metrics = Invoke-RestMethod -Uri "http://localhost:8080/completion-metrics" -Method GET
Write-Output "Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%"
Write-Output "Average TPS: $($metrics.completion_metrics.summary.avg_tps)"curl -X GET http://localhost:8080/v1/healthFor Windows users, here are PowerShell equivalents:
$body = @{
    engine_id = "my-model"
    model_path = "C:\path\to\model.gguf"
    load_immediately = $true
    n_ctx = 2048
    n_gpu_layers = 0
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:8080/engines" -Method POST -Body $body -ContentType "application/json"$body = @{
    model = "my-model"
    messages = @(
        @{
            role = "user"
            content = "Hello, how are you?"
        }
    )
    stream = $false
    temperature = 0.7
    max_tokens = 100
} | ConvertTo-Json -Depth 3
Invoke-RestMethod -Uri "http://localhost:8080/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"$body = @{
    model = "my-model"
    prompt = "The future of AI is"
    stream = $false
    temperature = 0.7
    max_tokens = 50
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:8080/v1/completions" -Method POST -Body $body -ContentType "application/json"| Parameter | Type | Default | Description | 
|---|---|---|---|
| model | string | required | The ID of the model to use | 
| messages | array | required | List of message objects | 
| stream | boolean | false | Whether to stream responses | 
| temperature | number | 1.0 | Sampling temperature (0.0-2.0) | 
| top_p | number | 1.0 | Nucleus sampling parameter | 
| max_tokens | integer | 128 | Maximum tokens to generate | 
| seed | integer | random | Random seed for reproducible outputs | 
| presence_penalty | number | 0.0 | Presence penalty (-2.0 to 2.0) | 
| frequency_penalty | number | 0.0 | Frequency penalty (-2.0 to 2.0) | 
| Parameter | Type | Default | Description | 
|---|---|---|---|
| model | string | required | The ID of the model to use | 
| prompt | string/array | required | Text prompt or array of prompts | 
| stream | boolean | false | Whether to stream responses | 
| temperature | number | 1.0 | Sampling temperature (0.0-2.0) | 
| top_p | number | 1.0 | Nucleus sampling parameter | 
| max_tokens | integer | 16 | Maximum tokens to generate | 
| seed | integer | random | Random seed for reproducible outputs | 
| presence_penalty | number | 0.0 | Presence penalty (-2.0 to 2.0) | 
| frequency_penalty | number | 0.0 | Frequency penalty (-2.0 to 2.0) | 
| Field | Type | Description | 
|---|---|---|
| role | string | Role: "system", "user", or "assistant" | 
| content | string | The content of the message | 
| Parameter | Type | Default | Description | 
|---|---|---|---|
| engine_id | string | required | Unique identifier for the engine | 
| model_path | string | required | Path to the GGUF model file or URL | 
| load_immediately | boolean | true | Whether to load the model immediately or defer until first use | 
| n_ctx | integer | 4096 | Context window size | 
| n_gpu_layers | integer | 100 | Number of layers to offload to GPU | 
| main_gpu_id | integer | 0 | Primary GPU device ID | 
The server returns standard HTTP status codes and JSON error responses:
{
  "error": {
    "message": "Model 'non-existent-model' not found or could not be loaded",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}Common error codes:
- 400- Bad Request (invalid JSON, missing parameters)
- 404- Not Found (model/engine not found)
- 500- Internal Server Error (inference failures)
For developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the docs/ directory:
- Developer Guide - Complete setup, architecture, and development workflows
- Configuration Guide - Complete server configuration in JSON and YAML formats
- Architecture Overview - Detailed system design and component relationships
- Adding New Routes - Step-by-step guide for implementing API endpoints
- Adding New Models - Guide for creating data models and JSON handling
- API Specification - Complete API reference with examples
- Documentation Index - Complete documentation overview
- Project Structure - Understanding the codebase
- Contributing Guidelines - How to contribute
Kolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:
This project is powered by llama.cpp, developed by Georgi Gerganov and the ggml-org community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.
- Project: https://github.com/ggml-org/llama.cpp
- License: MIT License
- Description: Inference of Meta's LLaMA model (and others) in pure C/C++
We extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.
- yaml-cpp: YAML parsing and emitting library
- nlohmann/json: JSON library for Modern C++
- libcurl: Client-side URL transfer library
- prometheus-cpp: Prometheus metrics library for C++
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
We welcome contributions! Please see our Developer Documentation for detailed guides on:
- Getting Started: Developer Guide
- Understanding the System: Architecture Overview
- Adding Features: Route and Model guides
- API Changes: API Specification
- Fork the repository
- Follow the Developer Guide for setup
- Create a feature branch
- Implement your changes following our guides
- Add tests and update documentation
- Submit a Pull Request
- Issues: Report bugs and feature requests on GitHub Issues
- Documentation: Check the docs/ directory for comprehensive guides
- Discussions: Join Kolosal AI Discord for questions and community support