ClawRAG

Production-Ready RAG Engine. Self-Hosted. Data Sovereign. Enterprise-Grade.

ClawRAG is a powerful, self-hosted Retrieval-Augmented Generation (RAG) system that gives you complete control over your AI document processing. Process documents locally with local LLMs or connect to cloud APIs—your data never leaves your infrastructure unless you want it to.

🚀 Quick Start • 📊 Editions • 🏢 Enterprise • 📖 Documentation

🎯 Why ClawRAG?

The Problem

Enterprise document processing forces you to choose:

❌ Cloud solutions: Send sensitive data to third parties
❌ Simple tools: Fail on complex PDFs, tables, mixed content
❌ DIY approaches: Months of integration, no production readiness

Our Solution

ClawRAG provides production-grade document intelligence that runs entirely on your infrastructure:

✅ Self-Hosted: Complete data sovereignty
✅ Intelligent Processing: Handles complex PDFs, tables, scanned documents
✅ Hybrid Search: Combines semantic + keyword search for accuracy
✅ Production-Ready: Docker-first, scalable, monitored
✅ Enterprise Path: Seamless upgrade to advanced features

🚀 Quick Start

Prerequisites

Docker & Docker Compose installed
8GB+ RAM (16GB recommended)

One-Command Setup

# 1. Clone and enter
git clone https://github.com/2dogsandanerd/ClawRag.git
cd ClawRag

# 2. Configure
cp .env.example .env

# 3. Start everything
docker compose up -d

# 4. Verify
curl http://localhost:8080/health

Access your instance:

🌐 Web UI: http://localhost:8080
📚 API Docs: http://localhost:8080/docs
🔍 Health Check: http://localhost:8080/health

📊 Editions

Choose the edition that fits your needs:

Feature	Community Edition	Enterprise Edition
Document Processing	Single-engine (Docling)	Multi-engine consensus
Supported Formats	PDF, DOCX, MD, TXT, CSV	+ Images, Email, Code, PPT, XLS
Search	Vector + BM25 Hybrid	+ Graph RAG, Semantic Routing
Data Validation	Basic quality scoring	Consensus validation, 100% accuracy
Human Verification	Manual review	Surgical precision (only conflicts)
Multi-Tenancy	Single user	Mission-based isolation
Quality Assurance	Basic logging	Real-time dashboards, automated testing
Support	Community	SLA-backed, dedicated support
Pricing	Free (MIT License)	Custom enterprise licensing

→ Compare full feature matrix

🏢 Enterprise Features

Need more? ClawRAG Enterprise Core provides advanced capabilities:

🧠 Adaptive Intelligence

Multi-Engine Processing: 3+ specialized extractors work in parallel
Consensus Validation: Automated comparison ensures 100% data integrity
Intelligent Routing: Analyzes document type and selects optimal strategy

🔒 Zero Data Loss Architecture

Parallel Processing: Multiple engines analyze each document independently
Conflict Detection: Flags discrepancies for targeted human review
Visual Verification: See exactly where conflicts occur on source documents

🌐 Advanced Knowledge Retrieval

Graph RAG: Neo4j-powered relationship traversal between documents
Semantic + Graph Hybrid: Combines concept search with factual relationships
Query Decomposition: Complex queries automatically split into sub-tasks

📊 Mission-Based Multi-Tenancy

Customer Isolation: Complete data separation per mission
Hot Configuration: Update rules without system restart
Quality Gates: Per-mission thresholds and processing rules

📈 Continuous Quality Assurance

Real-Time Observability: Grafana dashboards for all processing stages
Automated Testing: Continuous validation against reference datasets
Performance Monitoring: Track accuracy, speed, and system health

→ View Enterprise Manifest

💼 Use Cases

📄 Legal & Compliance

Process contracts, court filings, and regulatory documents with:

Citation extraction and validation
Clause detection and comparison
Audit-compliant processing trails
PII detection and redaction

🏥 Healthcare & Research

Analyze medical literature and patient records:

Medical entity extraction
Cross-document concept linking
Citation graph analysis
HIPAA-compliant processing

💰 Financial Services

Process invoices, reports, and due diligence documents:

Entity mapping and verification
Anomaly detection in financial data
Structured data extraction from tables
Multi-document reconciliation

🔬 Technical Documentation

Ingest API docs, manuals, and technical specifications:

Code block preservation
Structured content extraction
Semantic chunking for technical terms
Cross-reference resolution

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                    Nginx Gateway (8080)                 │
│  ┌──────────────┐  ┌──────────────────────────────┐    │
│  │  Frontend UI │  │  FastAPI Backend (8081)      │    │
│  │  (Vue.js)    │  │  - RAG API                   │    │
│  └──────────────┘  │  - Document Processing       │    │
│                    │  - Hybrid Search             │    │
│                    └──────────────┬───────────────┘    │
└───────────────────────────────────┼─────────────────────┘
                                    │
         ┌──────────────────────────┼──────────────────┐
         │                          │                  │
         ▼                          ▼                  ▼
┌─────────────────┐  ┌──────────────────────┐  ┌──────────────┐
│  ChromaDB       │  │  Ollama / LLM        │  │  (Optional)  │
│  Vector Store   │  │  Embeddings & Chat   │  │  Enterprise  │
│  (Port 8001)    │  │  (Port 11434)        │  │  Extensions  │
└─────────────────┘  └──────────────────────┘  └──────────────┘

Key Components

Backend (backend/src/):

api/v1/rag/ - REST API endpoints (ingestion, query, collections)
core/ - ChromaDB manager, document loaders, retrievers
services/ - Document processing, deduplication, task management

Processing Pipeline:

Ingestion: Multi-format support with intelligent parsing
Chunking: Configurable semantic and sentence-based strategies
Embedding: Multi-provider support (Ollama, OpenAI, Anthropic)
Storage: ChromaDB with metadata and hybrid search
Retrieval: Vector + BM25 with Reciprocal Rank Fusion

⚙️ Configuration

Environment Variables

Variable	Default	Description
`PORT`	`8080`	External port for nginx gateway
`DOCS_DIR`	`./data/docs`	Host directory for folder ingestion
`LLM_PROVIDER`	`ollama`	LLM provider (ollama, openai, anthropic, gemini)
`LLM_MODEL`	`llama3:latest`	Model name for selected provider
`EMBEDDING_MODEL`	`nomic-embed-text`	Embedding model name
`CHUNK_SIZE`	`512`	Size of text chunks
`CHUNK_OVERLAP`	`128`	Overlap between chunks
`DEBUG`	`false`	Enable debug logging

LLM Configuration Examples

Local (Ollama) - Privacy First:

LLM_PROVIDER=ollama
LLM_MODEL=llama3:latest
EMBEDDING_MODEL=nomic-embed-text

Cloud (OpenAI) - Maximum Performance:

LLM_PROVIDER=openai
LLM_MODEL=gpt-4o
OPENAI_API_KEY=sk-proj-...

Local Server (LM Studio, etc.):

LLM_PROVIDER=openai_compatible
LLM_MODEL=local-model
OPENAI_BASE_URL=http://host.docker.internal:1234/v1

🛠️ Using the API

Python Example

import requests

BASE_URL = "http://localhost:8080/api/v1/rag"

# 1. Create a collection
requests.post(f"{BASE_URL}/collections", files={
    "collection_name": (None, "legal_docs"),
    "embedding_model": (None, "nomic-embed-text")
})

# 2. Upload documents
with open("contract.pdf", "rb") as f:
    requests.post(
        f"{BASE_URL}/documents/upload",
        files={"files": f},
        data={"collection_name": "legal_docs"}
    )

# 3. Query knowledge base
response = requests.post(
    f"{BASE_URL}/query",
    json={
        "query": "What are the termination clauses?",
        "collection": "legal_docs",
        "k": 5
    }
)
print(response.json()["answer"])

cURL Examples

# Health check
curl http://localhost:8080/health

# List collections
curl http://localhost:8080/api/v1/rag/collections

# Query with filters
curl -X POST http://localhost:8080/api/v1/rag/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Explain the architecture",
    "collection": "my_docs",
    "k": 10,
    "similarity_threshold": 0.5
  }'

📦 What's Included

Community Edition

✅ FastAPI backend with comprehensive API
✅ Web UI for document management and querying
✅ ChromaDB vector database
✅ Ollama LLM/embedding service
✅ Nginx reverse proxy
✅ Multi-provider LLM support
✅ Hybrid search (vector + BM25)
✅ Document chunking strategies
✅ Folder batch ingestion

Optional Enterprise Extensions

🔧 Multi-engine consensus processing
🔧 Graph database (Neo4j) integration
🔧 Real-time monitoring dashboards
🔧 Advanced quality assurance pipeline
🔧 Mission-based multi-tenancy
🔧 PII detection and compliance tools

🚨 Troubleshooting

Common Issues

LLM Connection Problems:

# Check LLM initialization
docker compose logs backend | grep "LLM"

# For local servers, verify network
docker exec clawrag-backend curl http://host.docker.internal:11434

Folder Ingestion Not Working:

# Verify DOCS_DIR is set correctly
cat .env | grep DOCS_DIR

# Check mount inside container
docker exec clawrag-backend ls -la /host_root/

Performance Issues:

# Check resource usage
docker stats

# Review logs for bottlenecks
docker compose logs -f backend

→ Full troubleshooting guide

🤝 Community & Support

Community Edition

🐛 GitHub Issues - Bug reports
💡 Discussions - Feature requests
📖 Documentation - Setup and configuration
📝 Contributing Guide - How to contribute

Enterprise Support

🎫 SLA-backed support with response time guarantees
📞 Direct engineering contact
🚀 Priority feature development
🏢 Custom integration assistance

→ Contact for Enterprise

📄 License

Community Edition: MIT License - Free for commercial and personal use.

Enterprise Edition: Proprietary license with custom terms. Contact for details.

🎯 Roadmap

Version 1.3 (Next)

Multi-collection search
Enhanced UI with query history
Additional document format support

Version 2.0 (Planned)

Conversational memory
Advanced analytics dashboard
Plugin system for custom processors

Enterprise Roadmap

Kubernetes deployment
Advanced graph reasoning
Multi-modal search (text + images)
ISO 27001 certification

Built with ❤️ for developers who value data sovereignty
GitHub • Enterprise • Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ClawRAG

🎯 Why ClawRAG?

The Problem

Our Solution

🚀 Quick Start

Prerequisites

One-Command Setup

📊 Editions

🏢 Enterprise Features

🧠 Adaptive Intelligence

🔒 Zero Data Loss Architecture

🌐 Advanced Knowledge Retrieval

📊 Mission-Based Multi-Tenancy

📈 Continuous Quality Assurance

💼 Use Cases

📄 Legal & Compliance

🏥 Healthcare & Research

💰 Financial Services

🔬 Technical Documentation

🏗️ Architecture

Key Components

⚙️ Configuration

Environment Variables

LLM Configuration Examples

🛠️ Using the API

Python Example

cURL Examples

📦 What's Included

Community Edition

Optional Enterprise Extensions

🚨 Troubleshooting

Common Issues

🤝 Community & Support

Community Edition

Enterprise Support

📄 License

🎯 Roadmap

Version 1.3 (Next)

Version 2.0 (Planned)

Enterprise Roadmap