Skip to content

2dogsandanerd/ClawRag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚠️ IMPORTANT NOTICE REGARDING PLAGIARISM & AUTHORSHIP ⚠️

It has come to my attention that the GitHub user @tfantas (Thiago Antas) and his automated account @jarvis-aix are falsely claiming credit for my architecture. They have explicitly listed my original repositories (including RAG_enterprise_core, smart-ingest-kit, DAUT, etc.) as their own 'πŸ”¬ Featured Work' on their public profile without authorization or proper attribution. Below is the documented proof.

Screenshot Plagiat

https://github.com/tfantas seems to have 20+ years of expirience but no own ideas .... Im gonna make him famous...... If you enjoyed my repos and found them useful, Im sorry but im out of this game !!! No more opensource Sorry Im sure you will find my further developed Repos at https://github.com/jarvis-aix .... What a disgrace and disrespect !

This repository, the Multi-Lane Consensus Architecture, and the V4.0 Manifest are 100% my original work, built over two years. Please be highly cautious of actors in the AI space attempting to rebrand, clone, or take credit for this Enterprise RAG system

⚠️ ⚠️ ⚠️

ClawRAG

Version License: MIT Docker Python 3.12 Enterprise

Production-Ready RAG Engine. Self-Hosted. Data Sovereign. Enterprise-Grade.

ClawRAG is a powerful, self-hosted Retrieval-Augmented Generation (RAG) system that gives you complete control over your AI document processing. Process documents locally with local LLMs or connect to cloud APIsβ€”your data never leaves your infrastructure unless you want it to.

πŸš€ Quick Start β€’ πŸ“Š Editions β€’ 🏒 Enterprise β€’ πŸ“– Documentation


🎯 Why ClawRAG?

The Problem

Enterprise document processing forces you to choose:

  • ❌ Cloud solutions: Send sensitive data to third parties
  • ❌ Simple tools: Fail on complex PDFs, tables, mixed content
  • ❌ DIY approaches: Months of integration, no production readiness

Our Solution

ClawRAG provides production-grade document intelligence that runs entirely on your infrastructure:

βœ… Self-Hosted: Complete data sovereignty
βœ… Intelligent Processing: Handles complex PDFs, tables, scanned documents
βœ… Hybrid Search: Combines semantic + keyword search for accuracy
βœ… Production-Ready: Docker-first, scalable, monitored
βœ… Enterprise Path: Seamless upgrade to advanced features


πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose installed
  • 8GB+ RAM (16GB recommended)

One-Command Setup

# 1. Clone and enter
git clone https://github.com/2dogsandanerd/ClawRag.git
cd ClawRag

# 2. Configure
cp .env.example .env

# 3. Start everything
docker compose up -d

# 4. Verify
curl http://localhost:8080/health

Access your instance:


πŸ“Š Editions

Choose the edition that fits your needs:

Feature Community Edition Enterprise Edition
Document Processing Single-engine (Docling) Multi-engine consensus
Supported Formats PDF, DOCX, MD, TXT, CSV + Images, Email, Code, PPT, XLS
Search Vector + BM25 Hybrid + Graph RAG, Semantic Routing
Data Validation Basic quality scoring Consensus validation, 100% accuracy
Human Verification Manual review Surgical precision (only conflicts)
Multi-Tenancy Single user Mission-based isolation
Quality Assurance Basic logging Real-time dashboards, automated testing
Support Community SLA-backed, dedicated support
Pricing Free (MIT License) Custom enterprise licensing

β†’ Compare full feature matrix


🏒 Enterprise Features

Need more? ClawRAG Enterprise Core provides advanced capabilities:

🧠 Adaptive Intelligence

  • Multi-Engine Processing: 3+ specialized extractors work in parallel
  • Consensus Validation: Automated comparison ensures 100% data integrity
  • Intelligent Routing: Analyzes document type and selects optimal strategy

πŸ”’ Zero Data Loss Architecture

  • Parallel Processing: Multiple engines analyze each document independently
  • Conflict Detection: Flags discrepancies for targeted human review
  • Visual Verification: See exactly where conflicts occur on source documents

🌐 Advanced Knowledge Retrieval

  • Graph RAG: Neo4j-powered relationship traversal between documents
  • Semantic + Graph Hybrid: Combines concept search with factual relationships
  • Query Decomposition: Complex queries automatically split into sub-tasks

πŸ“Š Mission-Based Multi-Tenancy

  • Customer Isolation: Complete data separation per mission
  • Hot Configuration: Update rules without system restart
  • Quality Gates: Per-mission thresholds and processing rules

πŸ“ˆ Continuous Quality Assurance

  • Real-Time Observability: Grafana dashboards for all processing stages
  • Automated Testing: Continuous validation against reference datasets
  • Performance Monitoring: Track accuracy, speed, and system health

β†’ View Enterprise Manifest


πŸ’Ό Use Cases

πŸ“„ Legal & Compliance

Process contracts, court filings, and regulatory documents with:

  • Citation extraction and validation
  • Clause detection and comparison
  • Audit-compliant processing trails
  • PII detection and redaction

πŸ₯ Healthcare & Research

Analyze medical literature and patient records:

  • Medical entity extraction
  • Cross-document concept linking
  • Citation graph analysis
  • HIPAA-compliant processing

πŸ’° Financial Services

Process invoices, reports, and due diligence documents:

  • Entity mapping and verification
  • Anomaly detection in financial data
  • Structured data extraction from tables
  • Multi-document reconciliation

πŸ”¬ Technical Documentation

Ingest API docs, manuals, and technical specifications:

  • Code block preservation
  • Structured content extraction
  • Semantic chunking for technical terms
  • Cross-reference resolution

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Nginx Gateway (8080)                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Frontend UI β”‚  β”‚  FastAPI Backend (8081)      β”‚    β”‚
β”‚  β”‚  (Vue.js)    β”‚  β”‚  - RAG API                   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  - Document Processing       β”‚    β”‚
β”‚                    β”‚  - Hybrid Search             β”‚    β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                          β”‚                  β”‚
         β–Ό                          β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ChromaDB       β”‚  β”‚  Ollama / LLM        β”‚  β”‚  (Optional)  β”‚
β”‚  Vector Store   β”‚  β”‚  Embeddings & Chat   β”‚  β”‚  Enterprise  β”‚
β”‚  (Port 8001)    β”‚  β”‚  (Port 11434)        β”‚  β”‚  Extensions  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

Backend (backend/src/):

  • api/v1/rag/ - REST API endpoints (ingestion, query, collections)
  • core/ - ChromaDB manager, document loaders, retrievers
  • services/ - Document processing, deduplication, task management

Processing Pipeline:

  1. Ingestion: Multi-format support with intelligent parsing
  2. Chunking: Configurable semantic and sentence-based strategies
  3. Embedding: Multi-provider support (Ollama, OpenAI, Anthropic)
  4. Storage: ChromaDB with metadata and hybrid search
  5. Retrieval: Vector + BM25 with Reciprocal Rank Fusion

βš™οΈ Configuration

Environment Variables

Variable Default Description
PORT 8080 External port for nginx gateway
DOCS_DIR ./data/docs Host directory for folder ingestion
LLM_PROVIDER ollama LLM provider (ollama, openai, anthropic, gemini)
LLM_MODEL llama3:latest Model name for selected provider
EMBEDDING_MODEL nomic-embed-text Embedding model name
CHUNK_SIZE 512 Size of text chunks
CHUNK_OVERLAP 128 Overlap between chunks
DEBUG false Enable debug logging

LLM Configuration Examples

Local (Ollama) - Privacy First:

LLM_PROVIDER=ollama
LLM_MODEL=llama3:latest
EMBEDDING_MODEL=nomic-embed-text

Cloud (OpenAI) - Maximum Performance:

LLM_PROVIDER=openai
LLM_MODEL=gpt-4o
OPENAI_API_KEY=sk-proj-...

Local Server (LM Studio, etc.):

LLM_PROVIDER=openai_compatible
LLM_MODEL=local-model
OPENAI_BASE_URL=http://host.docker.internal:1234/v1

πŸ› οΈ Using the API

Python Example

import requests

BASE_URL = "http://localhost:8080/api/v1/rag"

# 1. Create a collection
requests.post(f"{BASE_URL}/collections", files={
    "collection_name": (None, "legal_docs"),
    "embedding_model": (None, "nomic-embed-text")
})

# 2. Upload documents
with open("contract.pdf", "rb") as f:
    requests.post(
        f"{BASE_URL}/documents/upload",
        files={"files": f},
        data={"collection_name": "legal_docs"}
    )

# 3. Query knowledge base
response = requests.post(
    f"{BASE_URL}/query",
    json={
        "query": "What are the termination clauses?",
        "collection": "legal_docs",
        "k": 5
    }
)
print(response.json()["answer"])

cURL Examples

# Health check
curl http://localhost:8080/health

# List collections
curl http://localhost:8080/api/v1/rag/collections

# Query with filters
curl -X POST http://localhost:8080/api/v1/rag/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Explain the architecture",
    "collection": "my_docs",
    "k": 10,
    "similarity_threshold": 0.5
  }'

πŸ“¦ What's Included

Community Edition

  • βœ… FastAPI backend with comprehensive API
  • βœ… Web UI for document management and querying
  • βœ… ChromaDB vector database
  • βœ… Ollama LLM/embedding service
  • βœ… Nginx reverse proxy
  • βœ… Multi-provider LLM support
  • βœ… Hybrid search (vector + BM25)
  • βœ… Document chunking strategies
  • βœ… Folder batch ingestion

Optional Enterprise Extensions

  • πŸ”§ Multi-engine consensus processing
  • πŸ”§ Graph database (Neo4j) integration
  • πŸ”§ Real-time monitoring dashboards
  • πŸ”§ Advanced quality assurance pipeline
  • πŸ”§ Mission-based multi-tenancy
  • πŸ”§ PII detection and compliance tools

🚨 Troubleshooting

Common Issues

LLM Connection Problems:

# Check LLM initialization
docker compose logs backend | grep "LLM"

# For local servers, verify network
docker exec clawrag-backend curl http://host.docker.internal:11434

Folder Ingestion Not Working:

# Verify DOCS_DIR is set correctly
cat .env | grep DOCS_DIR

# Check mount inside container
docker exec clawrag-backend ls -la /host_root/

Performance Issues:

# Check resource usage
docker stats

# Review logs for bottlenecks
docker compose logs -f backend

β†’ Full troubleshooting guide


🀝 Community & Support

Community Edition

Enterprise Support

  • 🎫 SLA-backed support with response time guarantees
  • πŸ“ž Direct engineering contact
  • πŸš€ Priority feature development
  • 🏒 Custom integration assistance

β†’ Contact for Enterprise


πŸ“„ License

Community Edition: MIT License - Free for commercial and personal use.

Enterprise Edition: Proprietary license with custom terms. Contact for details.

Copyright (c) 2025 2dogsandanerd


🎯 Roadmap

Version 1.3 (Next)

  • Multi-collection search
  • Enhanced UI with query history
  • Additional document format support

Version 2.0 (Planned)

  • Conversational memory
  • Advanced analytics dashboard
  • Plugin system for custom processors

Enterprise Roadmap

  • Kubernetes deployment
  • Advanced graph reasoning
  • Multi-modal search (text + images)
  • ISO 27001 certification

Built with ❀️ for developers who value data sovereignty
GitHub β€’ Enterprise β€’ Documentation

About

RAG system combining Docling document processing with ChromaDB vector storage to power openclaw

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages