LocalRAG - Production RAG System Starter Kit

Learn to build production-grade RAG systems with local LLMs, vector search, and intelligent document processing

What is LocalRAG?

LocalRAG is a complete, production-ready starter template for building Retrieval-Augmented Generation (RAG) systems. It demonstrates how to capture knowledge from multiple sources, process documents intelligently, and serve context-aware AI responses - all running locally with full cost and model control.

This is not just a demo - it's a foundational architecture you can fork and adapt to any domain: customer support, legal document analysis, medical research, internal knowledge bases, or any use case requiring AI with grounded, factual responses.

Why LocalRAG?

The Unique Approach

LocalRAG demonstrates production RAG architecture patterns that work regardless of your tech stack. The key differentiators are:

1. Hybrid Search Architecture

Combines vector similarity (semantic) with BM25 (keyword) search
Configurable weighting between the two approaches
Addresses the "lexical gap" problem where pure vector search fails on exact terms
See implementation: services/knowledge/services/search.py

2. Intelligent Document Processing Pipeline

Async worker-based architecture separates ingestion from serving
Configurable chunking strategies (fixed-size, sentence-aware, section-based)
Preserves context across chunk boundaries with overlap
Batch embedding generation with caching
See implementation: services/knowledge/services/chunker.py

3. Model and Provider Agnostic

OpenAI-compatible API interface means swap any provider
Local Ollama, OpenAI, Anthropic, or your own hosted models
Same codebase works with 3B parameter models or 70B+
Embedding models are similarly swappable
No vendor lock-in at any layer

4. Observable and Measurable

Every request traced through the full pipeline
Measure retrieval quality (precision@k, recall@k)
Track inference costs per query
A/B test chunking strategies or models
See: Jaeger traces, Prometheus metrics

5. Separation of Concerns

Knowledge Service: Document processing + search (plug into your existing API)
Inference Service: LLM + embeddings (swap for your provider)
API Service: Orchestration (integrate with your auth system)
Workers: Async processing (scale independently)

Each service is independently deployable and replaceable.

How to Integrate With Your Stack

You don't need to use everything. Pick what you need:

Just need document processing?

Use Knowledge Service standalone
Provides /parse, /chunk, /embed, /search endpoints
Plug into your existing application

Just need hybrid search?

Reuse the search implementation
Works with any pgvector database
Combine with your existing vector embeddings

Just need Ollama integration?

Use the Inference Service setup
OpenAI-compatible API for any client
Includes model management and GPU optimization

Need reference architecture?

Study the service boundaries and communication patterns
Apply microservices design to your RAG system
Adapt observability setup to your monitoring stack

What Makes This Different

Comparison	LocalRAG	Alternative
vs LangChain/LlamaIndex	Complete production system (API, auth, workers, observability)	Libraries only - infrastructure not included
vs Managed Services (Pinecone, Weaviate)	Fixed cost, data on your infrastructure, portable	Per-vector pricing, vendor lock-in, black box
vs OpenAI Assistants	Any model, fixed cost, local privacy	OpenAI only, per-token pricing, data sent to OpenAI
vs RAG Frameworks (Haystack, txtai)	Full stack with UI, auth, deployment	Backend library only, manual setup

Key differentiator: LocalRAG = RAG library + production infrastructure + deployment tooling in one package

Architecture

System Components

┌─────────────────────────────────────────────────────────────────┐
│                    Frontend (Next.js + TypeScript)               │
│                         http://localhost:3000                    │
│  • User interface for document upload & chat                    │
│  • Real-time streaming responses                                │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                    API Gateway (FastAPI)                         │
│                    http://localhost:8080                         │
│  • REST API for all operations                                  │
│  • JWT authentication & user management                         │
│  • Request orchestration & response streaming                   │
└─────┬─────────────────────┬────────────────────────┬───────────┘
      │                     │                        │
      │                     │                        │
┌─────▼──────────────┐ ┌────▼─────────────┐  ┌──────▼────────────┐
│  Knowledge Service │ │ Inference Service│  │  Worker Service   │
│  (Port 8081)       │ │  (Port 8000)     │  │  (Background)     │
│                    │ │                  │  │                   │
│ • Document parsing │ │ • Ollama LLM     │  │ • Async jobs      │
│ • Chunking logic   │ │ • Embeddings     │  │ • Batch processing│
│ • Embedding gen    │ │ • OpenAI API     │  │ • Document queue  │
│ • Hybrid search    │ │   compatible     │  │ • Celery workers  │
│ • BM25 + vectors   │ │                  │  │                   │
└─────┬──────────────┘ └──────────────────┘  └───────────────────┘
      │
      │
┌─────▼──────────────────────────────────────────────────────────┐
│                     Data Layer                                  │
│                                                                 │
│  PostgreSQL + pgvector    Redis Cache    MinIO (S3)   Jaeger   │
│  • Vector storage         • Session      • File       • Tracing │
│  • Full-text search       • Queue        • Storage    • Metrics │
│  • User data              • Cache        • Objects              │
└─────────────────────────────────────────────────────────────────┘

Document Processing Pipeline (Knowledge Service)

User Upload → Parser → Chunker → Embedder → Vector DB → Search Index
                ↓          ↓         ↓          ↓           ↓
             Extract    Split    Generate   Store in    Build BM25
              text    into 512   vectors   pgvector     index
                      chunks

1. Parser (services/knowledge/services/parser.py)

Handles PDF, DOCX, TXT, MD, HTML, JSON
Extracts text + metadata (title, author, date)
Preserves document structure where possible

2. Chunker (services/knowledge/services/chunker.py)

Splits documents into semantic chunks
Default: 512 tokens with 64 token overlap
Preserves context across chunk boundaries
Configurable chunk size and overlap

3. Embedder (services/knowledge/services/embedder.py)

Generates vector embeddings locally
Uses inference service (Ollama or API)
Batch processing for efficiency
Caches embeddings to avoid recomputation

4. Search (services/knowledge/services/search.py)

Hybrid search: Combines semantic + keyword
Vector similarity (pgvector cosine distance)
BM25 full-text search for exact matches
Configurable weighting (default 50/50)
Returns top-k chunks with scores and metadata

RAG Query Flow

User Question
    ↓
[1] Embed query → vector
    ↓
[2] Hybrid Search (Vector + BM25)
    ↓
[3] Retrieve top 10 chunks
    ↓
[4] Build context prompt
    ↓
[5] Send to LLM (Ollama/OpenAI/Claude)
    ↓
[6] Stream response with citations
    ↓
User sees answer + sources

Quick Start

Prerequisites

Docker Desktop (Windows/Mac) or Docker + Docker Compose (Linux)
8GB+ RAM (16GB recommended for better performance)
10GB free disk space (for model storage)

One-Command Deployment

Windows:

start.bat cpu

Mac/Linux:

chmod +x start.sh
./start.sh cpu

Manual start:

docker-compose -f docker-compose.simple.yml --profile cpu up -d

First Run (Model Download)

On first startup, Ollama downloads Llama 3.2 3B (~2GB). This takes 2-5 minutes.

Monitor progress:

docker logs localrag-inference-cpu -f

Access the System

Frontend: http://localhost:3000
API Docs: http://localhost:8080/docs (interactive Swagger UI)
Health Check: http://localhost:8080/health

Default login:

Email: [email protected]
Password: admin123

What's Included

Core Services

Service	Port	Purpose	Technology
Frontend	3000	Web UI	Next.js 15, React 19, TypeScript
API Gateway	8080	REST API	FastAPI, Pydantic, SQLAlchemy
Knowledge Service	8081	RAG pipeline	Document parsing, chunking, search
Inference	8000/11434	LLM + embeddings	Ollama, LiteLLM proxy
Workers	-	Background jobs	Celery, async processing

Infrastructure

Component	Purpose	Why It Matters
PostgreSQL + pgvector	Vector storage & search	Production-grade vector DB, scales to billions of vectors
Redis	Cache + message queue	Fast session storage, Celery broker
MinIO	Object storage	S3-compatible file storage (dev), swap to real S3 in prod
Jaeger	Distributed tracing	Debug RAG pipeline, find bottlenecks
Prometheus	Metrics collection	Monitor performance, cost per query
Grafana	Visualization	Custom dashboards for RAG metrics

Configuration

Environment Variables

Copy .env.example to .env and customize:

cp .env.example .env

Key configurations:

# LLM Model (Ollama format)
INFERENCE_MODEL=llama3.2:3b
# Options: llama3.2:1b (fast), llama3.2:3b (balanced),
#          llama3.1:8b (quality), qwen2.5:7b (multilingual)

# Chunking Strategy
CHUNK_SIZE=512              # Tokens per chunk
CHUNK_OVERLAP=64            # Overlap between chunks

# Retrieval Configuration
RETRIEVAL_TOP_K=50          # Chunks to retrieve
RERANK_TOP_K=10            # Top chunks after reranking
BM25_WEIGHT=0.5            # Balance between vector (0.5) and keyword (0.5)

# Optional: External LLM APIs (if not using local)
OPENAI_API_KEY=            # OpenAI
ANTHROPIC_API_KEY=         # Claude
GOOGLE_API_KEY=            # Gemini

Swap LLM Models

Local (Ollama):

# Pull different model
docker exec localrag-inference-cpu ollama pull mistral:7b

# Update .env
INFERENCE_MODEL=mistral:7b

# Restart services
docker-compose -f docker-compose.simple.yml --profile cpu restart

Cloud API (OpenAI/Anthropic):

# Set API key in .env
OPENAI_API_KEY=sk-...

# API service automatically falls back to OpenAI if local unavailable

Adjust Chunking Strategy

Edit services/knowledge/services/chunker.py:

# Current: Fixed-size chunks
def chunk_document(text: str, chunk_size=512, overlap=64):
    # ...

# Custom: Sentence-aware chunking
def chunk_by_sentences(text: str, min_chunk=256, max_chunk=1024):
    sentences = sent_tokenize(text)
    # Group sentences intelligently

# Custom: Section-based chunking (for structured docs)
def chunk_by_headers(text: str):
    # Split on markdown headers, preserve hierarchy

Deployment Profiles

1. CPU Mode (Default - No GPU Required)

docker-compose -f docker-compose.simple.yml --profile cpu up -d

Best for:

Development & learning
Small-scale deployments (< 50 users)
Budget constraints

Performance:

Response time: 2-5 seconds
RAM: 8GB minimum
Model: Llama 3.2 3B (2GB download)

2. GPU Mode (NVIDIA Required)

docker-compose -f docker-compose.simple.yml --profile gpu up -d

Best for:

Production deployments
Real-time responses
Larger models (7B-14B parameters)

Performance:

Response time: < 1 second
VRAM: 8GB+ (3B), 24GB+ (7B)
Model: Any size supported by your GPU

3. Full Observability (Monitoring Enabled)

docker-compose -f docker-compose.simple.yml --profile full up -d

Includes:

Jaeger UI: http://localhost:16686 (trace requests)
Grafana: http://localhost:3001 (dashboards)
Prometheus: http://localhost:9090 (metrics)

Best for:

Debugging RAG pipeline
Performance optimization
Production monitoring

Using the System

Upload Documents (Build Knowledge Base)

Via API:

curl -X POST http://localhost:8080/admin/documents \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -F "[email protected]" \
  -F "title=My Document" \
  -F "metadata={\"category\":\"research\"}"

What happens:

Document queued for processing
Worker extracts text (async)
Text chunked into 512-token segments
Each chunk embedded into vectors
Stored in PostgreSQL + pgvector
BM25 index updated
Available for RAG queries immediately

Supported formats:

PDF (.pdf)
Word (.docx)
Text (.txt, .md)
Web pages (URL scraping)

Query the System (RAG)

Via API:

curl -X POST http://localhost:8080/chat \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What are the key findings?",
    "thread_id": "optional-thread-id"
  }'

Response includes:

{
  "response": "Based on the documents, the key findings are...",
  "citations": [
    {
      "doc_id": "uuid",
      "chunk_id": "uuid",
      "text": "...relevant excerpt...",
      "score": 0.89,
      "source": "document.pdf"
    }
  ],
  "metadata": {
    "chunks_retrieved": 10,
    "search_time_ms": 45,
    "llm_time_ms": 1250
  }
}

Real-World Use Cases

Customer Support: Support docs, FAQs → consistent, cited answers for agents
Legal Research: Contracts, case law → find relevant precedents quickly
Internal Wiki: Policies, procedures → reduce HR/IT support load
Academic Research: Papers, textbooks → synthesize information across sources
Medical Documentation: Clinical guidelines → evidence-based recommendations

The included "Career Mentor" is one example - the architecture works for any domain.

Learning Path

Beginner: Understand the Basics

Deploy with start.bat cpu
Upload a PDF document
Ask questions and see citations
Review services/knowledge/services/search.py - see hybrid search
Check docker-compose.simple.yml - understand service connections

Intermediate: Customize RAG

Modify chunk size in .env (try 256, 512, 1024)
Adjust BM25_WEIGHT (try 0.3, 0.5, 0.7)
Swap Ollama model (llama vs qwen vs mistral)
Add new document parser (Excel, CSV)
Implement custom reranking logic

Advanced: Production Deployment

Enable GPU mode for performance
Set up observability (Jaeger + Prometheus)
Implement fine-tuning pipeline for embeddings
Add custom evaluation metrics
Deploy to Kubernetes with autoscaling

Fine-Tuning & Customization

Why This is a Great Fine-Tuning Starter

1. Data Collection Built-In

User queries logged with retrieved chunks
Track which chunks lead to good answers
Export for fine-tuning dataset

2. Evaluation Framework Ready

Measure retrieval accuracy (top-k)
Track answer quality (user feedback)
A/B test different models/prompts

3. Model Flexibility

Swap embedding models easily
Test different LLMs side-by-side
Measure cost vs quality tradeoffs

4. Production Pipeline

Same code from dev → staging → prod
Database migrations included
Monitoring built-in

Example: Fine-Tune Embeddings

# 1. Collect training data (queries + relevant docs)
# services/knowledge/services/embedder.py

# 2. Export to training format
SELECT query, relevant_chunk, user_feedback
FROM query_logs WHERE feedback > 4;

# 3. Fine-tune embedding model (sentence-transformers)
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer('base-model')
# Train on your domain-specific data
model.fit(train_data)

# 4. Deploy custom model
# Update embedder.py to use your model

Performance & Scaling

Single Server Limits

CPU mode: 10-50 concurrent users, 2-5s response time
GPU mode: 100+ concurrent users, <1s response time

Scaling Horizontally

# Add more workers
docker-compose up --scale worker=5

# Load balance API instances
# (nginx/Caddy in front of multiple API containers)

# Shared infrastructure
# PostgreSQL, Redis, MinIO can be managed services (AWS RDS, ElastiCache, S3)

Cost Analysis

Local deployment (CPU):

Hardware: $0 (use existing)
Electricity: ~$10-30/month
Total: ~$30/month

Cloud deployment (AWS example):

EC2 g5.2xlarge (GPU): ~$1.20/hr = ~$864/month
RDS PostgreSQL: ~$50/month
Total: ~$900/month

Hybrid (local LLM + cloud infrastructure):

Local GPU server: ~$30/month electricity
Cloud DB/storage: ~$100/month
Total: ~$130/month

Compare to:

OpenAI API (1M tokens): ~$30-60
Anthropic Claude (1M tokens): ~$15-75
Heavy usage (100M tokens/month): $1,500-7,500/month

Security & Privacy

Data Privacy

Local-first: Documents never leave your infrastructure (unless you configure external APIs)
No telemetry: Zero tracking or phone-home
Audit trail: All queries logged locally

Production Checklist

Change default passwords (POSTGRES_PASSWORD, MINIO_PASSWORD)
Generate strong JWT secret: openssl rand -hex 32
Enable HTTPS (nginx/Caddy reverse proxy)
Set up database backups (pg_dump automation)
Configure firewall rules
Enable rate limiting on API
Review CORS settings
Set up monitoring alerts

What's Safe to Commit

✅ .env.example, code, configs with env var substitution ❌ .env, API keys, database backups, user data

Contributing

This is a learning resource and starter template. Contributions welcome:

Document format parsers: Add support for new file types
Search algorithms: Implement MMR, ColBERT, etc.
Evaluation tools: Add RAG metrics (faithfulness, relevance)
Deployment guides: Kubernetes, AWS, GCP examples
Fine-tuning examples: Show domain adaptation

See CONTRIBUTING.md for guidelines.

License

MIT License - free for commercial and personal use. See LICENSE.

Built With

Core RAG Stack:

Ollama - Local LLM runtime
pgvector - PostgreSQL vector extension
LangChain - LLM orchestration (optional, not required)

Backend:

FastAPI - Modern Python web framework
SQLAlchemy - ORM & database toolkit
Celery - Distributed task queue

Frontend:

Next.js - React framework
TypeScript - Type safety
Tailwind CSS - Styling

Infrastructure:

PostgreSQL - Primary database
Redis - Cache & queue
MinIO - S3-compatible storage
Jaeger - Distributed tracing

FAQ

Q: Do I need a GPU? A: No. CPU mode works fine for learning and development. GPU recommended for production.

Q: Can I use this commercially? A: Yes, MIT license allows commercial use.

Q: How do I add support for [file format]? A: Add parser in services/knowledge/services/parser.py, see existing PDF/DOCX parsers.

Q: Can I use OpenAI instead of Ollama? A: Yes, set OPENAI_API_KEY in .env. System falls back automatically.

Q: How accurate is the RAG system? A: Depends on your documents, chunk size, and model. Start with defaults, measure, iterate.

Q: Can I deploy this to Kubernetes? A: Yes, services are containerized. Create k8s manifests or Helm chart.

Q: What's the vector database performance? A: pgvector handles millions of vectors. For billions, consider Qdrant, Milvus, or Weaviate.

Q: How do I evaluate RAG quality? A: Track metrics: retrieval accuracy (top-k), answer faithfulness, user feedback. Build eval set.

Ready to learn RAG? Start with ./start.sh cpu and build your first knowledge system in 5 minutes.

Questions? Open an issue or discussion on GitHub.

Built something cool? Share it! Tag your fork with #LocalRAG.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
frontend		frontend
infrastructure		infrastructure
scripts		scripts
services		services
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY_CHECKLIST.md		SECURITY_CHECKLIST.md
docker-compose.simple.yml		docker-compose.simple.yml
docker-compose.yml		docker-compose.yml
start.bat		start.bat
start.sh		start.sh

License

dpmorr/LocalRAG

Folders and files

Latest commit

History

Repository files navigation

LocalRAG - Production RAG System Starter Kit

What is LocalRAG?

Why LocalRAG?

The Unique Approach

How to Integrate With Your Stack

What Makes This Different

Architecture

System Components

Document Processing Pipeline (Knowledge Service)

RAG Query Flow

Quick Start

Prerequisites

One-Command Deployment

First Run (Model Download)

Access the System

What's Included

Core Services

Infrastructure

Configuration

Environment Variables

Swap LLM Models

Adjust Chunking Strategy

Deployment Profiles

1. CPU Mode (Default - No GPU Required)

2. GPU Mode (NVIDIA Required)

3. Full Observability (Monitoring Enabled)

Using the System

Upload Documents (Build Knowledge Base)

Query the System (RAG)

Real-World Use Cases

Learning Path

Beginner: Understand the Basics

Intermediate: Customize RAG

Advanced: Production Deployment

Fine-Tuning & Customization

Why This is a Great Fine-Tuning Starter

Example: Fine-Tune Embeddings

Performance & Scaling

Single Server Limits

Scaling Horizontally

Cost Analysis

Security & Privacy

Data Privacy

Production Checklist

What's Safe to Commit

Contributing

License

Built With

FAQ

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages