Learn to build production-grade RAG systems with local LLMs, vector search, and intelligent document processing
LocalRAG is a complete, production-ready starter template for building Retrieval-Augmented Generation (RAG) systems. It demonstrates how to capture knowledge from multiple sources, process documents intelligently, and serve context-aware AI responses - all running locally with full cost and model control.
This is not just a demo - it's a foundational architecture you can fork and adapt to any domain: customer support, legal document analysis, medical research, internal knowledge bases, or any use case requiring AI with grounded, factual responses.
LocalRAG demonstrates production RAG architecture patterns that work regardless of your tech stack. The key differentiators are:
1. Hybrid Search Architecture
- Combines vector similarity (semantic) with BM25 (keyword) search
- Configurable weighting between the two approaches
- Addresses the "lexical gap" problem where pure vector search fails on exact terms
- See implementation:
services/knowledge/services/search.py
2. Intelligent Document Processing Pipeline
- Async worker-based architecture separates ingestion from serving
- Configurable chunking strategies (fixed-size, sentence-aware, section-based)
- Preserves context across chunk boundaries with overlap
- Batch embedding generation with caching
- See implementation:
services/knowledge/services/chunker.py
3. Model and Provider Agnostic
- OpenAI-compatible API interface means swap any provider
- Local Ollama, OpenAI, Anthropic, or your own hosted models
- Same codebase works with 3B parameter models or 70B+
- Embedding models are similarly swappable
- No vendor lock-in at any layer
4. Observable and Measurable
- Every request traced through the full pipeline
- Measure retrieval quality (precision@k, recall@k)
- Track inference costs per query
- A/B test chunking strategies or models
- See: Jaeger traces, Prometheus metrics
5. Separation of Concerns
- Knowledge Service: Document processing + search (plug into your existing API)
- Inference Service: LLM + embeddings (swap for your provider)
- API Service: Orchestration (integrate with your auth system)
- Workers: Async processing (scale independently)
Each service is independently deployable and replaceable.
You don't need to use everything. Pick what you need:
Just need document processing?
- Use Knowledge Service standalone
- Provides
/parse,/chunk,/embed,/searchendpoints - Plug into your existing application
Just need hybrid search?
- Reuse the search implementation
- Works with any pgvector database
- Combine with your existing vector embeddings
Just need Ollama integration?
- Use the Inference Service setup
- OpenAI-compatible API for any client
- Includes model management and GPU optimization
Need reference architecture?
- Study the service boundaries and communication patterns
- Apply microservices design to your RAG system
- Adapt observability setup to your monitoring stack
| Comparison | LocalRAG | Alternative |
|---|---|---|
| vs LangChain/LlamaIndex | Complete production system (API, auth, workers, observability) | Libraries only - infrastructure not included |
| vs Managed Services (Pinecone, Weaviate) | Fixed cost, data on your infrastructure, portable | Per-vector pricing, vendor lock-in, black box |
| vs OpenAI Assistants | Any model, fixed cost, local privacy | OpenAI only, per-token pricing, data sent to OpenAI |
| vs RAG Frameworks (Haystack, txtai) | Full stack with UI, auth, deployment | Backend library only, manual setup |
Key differentiator: LocalRAG = RAG library + production infrastructure + deployment tooling in one package
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (Next.js + TypeScript) │
│ http://localhost:3000 │
│ • User interface for document upload & chat │
│ • Real-time streaming responses │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────────┐
│ API Gateway (FastAPI) │
│ http://localhost:8080 │
│ • REST API for all operations │
│ • JWT authentication & user management │
│ • Request orchestration & response streaming │
└─────┬─────────────────────┬────────────────────────┬───────────┘
│ │ │
│ │ │
┌─────▼──────────────┐ ┌────▼─────────────┐ ┌──────▼────────────┐
│ Knowledge Service │ │ Inference Service│ │ Worker Service │
│ (Port 8081) │ │ (Port 8000) │ │ (Background) │
│ │ │ │ │ │
│ • Document parsing │ │ • Ollama LLM │ │ • Async jobs │
│ • Chunking logic │ │ • Embeddings │ │ • Batch processing│
│ • Embedding gen │ │ • OpenAI API │ │ • Document queue │
│ • Hybrid search │ │ compatible │ │ • Celery workers │
│ • BM25 + vectors │ │ │ │ │
└─────┬──────────────┘ └──────────────────┘ └───────────────────┘
│
│
┌─────▼──────────────────────────────────────────────────────────┐
│ Data Layer │
│ │
│ PostgreSQL + pgvector Redis Cache MinIO (S3) Jaeger │
│ • Vector storage • Session • File • Tracing │
│ • Full-text search • Queue • Storage • Metrics │
│ • User data • Cache • Objects │
└─────────────────────────────────────────────────────────────────┘
User Upload → Parser → Chunker → Embedder → Vector DB → Search Index
↓ ↓ ↓ ↓ ↓
Extract Split Generate Store in Build BM25
text into 512 vectors pgvector index
chunks
1. Parser (services/knowledge/services/parser.py)
- Handles PDF, DOCX, TXT, MD, HTML, JSON
- Extracts text + metadata (title, author, date)
- Preserves document structure where possible
2. Chunker (services/knowledge/services/chunker.py)
- Splits documents into semantic chunks
- Default: 512 tokens with 64 token overlap
- Preserves context across chunk boundaries
- Configurable chunk size and overlap
3. Embedder (services/knowledge/services/embedder.py)
- Generates vector embeddings locally
- Uses inference service (Ollama or API)
- Batch processing for efficiency
- Caches embeddings to avoid recomputation
4. Search (services/knowledge/services/search.py)
- Hybrid search: Combines semantic + keyword
- Vector similarity (pgvector cosine distance)
- BM25 full-text search for exact matches
- Configurable weighting (default 50/50)
- Returns top-k chunks with scores and metadata
User Question
↓
[1] Embed query → vector
↓
[2] Hybrid Search (Vector + BM25)
↓
[3] Retrieve top 10 chunks
↓
[4] Build context prompt
↓
[5] Send to LLM (Ollama/OpenAI/Claude)
↓
[6] Stream response with citations
↓
User sees answer + sources
- Docker Desktop (Windows/Mac) or Docker + Docker Compose (Linux)
- 8GB+ RAM (16GB recommended for better performance)
- 10GB free disk space (for model storage)
Windows:
start.bat cpuMac/Linux:
chmod +x start.sh
./start.sh cpuManual start:
docker-compose -f docker-compose.simple.yml --profile cpu up -dOn first startup, Ollama downloads Llama 3.2 3B (~2GB). This takes 2-5 minutes.
Monitor progress:
docker logs localrag-inference-cpu -f- Frontend: http://localhost:3000
- API Docs: http://localhost:8080/docs (interactive Swagger UI)
- Health Check: http://localhost:8080/health
Default login:
- Email:
[email protected] - Password:
admin123
| Service | Port | Purpose | Technology |
|---|---|---|---|
| Frontend | 3000 | Web UI | Next.js 15, React 19, TypeScript |
| API Gateway | 8080 | REST API | FastAPI, Pydantic, SQLAlchemy |
| Knowledge Service | 8081 | RAG pipeline | Document parsing, chunking, search |
| Inference | 8000/11434 | LLM + embeddings | Ollama, LiteLLM proxy |
| Workers | - | Background jobs | Celery, async processing |
| Component | Purpose | Why It Matters |
|---|---|---|
| PostgreSQL + pgvector | Vector storage & search | Production-grade vector DB, scales to billions of vectors |
| Redis | Cache + message queue | Fast session storage, Celery broker |
| MinIO | Object storage | S3-compatible file storage (dev), swap to real S3 in prod |
| Jaeger | Distributed tracing | Debug RAG pipeline, find bottlenecks |
| Prometheus | Metrics collection | Monitor performance, cost per query |
| Grafana | Visualization | Custom dashboards for RAG metrics |
Copy .env.example to .env and customize:
cp .env.example .envKey configurations:
# LLM Model (Ollama format)
INFERENCE_MODEL=llama3.2:3b
# Options: llama3.2:1b (fast), llama3.2:3b (balanced),
# llama3.1:8b (quality), qwen2.5:7b (multilingual)
# Chunking Strategy
CHUNK_SIZE=512 # Tokens per chunk
CHUNK_OVERLAP=64 # Overlap between chunks
# Retrieval Configuration
RETRIEVAL_TOP_K=50 # Chunks to retrieve
RERANK_TOP_K=10 # Top chunks after reranking
BM25_WEIGHT=0.5 # Balance between vector (0.5) and keyword (0.5)
# Optional: External LLM APIs (if not using local)
OPENAI_API_KEY= # OpenAI
ANTHROPIC_API_KEY= # Claude
GOOGLE_API_KEY= # GeminiLocal (Ollama):
# Pull different model
docker exec localrag-inference-cpu ollama pull mistral:7b
# Update .env
INFERENCE_MODEL=mistral:7b
# Restart services
docker-compose -f docker-compose.simple.yml --profile cpu restartCloud API (OpenAI/Anthropic):
# Set API key in .env
OPENAI_API_KEY=sk-...
# API service automatically falls back to OpenAI if local unavailableEdit services/knowledge/services/chunker.py:
# Current: Fixed-size chunks
def chunk_document(text: str, chunk_size=512, overlap=64):
# ...
# Custom: Sentence-aware chunking
def chunk_by_sentences(text: str, min_chunk=256, max_chunk=1024):
sentences = sent_tokenize(text)
# Group sentences intelligently
# Custom: Section-based chunking (for structured docs)
def chunk_by_headers(text: str):
# Split on markdown headers, preserve hierarchydocker-compose -f docker-compose.simple.yml --profile cpu up -dBest for:
- Development & learning
- Small-scale deployments (< 50 users)
- Budget constraints
Performance:
- Response time: 2-5 seconds
- RAM: 8GB minimum
- Model: Llama 3.2 3B (2GB download)
docker-compose -f docker-compose.simple.yml --profile gpu up -dBest for:
- Production deployments
- Real-time responses
- Larger models (7B-14B parameters)
Performance:
- Response time: < 1 second
- VRAM: 8GB+ (3B), 24GB+ (7B)
- Model: Any size supported by your GPU
docker-compose -f docker-compose.simple.yml --profile full up -dIncludes:
- Jaeger UI: http://localhost:16686 (trace requests)
- Grafana: http://localhost:3001 (dashboards)
- Prometheus: http://localhost:9090 (metrics)
Best for:
- Debugging RAG pipeline
- Performance optimization
- Production monitoring
Via API:
curl -X POST http://localhost:8080/admin/documents \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-F "[email protected]" \
-F "title=My Document" \
-F "metadata={\"category\":\"research\"}"What happens:
- Document queued for processing
- Worker extracts text (async)
- Text chunked into 512-token segments
- Each chunk embedded into vectors
- Stored in PostgreSQL + pgvector
- BM25 index updated
- Available for RAG queries immediately
Supported formats:
- PDF (
.pdf) - Word (
.docx) - Text (
.txt,.md) - Web pages (URL scraping)
Via API:
curl -X POST http://localhost:8080/chat \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"message": "What are the key findings?",
"thread_id": "optional-thread-id"
}'Response includes:
{
"response": "Based on the documents, the key findings are...",
"citations": [
{
"doc_id": "uuid",
"chunk_id": "uuid",
"text": "...relevant excerpt...",
"score": 0.89,
"source": "document.pdf"
}
],
"metadata": {
"chunks_retrieved": 10,
"search_time_ms": 45,
"llm_time_ms": 1250
}
}- Customer Support: Support docs, FAQs → consistent, cited answers for agents
- Legal Research: Contracts, case law → find relevant precedents quickly
- Internal Wiki: Policies, procedures → reduce HR/IT support load
- Academic Research: Papers, textbooks → synthesize information across sources
- Medical Documentation: Clinical guidelines → evidence-based recommendations
The included "Career Mentor" is one example - the architecture works for any domain.
- Deploy with
start.bat cpu - Upload a PDF document
- Ask questions and see citations
- Review
services/knowledge/services/search.py- see hybrid search - Check
docker-compose.simple.yml- understand service connections
- Modify chunk size in
.env(try 256, 512, 1024) - Adjust BM25_WEIGHT (try 0.3, 0.5, 0.7)
- Swap Ollama model (llama vs qwen vs mistral)
- Add new document parser (Excel, CSV)
- Implement custom reranking logic
- Enable GPU mode for performance
- Set up observability (Jaeger + Prometheus)
- Implement fine-tuning pipeline for embeddings
- Add custom evaluation metrics
- Deploy to Kubernetes with autoscaling
1. Data Collection Built-In
- User queries logged with retrieved chunks
- Track which chunks lead to good answers
- Export for fine-tuning dataset
2. Evaluation Framework Ready
- Measure retrieval accuracy (top-k)
- Track answer quality (user feedback)
- A/B test different models/prompts
3. Model Flexibility
- Swap embedding models easily
- Test different LLMs side-by-side
- Measure cost vs quality tradeoffs
4. Production Pipeline
- Same code from dev → staging → prod
- Database migrations included
- Monitoring built-in
# 1. Collect training data (queries + relevant docs)
# services/knowledge/services/embedder.py
# 2. Export to training format
SELECT query, relevant_chunk, user_feedback
FROM query_logs WHERE feedback > 4;
# 3. Fine-tune embedding model (sentence-transformers)
from sentence_transformers import SentenceTransformer, losses
model = SentenceTransformer('base-model')
# Train on your domain-specific data
model.fit(train_data)
# 4. Deploy custom model
# Update embedder.py to use your model- CPU mode: 10-50 concurrent users, 2-5s response time
- GPU mode: 100+ concurrent users, <1s response time
# Add more workers
docker-compose up --scale worker=5
# Load balance API instances
# (nginx/Caddy in front of multiple API containers)
# Shared infrastructure
# PostgreSQL, Redis, MinIO can be managed services (AWS RDS, ElastiCache, S3)Local deployment (CPU):
- Hardware: $0 (use existing)
- Electricity: ~$10-30/month
- Total: ~$30/month
Cloud deployment (AWS example):
- EC2 g5.2xlarge (GPU): ~$1.20/hr = ~$864/month
- RDS PostgreSQL: ~$50/month
- Total: ~$900/month
Hybrid (local LLM + cloud infrastructure):
- Local GPU server: ~$30/month electricity
- Cloud DB/storage: ~$100/month
- Total: ~$130/month
Compare to:
- OpenAI API (1M tokens): ~$30-60
- Anthropic Claude (1M tokens): ~$15-75
- Heavy usage (100M tokens/month): $1,500-7,500/month
- Local-first: Documents never leave your infrastructure (unless you configure external APIs)
- No telemetry: Zero tracking or phone-home
- Audit trail: All queries logged locally
- Change default passwords (
POSTGRES_PASSWORD,MINIO_PASSWORD) - Generate strong JWT secret:
openssl rand -hex 32 - Enable HTTPS (nginx/Caddy reverse proxy)
- Set up database backups (pg_dump automation)
- Configure firewall rules
- Enable rate limiting on API
- Review CORS settings
- Set up monitoring alerts
✅ .env.example, code, configs with env var substitution
❌ .env, API keys, database backups, user data
This is a learning resource and starter template. Contributions welcome:
- Document format parsers: Add support for new file types
- Search algorithms: Implement MMR, ColBERT, etc.
- Evaluation tools: Add RAG metrics (faithfulness, relevance)
- Deployment guides: Kubernetes, AWS, GCP examples
- Fine-tuning examples: Show domain adaptation
See CONTRIBUTING.md for guidelines.
MIT License - free for commercial and personal use. See LICENSE.
Core RAG Stack:
- Ollama - Local LLM runtime
- pgvector - PostgreSQL vector extension
- LangChain - LLM orchestration (optional, not required)
Backend:
- FastAPI - Modern Python web framework
- SQLAlchemy - ORM & database toolkit
- Celery - Distributed task queue
Frontend:
- Next.js - React framework
- TypeScript - Type safety
- Tailwind CSS - Styling
Infrastructure:
- PostgreSQL - Primary database
- Redis - Cache & queue
- MinIO - S3-compatible storage
- Jaeger - Distributed tracing
Q: Do I need a GPU? A: No. CPU mode works fine for learning and development. GPU recommended for production.
Q: Can I use this commercially? A: Yes, MIT license allows commercial use.
Q: How do I add support for [file format]?
A: Add parser in services/knowledge/services/parser.py, see existing PDF/DOCX parsers.
Q: Can I use OpenAI instead of Ollama?
A: Yes, set OPENAI_API_KEY in .env. System falls back automatically.
Q: How accurate is the RAG system? A: Depends on your documents, chunk size, and model. Start with defaults, measure, iterate.
Q: Can I deploy this to Kubernetes? A: Yes, services are containerized. Create k8s manifests or Helm chart.
Q: What's the vector database performance? A: pgvector handles millions of vectors. For billions, consider Qdrant, Milvus, or Weaviate.
Q: How do I evaluate RAG quality? A: Track metrics: retrieval accuracy (top-k), answer faithfulness, user feedback. Build eval set.
Ready to learn RAG? Start with ./start.sh cpu and build your first knowledge system in 5 minutes.
Questions? Open an issue or discussion on GitHub.
Built something cool? Share it! Tag your fork with #LocalRAG.