A comprehensive collection of Python utilities and demonstrations for building Retrieval-Augmented Generation (RAG) systems. This repository contains parsers, chunkers, vector databases, and complete RAG implementations to help you understand and build your own RAG applications.
This repository provides practical examples and reusable tools for every stage of the RAG pipeline:
- Document Parsing: Extract text from PDFs, Word docs, CSV, and plain text
- Text Chunking: Multiple strategies for splitting documents intelligently
- Vector Storage: Examples with ChromaDB and FAISS
- Semantic Search: BM25, vector search, and hybrid approaches
- Complete RAG: End-to-end implementations with Ollama and LLMs
Full-featured RAG implementations with progressive complexity.
Contains:
app_v1.py- Basic RAG demo with in-memory documentsapp_v2.py- File-based RAG with ingestion pipelinehybrid_rag.py- Hybrid search (BM25 + semantic vector search with RRF)
Key Features:
- Ollama integration (llama3.3 + nomic-embed-text)
- ChromaDB vector storage
- Automatic document chunking with overlap
- Source citation in answers
- Reciprocal Rank Fusion (RRF) for hybrid search
Use Case: Question-answering over book collections (Shakespeare, Sherlock Holmes, Frankenstein, etc.)
Minimal ChromaDB implementation for semantic search.
Contains:
ingest_and_query.py- Ingest text files and perform semantic queries
Key Features:
- Persistent ChromaDB storage
- Automatic chunking (1500 chars, 200 overlap)
- Sentence-transformers embeddings (all-MiniLM-L6-v2)
- Metadata filtering
- Idempotent ingestion
Use Case: Simple semantic search over text document collections
Extract and chunk text from PDF documents for RAG ingestion.
Contains:
pdf_parser.py- PDFParser class implementationmain.py- Usage examples
Key Features:
- PyPDF2-based text extraction
- LangChain text splitter integration
- Intelligent boundary detection
- Structured JSON output with metadata
- Multiple output formats (JSON, text)
Dependencies: PyPDF2, langchain-text-splitters
Parse Microsoft Word (.docx) documents for RAG systems.
Contains:
main.py- DocxParser class and demo
Key Features:
- Extracts text from .docx files
- Document properties (title, author, dates)
- Hierarchical chunking (paragraph β sentence β word)
- Preserves paragraph structure
- MD5-based document IDs
Dependencies: python-docx, lxml
Parse plain text files with smart chunking.
Contains:
main.py- TextDocumentParser class and demo
Key Features:
- Zero dependencies (Python stdlib only)
- Sentence-aware boundary detection
- Rich file metadata extraction
- Configurable chunk size and overlap
- MD5 document IDs
Use Case: Processing plain text documents, logs, markdown files
Convert CSV data to RAG-ready JSON documents.
Contains:
main.py- CSV to JSON converter- Sample datasets (sample_data.csv, big_data.csv)
Key Features:
- Transforms rows into searchable text
- Preserves all columns as metadata
- JSON output for vector database ingestion
- No external dependencies
Use Case: Making structured data semantically searchable
Comprehensive document chunker with 8 different strategies.
Contains:
document_chunker.py- CLI tool with multiple chunking methods
Chunking Methods:
- Line-by-Line: Group by number of lines
- Fixed Size: Fixed character chunks with overlap
- Sliding Window: Overlapping windows
- Sentence-Based: Group by sentences
- Paragraph-Based: Group by paragraphs
- Page-Based: Simulate pages (by line count)
- Section-Based: Split on headings (markdown, etc.)
- Token-Based: BERT tokenizer-based chunking
Supports: TXT, PDF, DOC, DOCX files
Dependencies: PyPDF2, python-docx, transformers (optional)
Jupyter notebooks demonstrating different search approaches.
Contains:
BM25-vs-Semantic.ipynb- Comparison of keyword vs vector searchSearchDemo.ipynb- Search implementation examplesSemantic-Demo.ipynb- Semantic search demonstration
Key Topics:
- BM25 keyword search
- Vector embeddings and similarity
- Hybrid search strategies
- Search quality comparison
Use Case: Understanding search approaches for RAG
Collection of classic literature texts for search and RAG testing.
Contains: 18 classic books in plain text format
- Adventures of Huckleberry Finn
- Adventures of Sherlock Holmes
- Alice in Wonderland
- Beowulf
- Complete Works of William Shakespeare
- Dracula
- Frankenstein
- Great Gatsby
- Jane Eyre
- Moby Dick
- Pride and Prejudice
- And more...
Use Case: Test corpus for RAG and search implementations
- Python 3.8 or higher
- pip package manager
Most demos have their own requirements.txt. Install per project:
cd BookSearch
pip install -r requirements.txt1. Basic RAG with BookSearch:
cd BookSearch
python app_v2.py init # Check setup
python app_v2.py ingest # Ingest documents
python app_v2.py ask "Who is Sherlock Holmes?"2. ChromaDB Semantic Search:
cd ChromaDemo
pip install -r requirements.txt
python ingest_and_query.py3. Parse a PDF:
cd PDFParser
pip install -r requirements.txt
python main.py4. Parse a Word Document:
cd WordParser
pip install -r requirements.txt
python main.py5. Chunk a Document:
cd ChunkingDemo
pip install -r requirements.txt
python document_chunker.py sample_document.txt sentence --max-sentences 56. Convert CSV to RAG Format:
cd CSVParser
python main.py-
Choose Your Parser (based on document format)
- PDF β PDFParser
- Word β WordParser
- Plain text β TextParser
- Structured data β CSVParser
-
Select Chunking Strategy
- Small chunks (300-500 chars) β Precise retrieval
- Medium chunks (1000-1500 chars) β Balanced
- Large chunks (2000+ chars) β Maximum context
- Use ChunkingDemo to experiment
-
Pick Vector Database
- ChromaDB β Easy setup, great for prototypes
- FAISS β High performance, production-ready
- Pinecone/Weaviate β Managed, scalable
-
Implement Search
- Semantic only β Simple, fast
- Hybrid (BM25 + Vector) β Best quality
- See SearchTool notebooks for comparisons
-
Add LLM Generation
- Ollama β Local, free
- OpenAI β High quality
- Anthropic β Long context
# 1. Parse documents
from pdfparser.main import PDFParser
parser = PDFParser(chunk_size=1000, chunk_overlap=200)
chunks = parser.process_document('document.pdf')
# 2. Store in vector DB
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
collection.add(
ids=[f"chunk_{c['chunk_id']}" for c in chunks],
documents=[c['text'] for c in chunks],
metadatas=[c['document_metadata'] for c in chunks]
)
# 3. Query
results = collection.query(
query_texts=["What is the main topic?"],
n_results=5
)
# 4. Generate answer with LLM
import ollama
context = "\n\n".join(results['documents'][0])
prompt = f"Answer based on context:\n{context}\n\nQuestion: What is the main topic?"
answer = ollama.generate(model="llama3.3", prompt=prompt)
print(answer['response'])| Document Type | Chunk Size | Overlap | Rationale |
|---|---|---|---|
| Technical docs | 800-1200 | 150-250 | Balance detail & context |
| Books/Articles | 1000-1500 | 200-300 | Preserve narrative flow |
| Code documentation | 500-800 | 100-150 | Precise code examples |
| Chat logs | 300-500 | 50-100 | Short exchanges |
| Research papers | 1500-2000 | 300-400 | Complex arguments |
| Model | Dimensions | Speed | Best For |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Very Fast | General purpose |
| nomic-embed-text | 768 | Fast | Ollama integration |
| text-embedding-ada-002 | 1536 | Medium | High quality (OpenAI) |
| instructor-large | 768 | Slow | Domain-specific |
| Feature | BookSearch | ChromaDemo | PDFParser | WordParser | TextParser | CSVParser |
|---|---|---|---|---|---|---|
| Complete RAG | β | β | β | β | β | β |
| Vector DB | β | β | β | β | β | β |
| LLM Integration | β | β | β | β | β | β |
| Hybrid Search | β | β | β | β | β | β |
| PDF Support | β | β | β | β | β | β |
| Word Support | β | β | β | β | β | β |
| Structured Data | β | β | β | β | β | β |
| Zero Dependencies | β | β | β | β | β | β |
- Python 3.8+: Primary language
- ChromaDB: Vector database
- Ollama: Local LLM inference
- PyPDF2: PDF parsing
- python-docx: Word document parsing
- LangChain: Text splitting utilities
- rank-bm25: BM25 search algorithm
- transformers: BERT tokenizer for chunking
- sentence-transformers: Embedding models
- Start with ChromaDemo to understand vector databases
- Explore TextParser for basic document processing
- Try ChunkingDemo to experiment with chunking strategies
- Use PDFParser and WordParser for real documents
- Study BookSearch/app_v1.py and app_v2.py for RAG basics
- Review SearchTool notebooks for search concepts
- Implement hybrid_rag.py for production-quality search
- Build custom parsers combining multiple demos
- Scale to production with proper error handling and monitoring
- Tools: PDFParser + BookSearch
- Example: Company policy documents, technical manuals
- Tools: ChromaDemo + hybrid search
- Example: FAQ database, support ticket history
- Tools: PDFParser + BookSearch
- Example: Academic papers, research notes
- Tools: TextParser + semantic search
- Example: README files, code comments
- Tools: CSVParser + vector search
- Example: Customer databases, log analysis
"Module not found" errors
pip install -r requirements.txtOllama connection errors (BookSearch)
ollama serve # Start Ollama server
ollama pull llama3.3
ollama pull nomic-embed-textPDF extraction issues
- Scanned PDFs need OCR (pytesseract + pdf2image)
- Password-protected PDFs not supported
- Try alternative: pdfplumber or PyMuPDF
Memory errors with large files
- Reduce chunk size
- Process files in batches
- Use streaming approaches
Poor search results
- Adjust chunk size (try smaller/larger)
- Increase overlap (15-20% of chunk size)
- Use hybrid search instead of semantic only
- Start Simple: Begin with ChromaDemo, then add complexity
- Test Chunking: Use ChunkingDemo to find optimal strategy
- Version Control: Track changes to chunk size and overlap
- Monitor Quality: Regularly evaluate retrieval accuracy
- Document Metadata: Always preserve source information
- Error Handling: Wrap parsers in try-except for production
- Batch Processing: Process multiple files efficiently
- Clean Data: Preprocess documents before ingestion
- LangChain - RAG framework
- LlamaIndex - Data framework for LLMs
- ChromaDB - Vector database
- Ollama - Local LLM runtime
This is an educational repository. Feel free to:
- Fork and modify for your projects
- Report issues or suggest improvements
- Share your implementations and learnings
Educational demo project. All code provided as-is for learning purposes.
Sample texts in SearchFiles/ are public domain works from Project Gutenberg.
Happy Building! π
Start with any demo that matches your needs, or combine multiple parsers for a complete solution. Each subdirectory has detailed documentation to guide you.