Semantic Document Processor is an intelligent, open-source document analysis system that leverages Large Language Models (LLMs) to process and analyze complex documents like insurance policies, contracts, and legal documents. Built with modern AI/ML technologies, it provides structured, decision-ready responses to natural language queries against document collections.
- Smart Document Processing: Automatically extracts, cleans, and chunks text from PDF, DOCX, EML, and TXT files
- Semantic Search: Uses advanced embeddings (BGE model) to find relevant document sections based on meaning, not just keywords
- Intelligent Analysis: Leverages local LLMs (via Ollama) to analyze queries and provide structured responses with decisions, justifications, and confidence scores
- Production Ready: Includes a FastAPI REST API, comprehensive testing, and deployment configurations
- Multi-format Support: Handles PDF, DOCX, EML, and TXT files seamlessly
- Local LLM Integration: Uses Ollama for privacy-focused, offline document analysis
- Structured Output: Returns JSON responses with decisions, amounts, justifications, and relevant clauses
- Scalable Architecture: Modular design with separate phases for document processing, search indexing, and query analysis
- API-First Design: RESTful API endpoints for easy integration into existing systems
- Insurance Claims Processing: Analyze policy documents to determine coverage eligibility
- Contract Review: Extract key terms and conditions from legal documents
- Compliance Checking: Verify document requirements against regulatory standards
- Document Q&A: Natural language interface for large document collections
- Research & Analysis: Semantic search across academic or technical documents
- AI/ML: Ollama (local LLMs), BGE embeddings, semantic search
- Backend: Python, FastAPI, ChromaDB (vector database)
- Document Processing: PyMuPDF, python-docx, email parsing
- Infrastructure: Docker support, comprehensive logging, configuration management
This project demonstrates how to build enterprise-grade document intelligence systems using open-source AI technologies, making advanced document analysis accessible to developers and organizations who prioritize privacy and cost-effectiveness.
This is a complete, production-ready LLM Document Processing System that processes natural language queries against unstructured documents (insurance policies, contracts, emails) and returns structured JSON responses with decisions and justifications.
Semantic-Document-Processor/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ config.yaml # Main configuration
βββ .gitignore # Git ignore file
β
βββ data/ # All data files
β βββ raw_documents/ # Place your PDF/DOCX/EML files here
β βββ processed/ # Generated processed files
β βββ vector_db/ # ChromaDB storage
β
βββ src/ # Main source code
β βββ phase1_document_processing.py
β βββ phase2_semantic_search.py
β βββ phase3_query_processing.py
β βββ phase4_ollama_llm.py
β βββ main_pipeline_ollama.py # Main orchestrator (Ollama-based)
β
βββ notebooks/ # Jupyter notebooks for development
β βββ Phase1.ipynb
β βββ Phase2.ipynb
β βββ Phase3.ipynb
β
βββ api/ # FastAPI REST API
β βββ main.py
β
βββ tests/ # Test files
βββ __init__.py
# Clone the repository
git clone https://github.com/yourusername/Semantic-Document-Processor.git
cd Semantic-Document-Processor
# Create Python virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt-
Copy the provided files into your project directory:
requirements.txtconfig.yaml- All Python files in
src/directory
-
Install and start Ollama (for local LLM support):
# Install Ollama from https://ollama.ai/
# Then pull the required model:
ollama pull llama3
# Start Ollama service
ollama serve- Create directory structure:
mkdir -p data/raw_documents data/processed data/vector_db
mkdir -p src api tests notebooks-
Place your documents in
data/raw_documents/:- Insurance policies (PDF)
- Contracts (PDF, DOCX)
- Email files (EML)
- Text files (TXT)
-
Example document structure:
data/raw_documents/
βββ health_policy_2024.pdf
βββ travel_insurance.pdf
βββ claims_procedure.docx
βββ policy_terms.pdf
# From project root directory
python src/main_pipeline_ollama.pyThis will:
- β
Process all documents in
data/raw_documents/ - β Generate embeddings and build search index
- β Run system tests
- β Start interactive query mode
Note: Make sure Ollama is running (ollama serve) and the llama3 model is downloaded (ollama pull llama3) before running the pipeline.
# Step 1: Process documents
python src/phase1_document_processing.py
# Step 2: Build search index
python src/phase2_semantic_search.py
# Step 3: Test query processing
python src/phase3_query_processing.py
# Step 4: Test LLM analysis (requires Ollama running)
python src/phase4_ollama_llm.pyYour current notebooks (Phase1.ipynb, Phase2.ipynb, Phase3.ipynb) will work with this structure. Just update the file paths in them:
# Update paths in your notebooks
config_path = "../config.yaml"
raw_documents_path = "../data/raw_documents/"
processed_path = "../data/processed/"from src.main_pipeline import LLMDocumentProcessor
# Initialize system
processor = LLMDocumentProcessor()
# Setup (only needed once)
processor.setup_system()
# Process a query
response = processor.process_query(
"46-year-old male, knee surgery in Pune, 3-month-old insurance policy"
)
print(f"Decision: {response.decision}")
print(f"Amount: {response.amount}")
print(f"Justification: {response.justification}")queries = [
"What is the grace period for premium payment?",
"Is cataract surgery covered under the policy?",
"How to file a cashless claim?",
"What are waiting periods for pre-existing diseases?"
]
responses = processor.process_batch_queries(queries)
for response in responses:
print(f"Q: {response.query}")
print(f"A: {response.decision} - {response.justification[:100]}...")# Document processing
document_processing:
chunk_size: 200 # Words per chunk
chunk_overlap: 50 # Overlapping words
# Embeddings
embeddings:
model_name: "BAAI/bge-base-en-v1.5"
batch_size: 32
# LLM settings
llm:
provider: "ollama"
model: "llama3"
temperature: 0.1
max_tokens: 2000
# Search settings
semantic_search:
top_k: 5 # Number of results to retrieve
similarity_threshold: 0.7Start the REST API server:
cd api/
uvicorn main:app --host 0.0.0.0 --port 8000 --reload# Process single query
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"query": "What is the grace period for premium payment?"}'
# System health check
curl "http://localhost:8000/health"
# System statistics
curl "http://localhost:8000/stats"python -m pytest tests/ -v# Test document processing
python src/phase1_document_processing.py
# Test semantic search
python src/phase2_semantic_search.py
# Test query processing
python src/phase3_query_processing.py
# Test LLM analysis (requires API key)
python src/phase4_llm_analysis.py{
"query": "46-year-old male, knee surgery in Pune, 3-month-old insurance policy",
"decision": "rejected",
"amount": null,
"justification": "Knee surgery requires 24-month waiting period. Policy is only 3 months old, therefore not eligible for coverage under current terms.",
"relevant_clauses": [
{
"clause_text": "Knee surgery and joint replacement procedures are covered after completion of 24 months waiting period...",
"source_document": "health_policy.pdf",
"relevance": "Defines waiting period for orthopedic procedures"
}
],
"query_analysis": {
"category": "Coverage",
"entities": {
"age": 46,
"gender": "male",
"procedure": "knee surgery",
"location": "Pune",
"policy_duration": "3 months"
}
},
"confidence": 0.92,
"processing_time": 2.34,
"timestamp": "2024-08-06T22:30:45"
}-
"Config file not found"
- Ensure
config.yamlis in the project root - Check file permissions
- Ensure
-
"Ollama connection failed"
- Make sure Ollama is running:
ollama serve - Check if the model is downloaded:
ollama list - Pull the required model:
ollama pull llama3
- Make sure Ollama is running:
-
"No documents found"
- Place PDF/DOCX files in
data/raw_documents/ - Check file permissions and formats
- Place PDF/DOCX files in
-
"ChromaDB collection not found"
- Run Phase 2 to build the search index
- Check
data/vector_db/directory exists
-
Memory issues with large documents
- Reduce
chunk_sizein config.yaml - Process documents in smaller batches
- Reduce
Enable detailed logging:
import logging
logging.getLogger().setLevel(logging.DEBUG)Or set in .env:
LOG_LEVEL=DEBUG-
For large document collections:
- Use smaller chunk sizes (150-200 words)
- Enable embedding caching
- Consider using GPU for embeddings
-
For faster queries:
- Reduce
top_kin semantic search - Use lighter LLM models for development
- Implement response caching
- Reduce
-
For production deployment:
- Use async processing for batch queries
- Implement proper error handling and retries
- Monitor API rate limits
-
Never commit API keys to version control
- Use
.envfiles (add to.gitignore) - Use environment variables in production
- Use
-
Secure document storage
- Encrypt sensitive documents at rest
- Implement access controls
-
API security
- Add authentication/authorization
- Implement rate limiting
- Validate all inputs
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "src/main_pipeline.py"]# Build and run
docker build -t semantic-doc-processor .
docker run -p 8000:8000 semantic-doc-processor
# Note: For production, you may want to run Ollama separately
# or use a multi-container setup with docker-compose##For issues or questions:
- Check the troubleshooting section above
- Review the logs in
logs/app.log - Test individual components separately
- Verify all dependencies are installed correctly
- Add more document types (Excel, PowerPoint, etc.)
- Implement user authentication for API
- Add support for multiple languages
- Create web interface for easier interaction
- Add document versioning and change tracking
- Implement advanced analytics and reporting
Congratulations! π Your LLM Document Processing System is ready to use!
This system provides you with a complete, production-ready solution for processing insurance queries against policy documents. The modular design allows you to extend and customize each component as needed.
run this in shell ""ollama pull llama3""
python src/main_pipeline_ollama.py