Advanced OCR system for SynthoraAI - Extracting text from government documents, images, and PDFs to enhance article curation and accessibility.
The SynthoraAI OCR system is a production-ready, multi-engine Optical Character Recognition solution designed to extract text from:
- Scanned government documents
- PDF files (searchable and non-searchable)
- Images (JPG, PNG, TIFF, WebP)
- Screenshots and infographics
- Historical documents with degraded quality
This system integrates seamlessly with the SynthoraAI AI-Powered Article Content Curator to enable text extraction from non-digital sources, making government information more accessible.
- Tesseract OCR - Industry standard, 100+ languages
- EasyOCR - Deep learning-based, excellent accuracy
- PaddleOCR - Ultra-fast, multilingual support
- TrOCR (Microsoft) - Transformer-based for handwriting
- Ensemble Mode - Combines results from multiple engines
- PDF Processing - Extract text from both searchable and scanned PDFs
- Image Preprocessing - Deskewing, noise removal, contrast enhancement
- Layout Analysis - Preserve document structure and formatting
- Table Extraction - Extract data from tables in documents
- Handwriting Recognition - Process handwritten government forms
- RESTful API - Easy integration with SynthoraAI backend
- WebSocket Support - Real-time OCR progress updates
- Batch Processing - Process multiple documents simultaneously
- Cloud Storage - S3, Azure Blob, Google Cloud Storage support
- Webhook Notifications - Get notified when OCR completes
- Confidence Scoring - Get reliability metrics for extracted text
- Language Detection - Automatic language identification
- Post-Processing - Spell checking and text correction
- GPU Acceleration - CUDA support for faster processing
- Caching - Redis-based caching for processed documents
- Quick Start
- Installation
- Usage
- API Reference
- Integration with SynthoraAI
- Architecture
- Configuration
- Performance
- Deployment
- Testing
- Contributing
- License
# Clone the repository
git clone https://github.com/SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition.git
cd Optical-Character-Recognition
# Start all services
docker-compose up -d
# Test the OCR API
curl -X POST http://localhost:8000/api/ocr/image \
-F "[email protected]" \
-F "engine=ensemble"# Install Python dependencies
cd ocr_backend
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Install Node.js dependencies
cd ../ocr_api
npm install
# Set up environment variables
cp .env.example .env
# Edit .env with your configuration
# Start the services
# Terminal 1: Python OCR backend
cd ocr_backend
python main.py
# Terminal 2: Node.js API
cd ocr_api
npm run dev- Python 3.8+ with pip
- Node.js 18+ with npm
- Tesseract OCR 4.0+
- Redis (optional, for caching)
- MongoDB (optional, for document storage)
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr \
tesseract-ocr-eng \
libtesseract-dev \
poppler-utils \
libsm6 \
libxext6 \
libxrender-dev \
libgomp1brew install tesseract
brew install poppler# Install via chocolatey
choco install tesseract
choco install popplercd ocr_backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download language models
python -c "import easyocr; easyocr.Reader(['en'])"
python -m spacy download en_core_web_smcd ocr_api
# Install dependencies
npm install
# For development
npm install -D nodemon @types/node# Process a single image
python ocr_backend/cli.py process --input document.jpg --engine tesseract
# Process a PDF
python ocr_backend/cli.py process --input document.pdf --engine paddleocr --output result.txt
# Batch process a directory
python ocr_backend/cli.py batch --input ./documents --output ./results --engine ensemble
# Compare OCR engines
python ocr_backend/cli.py compare --input document.jpgfrom ocr_backend.processors import OCRProcessor
# Initialize processor
processor = OCRProcessor(engine='ensemble')
# Process an image
result = processor.process_image('document.jpg')
print(f"Extracted text: {result['text']}")
print(f"Confidence: {result['confidence']}")
print(f"Language: {result['language']}")
# Process a PDF
result = processor.process_pdf('document.pdf')
for page_num, page_result in enumerate(result['pages']):
print(f"Page {page_num + 1}: {page_result['text']}")# Process an image
curl -X POST http://localhost:8000/api/ocr/image \
-F "[email protected]" \
-F "engine=tesseract" \
-F "languages=eng"
# Process a PDF
curl -X POST http://localhost:8000/api/ocr/pdf \
-F "[email protected]" \
-F "engine=paddleocr"
# Get processing status
curl http://localhost:8000/api/ocr/status/job_id_here
# List supported languages
curl http://localhost:8000/api/ocr/languagesconst OCRClient = require('./ocr_api/client');
const client = new OCRClient('http://localhost:8000');
// Process an image
const result = await client.processImage('document.jpg', {
engine: 'ensemble',
languages: ['eng'],
preprocessing: true
});
console.log('Text:', result.text);
console.log('Confidence:', result.confidence);Process a single image file.
Parameters:
file(multipart/form-data) - Image file (JPG, PNG, TIFF, WebP)engine(string, optional) - OCR engine:tesseract,easyocr,paddleocr,trocr,ensemble(default:tesseract)languages(string[], optional) - Language codes (default:['eng'])preprocessing(boolean, optional) - Apply image preprocessing (default:true)
Response:
{
"success": true,
"text": "Extracted text content",
"confidence": 0.95,
"language": "eng",
"processing_time": 1.23,
"engine": "tesseract",
"metadata": {
"image_size": [1920, 1080],
"dpi": 300
}
}Process a PDF document.
Parameters:
file(multipart/form-data) - PDF fileengine(string, optional) - OCR engineextract_images(boolean, optional) - Extract and OCR embedded images
Response:
{
"success": true,
"pages": [
{
"page_number": 1,
"text": "Page 1 content",
"confidence": 0.92
}
],
"total_pages": 10,
"processing_time": 15.67
}Process multiple files.
Parameters:
files(multipart/form-data[]) - Array of filesengine(string, optional) - OCR engine
Get processing status for async jobs.
List supported languages for each engine.
List available OCR engines and their capabilities.
Add to your SynthoraAI backend (backend/api/articles.js):
const axios = require('axios');
// OCR integration for article processing
async function processArticleWithOCR(articleUrl) {
try {
// Download the article image/PDF
const response = await axios.get(articleUrl, { responseType: 'arraybuffer' });
// Send to OCR service
const formData = new FormData();
formData.append('file', response.data);
formData.append('engine', 'ensemble');
const ocrResult = await axios.post('http://localhost:8000/api/ocr/image', formData);
return {
extractedText: ocrResult.data.text,
confidence: ocrResult.data.confidence
};
} catch (error) {
console.error('OCR processing failed:', error);
return null;
}
}Enhance your crawler to extract text from images in articles:
// In crawler/schedule/fetchAndSummarize.ts
import { OCRProcessor } from '../utils/ocr';
const ocrProcessor = new OCRProcessor();
async function processArticle(article) {
// Extract images from article
const images = await extractImagesFromArticle(article.url);
// OCR each image
const ocrResults = await Promise.all(
images.map(img => ocrProcessor.processImage(img))
);
// Combine OCR text with article content
article.content += '\n\n' + ocrResults.map(r => r.text).join('\n\n');
return article;
}Add OCR upload component:
// frontend/components/OCRUpload.jsx
import { useState } from 'react';
export default function OCRUpload() {
const [file, setFile] = useState(null);
const [result, setResult] = useState(null);
const [loading, setLoading] = useState(false);
const handleUpload = async () => {
setLoading(true);
const formData = new FormData();
formData.append('file', file);
formData.append('engine', 'ensemble');
const response = await fetch('http://localhost:8000/api/ocr/image', {
method: 'POST',
body: formData
});
const data = await response.json();
setResult(data);
setLoading(false);
};
return (
<div className="ocr-upload">
<input type="file" onChange={(e) => setFile(e.target.files[0])} />
<button onClick={handleUpload} disabled={!file || loading}>
{loading ? 'Processing...' : 'Extract Text'}
</button>
{result && (
<div className="result">
<p>Confidence: {(result.confidence * 100).toFixed(2)}%</p>
<pre>{result.text}</pre>
</div>
)}
</div>
);
}βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SynthoraAI OCR System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββ β
β β Frontend βββββββΆβ Node.js API βββββββΆβ Python β β
β β (React) β β (Express) β β Backend β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββ βββββββββββββββ β
β β Redis β β OCR β β
β β Cache β β Engines β β
β ββββββββββββ β - Tesseractβ β
β β - EasyOCR β β
β β β - PaddleOCRβ β
β βΌ β - TrOCR β β
β ββββββββββββ βββββββββββββββ β
β β MongoDB β β
β β Storage β β
β ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Cloud Integrations β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
β β AWS β β Azure β β Google β β Vercel β β
β β S3 β β Blob β β Cloud β β Functions β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
-
Frontend Layer
- React components for file upload
- Real-time OCR progress display
- Result visualization and editing
-
API Layer (Node.js/Express)
- Request validation and routing
- File upload handling
- WebSocket connections for real-time updates
- Authentication and rate limiting
-
Processing Layer (Python)
- Image preprocessing
- Multi-engine OCR processing
- Post-processing and text correction
- Confidence scoring
-
Storage Layer
- Redis: Caching and job queue
- MongoDB: Document metadata and results
- Cloud storage: Original files and processed outputs
Create a .env file:
# OCR Backend
PYTHON_PORT=5000
OCR_ENGINES=tesseract,easyocr,paddleocr,trocr
DEFAULT_ENGINE=tesseract
ENABLE_GPU=false
MAX_FILE_SIZE=50MB
# Node.js API
NODE_PORT=8000
API_BASE_URL=http://localhost:8000
ENABLE_WEBSOCKET=true
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_WINDOW=15min
# Database
MONGODB_URI=mongodb://localhost:27017/synthoraai_ocr
REDIS_URL=redis://localhost:6379
# Cloud Storage (Optional)
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION=us-east-1
AWS_S3_BUCKET=synthoraai-ocr
AZURE_STORAGE_CONNECTION_STRING=your_connection_string
AZURE_CONTAINER_NAME=ocr-documents
# SynthoraAI Integration
SYNTHORAAI_API_URL=https://ai-content-curator-backend.vercel.app
SYNTHORAAI_API_KEY=your_api_key
# Logging
LOG_LEVEL=info
ENABLE_DEBUG=falseEdit ocr_backend/config/engines.yaml:
tesseract:
enabled: true
languages: [eng, spa, fra, deu]
config: --psm 3 --oem 3
confidence_threshold: 0.6
easyocr:
enabled: true
languages: [en, es, fr, de]
gpu: false
confidence_threshold: 0.5
paddleocr:
enabled: true
lang: en
use_angle_cls: true
confidence_threshold: 0.5
trocr:
enabled: false # Requires GPU
model: microsoft/trocr-base-handwritten
confidence_threshold: 0.7
ensemble:
enabled: true
engines: [tesseract, easyocr, paddleocr]
voting_method: weighted # or majority
weights:
tesseract: 0.3
easyocr: 0.4
paddleocr: 0.3Tested on AWS EC2 t3.xlarge (4 vCPU, 16GB RAM):
| Engine | Speed (pages/min) | Accuracy | GPU Support |
|---|---|---|---|
| Tesseract | 15-20 | 85-90% | No |
| EasyOCR | 8-12 | 88-93% | Yes |
| PaddleOCR | 20-30 | 87-92% | Yes |
| TrOCR | 5-8 | 90-95% | Yes |
| Ensemble | 10-15 | 92-96% | Partial |
- Enable GPU: Set
ENABLE_GPU=truefor 3-5x speedup - Use Preprocessing: Improves accuracy by 5-10%
- Enable Caching: Reduces redundant processing
- Batch Processing: Process multiple files in parallel
- Choose Right Engine:
- Tesseract: Fast, general purpose
- EasyOCR: High accuracy, slower
- PaddleOCR: Best speed/accuracy balance
- Ensemble: Highest accuracy, slowest
# Build images
docker-compose build
# Start all services
docker-compose up -d
# Scale processing workers
docker-compose up -d --scale ocr-worker=4
# View logs
docker-compose logs -f
# Stop services
docker-compose down# Apply configurations
kubectl apply -f k8s/
# Check status
kubectl get pods -n synthoraai-ocr
# Scale workers
kubectl scale deployment ocr-worker --replicas=5 -n synthoraai-ocrcd ocr_api
vercel --prodcd ocr_backend
./scripts/deploy-lambda.sh# Python backend
cd ocr_backend
pytest tests/ -v --cov=.
# Node.js API
cd ocr_api
npm test
npm run test:coverage# Test full pipeline
npm run test:integration
# Test specific engine
pytest tests/test_tesseract.py -v# Using artillery
cd ocr_api
npm run test:load
# Using locust
cd ocr_backend
locust -f tests/locustfile.pyThe system exposes Prometheus metrics at /metrics:
ocr_requests_total- Total OCR requestsocr_processing_duration_seconds- Processing time histogramocr_confidence_score- Confidence score distributionocr_errors_total- Error count by typeocr_queue_size- Current queue size
Logs are structured in JSON format:
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "info",
"service": "ocr-backend",
"message": "Image processed successfully",
"metadata": {
"job_id": "abc123",
"engine": "tesseract",
"processing_time": 1.23,
"confidence": 0.95
}
}We welcome contributions! Please see CONTRIBUTING.md for details.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/Optical-Character-Recognition.git
cd Optical-Character-Recognition
# Create a branch
git checkout -b feature/your-feature
# Make changes and test
npm test
# Commit with conventional commits
git commit -m "feat: add new OCR engine support"
# Push and create PR
git push origin feature/your-featureThis project is licensed under the MIT License - see LICENSE file for details.
- Project Lead: David Nguyen
- Email: [email protected]
- SynthoraAI: https://synthoraai.vercel.app
- Issues: https://github.com/SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition/issues
Built with β€οΈ for SynthoraAI - Synthesizing the world's news & information through AI
π Back to Top