Skip to content

SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition

Repository files navigation

SynthoraAI OCR - Optical Character Recognition System

License: MIT Python 3.8+ Node.js 18+

Advanced OCR system for SynthoraAI - Extracting text from government documents, images, and PDFs to enhance article curation and accessibility.

πŸš€ Overview

The SynthoraAI OCR system is a production-ready, multi-engine Optical Character Recognition solution designed to extract text from:

  • Scanned government documents
  • PDF files (searchable and non-searchable)
  • Images (JPG, PNG, TIFF, WebP)
  • Screenshots and infographics
  • Historical documents with degraded quality

This system integrates seamlessly with the SynthoraAI AI-Powered Article Content Curator to enable text extraction from non-digital sources, making government information more accessible.

✨ Key Features

πŸ”₯ Multi-Engine OCR

  • Tesseract OCR - Industry standard, 100+ languages
  • EasyOCR - Deep learning-based, excellent accuracy
  • PaddleOCR - Ultra-fast, multilingual support
  • TrOCR (Microsoft) - Transformer-based for handwriting
  • Ensemble Mode - Combines results from multiple engines

πŸ“„ Document Processing

  • PDF Processing - Extract text from both searchable and scanned PDFs
  • Image Preprocessing - Deskewing, noise removal, contrast enhancement
  • Layout Analysis - Preserve document structure and formatting
  • Table Extraction - Extract data from tables in documents
  • Handwriting Recognition - Process handwritten government forms

🌐 API & Integration

  • RESTful API - Easy integration with SynthoraAI backend
  • WebSocket Support - Real-time OCR progress updates
  • Batch Processing - Process multiple documents simultaneously
  • Cloud Storage - S3, Azure Blob, Google Cloud Storage support
  • Webhook Notifications - Get notified when OCR completes

🎯 Accuracy & Performance

  • Confidence Scoring - Get reliability metrics for extracted text
  • Language Detection - Automatic language identification
  • Post-Processing - Spell checking and text correction
  • GPU Acceleration - CUDA support for faster processing
  • Caching - Redis-based caching for processed documents

πŸ“‹ Table of Contents

πŸš€ Quick Start

Using Docker (Recommended)

# Clone the repository
git clone https://github.com/SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition.git
cd Optical-Character-Recognition

# Start all services
docker-compose up -d

# Test the OCR API
curl -X POST http://localhost:8000/api/ocr/image \
  -F "[email protected]" \
  -F "engine=ensemble"

Local Installation

# Install Python dependencies
cd ocr_backend
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Install Node.js dependencies
cd ../ocr_api
npm install

# Set up environment variables
cp .env.example .env
# Edit .env with your configuration

# Start the services
# Terminal 1: Python OCR backend
cd ocr_backend
python main.py

# Terminal 2: Node.js API
cd ocr_api
npm run dev

πŸ“¦ Installation

Prerequisites

  • Python 3.8+ with pip
  • Node.js 18+ with npm
  • Tesseract OCR 4.0+
  • Redis (optional, for caching)
  • MongoDB (optional, for document storage)

System Dependencies

Ubuntu/Debian

sudo apt-get update
sudo apt-get install -y \
  tesseract-ocr \
  tesseract-ocr-eng \
  libtesseract-dev \
  poppler-utils \
  libsm6 \
  libxext6 \
  libxrender-dev \
  libgomp1

macOS

brew install tesseract
brew install poppler

Windows

# Install via chocolatey
choco install tesseract
choco install poppler

Python Backend

cd ocr_backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download language models
python -c "import easyocr; easyocr.Reader(['en'])"
python -m spacy download en_core_web_sm

Node.js API

cd ocr_api

# Install dependencies
npm install

# For development
npm install -D nodemon @types/node

πŸ”§ Usage

Command Line Interface

# Process a single image
python ocr_backend/cli.py process --input document.jpg --engine tesseract

# Process a PDF
python ocr_backend/cli.py process --input document.pdf --engine paddleocr --output result.txt

# Batch process a directory
python ocr_backend/cli.py batch --input ./documents --output ./results --engine ensemble

# Compare OCR engines
python ocr_backend/cli.py compare --input document.jpg

Python API

from ocr_backend.processors import OCRProcessor

# Initialize processor
processor = OCRProcessor(engine='ensemble')

# Process an image
result = processor.process_image('document.jpg')
print(f"Extracted text: {result['text']}")
print(f"Confidence: {result['confidence']}")
print(f"Language: {result['language']}")

# Process a PDF
result = processor.process_pdf('document.pdf')
for page_num, page_result in enumerate(result['pages']):
    print(f"Page {page_num + 1}: {page_result['text']}")

REST API

# Process an image
curl -X POST http://localhost:8000/api/ocr/image \
  -F "[email protected]" \
  -F "engine=tesseract" \
  -F "languages=eng"

# Process a PDF
curl -X POST http://localhost:8000/api/ocr/pdf \
  -F "[email protected]" \
  -F "engine=paddleocr"

# Get processing status
curl http://localhost:8000/api/ocr/status/job_id_here

# List supported languages
curl http://localhost:8000/api/ocr/languages

JavaScript/Node.js Integration

const OCRClient = require('./ocr_api/client');

const client = new OCRClient('http://localhost:8000');

// Process an image
const result = await client.processImage('document.jpg', {
  engine: 'ensemble',
  languages: ['eng'],
  preprocessing: true
});

console.log('Text:', result.text);
console.log('Confidence:', result.confidence);

πŸ“š API Reference

Endpoints

POST /api/ocr/image

Process a single image file.

Parameters:

  • file (multipart/form-data) - Image file (JPG, PNG, TIFF, WebP)
  • engine (string, optional) - OCR engine: tesseract, easyocr, paddleocr, trocr, ensemble (default: tesseract)
  • languages (string[], optional) - Language codes (default: ['eng'])
  • preprocessing (boolean, optional) - Apply image preprocessing (default: true)

Response:

{
  "success": true,
  "text": "Extracted text content",
  "confidence": 0.95,
  "language": "eng",
  "processing_time": 1.23,
  "engine": "tesseract",
  "metadata": {
    "image_size": [1920, 1080],
    "dpi": 300
  }
}

POST /api/ocr/pdf

Process a PDF document.

Parameters:

  • file (multipart/form-data) - PDF file
  • engine (string, optional) - OCR engine
  • extract_images (boolean, optional) - Extract and OCR embedded images

Response:

{
  "success": true,
  "pages": [
    {
      "page_number": 1,
      "text": "Page 1 content",
      "confidence": 0.92
    }
  ],
  "total_pages": 10,
  "processing_time": 15.67
}

POST /api/ocr/batch

Process multiple files.

Parameters:

  • files (multipart/form-data[]) - Array of files
  • engine (string, optional) - OCR engine

GET /api/ocr/status/:jobId

Get processing status for async jobs.

GET /api/ocr/languages

List supported languages for each engine.

GET /api/ocr/engines

List available OCR engines and their capabilities.

πŸ”— Integration with SynthoraAI

Backend Integration

Add to your SynthoraAI backend (backend/api/articles.js):

const axios = require('axios');

// OCR integration for article processing
async function processArticleWithOCR(articleUrl) {
  try {
    // Download the article image/PDF
    const response = await axios.get(articleUrl, { responseType: 'arraybuffer' });

    // Send to OCR service
    const formData = new FormData();
    formData.append('file', response.data);
    formData.append('engine', 'ensemble');

    const ocrResult = await axios.post('http://localhost:8000/api/ocr/image', formData);

    return {
      extractedText: ocrResult.data.text,
      confidence: ocrResult.data.confidence
    };
  } catch (error) {
    console.error('OCR processing failed:', error);
    return null;
  }
}

Crawler Integration

Enhance your crawler to extract text from images in articles:

// In crawler/schedule/fetchAndSummarize.ts
import { OCRProcessor } from '../utils/ocr';

const ocrProcessor = new OCRProcessor();

async function processArticle(article) {
  // Extract images from article
  const images = await extractImagesFromArticle(article.url);

  // OCR each image
  const ocrResults = await Promise.all(
    images.map(img => ocrProcessor.processImage(img))
  );

  // Combine OCR text with article content
  article.content += '\n\n' + ocrResults.map(r => r.text).join('\n\n');

  return article;
}

Frontend Integration

Add OCR upload component:

// frontend/components/OCRUpload.jsx
import { useState } from 'react';

export default function OCRUpload() {
  const [file, setFile] = useState(null);
  const [result, setResult] = useState(null);
  const [loading, setLoading] = useState(false);

  const handleUpload = async () => {
    setLoading(true);
    const formData = new FormData();
    formData.append('file', file);
    formData.append('engine', 'ensemble');

    const response = await fetch('http://localhost:8000/api/ocr/image', {
      method: 'POST',
      body: formData
    });

    const data = await response.json();
    setResult(data);
    setLoading(false);
  };

  return (
    <div className="ocr-upload">
      <input type="file" onChange={(e) => setFile(e.target.files[0])} />
      <button onClick={handleUpload} disabled={!file || loading}>
        {loading ? 'Processing...' : 'Extract Text'}
      </button>
      {result && (
        <div className="result">
          <p>Confidence: {(result.confidence * 100).toFixed(2)}%</p>
          <pre>{result.text}</pre>
        </div>
      )}
    </div>
  );
}

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     SynthoraAI OCR System                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Frontend  │─────▢│  Node.js API │─────▢│  Python   β”‚  β”‚
β”‚  β”‚  (React)    β”‚      β”‚  (Express)   β”‚      β”‚  Backend  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              β”‚                      β”‚        β”‚
β”‚                              β–Ό                      β–Ό        β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚                       β”‚  Redis   β”‚          β”‚  OCR        β”‚ β”‚
β”‚                       β”‚  Cache   β”‚          β”‚  Engines    β”‚ β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  - Tesseractβ”‚ β”‚
β”‚                                             β”‚  - EasyOCR  β”‚ β”‚
β”‚                              β”‚              β”‚  - PaddleOCRβ”‚ β”‚
β”‚                              β–Ό              β”‚  - TrOCR    β”‚ β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                       β”‚ MongoDB  β”‚                           β”‚
β”‚                       β”‚ Storage  β”‚                           β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                                                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                     Cloud Integrations                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   AWS    β”‚  β”‚  Azure   β”‚  β”‚  Google  β”‚  β”‚  Vercel    β”‚  β”‚
β”‚  β”‚   S3     β”‚  β”‚  Blob    β”‚  β”‚  Cloud   β”‚  β”‚  Functions β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

  1. Frontend Layer

    • React components for file upload
    • Real-time OCR progress display
    • Result visualization and editing
  2. API Layer (Node.js/Express)

    • Request validation and routing
    • File upload handling
    • WebSocket connections for real-time updates
    • Authentication and rate limiting
  3. Processing Layer (Python)

    • Image preprocessing
    • Multi-engine OCR processing
    • Post-processing and text correction
    • Confidence scoring
  4. Storage Layer

    • Redis: Caching and job queue
    • MongoDB: Document metadata and results
    • Cloud storage: Original files and processed outputs

βš™οΈ Configuration

Environment Variables

Create a .env file:

# OCR Backend
PYTHON_PORT=5000
OCR_ENGINES=tesseract,easyocr,paddleocr,trocr
DEFAULT_ENGINE=tesseract
ENABLE_GPU=false
MAX_FILE_SIZE=50MB

# Node.js API
NODE_PORT=8000
API_BASE_URL=http://localhost:8000
ENABLE_WEBSOCKET=true
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_WINDOW=15min

# Database
MONGODB_URI=mongodb://localhost:27017/synthoraai_ocr
REDIS_URL=redis://localhost:6379

# Cloud Storage (Optional)
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION=us-east-1
AWS_S3_BUCKET=synthoraai-ocr

AZURE_STORAGE_CONNECTION_STRING=your_connection_string
AZURE_CONTAINER_NAME=ocr-documents

# SynthoraAI Integration
SYNTHORAAI_API_URL=https://ai-content-curator-backend.vercel.app
SYNTHORAAI_API_KEY=your_api_key

# Logging
LOG_LEVEL=info
ENABLE_DEBUG=false

OCR Engine Configuration

Edit ocr_backend/config/engines.yaml:

tesseract:
  enabled: true
  languages: [eng, spa, fra, deu]
  config: --psm 3 --oem 3
  confidence_threshold: 0.6

easyocr:
  enabled: true
  languages: [en, es, fr, de]
  gpu: false
  confidence_threshold: 0.5

paddleocr:
  enabled: true
  lang: en
  use_angle_cls: true
  confidence_threshold: 0.5

trocr:
  enabled: false  # Requires GPU
  model: microsoft/trocr-base-handwritten
  confidence_threshold: 0.7

ensemble:
  enabled: true
  engines: [tesseract, easyocr, paddleocr]
  voting_method: weighted  # or majority
  weights:
    tesseract: 0.3
    easyocr: 0.4
    paddleocr: 0.3

πŸš€ Performance

Benchmarks

Tested on AWS EC2 t3.xlarge (4 vCPU, 16GB RAM):

Engine Speed (pages/min) Accuracy GPU Support
Tesseract 15-20 85-90% No
EasyOCR 8-12 88-93% Yes
PaddleOCR 20-30 87-92% Yes
TrOCR 5-8 90-95% Yes
Ensemble 10-15 92-96% Partial

Optimization Tips

  1. Enable GPU: Set ENABLE_GPU=true for 3-5x speedup
  2. Use Preprocessing: Improves accuracy by 5-10%
  3. Enable Caching: Reduces redundant processing
  4. Batch Processing: Process multiple files in parallel
  5. Choose Right Engine:
    • Tesseract: Fast, general purpose
    • EasyOCR: High accuracy, slower
    • PaddleOCR: Best speed/accuracy balance
    • Ensemble: Highest accuracy, slowest

🚒 Deployment

Docker Deployment

# Build images
docker-compose build

# Start all services
docker-compose up -d

# Scale processing workers
docker-compose up -d --scale ocr-worker=4

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Deployment

# Apply configurations
kubectl apply -f k8s/

# Check status
kubectl get pods -n synthoraai-ocr

# Scale workers
kubectl scale deployment ocr-worker --replicas=5 -n synthoraai-ocr

Vercel Deployment (API Only)

cd ocr_api
vercel --prod

AWS Lambda Deployment

cd ocr_backend
./scripts/deploy-lambda.sh

πŸ§ͺ Testing

Unit Tests

# Python backend
cd ocr_backend
pytest tests/ -v --cov=.

# Node.js API
cd ocr_api
npm test
npm run test:coverage

Integration Tests

# Test full pipeline
npm run test:integration

# Test specific engine
pytest tests/test_tesseract.py -v

Load Testing

# Using artillery
cd ocr_api
npm run test:load

# Using locust
cd ocr_backend
locust -f tests/locustfile.py

πŸ“Š Monitoring

Metrics

The system exposes Prometheus metrics at /metrics:

  • ocr_requests_total - Total OCR requests
  • ocr_processing_duration_seconds - Processing time histogram
  • ocr_confidence_score - Confidence score distribution
  • ocr_errors_total - Error count by type
  • ocr_queue_size - Current queue size

Logging

Logs are structured in JSON format:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "info",
  "service": "ocr-backend",
  "message": "Image processed successfully",
  "metadata": {
    "job_id": "abc123",
    "engine": "tesseract",
    "processing_time": 1.23,
    "confidence": 0.95
  }
}

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/Optical-Character-Recognition.git
cd Optical-Character-Recognition

# Create a branch
git checkout -b feature/your-feature

# Make changes and test
npm test

# Commit with conventional commits
git commit -m "feat: add new OCR engine support"

# Push and create PR
git push origin feature/your-feature

πŸ“ License

This project is licensed under the MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

πŸ“§ Contact


Built with ❀️ for SynthoraAI - Synthesizing the world's news & information through AI

πŸ” Back to Top

About

A multi-engine OCR solution designed to extract text from multi-modal documents

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published