SynthoraAI OCR - Optical Character Recognition System

Advanced OCR system for SynthoraAI - Extracting text from government documents, images, and PDFs to enhance article curation and accessibility.

🚀 Overview

The SynthoraAI OCR system is a production-ready, multi-engine Optical Character Recognition solution designed to extract text from:

Scanned government documents
PDF files (searchable and non-searchable)
Images (JPG, PNG, TIFF, WebP)
Screenshots and infographics
Historical documents with degraded quality

This system integrates seamlessly with the SynthoraAI AI-Powered Article Content Curator to enable text extraction from non-digital sources, making government information more accessible.

✨ Key Features

🔥 Multi-Engine OCR

Tesseract OCR - Industry standard, 100+ languages
EasyOCR - Deep learning-based, excellent accuracy
PaddleOCR - Ultra-fast, multilingual support
TrOCR (Microsoft) - Transformer-based for handwriting
Ensemble Mode - Combines results from multiple engines

📄 Document Processing

PDF Processing - Extract text from both searchable and scanned PDFs
Image Preprocessing - Deskewing, noise removal, contrast enhancement
Layout Analysis - Preserve document structure and formatting
Table Extraction - Extract data from tables in documents
Handwriting Recognition - Process handwritten government forms

🌐 API & Integration

RESTful API - Easy integration with SynthoraAI backend
WebSocket Support - Real-time OCR progress updates
Batch Processing - Process multiple documents simultaneously
Cloud Storage - S3, Azure Blob, Google Cloud Storage support
Webhook Notifications - Get notified when OCR completes

🎯 Accuracy & Performance

Confidence Scoring - Get reliability metrics for extracted text
Language Detection - Automatic language identification
Post-Processing - Spell checking and text correction
GPU Acceleration - CUDA support for faster processing
Caching - Redis-based caching for processed documents

🚀 Quick Start

Using Docker (Recommended)

# Clone the repository
git clone https://github.com/SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition.git
cd Optical-Character-Recognition

# Start all services
docker-compose up -d

# Test the OCR API
curl -X POST http://localhost:8000/api/ocr/image \
  -F "[email protected]" \
  -F "engine=ensemble"

Local Installation

# Install Python dependencies
cd ocr_backend
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Install Node.js dependencies
cd ../ocr_api
npm install

# Set up environment variables
cp .env.example .env
# Edit .env with your configuration

# Start the services
# Terminal 1: Python OCR backend
cd ocr_backend
python main.py

# Terminal 2: Node.js API
cd ocr_api
npm run dev

📦 Installation

Prerequisites

Python 3.8+ with pip
Node.js 18+ with npm
Tesseract OCR 4.0+
Redis (optional, for caching)
MongoDB (optional, for document storage)

System Dependencies

Ubuntu/Debian

sudo apt-get update
sudo apt-get install -y \
  tesseract-ocr \
  tesseract-ocr-eng \
  libtesseract-dev \
  poppler-utils \
  libsm6 \
  libxext6 \
  libxrender-dev \
  libgomp1

macOS

brew install tesseract
brew install poppler

Windows

# Install via chocolatey
choco install tesseract
choco install poppler

Python Backend

cd ocr_backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download language models
python -c "import easyocr; easyocr.Reader(['en'])"
python -m spacy download en_core_web_sm

Node.js API

cd ocr_api

# Install dependencies
npm install

# For development
npm install -D nodemon @types/node

🔧 Usage

Command Line Interface

# Process a single image
python ocr_backend/cli.py process --input document.jpg --engine tesseract

# Process a PDF
python ocr_backend/cli.py process --input document.pdf --engine paddleocr --output result.txt

# Batch process a directory
python ocr_backend/cli.py batch --input ./documents --output ./results --engine ensemble

# Compare OCR engines
python ocr_backend/cli.py compare --input document.jpg

Python API

from ocr_backend.processors import OCRProcessor

# Initialize processor
processor = OCRProcessor(engine='ensemble')

# Process an image
result = processor.process_image('document.jpg')
print(f"Extracted text: {result['text']}")
print(f"Confidence: {result['confidence']}")
print(f"Language: {result['language']}")

# Process a PDF
result = processor.process_pdf('document.pdf')
for page_num, page_result in enumerate(result['pages']):
    print(f"Page {page_num + 1}: {page_result['text']}")

REST API

# Process an image
curl -X POST http://localhost:8000/api/ocr/image \
  -F "[email protected]" \
  -F "engine=tesseract" \
  -F "languages=eng"

# Process a PDF
curl -X POST http://localhost:8000/api/ocr/pdf \
  -F "[email protected]" \
  -F "engine=paddleocr"

# Get processing status
curl http://localhost:8000/api/ocr/status/job_id_here

# List supported languages
curl http://localhost:8000/api/ocr/languages

JavaScript/Node.js Integration

const OCRClient = require('./ocr_api/client');

const client = new OCRClient('http://localhost:8000');

// Process an image
const result = await client.processImage('document.jpg', {
  engine: 'ensemble',
  languages: ['eng'],
  preprocessing: true
});

console.log('Text:', result.text);
console.log('Confidence:', result.confidence);

📚 API Reference

Endpoints

`POST /api/ocr/image`

Process a single image file.

Parameters:

file (multipart/form-data) - Image file (JPG, PNG, TIFF, WebP)
engine (string, optional) - OCR engine: tesseract, easyocr, paddleocr, trocr, ensemble (default: tesseract)
languages (string[], optional) - Language codes (default: ['eng'])
preprocessing (boolean, optional) - Apply image preprocessing (default: true)

Response:

{
  "success": true,
  "text": "Extracted text content",
  "confidence": 0.95,
  "language": "eng",
  "processing_time": 1.23,
  "engine": "tesseract",
  "metadata": {
    "image_size": [1920, 1080],
    "dpi": 300
  }
}

`POST /api/ocr/pdf`

Process a PDF document.

Parameters:

file (multipart/form-data) - PDF file
engine (string, optional) - OCR engine
extract_images (boolean, optional) - Extract and OCR embedded images

Response:

{
  "success": true,
  "pages": [
    {
      "page_number": 1,
      "text": "Page 1 content",
      "confidence": 0.92
    }
  ],
  "total_pages": 10,
  "processing_time": 15.67
}

`POST /api/ocr/batch`

Process multiple files.

Parameters:

files (multipart/form-data[]) - Array of files
engine (string, optional) - OCR engine

`GET /api/ocr/status/:jobId`

Get processing status for async jobs.

`GET /api/ocr/languages`

List supported languages for each engine.

`GET /api/ocr/engines`

List available OCR engines and their capabilities.

🔗 Integration with SynthoraAI

Backend Integration

Add to your SynthoraAI backend (backend/api/articles.js):

const axios = require('axios');

// OCR integration for article processing
async function processArticleWithOCR(articleUrl) {
  try {
    // Download the article image/PDF
    const response = await axios.get(articleUrl, { responseType: 'arraybuffer' });

    // Send to OCR service
    const formData = new FormData();
    formData.append('file', response.data);
    formData.append('engine', 'ensemble');

    const ocrResult = await axios.post('http://localhost:8000/api/ocr/image', formData);

    return {
      extractedText: ocrResult.data.text,
      confidence: ocrResult.data.confidence
    };
  } catch (error) {
    console.error('OCR processing failed:', error);
    return null;
  }
}

Crawler Integration

Enhance your crawler to extract text from images in articles:

// In crawler/schedule/fetchAndSummarize.ts
import { OCRProcessor } from '../utils/ocr';

const ocrProcessor = new OCRProcessor();

async function processArticle(article) {
  // Extract images from article
  const images = await extractImagesFromArticle(article.url);

  // OCR each image
  const ocrResults = await Promise.all(
    images.map(img => ocrProcessor.processImage(img))
  );

  // Combine OCR text with article content
  article.content += '\n\n' + ocrResults.map(r => r.text).join('\n\n');

  return article;
}

Frontend Integration

Add OCR upload component:

// frontend/components/OCRUpload.jsx
import { useState } from 'react';

export default function OCRUpload() {
  const [file, setFile] = useState(null);
  const [result, setResult] = useState(null);
  const [loading, setLoading] = useState(false);

  const handleUpload = async () => {
    setLoading(true);
    const formData = new FormData();
    formData.append('file', file);
    formData.append('engine', 'ensemble');

    const response = await fetch('http://localhost:8000/api/ocr/image', {
      method: 'POST',
      body: formData
    });

    const data = await response.json();
    setResult(data);
    setLoading(false);
  };

  return (
    <div className="ocr-upload">
      <input type="file" onChange={(e) => setFile(e.target.files[0])} />
      <button onClick={handleUpload} disabled={!file || loading}>
        {loading ? 'Processing...' : 'Extract Text'}
      </button>
      {result && (
        <div className="result">
          <p>Confidence: {(result.confidence * 100).toFixed(2)}%</p>
          <pre>{result.text}</pre>
        </div>
      )}
    </div>
  );
}

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     SynthoraAI OCR System                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐      ┌──────────────┐      ┌───────────┐  │
│  │   Frontend  │─────▶│  Node.js API │─────▶│  Python   │  │
│  │  (React)    │      │  (Express)   │      │  Backend  │  │
│  └─────────────┘      └──────────────┘      └───────────┘  │
│                              │                      │        │
│                              ▼                      ▼        │
│                       ┌──────────┐          ┌─────────────┐ │
│                       │  Redis   │          │  OCR        │ │
│                       │  Cache   │          │  Engines    │ │
│                       └──────────┘          │  - Tesseract│ │
│                                             │  - EasyOCR  │ │
│                              │              │  - PaddleOCR│ │
│                              ▼              │  - TrOCR    │ │
│                       ┌──────────┐          └─────────────┘ │
│                       │ MongoDB  │                           │
│                       │ Storage  │                           │
│                       └──────────┘                           │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│                     Cloud Integrations                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐  │
│  │   AWS    │  │  Azure   │  │  Google  │  │  Vercel    │  │
│  │   S3     │  │  Blob    │  │  Cloud   │  │  Functions │  │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘  │
└─────────────────────────────────────────────────────────────┘

Component Breakdown

Frontend Layer
- React components for file upload
- Real-time OCR progress display
- Result visualization and editing
API Layer (Node.js/Express)
- Request validation and routing
- File upload handling
- WebSocket connections for real-time updates
- Authentication and rate limiting
Processing Layer (Python)
- Image preprocessing
- Multi-engine OCR processing
- Post-processing and text correction
- Confidence scoring
Storage Layer
- Redis: Caching and job queue
- MongoDB: Document metadata and results
- Cloud storage: Original files and processed outputs

⚙️ Configuration

Environment Variables

Create a .env file:

# OCR Backend
PYTHON_PORT=5000
OCR_ENGINES=tesseract,easyocr,paddleocr,trocr
DEFAULT_ENGINE=tesseract
ENABLE_GPU=false
MAX_FILE_SIZE=50MB

# Node.js API
NODE_PORT=8000
API_BASE_URL=http://localhost:8000
ENABLE_WEBSOCKET=true
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_WINDOW=15min

# Database
MONGODB_URI=mongodb://localhost:27017/synthoraai_ocr
REDIS_URL=redis://localhost:6379

# Cloud Storage (Optional)
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION=us-east-1
AWS_S3_BUCKET=synthoraai-ocr

AZURE_STORAGE_CONNECTION_STRING=your_connection_string
AZURE_CONTAINER_NAME=ocr-documents

# SynthoraAI Integration
SYNTHORAAI_API_URL=https://ai-content-curator-backend.vercel.app
SYNTHORAAI_API_KEY=your_api_key

# Logging
LOG_LEVEL=info
ENABLE_DEBUG=false

OCR Engine Configuration

Edit ocr_backend/config/engines.yaml:

tesseract:
  enabled: true
  languages: [eng, spa, fra, deu]
  config: --psm 3 --oem 3
  confidence_threshold: 0.6

easyocr:
  enabled: true
  languages: [en, es, fr, de]
  gpu: false
  confidence_threshold: 0.5

paddleocr:
  enabled: true
  lang: en
  use_angle_cls: true
  confidence_threshold: 0.5

trocr:
  enabled: false  # Requires GPU
  model: microsoft/trocr-base-handwritten
  confidence_threshold: 0.7

ensemble:
  enabled: true
  engines: [tesseract, easyocr, paddleocr]
  voting_method: weighted  # or majority
  weights:
    tesseract: 0.3
    easyocr: 0.4
    paddleocr: 0.3

🚀 Performance

Benchmarks

Tested on AWS EC2 t3.xlarge (4 vCPU, 16GB RAM):

Engine	Speed (pages/min)	Accuracy	GPU Support
Tesseract	15-20	85-90%	No
EasyOCR	8-12	88-93%	Yes
PaddleOCR	20-30	87-92%	Yes
TrOCR	5-8	90-95%	Yes
Ensemble	10-15	92-96%	Partial

Optimization Tips

Enable GPU: Set ENABLE_GPU=true for 3-5x speedup
Use Preprocessing: Improves accuracy by 5-10%
Enable Caching: Reduces redundant processing
Batch Processing: Process multiple files in parallel
Choose Right Engine:
- Tesseract: Fast, general purpose
- EasyOCR: High accuracy, slower
- PaddleOCR: Best speed/accuracy balance
- Ensemble: Highest accuracy, slowest

🚢 Deployment

Docker Deployment

# Build images
docker-compose build

# Start all services
docker-compose up -d

# Scale processing workers
docker-compose up -d --scale ocr-worker=4

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Deployment

# Apply configurations
kubectl apply -f k8s/

# Check status
kubectl get pods -n synthoraai-ocr

# Scale workers
kubectl scale deployment ocr-worker --replicas=5 -n synthoraai-ocr

Vercel Deployment (API Only)

cd ocr_api
vercel --prod

AWS Lambda Deployment

cd ocr_backend
./scripts/deploy-lambda.sh

🧪 Testing

Unit Tests

# Python backend
cd ocr_backend
pytest tests/ -v --cov=.

# Node.js API
cd ocr_api
npm test
npm run test:coverage

Integration Tests

# Test full pipeline
npm run test:integration

# Test specific engine
pytest tests/test_tesseract.py -v

Load Testing

# Using artillery
cd ocr_api
npm run test:load

# Using locust
cd ocr_backend
locust -f tests/locustfile.py

📊 Monitoring

Metrics

The system exposes Prometheus metrics at /metrics:

ocr_requests_total - Total OCR requests
ocr_processing_duration_seconds - Processing time histogram
ocr_confidence_score - Confidence score distribution
ocr_errors_total - Error count by type
ocr_queue_size - Current queue size

Logging

Logs are structured in JSON format:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "info",
  "service": "ocr-backend",
  "message": "Image processed successfully",
  "metadata": {
    "job_id": "abc123",
    "engine": "tesseract",
    "processing_time": 1.23,
    "confidence": 0.95
  }
}

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/Optical-Character-Recognition.git
cd Optical-Character-Recognition

# Create a branch
git checkout -b feature/your-feature

# Make changes and test
npm test

# Commit with conventional commits
git commit -m "feat: add new OCR engine support"

# Push and create PR
git push origin feature/your-feature

📝 License

This project is licensed under the MIT License - see LICENSE file for details.

🙏 Acknowledgments

📧 Contact

Project Lead: David Nguyen
Email: [email protected]
SynthoraAI: https://synthoraai.vercel.app
Issues: https://github.com/SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition/issues

Built with ❤️ for SynthoraAI - Synthesizing the world's news & information through AI

🔝 Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
aws/lambda		aws/lambda
frontend		frontend
k8s		k8s
ocr_api		ocr_api
ocr_backend		ocr_backend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

License

SynthoraAI-AI-News-Content-Curator/Optical-Character-Recognition

Folders and files

Latest commit

History

Repository files navigation

SynthoraAI OCR - Optical Character Recognition System

🚀 Overview

✨ Key Features

🔥 Multi-Engine OCR

📄 Document Processing

🌐 API & Integration

🎯 Accuracy & Performance

📋 Table of Contents

🚀 Quick Start

Using Docker (Recommended)

Local Installation

📦 Installation

Prerequisites

System Dependencies

Ubuntu/Debian

macOS

Windows

Python Backend

Node.js API

🔧 Usage

Command Line Interface

Python API

REST API

JavaScript/Node.js Integration

📚 API Reference

Endpoints

POST /api/ocr/image

POST /api/ocr/pdf

POST /api/ocr/batch

GET /api/ocr/status/:jobId

GET /api/ocr/languages

GET /api/ocr/engines

🔗 Integration with SynthoraAI

Backend Integration

Crawler Integration

Frontend Integration

🏗️ Architecture

Component Breakdown

⚙️ Configuration

Environment Variables

OCR Engine Configuration

🚀 Performance

Benchmarks

Optimization Tips

🚢 Deployment

Docker Deployment

Kubernetes Deployment

Vercel Deployment (API Only)

AWS Lambda Deployment

🧪 Testing

Unit Tests

Integration Tests

Load Testing

📊 Monitoring

Metrics

Logging

🤝 Contributing

Development Setup

📝 License

🙏 Acknowledgments

📧 Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`POST /api/ocr/image`

`POST /api/ocr/pdf`

`POST /api/ocr/batch`

`GET /api/ocr/status/:jobId`

`GET /api/ocr/languages`

`GET /api/ocr/engines`

Packages