Skip to content

NotADevIAmaMeatPopsicle/DTAT-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

DTAT OCR (Ducktape and Twine OCR)

License: MIT Python 3.12+ Code style: black

Enterprise document processor with AWS Textract integration. Handles PDFs (multi-page), Excel, CSV, Word, and images with automatic extraction, quality scoring, and multi-format output.

Drop-in replacement for AWS Textract, Google Cloud Vision, and Azure Computer Vision β€” outputs in industry-standard formats for seamless migration.

Features

  • AWS Textract integration: ~1-3s per page, 90%+ confidence, multi-page PDF support
  • Multi-format output: Textract, Google Vision, Azure OCR, or DTAT native format
  • Multi-format input: PDF, XLSX, CSV, DOCX, JPG, PNG, TIFF, and more
  • Intelligent extraction ladder: Native parsing first, Textract for images/scanned docs
  • High-volume ready: 4 concurrent workers, PostgreSQL storage, async job queue
  • Boomi integration: /ocr endpoint for direct binary passthrough (3-shape WSS process)
  • Async processing: Fire-and-forget /ocr/async with PostgreSQL-backed job tracking
  • Web UI: Drag-and-drop processing, document viewer, and settings
  • REST API: 15+ endpoints with Swagger documentation
  • Profile-based extraction: Built-in templates for invoices, receipts, W-2s, driver's licenses
  • Docker ready: CPU and GPU images available

Architecture

Document In
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 1: Native Extraction (FREE)   β”‚  PDF, Excel, CSV, Word
β”‚ pdfplumber, pandas, python-docx     β”‚
β”‚ Confidence check β†’ pass? β†’ Done βœ“   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ fail/low confidence (or image input)
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 2: AWS Textract (DEFAULT)     β”‚  Images, scanned/multi-page PDFs
β”‚ ~1-3 seconds per page               β”‚
β”‚ 90%+ confidence β†’ Done βœ“            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ fail
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 3: LightOnOCR (OPTIONAL)      β”‚  Local GPU/CPU fallback
β”‚ No cloud dependency, offline mode    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ fail
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Dead Letter Queue                   β”‚  Manual review
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Boomi Integration

DTAT-OCR includes a /ocr endpoint designed for Boomi passthrough β€” accepts raw image binary via POST and returns extracted text directly. This enables a simple 3-shape Boomi process (WSS Listener β†’ REST Connector β†’ Return Documents) with no scripting required.

Browser β†’ Boomi WSS β†’ REST Connector β†’ DTAT-OCR /ocr β†’ AWS Textract β†’ Response

Performance (Single EC2 t3.medium)

Path Throughput Estimated Daily (8hr)
Direct to DTAT 250 docs/min ~120,000 docs
Through Boomi WSS 41 docs/min ~20,000 docs

50-document burst test: 0 failures, avg 2.4s/doc, P95 3.5s/doc.

Quick Start

Installation

git clone https://github.com/MrGriff-Boomi/DTAT-OCR.git
cd DTAT-OCR

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

pip install -r requirements.txt
python worker.py init

Run

# Single worker (development)
python -m uvicorn api:app --host 0.0.0.0 --port 8000

# Multi-worker (production)
python -m uvicorn api:app --host 0.0.0.0 --port 8000 --workers 4

Open http://localhost:8000 in your browser.

Environment Variables

# Required for Textract
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
AWS_REGION=us-east-1

# Database (SQLite default, PostgreSQL recommended for production)
DATABASE_URL=sqlite:///documents.db
# DATABASE_URL=postgresql://user:pass@localhost:5432/ocr_demo

# Authentication
DTAT_USERNAME=admin
DTAT_PASSWORD=your-secure-password

API Endpoints

Endpoint Method Description
/health GET Health check (no auth)
/stats GET Processing statistics
/process POST Upload & process via multipart form (sync)
/ocr POST Raw binary OCR β€” accepts image bytes, returns text
/ocr/async POST Fire-and-forget OCR β€” returns job ID immediately
/ocr/jobs/{job_id} GET Poll for async job result
/queue/status GET Job queue depth and avg/p95 processing times
/documents GET List all stored documents
/documents/{id} GET Get document metadata
/documents/{id}/content?format={fmt} GET Get content in Textract/Google/Azure/DTAT format
/documents/{id}/retry POST Retry failed document
/docs GET Swagger API documentation

Example Calls

# Health check
curl http://localhost:8000/health

# OCR an image (raw binary β€” for Boomi passthrough)
curl -X POST -u admin:password -H "Content-Type: image/png" \
  --data-binary @receipt.png "http://localhost:8000/ocr?format=text"

# OCR async (fire-and-forget)
curl -X POST -u admin:password -H "Content-Type: image/png" \
  --data-binary @receipt.png http://localhost:8000/ocr/async
# Returns: {"job_id": "abc-123", "status": "processing"}

# Poll for result
curl -u admin:password http://localhost:8000/ocr/jobs/abc-123

# Process document (multipart upload)
curl -X POST -u admin:password -F "file=@invoice.pdf" http://localhost:8000/process

# Get extracted content in different formats
curl -u admin:password "http://localhost:8000/documents/1/content?format=textract"
curl -u admin:password "http://localhost:8000/documents/1/content?format=google"
curl -u admin:password "http://localhost:8000/documents/1/content?format=azure"

# Queue monitoring
curl -u admin:password http://localhost:8000/queue/status

Multi-Format Output

Output OCR results in industry-standard formats for drop-in replacement of commercial OCR services.

Format Description Use Case
Textract AWS Textract-compatible Default, enterprise standard
Google Google Cloud Vision-compatible Migrate from Google Vision
Azure Azure Computer Vision-compatible Migrate from Azure OCR
DTAT Native format Simple text + tables + metadata

Supported Input Formats

Format Method Multi-Page Notes
PDF (digital) Native Yes pdfplumber β€” text + tables
PDF (scanned) Textract Yes (up to 5MB) analyze_document with TABLES+FORMS
Excel (.xlsx) Native All sheets pandas + openpyxl
CSV Native Yes pandas
Word (.docx) Native Full document python-docx
Images Textract Single image JPG, PNG, TIFF, BMP, GIF, WebP

Docker

# CPU image
docker build -t dtat-ocr:cpu .
docker run -p 8000:8000 -v $(pwd)/data:/app/data dtat-ocr:cpu

# Docker Compose (includes PostgreSQL)
docker-compose up --build

Configuration

Edit config.py or use the Web UI at /ui/settings:

Setting Default Description
enable_native_extraction True Level 1: Free parsing (PDF, Excel, Word)
enable_textract True Level 2: AWS Textract (~1-3s per page)
enable_local_ocr False Level 3: LightOnOCR (local, slow on CPU)
min_confidence_score 60 Threshold to escalate to next level
max_retries_per_level 2 Retries before escalating

Project Structure

DTAT-OCR/
β”œβ”€β”€ api.py                    # FastAPI REST endpoints + Web UI + async job system
β”œβ”€β”€ config.py                 # Configuration and feature toggles
β”œβ”€β”€ database.py               # SQLAlchemy models, PostgreSQL/SQLite support
β”œβ”€β”€ extraction_pipeline.py    # Extraction ladder, Textract, quality scoring
β”œβ”€β”€ formatters.py             # Multi-format output converters
β”œβ”€β”€ worker.py                 # CLI for batch processing
β”œβ”€β”€ profiles.py               # Extraction profile system
β”œβ”€β”€ extractors.py             # Field extraction strategies
β”œβ”€β”€ templates/                # Web UI (Tailwind + HTMX)
β”œβ”€β”€ tests/                    # Test suite (8 files)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ adr/                  # Architecture Decision Records
β”‚   β”œβ”€β”€ OCR-API-FORMATS.md    # Format specifications
β”‚   β”œβ”€β”€ PROFILE-TEMPLATES.md  # Built-in extraction profiles
β”‚   └── TASK-HIGH-VOLUME.md   # High-volume optimization task + results
β”œβ”€β”€ Dockerfile                # CPU Docker image (4 workers)
β”œβ”€β”€ docker-compose.yml        # Local dev with PostgreSQL
└── requirements.txt          # Python dependencies

Roadmap

Current Status: Production-Ready (v2.1.0)

Completed:

  • AWS Textract integration (1-3s/page, 90%+ confidence)
  • Multi-page PDF support (up to 5MB via sync Textract API)
  • Large PDF support (>5MB via S3 upload + async Textract API)
  • Password-protected PDF detection with clear user error
  • Textract rate limit retry (adaptive backoff, 5 max attempts)
  • Multi-format output (Textract, Google Vision, Azure OCR)
  • High-volume: 4 workers, PostgreSQL, async job queue
  • Boomi integration via /ocr binary passthrough
  • Load tested: 50-doc burst, 0 failures, 250 docs/min direct
  • Web UI with settings page (AWS credentials, auth, database config)
  • Swagger API docs, profile-based extraction

Planned:

  • File validation (magic bytes, corrupt file detection)
  • Blank page detection
  • Content hash dedup (skip re-processing identical documents)
  • Handwriting detection mode
  • SQS integration for guaranteed message delivery
  • ECS Fargate with auto-scaling
  • Boomi Event Streams decoupling for higher WSS throughput

Architecture Decision Records

ADR Decision
001 Replace PyMuPDF with pdfplumber (licensing)
002 Multi-worker + PostgreSQL + session reuse (10x throughput)
003 Sync Textract for multi-page PDFs, Bridge mode for Boomi

License

MIT License. All dependencies are permissively licensed (MIT/BSD/Apache 2.0) except psycopg2-binary which is LGPL-3.0. LGPL permits commercial use as a library dependency without requiring your code to be open-sourced.

About

Swiss Army Knife document processor with OCR fallback. Handles PDFs, Excel, CSV, Word, and images with automatic retry logic and quality scoring.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors