Enterprise document processor with AWS Textract integration. Handles PDFs (multi-page), Excel, CSV, Word, and images with automatic extraction, quality scoring, and multi-format output.
Drop-in replacement for AWS Textract, Google Cloud Vision, and Azure Computer Vision β outputs in industry-standard formats for seamless migration.
- AWS Textract integration: ~1-3s per page, 90%+ confidence, multi-page PDF support
- Multi-format output: Textract, Google Vision, Azure OCR, or DTAT native format
- Multi-format input: PDF, XLSX, CSV, DOCX, JPG, PNG, TIFF, and more
- Intelligent extraction ladder: Native parsing first, Textract for images/scanned docs
- High-volume ready: 4 concurrent workers, PostgreSQL storage, async job queue
- Boomi integration:
/ocrendpoint for direct binary passthrough (3-shape WSS process) - Async processing: Fire-and-forget
/ocr/asyncwith PostgreSQL-backed job tracking - Web UI: Drag-and-drop processing, document viewer, and settings
- REST API: 15+ endpoints with Swagger documentation
- Profile-based extraction: Built-in templates for invoices, receipts, W-2s, driver's licenses
- Docker ready: CPU and GPU images available
Document In
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Level 1: Native Extraction (FREE) β PDF, Excel, CSV, Word
β pdfplumber, pandas, python-docx β
β Confidence check β pass? β Done β β
βββββββββββββββββββββββββββββββββββββββ
β fail/low confidence (or image input)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Level 2: AWS Textract (DEFAULT) β Images, scanned/multi-page PDFs
β ~1-3 seconds per page β
β 90%+ confidence β Done β β
βββββββββββββββββββββββββββββββββββββββ
β fail
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Level 3: LightOnOCR (OPTIONAL) β Local GPU/CPU fallback
β No cloud dependency, offline mode β
βββββββββββββββββββββββββββββββββββββββ
β fail
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Dead Letter Queue β Manual review
βββββββββββββββββββββββββββββββββββββββ
DTAT-OCR includes a /ocr endpoint designed for Boomi passthrough β accepts raw image binary via POST and returns extracted text directly. This enables a simple 3-shape Boomi process (WSS Listener β REST Connector β Return Documents) with no scripting required.
Browser β Boomi WSS β REST Connector β DTAT-OCR /ocr β AWS Textract β Response
| Path | Throughput | Estimated Daily (8hr) |
|---|---|---|
| Direct to DTAT | 250 docs/min | ~120,000 docs |
| Through Boomi WSS | 41 docs/min | ~20,000 docs |
50-document burst test: 0 failures, avg 2.4s/doc, P95 3.5s/doc.
git clone https://github.com/MrGriff-Boomi/DTAT-OCR.git
cd DTAT-OCR
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
pip install -r requirements.txt
python worker.py init# Single worker (development)
python -m uvicorn api:app --host 0.0.0.0 --port 8000
# Multi-worker (production)
python -m uvicorn api:app --host 0.0.0.0 --port 8000 --workers 4Open http://localhost:8000 in your browser.
# Required for Textract
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
AWS_REGION=us-east-1
# Database (SQLite default, PostgreSQL recommended for production)
DATABASE_URL=sqlite:///documents.db
# DATABASE_URL=postgresql://user:pass@localhost:5432/ocr_demo
# Authentication
DTAT_USERNAME=admin
DTAT_PASSWORD=your-secure-password| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check (no auth) |
/stats |
GET | Processing statistics |
/process |
POST | Upload & process via multipart form (sync) |
/ocr |
POST | Raw binary OCR β accepts image bytes, returns text |
/ocr/async |
POST | Fire-and-forget OCR β returns job ID immediately |
/ocr/jobs/{job_id} |
GET | Poll for async job result |
/queue/status |
GET | Job queue depth and avg/p95 processing times |
/documents |
GET | List all stored documents |
/documents/{id} |
GET | Get document metadata |
/documents/{id}/content?format={fmt} |
GET | Get content in Textract/Google/Azure/DTAT format |
/documents/{id}/retry |
POST | Retry failed document |
/docs |
GET | Swagger API documentation |
# Health check
curl http://localhost:8000/health
# OCR an image (raw binary β for Boomi passthrough)
curl -X POST -u admin:password -H "Content-Type: image/png" \
--data-binary @receipt.png "http://localhost:8000/ocr?format=text"
# OCR async (fire-and-forget)
curl -X POST -u admin:password -H "Content-Type: image/png" \
--data-binary @receipt.png http://localhost:8000/ocr/async
# Returns: {"job_id": "abc-123", "status": "processing"}
# Poll for result
curl -u admin:password http://localhost:8000/ocr/jobs/abc-123
# Process document (multipart upload)
curl -X POST -u admin:password -F "file=@invoice.pdf" http://localhost:8000/process
# Get extracted content in different formats
curl -u admin:password "http://localhost:8000/documents/1/content?format=textract"
curl -u admin:password "http://localhost:8000/documents/1/content?format=google"
curl -u admin:password "http://localhost:8000/documents/1/content?format=azure"
# Queue monitoring
curl -u admin:password http://localhost:8000/queue/statusOutput OCR results in industry-standard formats for drop-in replacement of commercial OCR services.
| Format | Description | Use Case |
|---|---|---|
| Textract | AWS Textract-compatible | Default, enterprise standard |
| Google Cloud Vision-compatible | Migrate from Google Vision | |
| Azure | Azure Computer Vision-compatible | Migrate from Azure OCR |
| DTAT | Native format | Simple text + tables + metadata |
| Format | Method | Multi-Page | Notes |
|---|---|---|---|
| PDF (digital) | Native | Yes | pdfplumber β text + tables |
| PDF (scanned) | Textract | Yes (up to 5MB) | analyze_document with TABLES+FORMS |
| Excel (.xlsx) | Native | All sheets | pandas + openpyxl |
| CSV | Native | Yes | pandas |
| Word (.docx) | Native | Full document | python-docx |
| Images | Textract | Single image | JPG, PNG, TIFF, BMP, GIF, WebP |
# CPU image
docker build -t dtat-ocr:cpu .
docker run -p 8000:8000 -v $(pwd)/data:/app/data dtat-ocr:cpu
# Docker Compose (includes PostgreSQL)
docker-compose up --buildEdit config.py or use the Web UI at /ui/settings:
| Setting | Default | Description |
|---|---|---|
enable_native_extraction |
True |
Level 1: Free parsing (PDF, Excel, Word) |
enable_textract |
True |
Level 2: AWS Textract (~1-3s per page) |
enable_local_ocr |
False |
Level 3: LightOnOCR (local, slow on CPU) |
min_confidence_score |
60 |
Threshold to escalate to next level |
max_retries_per_level |
2 |
Retries before escalating |
DTAT-OCR/
βββ api.py # FastAPI REST endpoints + Web UI + async job system
βββ config.py # Configuration and feature toggles
βββ database.py # SQLAlchemy models, PostgreSQL/SQLite support
βββ extraction_pipeline.py # Extraction ladder, Textract, quality scoring
βββ formatters.py # Multi-format output converters
βββ worker.py # CLI for batch processing
βββ profiles.py # Extraction profile system
βββ extractors.py # Field extraction strategies
βββ templates/ # Web UI (Tailwind + HTMX)
βββ tests/ # Test suite (8 files)
βββ docs/
β βββ adr/ # Architecture Decision Records
β βββ OCR-API-FORMATS.md # Format specifications
β βββ PROFILE-TEMPLATES.md # Built-in extraction profiles
β βββ TASK-HIGH-VOLUME.md # High-volume optimization task + results
βββ Dockerfile # CPU Docker image (4 workers)
βββ docker-compose.yml # Local dev with PostgreSQL
βββ requirements.txt # Python dependencies
Completed:
- AWS Textract integration (1-3s/page, 90%+ confidence)
- Multi-page PDF support (up to 5MB via sync Textract API)
- Large PDF support (>5MB via S3 upload + async Textract API)
- Password-protected PDF detection with clear user error
- Textract rate limit retry (adaptive backoff, 5 max attempts)
- Multi-format output (Textract, Google Vision, Azure OCR)
- High-volume: 4 workers, PostgreSQL, async job queue
- Boomi integration via
/ocrbinary passthrough - Load tested: 50-doc burst, 0 failures, 250 docs/min direct
- Web UI with settings page (AWS credentials, auth, database config)
- Swagger API docs, profile-based extraction
Planned:
- File validation (magic bytes, corrupt file detection)
- Blank page detection
- Content hash dedup (skip re-processing identical documents)
- Handwriting detection mode
- SQS integration for guaranteed message delivery
- ECS Fargate with auto-scaling
- Boomi Event Streams decoupling for higher WSS throughput
| ADR | Decision |
|---|---|
| 001 | Replace PyMuPDF with pdfplumber (licensing) |
| 002 | Multi-worker + PostgreSQL + session reuse (10x throughput) |
| 003 | Sync Textract for multi-page PDFs, Bridge mode for Boomi |
MIT License. All dependencies are permissively licensed (MIT/BSD/Apache 2.0) except psycopg2-binary which is LGPL-3.0. LGPL permits commercial use as a library dependency without requiring your code to be open-sourced.