A FastAPI-based document workflow system for tax preparation automation. Built for the RPG Founding Engineer Challenge.
This system automates the document understanding workflow for accounting firms:
Client β Intake β Upload β Classify β Extract β Checklist
βββββββββββββββ
β Client β (complexity: simple/average/complex)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Intake β (fiscal_year: 2025, status: open/done)
ββββββββ¬βββββββ
β
ββββΊ Documents (T4, ID, receipts)
β β
β ββββΊ Classification (Qwen3-VL-2B)
β β
β ββββΊ Extraction (structured fields)
β
ββββΊ Checklist (auto-updates on extraction)
- Duplicate Detection - SHA256 hashing prevents re-uploading same document
- Smart Classification - Qwen3-VL-2B vision-language model identifies document types
- Field Extraction - Parses ALL fields from T4 forms, IDs, and receipts with 100% accuracy
- Auto Checklist - Tracks required documents based on client complexity
- Production-Ready - Async endpoints, proper error handling, comprehensive validation
-
poppler-utils is REQUIRED - Needed for PDF-to-image conversion (pdf2image library dependency)
- Install with:
sudo apt-get install -y poppler-utils - On macOS:
brew install poppler - On Windows: Download from https://github.com/oschwartz10612/poppler-windows/releases/
- Install with:
-
Qwen3-VL-2B model downloads automatically - First classification downloads ~4GB model to Hugging Face cache
- Model:
Qwen/Qwen2-VL-2B-Instruct - Download URL: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
- Cache location:
~/.cache/huggingface/hub/ - Requires internet connection on first run
- Model:
-
Database initialization is REQUIRED
- Run
python init_db.pybefore starting the server - Run
python init_db.pybefore running tests (creates fresh database) - Database file:
rpg_challenge.db(SQLite)
- Run
-
GPU memory requirements
- Qwen3-VL-2B needs ~14GB GPU memory
- Image preprocessing resizes to max 1536px to prevent OOM errors
- If you get CUDA OOM errors, reduce image size in
app/services/qwen3vl_service.py
- Python 3.11+
- CUDA 12.6+ compatible GPU (L40S or better recommended)
- 14GB+ GPU memory (for Qwen3-VL-2B model)
- Ubuntu/Linux (for poppler-utils)
# Clone the repository
git clone <repository-url>
cd rpg
# Create and activate virtual environment (recommended Python 3.12)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install system dependencies (REQUIRED - not available via pip)
sudo apt-get update && sudo apt-get install -y poppler-utils
# Install Python dependencies
# Note: PyTorch with CUDA 12.6 will be installed from custom index
pip install -r requirements.txt
# Initialize database (REQUIRED before first run or tests)
python init_db.py
# Run the server
uvicorn main:app --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000
Interactive API docs: http://localhost:8000/docs
# Create a client
curl -X POST http://localhost:8000/clients \
-H "Content-Type: application/json" \
-d '{"name": "Test Client", "email": "[email protected]", "complexity": "simple"}'
# Create an intake (use client_id from above response)
curl -X POST http://localhost:8000/intakes \
-H "Content-Type: application/json" \
-d '{"client_id": "<client_id>", "fiscal_year": 2025}'
# Upload a document (use intake_id from above response)
curl -X POST http://localhost:8000/intakes/<intake_id>/documents \
-F "file=@sample_docs/T4_sample.JPG"
# Classify all documents in intake
curl -X POST http://localhost:8000/intakes/<intake_id>/classify
# Extract data from all classified documents
curl -X POST http://localhost:8000/intakes/<intake_id>/extract
# Check checklist status
curl http://localhost:8000/intakes/<intake_id>/checklistChoice: SQLite with UUID stored as String(36)
Rationale:
- For Reviewers: Zero-setup evaluation. Clone and run in 60 seconds
- Production Path: SQLAlchemy ORM enables seamless migration to PostgreSQL (Supabase/RDS) by changing one connection string
- UUID Design: Maintains production-quality schema with globally unique identifiers
- All primary keys use:
id = Column(String(36), primary_key=True, default=lambda: str(uuid.uuid4())) - Provides globally unique IDs, distributed system compatibility, no auto-increment collisions
- SQLite doesn't have native UUID type, so we store as String(36) for portability
- All primary keys use:
Trade-off: Accepted SQLite's limited concurrent writes for demo simplicity. In production, PostgreSQL would provide native UUID support, better concurrency, and advanced features like row-level security.
Choice: Qwen3-VL-2B-Instruct vision-language model
Rationale:
- No paid APIs - Challenge requirement, fully open-source
- 100% extraction accuracy - Achieved 32/32 fields correct on T4 sample
- Vision-language model - Understands both image layout and text semantics
- Single model for both tasks - Classification AND extraction with one inference
- Fast inference - 2B parameters, ~50 seconds per document on L40S GPU
- Proven reliability - Alibaba's production-grade model
- Open-source - Apache 2.0 license, no restrictions. Some OSS OCR models were tested, mainly PaddleOCR-VL. But the layout detection is subpar. MinerU OCR model was nott tested because of its restrictive license.
Choice: Store files in local directory, metadata in database
Rationale:
- Simple, reliable, no S3 dependency for demo
- Production migration path: swap to S3/GCS with minimal code changes
- SHA256 hash enables content-addressable storage patterns
POST /clients - Create a new client with complexity level
curl -X POST http://localhost:8000/clients \
-H "Content-Type: application/json" \
-d '{"name": "Acme Corp", "email": "[email protected]", "complexity": "simple"}'POST /intakes - Create intake for a client (auto-initializes checklist)
curl -X POST http://localhost:8000/intakes \
-H "Content-Type: application/json" \
-d '{"client_id": "<client_id>", "fiscal_year": 2025}'GET /intakes/{id}/checklist - Get checklist status and intake completion
curl http://localhost:8000/intakes/<intake_id>/checklistPOST /intakes/{id}/documents - Upload document (with duplicate detection)
curl -X POST http://localhost:8000/intakes/<intake_id>/documents \
-F "file=@sample_docs/T4_sample.JPG"POST /documents/{id}/classify - Classify single document
curl -X POST http://localhost:8000/documents/<document_id>/classifyPOST /intakes/{id}/classify - Classify all unknown documents in intake
curl -X POST http://localhost:8000/intakes/<intake_id>/classifyPOST /documents/{id}/extract - Extract fields from single document
curl -X POST http://localhost:8000/documents/<document_id>/extractPOST /intakes/{id}/extract - Extract from all classified documents in intake
curl -X POST http://localhost:8000/intakes/<intake_id>/extractid(UUID) - Primary keyname(String) - Client nameemail(String) - Contact emailcomplexity(Enum) - simple | average | complexcreated_at(DateTime)
Complexity determines expected documents:
- simple: T4 + ID (2 documents)
- average: T4 + ID + 2 receipts (4 documents)
- complex: T4 + ID + 5 receipts (8 documents)
id(UUID) - Primary keyclient_id(UUID) - Foreign key to Clientfiscal_year(Integer) - e.g., 2025status(Enum) - open | donecreated_at(DateTime)
id(UUID) - Primary keyintake_id(UUID) - Foreign key to Intakefilename(String)sha256(String, 64 chars) - For duplicate detectionmime_type(String)size_bytes(Integer)stored_path(String) - Path in /bucketuploaded_at(DateTime)doc_kind(Enum) - T4 | receipt | id | unknownextracted_data(JSON) - Structured fields
Unique constraint: (intake_id, sha256) prevents duplicate uploads within same intake
id(UUID) - Primary keyintake_id(UUID) - Foreign key to Intakedoc_kind(Enum) - T4 | receipt | idstatus(Enum) - missing | receivedquantity_expected(Integer) - e.g., 5 for receipts in complex clientquantity_received(Integer) - Increments as documents are extracted
# Delete existing database (if it exists)
rm -f rpg_challenge.db
# Initialize fresh database (REQUIRED before tests)
python init_db.py
# Run all tests
pytest tests/ -v
# Run specific test
pytest tests/test_workflow.py::test_complete_workflow -v
# Run with coverage
pytest tests/ --cov=app --cov-report=htmlWhy database deletion and initialization is required:
- Tests create clients/intakes with specific emails (e.g.,
[email protected]) - SQLite enforces unique constraints on client emails
- Running tests multiple times without fresh DB causes 409 conflicts
rm -f rpg_challenge.dbremoves old databasepython init_db.pycreates fresh database with clean schema
Test Coverage:
- β Full workflow test (upload β classify β extract β checklist completion)
- β Duplicate detection within same intake (SHA256 hash)
- β Same file allowed in different intakes
- β Different files allowed in same intake
- β SHA256 hash consistency across uploads
- β Partial workflow (intake stays open until all docs received)
- β Classification accuracy on sample documents (T4, ID, receipts)
Start Server - Run the FastAPI application
uvicorn main:app --host 0.0.0.0 --port 8000
# Or with auto-reload for development:
uvicorn main:app --reloadrpg/
βββ app/
β βββ api/ # API route handlers
β β βββ clients.py
β β βββ intakes.py
β β βββ documents.py
β β βββ checklist.py
β βββ models/ # SQLAlchemy database models
β β βββ client.py
β β βββ intake.py
β β βββ document.py
β β βββ checklist_item.py
β β βββ enums.py
β βββ schemas/ # Pydantic request/response schemas
β βββ services/ # Business logic
β β βββ classification.py
β β βββ extraction.py
β β βββ checklist.py
β βββ core/ # Configuration and database
β β βββ config.py
β β βββ database.py
β βββ utils/ # Helper functions
βββ tests/ # Test suite
βββ extractions/ # Extractions output folder (qwen3vl_tests, paddleocr_tests is deprecated)
βββ bucket/ # Document storage
βββ sample_docs/ # Provided test documents
βββ extract_receipt.py # Standalone dev scripts for testing ML extraction
βββ extract_drivers_license.py
βββ test_qwen3vl_t4_extraction.py # Standalone dev scripts for ML parameter tuning
βββ test_ocr_parameters.py # DEPRECATED. NOT USING PADDLE-OCR-VL ANYMORE.
βββ main.py # FastAPI application entry point
βββ init_db.py # Database initialization script
βββ requirements.txt # Python dependencies
βββ README.md
- No paid APIs - Challenge constraint required open-source models
- Vision-language model - Understands both image layout and text semantics
- Proven accuracy - Achieved 100% extraction accuracy on T4 sample (32/32 fields)
- Fast inference - 2B parameters fit in 14GB GPU memory with room for processing
- Multimodal - Single model handles both classification AND extraction
- PDF support - Converts PDFs to images automatically (uses first page for multi-page docs)
- Reviewer convenience - No external database setup required
- ACID compliance - Proper transactions for checklist updates
- Async support - aiosqlite enables non-blocking database operations
- Production path - Easy migration to PostgreSQL (same SQLAlchemy models)
- Concurrent requests - Handle multiple uploads/classifications simultaneously
- ML inference - Non-blocking while model processes images
- Database I/O - Async queries don't block the event loop
- Scalability - Ready for production load with minimal changes
- PDF Conversion - pdf2image converts PDFs to JPG (requires poppler-utils)
- Format Normalization - Convert palette/RGBA images to RGB
- Size Optimization - Resize to max 1536px to prevent GPU OOM
- Quality Preservation - Save as JPG with 95% quality for model input
If I had more time, here's what I could/would build:
- Background Task Queue - Celery/RQ for long-running classification/extraction jobs
- Confidence Scores - Return model confidence with classifications for human review thresholds
- Document Versioning - Track document history and allow re-uploads with version tracking
- Webhook Notifications - Notify external systems when intake status changes to "done"
- Fine-tuned Model - Train Qwen3-VL specifically on Canadian tax forms for even higher accuracy
- Retry Logic - Exponential backoff for transient ML inference failures with circuit breaker
- Audit Logging - Track all document operations for compliance (who, what, when)
- Batch Processing - Process entire intake in one API call with progress tracking
- Multi-tenancy - Isolate data by accounting firm with row-level security policies
- API Rate Limiting - Token bucket algorithm to protect against abuse
- Idempotency Keys - Prevent duplicate requests with client-provided idempotency tokens
- Model Caching - Redis cache for frequently extracted documents
- Horizontal Scaling - Load balancer + multiple API instances sharing model via model server
All configuration is hardcoded in app/core/config.py for simplicity:
- Database:
sqlite+aiosqlite:///./rpg_challenge.db - File storage:
./bucket/ - Max file size: 10MB
For production, we would use environment variables via .env file.
Built for the RPG Founding Engineer Challenge.
- Qwen3-VL - Alibaba's open-source vision-language model
- FastAPI - Modern Python web framework
- SQLAlchemy - Powerful ORM with async support
- Transformers - Hugging Face library for ML models
- Lightning.ai - GPU-accelerated inference