A fast command-line tool for extracting specific fields from PDF documents using Docling and Google Gemini AI with Arize Phoenix observability.
- 🚀 Fast PDF Processing: Uses Docling for efficient PDF parsing
- 🤖 AI-Powered Extraction: Google Gemini LLM for intelligent field extraction
- 🔄 LangGraph Native Retries: Built-in retry policies for automatic error recovery
- 📊 Observability: Built-in Arize Phoenix integration for tracing and monitoring
- 🔧 Configurable Fields: Define custom fields and types via JSON configuration
- 📝 Detailed Logging: Comprehensive extraction logs and audit trails
- 🧪 Well Tested: 70% test coverage with comprehensive test suite
- ⚡ Pipeline Ready: Fast test execution without Phoenix dependencies
- Python 3.10+
- Google Cloud Project with Vertex AI API enabled
- Google Cloud credentials configured
-
Clone the repository
git clone https://github.com/logesh45/QuickExtract.git cd QuickExtract -
Set up virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment
cp .env.example .env # Edit .env with your Google Cloud project ID -
Authenticate with Google Cloud
gcloud auth application-default login
python extract.py fields.json invoice-sample.pdf# Use custom retry limit (default: 5)
python extract.py fields.json invoice-sample.pdf --max-attempts 3
# Fast processing with single attempt
python extract.py fields.json invoice-sample.pdf --max-attempts 1
# High reliability with more attempts
python extract.py fields.json invoice-sample.pdf --max-attempts 10Create a fields.json file with your desired fields:
{
"INVOICE NUMBER": "string",
"Date": "date",
"Total": "number",
"BILL TO address": "string",
"Shipping Charges": "number",
"Insurance": "number"
}string: Text valuesnumber: Numeric values (integers and decimals)date: Date values
Results are saved to output/extracted_[pdf_name].json:
{
"success": true,
"attempts": 1,
"extracted_fields": {
"INVOICE NUMBER": "F1000876/23",
"Date": "14/08/2023",
"Total": "$93.50",
"BILL TO address": "IMPORTING COMPANY\n100 Mighty Bay\n125863 Rome, IT",
"Shipping Charges": 100,
"Insurance": 0
},
"missing_fields": [],
"raw_output_files": ["raw_outputs/invoice-sample_attempt_1_20251027_210515.json"],
"error": null
}The tool includes built-in observability with Arize Phoenix:
- Automatic Tracing: All LLM calls are automatically traced
- Local Phoenix UI: Launches locally at
http://localhost:6006 - Custom Project: Uses project name "pdf-extraction" (configurable via
PHOENIX_PROJECT_NAME) - Pipeline Compatible: Can be disabled for CI/CD environments
Optional environment variables for Phoenix:
PHOENIX_PROJECT_NAME=pdf-extraction
PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
PHOENIX_ENABLED=false # Disable for faster execution/pipelines# Run tests without Phoenix (3x faster, pipeline compatible)
python run_tests.py
# Run tests with coverage report
python run_tests.py --coverage
# Verbose output
python run_tests.py --verbose# Enable Phoenix for debugging tests
python run_tests.py --phoenix
# Run integration test only
python run_tests.py --integration
# Coverage with Phoenix enabled
python run_tests.py --phoenix --coverageThe test suite includes:
- 70% overall coverage (92% for extract.py, 65% for extractor.py)
- Unit tests for all core components
- CLI functionality tests
- Phoenix integration tests
- End-to-end workflow tests
- Mock-based testing to avoid API costs
- Fast execution mode (3.4 seconds vs 11+ seconds with Phoenix)
# Run all tests with unittest
python -m unittest test_extractor.py -v
# Run specific test classes
python -m unittest test_extractor.TestDoclingExtractor -vQuickExtract/
├── extractor.py # Core extraction logic with Docling + Gemini
├── extract.py # CLI interface
├── test_extractor.py # Comprehensive test suite (42 tests)
├── run_tests.py # Enhanced test runner with Phoenix control
├── fields.json # Sample field configuration
├── invoice-sample.pdf # Sample invoice for testing
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── htmlcov/ # HTML coverage reports (generated)
├── output/ # Extraction results
├── raw_outputs/ # Raw extraction logs
└── README.md # This file
- DoclingExtractor: Main extraction class with LangGraph workflow
- CLI Interface: Argument parsing and file handling
- Phoenix Integration: Automatic LLM tracing and observability
- Test Suite: Comprehensive testing with mocking and fast execution modes
- Retry Logic: Automatic retry with configurable limits
- Audit Trails: Detailed logging and raw output preservation
The extraction process uses a LangGraph workflow with the following nodes:
- Parse PDF: Extract text and metadata using Docling
- Extract Fields: Use Gemini LLM to extract specified fields
- Validate Extraction: Check for missing fields and determine retry need
- Save Results: Store extracted data and audit logs
Required:
GOOGLE_CLOUD_PROJECT: Your Google Cloud project ID
Optional:
GOOGLE_CLOUD_LOCATION: GCP region (default:us-central1)PHOENIX_PROJECT_NAME: Phoenix project name (default:pdf-extraction)PHOENIX_COLLECTOR_ENDPOINT: Phoenix endpoint (default:http://localhost:6006)PHOENIX_ENABLED: Enable/disable Phoenix (default:true, set tofalsefor pipelines)
- Be Specific: Use exact field names from your documents
- Include Variations: Add multiple similar fields if needed
- Choose Right Types: Use appropriate field types for better extraction
- Test Iteratively: Start with a few fields, then expand
The extractor uses LangGraph native retries combined with custom retry logic for maximum reliability:
# Default: 5 retry attempts
python extract.py fields.json invoice.pdf
# Custom retry limit
python extract.py fields.json invoice.pdf --max-attempts 3
# Single attempt (fastest)
python extract.py fields.json invoice.pdf --max-attempts 1LangGraph Native Retry Features:
- 🔄 Automatic Exception Handling: Retries on
ValueError,KeyError, andjson.JSONDecodeError - 🛡️ Built-in Error Recovery: LangGraph automatically handles retry timing and backoff
- 📊 Retry Tracking: Detailed logging of retry attempts and reasons
- ⚡ Smart Retry Logic: Combines custom field validation with native error handling
Retry Behavior:
- Progressive retry delays
- Missing field validation after each attempt
- Comprehensive error logging
- Early termination when all fields are successfully extracted
- LangGraph handles retryable exceptions automatically
- Marks as success after max attempts to avoid infinite loops
-
Google Cloud Authentication
gcloud auth application-default login gcloud config set project YOUR_PROJECT_ID -
Vertex AI API Not Enabled
- Enable Vertex AI API in your Google Cloud console
- Ensure proper IAM permissions
-
Phoenix UI Not Accessible
- Check if port 6006 is available
- Verify Phoenix installation:
pip install arize-phoenix - Use
PHOENIX_ENABLED=falseto disable if not needed
-
Tests Running Slow
- Use
python run_tests.pyfor fast execution (Phoenix disabled) - Avoid
--phoenixflag unless debugging tracing issues
- Use
For detailed debugging, the extractor provides:
- Raw output files in
raw_outputs/ - Comprehensive audit logs
- Step-by-step extraction logs
- Phoenix tracing (when enabled)
- Fast Test Mode: 3.4 seconds execution time
- Standard Mode: 11+ seconds with Phoenix enabled
- Pipeline Optimized: Phoenix disabled by default in test runner
- Memory Efficient: Streaming PDF processing for large files
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the QuickExtract repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
python run_tests.py - Check coverage:
python run_tests.py --coverage - Submit a pull request
For issues and questions:
- Check the troubleshooting section
- Review the test files for usage examples
- Open an issue on GitHub