QuickExtract

A fast command-line tool for extracting specific fields from PDF documents using Docling and Google Gemini AI with Arize Phoenix observability.

Features

🚀 Fast PDF Processing: Uses Docling for efficient PDF parsing
🤖 AI-Powered Extraction: Google Gemini LLM for intelligent field extraction
🔄 LangGraph Native Retries: Built-in retry policies for automatic error recovery
📊 Observability: Built-in Arize Phoenix integration for tracing and monitoring
🔧 Configurable Fields: Define custom fields and types via JSON configuration
📝 Detailed Logging: Comprehensive extraction logs and audit trails
🧪 Well Tested: 70% test coverage with comprehensive test suite
⚡ Pipeline Ready: Fast test execution without Phoenix dependencies

Quick Start

Prerequisites

Python 3.10+
Google Cloud Project with Vertex AI API enabled
Google Cloud credentials configured

Installation

Clone the repository

git clone https://github.com/logesh45/QuickExtract.git
cd QuickExtract

Set up virtual environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Configure environment

cp .env.example .env
# Edit .env with your Google Cloud project ID

Authenticate with Google Cloud
```
gcloud auth application-default login
```

Usage

Basic Extraction

python extract.py fields.json invoice-sample.pdf

Custom Retry Attempts

# Use custom retry limit (default: 5)
python extract.py fields.json invoice-sample.pdf --max-attempts 3

# Fast processing with single attempt
python extract.py fields.json invoice-sample.pdf --max-attempts 1

# High reliability with more attempts
python extract.py fields.json invoice-sample.pdf --max-attempts 10

Custom Fields Configuration

Create a fields.json file with your desired fields:

{
  "INVOICE NUMBER": "string",
  "Date": "date",
  "Total": "number",
  "BILL TO address": "string",
  "Shipping Charges": "number",
  "Insurance": "number"
}

Supported Field Types

string: Text values
number: Numeric values (integers and decimals)
date: Date values

Output

Results are saved to output/extracted_[pdf_name].json:

{
  "success": true,
  "attempts": 1,
  "extracted_fields": {
    "INVOICE NUMBER": "F1000876/23",
    "Date": "14/08/2023",
    "Total": "$93.50",
    "BILL TO address": "IMPORTING COMPANY\n100 Mighty Bay\n125863 Rome, IT",
    "Shipping Charges": 100,
    "Insurance": 0
  },
  "missing_fields": [],
  "raw_output_files": ["raw_outputs/invoice-sample_attempt_1_20251027_210515.json"],
  "error": null
}

Arize Phoenix Integration

The tool includes built-in observability with Arize Phoenix:

Automatic Tracing: All LLM calls are automatically traced
Local Phoenix UI: Launches locally at http://localhost:6006
Custom Project: Uses project name "pdf-extraction" (configurable via PHOENIX_PROJECT_NAME)
Pipeline Compatible: Can be disabled for CI/CD environments

Phoenix Configuration

Optional environment variables for Phoenix:

PHOENIX_PROJECT_NAME=pdf-extraction
PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
PHOENIX_ENABLED=false  # Disable for faster execution/pipelines

Testing

Fast Test Execution (Recommended)

# Run tests without Phoenix (3x faster, pipeline compatible)
python run_tests.py

# Run tests with coverage report
python run_tests.py --coverage

# Verbose output
python run_tests.py --verbose

Test Options

# Enable Phoenix for debugging tests
python run_tests.py --phoenix

# Run integration test only
python run_tests.py --integration

# Coverage with Phoenix enabled
python run_tests.py --phoenix --coverage

Test Coverage

The test suite includes:

70% overall coverage (92% for extract.py, 65% for extractor.py)
Unit tests for all core components
CLI functionality tests
Phoenix integration tests
End-to-end workflow tests
Mock-based testing to avoid API costs
Fast execution mode (3.4 seconds vs 11+ seconds with Phoenix)

Running Tests Directly

# Run all tests with unittest
python -m unittest test_extractor.py -v

# Run specific test classes
python -m unittest test_extractor.TestDoclingExtractor -v

Development

Project Structure

QuickExtract/
├── extractor.py          # Core extraction logic with Docling + Gemini
├── extract.py            # CLI interface
├── test_extractor.py     # Comprehensive test suite (42 tests)
├── run_tests.py          # Enhanced test runner with Phoenix control
├── fields.json           # Sample field configuration
├── invoice-sample.pdf    # Sample invoice for testing
├── requirements.txt      # Python dependencies
├── .env.example         # Environment variables template
├── htmlcov/             # HTML coverage reports (generated)
├── output/              # Extraction results
├── raw_outputs/         # Raw extraction logs
└── README.md            # This file

Key Components

DoclingExtractor: Main extraction class with LangGraph workflow
CLI Interface: Argument parsing and file handling
Phoenix Integration: Automatic LLM tracing and observability
Test Suite: Comprehensive testing with mocking and fast execution modes
Retry Logic: Automatic retry with configurable limits
Audit Trails: Detailed logging and raw output preservation

Workflow Architecture

The extraction process uses a LangGraph workflow with the following nodes:

Parse PDF: Extract text and metadata using Docling
Extract Fields: Use Gemini LLM to extract specified fields
Validate Extraction: Check for missing fields and determine retry need
Save Results: Store extracted data and audit logs

Configuration

Environment Variables

Required:

GOOGLE_CLOUD_PROJECT: Your Google Cloud project ID

Optional:

GOOGLE_CLOUD_LOCATION: GCP region (default: us-central1)
PHOENIX_PROJECT_NAME: Phoenix project name (default: pdf-extraction)
PHOENIX_COLLECTOR_ENDPOINT: Phoenix endpoint (default: http://localhost:6006)
PHOENIX_ENABLED: Enable/disable Phoenix (default: true, set to false for pipelines)

Field Configuration Tips

Be Specific: Use exact field names from your documents
Include Variations: Add multiple similar fields if needed
Choose Right Types: Use appropriate field types for better extraction
Test Iteratively: Start with a few fields, then expand

Retry Configuration

The extractor uses LangGraph native retries combined with custom retry logic for maximum reliability:

# Default: 5 retry attempts
python extract.py fields.json invoice.pdf

# Custom retry limit
python extract.py fields.json invoice.pdf --max-attempts 3

# Single attempt (fastest)
python extract.py fields.json invoice.pdf --max-attempts 1

LangGraph Native Retry Features:

🔄 Automatic Exception Handling: Retries on ValueError, KeyError, and json.JSONDecodeError
🛡️ Built-in Error Recovery: LangGraph automatically handles retry timing and backoff
📊 Retry Tracking: Detailed logging of retry attempts and reasons
⚡ Smart Retry Logic: Combines custom field validation with native error handling

Retry Behavior:

Progressive retry delays
Missing field validation after each attempt
Comprehensive error logging
Early termination when all fields are successfully extracted
LangGraph handles retryable exceptions automatically
Marks as success after max attempts to avoid infinite loops

Troubleshooting

Common Issues

Google Cloud Authentication

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

Vertex AI API Not Enabled
- Enable Vertex AI API in your Google Cloud console
- Ensure proper IAM permissions
Phoenix UI Not Accessible
- Check if port 6006 is available
- Verify Phoenix installation: pip install arize-phoenix
- Use PHOENIX_ENABLED=false to disable if not needed
Tests Running Slow
- Use python run_tests.py for fast execution (Phoenix disabled)
- Avoid --phoenix flag unless debugging tracing issues

Debug Mode

For detailed debugging, the extractor provides:

Raw output files in raw_outputs/
Comprehensive audit logs
Step-by-step extraction logs
Phoenix tracing (when enabled)

Performance

Fast Test Mode: 3.4 seconds execution time
Standard Mode: 11+ seconds with Phoenix enabled
Pipeline Optimized: Phoenix disabled by default in test runner
Memory Efficient: Streaming PDF processing for large files

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Fork the QuickExtract repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass: python run_tests.py
Check coverage: python run_tests.py --coverage
Submit a pull request

Support

For issues and questions:

Check the troubleshooting section
Review the test files for usage examples
Open an issue on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract.py		extract.py
extractor.py		extractor.py
fields.json		fields.json
invoice-sample.pdf		invoice-sample.pdf
requirements.txt		requirements.txt
run_tests.py		run_tests.py
test_extractor.py		test_extractor.py

License

logesh45/QuickExtract

Folders and files

Latest commit

History

Repository files navigation

QuickExtract

Features

Quick Start

Prerequisites

Installation

Usage

Basic Extraction

Custom Retry Attempts

Custom Fields Configuration

Supported Field Types

Output

Arize Phoenix Integration

Phoenix Configuration

Testing

Fast Test Execution (Recommended)

Test Options

Test Coverage

Running Tests Directly

Development

Project Structure

Key Components

Workflow Architecture

Configuration

Environment Variables

Field Configuration Tips

Retry Configuration

Troubleshooting

Common Issues

Debug Mode

Performance

License

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages