A lightweight offline application for PDF text extraction, OCR processing, and rule-based document analysis.
- Offline Operation: No external APIs required - everything runs in your browser
- Client-Side OCR: Uses Tesseract.js for browser-based OCR processing
- PDF Text Extraction: Automatic text layer detection with PDF.js
- Rules Engine: Apply custom JSON rules to extracted text
- Demo Mode: Built-in sample rules for testing
- CSV Export: Export results for Excel/Google Sheets
- Single Binary: Everything embedded in one executable (frontend, rules, dependencies)
- Local Processing: All PDF processing happens client-side for privacy
No additional prerequisites required! The application uses a pure Go SQLite driver (github.com/glebarez/sqlite) that doesn't require CGO or external dependencies.
Note: Tesseract is no longer required as OCR processing is handled client-side in the browser using Tesseract.js.
-
Clone/Download this project
-
Install dependencies:
go mod tidy
-
Build the application:
go build -o ocr-app.exe main.go
-
Run the application:
./ocr-app.exe
-
Open in browser: http://localhost:8080
- Click "Choose File" and select a PDF
- Click "Upload & Process" to extract text/perform OCR
- The app will automatically detect if OCR is needed
- Select "Demo Mode"
- Click "Load Demo Rules" to use built-in sample rules
- Click "Apply Rules" to process the document
- Select "Test Mode"
- Paste your custom JSON rules in the text area
- Click "Apply Rules"
[
{
"id": "unique_id",
"name": "Rule Name",
"pattern": "search_text",
"description": "What this rule finds"
}
]- Click "Export to CSV" to download results
- Compatible with Excel and Google Sheets
/ocr-app
main.go # Main application server
go.mod # Go dependencies
go.sum # Go dependency checksums
.gitignore # Git ignore rules
/frontend # Embedded web interface
index.html # Main web interface
tesseract.min.js # Client-side OCR library
pdf.min.js # Client-side PDF processing
/rules # Embedded rule definitions
demo.json # Sample rules for testing
ocr_app.db # SQLite database (created at runtime)
GET /- Web interfacePOST /upload- PDF upload and text processingGET /rules/demo- Get demo rulesPOST /rules/test- Apply custom rulesGET /results?document_id=X- Get processing resultsGET /export?document_id=X- Export CSV
- Install TDM-GCC or MinGW-w64
- Required for SQLite CGO driver
# Clean and rebuild
go clean
go mod tidy
go build -o ocr-app.exe main.go- The app processes PDFs in the browser memory
- For very large files, consider splitting them first or use a more powerful device
- OCR processing happens entirely in the browser using Tesseract.js
- Ensure JavaScript is enabled in your browser
- For better OCR accuracy, use high-quality scanned documents
- PDFs are processed locally and not stored permanently
- SHA-256 hashes are computed for integrity verification
- No data is sent to external services
This project is provided as-is for demonstration purposes.