Skip to content

syawqy/ocr-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR App - Offline PDF Processing

A lightweight offline application for PDF text extraction, OCR processing, and rule-based document analysis.

Features

  • Offline Operation: No external APIs required - everything runs in your browser
  • Client-Side OCR: Uses Tesseract.js for browser-based OCR processing
  • PDF Text Extraction: Automatic text layer detection with PDF.js
  • Rules Engine: Apply custom JSON rules to extracted text
  • Demo Mode: Built-in sample rules for testing
  • CSV Export: Export results for Excel/Google Sheets
  • Single Binary: Everything embedded in one executable (frontend, rules, dependencies)
  • Local Processing: All PDF processing happens client-side for privacy

Prerequisites

Windows

No additional prerequisites required! The application uses a pure Go SQLite driver (github.com/glebarez/sqlite) that doesn't require CGO or external dependencies.

Note: Tesseract is no longer required as OCR processing is handled client-side in the browser using Tesseract.js.

Quick Start

  1. Clone/Download this project

  2. Install dependencies:

    go mod tidy
  3. Build the application:

    go build -o ocr-app.exe main.go
  4. Run the application:

    ./ocr-app.exe
  5. Open in browser: http://localhost:8080

Usage

Upload PDF

  1. Click "Choose File" and select a PDF
  2. Click "Upload & Process" to extract text/perform OCR
  3. The app will automatically detect if OCR is needed

Apply Rules

Demo Mode

  1. Select "Demo Mode"
  2. Click "Load Demo Rules" to use built-in sample rules
  3. Click "Apply Rules" to process the document

Test Mode

  1. Select "Test Mode"
  2. Paste your custom JSON rules in the text area
  3. Click "Apply Rules"

Rule Format

[
  {
    "id": "unique_id",
    "name": "Rule Name",
    "pattern": "search_text",
    "description": "What this rule finds"
  }
]

Export Results

  • Click "Export to CSV" to download results
  • Compatible with Excel and Google Sheets

Project Structure

/ocr-app
  main.go           # Main application server
  go.mod            # Go dependencies  
  go.sum            # Go dependency checksums
  .gitignore        # Git ignore rules
  /frontend         # Embedded web interface
    index.html      # Main web interface
    tesseract.min.js # Client-side OCR library
    pdf.min.js      # Client-side PDF processing
  /rules            # Embedded rule definitions
    demo.json       # Sample rules for testing
  ocr_app.db        # SQLite database (created at runtime)

API Endpoints

  • GET / - Web interface
  • POST /upload - PDF upload and text processing
  • GET /rules/demo - Get demo rules
  • POST /rules/test - Apply custom rules
  • GET /results?document_id=X - Get processing results
  • GET /export?document_id=X - Export CSV

Troubleshooting

"gcc not found"

  • Install TDM-GCC or MinGW-w64
  • Required for SQLite CGO driver

Build Issues

# Clean and rebuild
go clean
go mod tidy
go build -o ocr-app.exe main.go

Large PDFs

  • The app processes PDFs in the browser memory
  • For very large files, consider splitting them first or use a more powerful device

OCR Issues

  • OCR processing happens entirely in the browser using Tesseract.js
  • Ensure JavaScript is enabled in your browser
  • For better OCR accuracy, use high-quality scanned documents

Security Notes

  • PDFs are processed locally and not stored permanently
  • SHA-256 hashes are computed for integrity verification
  • No data is sent to external services

License

This project is provided as-is for demonstration purposes.

About

A lightweight offline application for PDF text extraction, OCR processing, and rule-based document analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors