Skip to content

A pipeline to identify, replace, or remove copyrighted material in educational content for open-access publishing.

Notifications You must be signed in to change notification settings

mitsoul/copyright_handling

Repository files navigation

PDF Image Copyright Analyzer

A Python tool that analyzes images in PDF documents for copyright considerations using Claude AI. The tool processes each image and:

  • Classifies them into copyright categories
  • Finds potential source citations through reverse image search
  • Generates alternative images for replaceable content
  • Produces a comprehensive report

Classification Categories

Images are classified into one of four categories:

  • N: Not copyrighted/Creative Commons/Public Domain
  • P: Permission needed (academic/university/federal/non-profit content)
  • F: Fair use applicable
  • R: Needs recreation/replacement

Features

  • Extracts and analyzes all images from PDF files
  • Performs reverse image search using Google Images
  • Attempts to extract citations from source websites
  • Annotates the original PDF with classification labels
  • Generates a JSON report with detailed classifications
  • Uses Claude 3.5 Sonnet for image analysis
  • Automatically generates alternative images using Stable Diffusion (optional)

Requirements

  • Python 3.x
  • Chrome WebDriver (for Selenium)
  • API Keys:
    • Anthropic API key (for Claude)
    • Stable Diffusion API key (for image generation)
    • 2captcha API key (for solving captchas - optional)

Installation

pip install anthropic python-dotenv PyMuPDF stability-sdk selenium beautifulsoup4 trafilatura 2captcha-python

Environment Setup

Create a .env file with your API keys:

CLAUDE_API_KEY=your_claude_api_key
STABILITY_API_KEY=your_stability_api_key
CAPTCHA_API_KEY=your_2captcha_api_key  # Optional

Usage

Basic usage:

python main.py path/to/your.pdf

With alternative image generation:

python main.py path/to/your.pdf --generate-alternatives

Show browser during citation search (helpful for debugging):

python main.py path/to/your.pdf --show-browser

Output Files

  1. Annotated PDF (*_annotated.pdf):

    • Original PDF with classification labels (N/P/F/R) next to each image
  2. Classification Report (*_classifications.json):

    • Detailed information for each image
    • Includes classifications, citations, source URLs
    • Alternative image paths (if generated)
  3. Alternative Images (optional):

    • Generated images for content marked as 'R' or 'N'
    • Saved as separate PNG files

Known Issues and Tips

  1. Citation Search:

    • Using --show-browser flag helps monitor the citation search process
    • Useful for debugging when citations aren't being found correctly
    • Allows visual confirmation of successful Google Image searches
  2. Captcha Handling (Work in Progress):

    • Current implementation of captcha solving has limitations
    • The URL being sent to 2captcha API may not match the actual captcha site
    • Manual intervention might be needed for sites with strict captcha protection
  3. Performance Considerations:

    • Processing large PDFs with many images may take significant time
    • Image search and text extraction are the most time-consuming operations

About

A pipeline to identify, replace, or remove copyrighted material in educational content for open-access publishing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages