PDF Image Copyright Analyzer

A Python tool that analyzes images in PDF documents for copyright considerations using Claude AI. The tool processes each image and:

Classifies them into copyright categories
Finds potential source citations through reverse image search
Generates alternative images for replaceable content
Produces a comprehensive report

Classification Categories

Images are classified into one of four categories:

N: Not copyrighted/Creative Commons/Public Domain
P: Permission needed (academic/university/federal/non-profit content)
F: Fair use applicable
R: Needs recreation/replacement

Features

Extracts and analyzes all images from PDF files
Performs reverse image search using Google Images
Attempts to extract citations from source websites
Annotates the original PDF with classification labels
Generates a JSON report with detailed classifications
Uses Claude 3.5 Sonnet for image analysis
Automatically generates alternative images using Stable Diffusion (optional)

Requirements

Python 3.x
Chrome WebDriver (for Selenium)
API Keys:
- Anthropic API key (for Claude)
- Stable Diffusion API key (for image generation)
- 2captcha API key (for solving captchas - optional)

Installation

pip install anthropic python-dotenv PyMuPDF stability-sdk selenium beautifulsoup4 trafilatura 2captcha-python

Environment Setup

Create a .env file with your API keys:

CLAUDE_API_KEY=your_claude_api_key
STABILITY_API_KEY=your_stability_api_key
CAPTCHA_API_KEY=your_2captcha_api_key  # Optional

Usage

Basic usage:

python main.py path/to/your.pdf

With alternative image generation:

python main.py path/to/your.pdf --generate-alternatives

Show browser during citation search (helpful for debugging):

python main.py path/to/your.pdf --show-browser

Output Files

Annotated PDF (*_annotated.pdf):
- Original PDF with classification labels (N/P/F/R) next to each image
Classification Report (*_classifications.json):
- Detailed information for each image
- Includes classifications, citations, source URLs
- Alternative image paths (if generated)
Alternative Images (optional):
- Generated images for content marked as 'R' or 'N'
- Saved as separate PNG files

Known Issues and Tips

Citation Search:
- Using --show-browser flag helps monitor the citation search process
- Useful for debugging when citations aren't being found correctly
- Allows visual confirmation of successful Google Image searches
Captcha Handling (Work in Progress):
- Current implementation of captcha solving has limitations
- The URL being sent to 2captcha API may not match the actual captcha site
- Manual intervention might be needed for sites with strict captcha protection
Performance Considerations:
- Processing large PDFs with many images may take significant time
- Image search and text extraction are the most time-consuming operations

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
citation.py		citation.py
image_generator.py		image_generator.py
instructions.txt		instructions.txt
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Image Copyright Analyzer

Classification Categories

Features

Requirements

Installation

Environment Setup

Usage

Output Files

Known Issues and Tips

About

Uh oh!

Releases

Packages

Languages

mitsoul/copyright_handling

Folders and files

Latest commit

History

Repository files navigation

PDF Image Copyright Analyzer

Classification Categories

Features

Requirements

Installation

Environment Setup

Usage

Output Files

Known Issues and Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages