PDFToolkit

PDF Extraction and Analysis CLI

Overview • Installation • Usage • Providers • References

Overview

PDFToolkit is a CLI for extracting, analyzing, and benchmarking PDF content, with a focus on charts and visualizations. It provides a unified interface to multiple conversion and analysis backends, plus a harness for comparing parsers on a single document.

Installation

git clone https://github.com/amadad/pdftoolkit.git
cd pdftoolkit

# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create environment and install
uv venv && source .venv/bin/activate
uv sync

Set up API keys:

export OPENAI_API_KEY="..."      # For marker --describe, markitdown
export MISTRAL_API_KEY="..."     # For mistral provider
export TOGETHER_API_KEY="..."    # For together provider

Optional installs:

uv pip install megaparse unstructured[all-docs]==0.15.0  # megaparse provider
uv pip install together                                   # together provider

Usage

Convert PDF to Markdown

# Default provider (docling)
pdftoolkit convert document.pdf

# Choose provider
pdftoolkit convert document.pdf -p marker
pdftoolkit convert document.pdf -p mistral
pdftoolkit convert document.pdf -p markitdown
pdftoolkit convert document.pdf -p megaparse

# With options
pdftoolkit convert document.pdf -p marker --describe  # Add AI image descriptions (marker only)
pdftoolkit convert document.pdf -o custom_output/     # Custom output directory

Benchmark a PDF Across Parsers

# Benchmark the default runnable commercial-safe tools
pdftoolkit benchmark document.pdf

# Benchmark an explicit commercial-friendly subset
pdftoolkit benchmark document.pdf -t docling -t markitdown -t mistral

# Benchmark optional research tools (if installed)
pdftoolkit benchmark document.pdf -t mineru -t olmocr -t paddleocr

By default, benchmark runs the low-friction commercial tool set that is currently runnable in your environment. Outputs are written under output/benchmark/<document-stem>/, with a results.json summary and per-tool output directories.

Analyze Images/Charts

# Default provider (ollama - local)
pdftoolkit analyze chart.jpg

# Choose provider
pdftoolkit analyze chart.jpg -p ollama
pdftoolkit analyze chart.jpg -p together
pdftoolkit analyze chart.jpg -p colqwen

# With options
pdftoolkit analyze chart.jpg -q "What trends does this show?"
pdftoolkit analyze images/ --threshold 0.6  # Batch with confidence filter

# ColQwen returns relevance scores for queries
pdftoolkit analyze chart.jpg -p colqwen -q "chart showing growth"

Help

pdftoolkit --help
pdftoolkit convert --help
pdftoolkit benchmark --help
pdftoolkit analyze --help

Providers

Convert Providers

Provider	Description	Requirements
`docling`	IBM's document toolkit, basic extraction	Default
`marker`	PDF extraction with image support	`--describe` needs OPENAI_API_KEY
`mistral`	Mistral OCR API	MISTRAL_API_KEY
`markitdown`	Microsoft's converter	OPENAI_API_KEY
`megaparse`	Advanced structure parsing	Separate install

Benchmark Tools

pdftoolkit benchmark can run the integrated convert providers plus optional eval tools when installed.

Tool	What it is	Commercial use
`docling`	IBM document parser	Yes
`markitdown`	Microsoft converter	Yes
`mistral`	Mistral OCR API	Yes
`megaparse`	Structural parser	Yes
`marker`	Layout-focused parser	Review license/weights
`paddleocr`	PP-Structure parser	Yes
`olmocr`	Technical-doc OCR	Yes
`mineru`	Strong open parser	No (AGPL)
`got-ocr`, `qwen-vl`, `internvl`, `nanonets`	VLM eval tools	Review model licenses

Analyze Providers

Provider	Description	Requirements
`ollama`	Local Llama Vision	Ollama running locally
`together`	Together API with confidence scoring	TOGETHER_API_KEY
`colqwen`	Visual similarity/relevance scores	Local GPU recommended

Project Structure

pdftoolkit/
├── cli.py              # Typer CLI
├── providers/
│   ├── convert.py      # PDF conversion providers
│   └── analyze.py      # Image analysis providers
├── benchmark.py        # Benchmark harness and tool registry
├── clients.py          # API client singletons
└── utils.py            # Shared utilities
src/                    # Standalone scripts (reference implementations)
tests/                  # Test suite

References

ColQwen2 - Visual retrieval model
Docling - IBM's document toolkit
Marker - PDF extraction
MegaParse - Advanced parsing
MarkItDown - Microsoft's converter
Mistral OCR - Mistral Document AI
Ollama - Local LLM inference
Together - Cloud LLM inference

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
eval		eval
pdftoolkit		pdftoolkit
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODEMAP.md		CODEMAP.md
PDFToolkit.code-workspace		PDFToolkit.code-workspace
README.md		README.md
TOOLS.md		TOOLS.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFToolkit

PDF Extraction and Analysis CLI

Overview

Installation

Usage

Convert PDF to Markdown

Benchmark a PDF Across Parsers

Analyze Images/Charts

Help

Providers

Convert Providers

Benchmark Tools

Analyze Providers

Project Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDFToolkit

PDF Extraction and Analysis CLI

Overview

Installation

Usage

Convert PDF to Markdown

Benchmark a PDF Across Parsers

Analyze Images/Charts

Help

Providers

Convert Providers

Benchmark Tools

Analyze Providers

Project Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages