thepi.pe

thepi.pe

Extract clean data from tricky documents ⚡

thepi.pe is a package that can scrape clean markdown or accurately extract structured data from complex documents. It uses vision-language models (VLMs) under the hood, and works out-of-the-box with any LLM, VLM, or vector database. It can be used right away on a hosted cloud, or it can be run locally.

Features 🌟

Scrape clean markdown, tables, and images from any document or webpage
Works out-of-the-box with LLMs, vector databases, and RAG frameworks
AI-native filetype detection, layout analysis, and structured data extraction
Accepts a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more

Get started in 5 minutes 🚀

thepi.pe can read a wide range of filetypes and web sources, so it requires a few dependencies. It also requires vision-language model inference for AI extraction features. For these reasons, we host an API that works out-of-the-box. For more detailed setup instructions, view the docs.

pip install thepipe-api

Hosted API (Python)

You can get an API key by signing up for a free account at thepi.pe. It is completely free to try out. The, simply set the THEPIPE_API_KEY environment variable to your API key.

Scrape Function

from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages
from openai import OpenAI

# scrape clean markdown
chunks = scrape_file(filepath="paper.pdf", ai_extraction=False)

# call LLM with scraped chunks
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=chunks_to_messages(chunks),
)

The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with thepipe.core.chunks_to_messages, which gives the following format:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "..."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64,..."
        }
      }
    ]
  }
]

Extract Function

The extract function allows you to extract structured data from documents. You can use it as follows:

from thepipe.extract import extract_from_file

# Define your schema
schema = {
    "name": "string",
    "age": "int",
    "is_student": "bool"
}

# Extract data from the file
result = extract_from_file(
    file_path="document.pdf",
    schema=schema,
    ai_model="gpt-4o-mini",
    multiple_extractions=True
)

print(result)

Local Installation (Python)

For a local installation, you can use the following command:

pip install thepipe-api[local]

You must have a local LLM server setup and running for AI extraction features. You can use any local LLM server that follows OpenAI format (such as LiteLLM) or a provider (such as OpenRouter or OpenAI). Next, set the LLM_SERVER_BASE_URL environment variable to your LLM server's endpoint URL and set LLM_SERVER_API_KEY. the DEFAULT_AI_MODEL environment variable can be set to your VLM of choice. For example, you would use openai/gpt-4o-mini if using OpenRouter or gpt-4o-mini if using OpenAI.

For full functionality with media-rich sources, you will need to install the following dependencies:

apt-get update && apt-get install -y git ffmpeg tesseract-ocr
python -m playwright install --with-deps chromium

When using thepi.pe locally, be sure to append local=True to your function calls:

chunks = scrape_url(url="https://example.com", local=True)

You can also use thepi.pe from the command line:

thepipe path/to/folder --include_regex .*\.tsx --local

Supported File Types 📚

Source	Input types	Multimodal	Notes
Webpage	URLs starting with `http`, `https`, `ftp`	✔️	Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI content extraction from the webpage's screenshot
PDF	`.pdf`	✔️	Extracts page markdown and page images. `ai_extraction` available to use a VLM for complex or scanned documents
Word Document	`.docx`	✔️	Extracts text, tables, and images
PowerPoint	`.pptx`	✔️	Extracts text and images from slides
Video	`.mp4`, `.mov`, `.wmv`	✔️	Uses Whisper for transcription and extracts frames
Audio	`.mp3`, `.wav`	✔️	Uses Whisper for transcription
Jupyter Notebook	`.ipynb`	✔️	Extracts markdown, code, outputs, and images
Spreadsheet	`.csv`, `.xls`, `.xlsx`	❌	Converts each row to JSON format, including row index for each
Plaintext	`.txt`, `.md`, `.rtf`, etc	❌	Simple text extraction
Image	`.jpg`, `.jpeg`, `.png`	✔️	Uses pytesseract for OCR in text-only mode
ZIP File	`.zip`	✔️	Extracts and processes contained files
Directory	any `path/to/folder`	✔️	Recursively processes all files in directory
YouTube Video (known issues)	YouTube video URLs starting with `https://youtube.com` or `https://www.youtube.com`.	✔️	Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your `pytube` installation to send a valid user agent header (see this issue).
Tweet	URLs starting with `https://twitter.com` or `https://x.com`	✔️	Uses unofficial API, may break unexpectedly
GitHub Repository	GitHub repo URLs starting with `https://github.com` or `https://www.github.com`	✔️	Requires GITHUB_TOKEN environment variable

How it works 🛠️

thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with language models, or vision transformers. You can feed these messages directly into the model, or alternatively you can use chunker.chunk_by_document, chunker.chunk_by_page, chunker.chunk_by_section, chunker.chunk_semantic to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with .to_llamaindex.

⚠️ It is important to be mindful of your model's token limit. GPT-4o does not work with too many images in the prompt (see discussion here). To remedy this issue, either use an LLM with a larger context window, extract larger documents with text_only=True, or embed the chunks into vector database.

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
.github/workflows		.github/workflows
tests		tests
thepipe		thepipe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
local.txt		local.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

thepi.pe

Extract clean data from tricky documents ⚡

Features 🌟

Get started in 5 minutes 🚀

Hosted API (Python)

Scrape Function

Extract Function

Local Installation (Python)

Supported File Types 📚

How it works 🛠️

Sponsors

About

Releases

Packages

Contributors 2

Languages

License

emcf/thepipe

Folders and files

Latest commit

History

Repository files navigation

thepi.pe

Extract clean data from tricky documents ⚡

Features 🌟

Get started in 5 minutes 🚀

Hosted API (Python)

Scrape Function

Extract Function

Local Installation (Python)

Supported File Types 📚

How it works 🛠️

Sponsors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages