AI-Agent PDF Extractor

This repository contains a Python-based automation script that processes PDF files by extracting their text content and performing natural language processing tasks, such as summarization or email extraction. The AI processing is powered by the Ollama language model, with an option to integrate Azure OpenAI for more advanced or scalable NLP tasks.

Project Overview

The AI-Agent PDF Extractor allows you to automate the extraction and processing of PDF documents using AI. It is designed to support various operations, such as:

Summarizing the content of PDFs.
Extracting email addresses from PDFs.

The project integrates with the Ollama language model, and it also provides an option to switch to Azure OpenAI for cases where large models, such as Phi-3, may be too large for your system.

Features

PDF Text Extraction: Uses pdfplumber to extract text from each page of the PDF.
Summarization: Summarizes the extracted text using the Ollama model or Azure OpenAI.
Email Extraction: Extracts email addresses from the text content.
Customizable Processing: Choose between summarization or email extraction with an easy-to-configure setting.

Requirements

Python 3.7 or higher
pdfplumber for PDF processing
requests for making HTTP requests to the Ollama API

External Services

Ollama for language model API
Optional: Azure OpenAI API for large-scale AI processing

Installation

Clone the repository:

git clone https://github.com/Morteningemann86/AI-Agent-PDF-Extractor.git
cd AI-Agent-PDF-Extractor

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate   # On Windows use: venv\Scripts�ctivate

Install the required Python dependencies:
```
pip install -r requirements.txt
```
Install the Ollama language model (optional):
```
ollama run llama3.2:1b
```
If you're using Azure OpenAI, follow their setup guide.

Configuration

All configurations are defined inside the script. You need to set the following variables:

PDF_DIR: Path to the folder where your PDF files are located (e.g., pdf_input).
OUTPUT_DIR: Path to the folder where the output files (summaries or extracted emails) will be saved (e.g., pdf_output).
OLLAMA_MODEL: The model you want to use (default is "phi3:latest").
OLLAMA_ENDPOINT: The URL for the Ollama API (default: http://localhost:11434/api/generate).
PROCESSING_TYPE: Choose between "summarize" or "extract_emails" to set the desired processing type.

Example:

PDF_DIR = "pdf_input"              # Directory for input PDF files
OUTPUT_DIR = "pdf_output"          # Directory for processed output
OLLAMA_MODEL = "phi3:latest"       # Ollama model version
OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"
PROCESSING_TYPE = "summarize"      # Choose 'summarize' or 'extract_emails'

Usage

Running the Script

Place your PDF files inside the folder specified by the PDF_DIR configuration.
Run the script with Robocorp:
```
rcc run
```

The script will iterate through all the PDF files, extract the text, and either summarize it or extract email addresses, depending on the PROCESSING_TYPE setting.

Customizing the Script

To modify the processing behavior, update the following configurations directly in the script:

To summarize the text, set:
```
PROCESSING_TYPE = "summarize"
```
To extract email addresses, set:
```
PROCESSING_TYPE = "extract_emails"
```

Example Output

For summarization, the output will be saved as a text file with the suffix _summary.txt. For email extraction, the output will be saved with the suffix _emails.txt.

Examples

Summarization

If you choose to summarize a PDF, the extracted text from each page will be processed through the Ollama model, and a concise summary will be generated.

Example summary output (_summary.txt):

--- Page 1 ---
This document provides an overview of the project...

--- Page 2 ---
In conclusion, the key points discussed include...

Email Extraction

If you opt to extract emails, the script will detect and list all email addresses found in the document.

Example email output (_emails.txt):

[email protected]
[email protected]

Limitations

Model Size: Large models, such as phi3, may require a machine with higher specs. Consider using a smaller Ollama model like llama3.2:1b for testing or switch to Azure OpenAI.
PDF Content: The script works best with text-based PDFs. Scanned or image-based PDFs will not yield useful results without OCR preprocessing.

Contributing

Contributions are welcome! Please follow these steps to contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Submit a pull request with a detailed explanation of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pdf_input		pdf_input
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda.yaml		conda.yaml
robot.yaml		robot.yaml
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Agent PDF Extractor

Table of Contents

Project Overview

Features

Requirements

External Services

Installation

Configuration

Usage

Running the Script

Customizing the Script

Example Output

Examples

Summarization

Email Extraction

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

Morteningemann86/AI-Agent-PDF-Extractor

Folders and files

Latest commit

History

Repository files navigation

AI-Agent PDF Extractor

Table of Contents

Project Overview

Features

Requirements

External Services

Installation

Configuration

Usage

Running the Script

Customizing the Script

Example Output

Examples

Summarization

Email Extraction

Limitations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages