Azure PDF Parser

Context

This repo provides a python wrapper class for calling text extraction on local or url accessible pdf documents.

Utility code is then provided to enable the conversion of this api response object to a Parser Output object.

Setup

Prior to using this wrapper class you will need to have an Azure FormRecognizer processor instantiated in the microsoft azure cloud.

You will then need to identify your endpoint and key variables for access.

Usage

Via CLI

The CLI takes as input a directory of pdfs and outputs a directory of 'blank' parser output JSON files, with only the document_id, document_name, text blocks and page metadata fields populated.

Install extra cli dependency group: poetry install --with cli
Populate environment variables (see .env.example)
Run the CLI: poetry run python -m src.cli --pdf-dir <path to pdf directory> --output-dir <path to output directory>

To run the CLI with source urls use the following method:

poetry run python -m src.cli --output-dir output --source_url cclw.executive.1.1 https://source.pdf

To run the CLI with source urls on multiple documents use the following method:

poetry run python -m src.cli --output-dir output --source_url cclw.executive.1.1 https://source.pdf --source_url cclw.executive.2.2 https://source.pdf

The CLI can also be run programmatically, which is a shortcut for the below.

from azure_pdf_parser.run import run_parser

# Saves JSONs named by IDs to output_dir
run_parser(
    output_dir=Path("./data/"),
    ids_and_source_urls=_ids_and_urls, # (id, url) tuples
)

Programmatically

Install dependencies and enter the python shell:

poetry install
python3

Import the wrapper class and conversion function:

from azure_pdf_parser import AzureApiWrapper
from azure_pdf_parser import azure_api_response_to_parser_output

Instantiate client connection and call text extraction on a pdf accessible via an endpoint. Then convert to a parser output object:

azure_client = AzureApiWrapper(AZURE_KEY, AZURE_ENDPOINT)

api_response = azure_client.analyze_document_from_url(
    doc_url="https://example.com/file.pdf"
)

parser_output = azure_api_response_to_parser_output(
    parser_input=parser_input,
    md5_sum=md5_sum,
    api_response=api_response,
    experimental_extract_tables=True,
)

One has four options for calling the text extraction api:

analyze_document_from_url - Pass a url to a pdf document.
analyze_document_from_bytes - Pass a byte string of a pdf document.
analyze_large_document_from_url - Pass a url to a pdf document that's greater than ~1500 pages.
analyze_large_document_from_bytes - Pass a bytes string of a pdf document that's greater than ~1500 pages.

The reason we have two different methods for large documents is so the Azure API can provide functionality for a user to provide either the bytes of a document or the url of the document. For the analyze_large_document_from_url method the azure wrapper will then handle the download of the document from source as well as the splitting of the document and calling of the api.

The package also provides functionality to extract tables from the pdf document. This is an experimental feature and is not recommended for use in production. This can be configured by setting the experimental_extract_tables flag to True when calling the azure_api_response_to_parser_output function. This defaults to False.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
.trunk		.trunk
scripts/unece_sprint		scripts/unece_sprint
src		src
tests		tests
.env.example		.env.example
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure PDF Parser

Context

Setup

Usage

Via CLI

Programmatically

About

Releases 20

Packages

Contributors 7

Languages

License

climatepolicyradar/azure-pdf-parser

Folders and files

Latest commit

History

Repository files navigation

Azure PDF Parser

Context

Setup

Usage

Via CLI

Programmatically

About

Resources

License

Stars

Watchers

Forks

Releases 20

Packages 0

Contributors 7

Languages

Packages