GitHub - gawati/pdf-to-xml: PDF to XML converter

This is a fork of the pdfminer tool, with a specific focus on extracting semantic XML out of OCR-ed PDF.

It extracts pdf content page by page, and also identifies words and lines using distinct tags.

Installation

python lc_setup.py install

You can also install it within a virtualenv.

python lc_pdf2txt.py

Provides various options, of interest to us are XML specific options which have been added:

-B make_brief

Which disables character level font glyphs if that is too verbose for you.

-t xml

Outputs XML

lc_pdf2txt.py -B -t xml -o test.xml ./akn_mu_act_1923-10-13_act_14-1923_eng_main.pdf

will convert akn_mu_act_1923-10-13_act_14-1923_eng_main.pdf to test.xml.

We typically don't need character level font-glyphs

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
cmaprsrc		cmaprsrc
pdfminer		pdfminer
.gitignore		.gitignore
Jenkinsfile		Jenkinsfile
LICENSE.txt		LICENSE.txt
README.md		README.md
lc_pdfengine.iml		lc_pdfengine.iml
lc_setup.py		lc_setup.py