PDF-to-Text-Conversion-with-Layout-Extraction

This notebook helps process the output generated by PyMuPDF, a PDF-to-text Python module. When converting PDF files, PyMuPDF can be used to automatically identify and label the strings based on font size, font weight and the most used font. The output file generated contains HTML stype tags such as <h1>, <h2>, <p>, <s1>, <s2>.

This notebook provide scripts and interactive widgets for:

Loading the PDF file
Process the PDF file with PyMuPDF to extract headers, paragraphs and subscripts
Inspecting the auto-generated HTML tags (select any tags and view random samples of paragraphs with the selected tag)
Renaming the labelings to desired naming schemes

Launch the notebook here:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
NIST Privacy Framework (sample PDF file).pdf		NIST Privacy Framework (sample PDF file).pdf
PDF to CSV.ipynb		PDF to CSV.ipynb
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-to-Text-Conversion-with-Layout-Extraction

About

Releases

Packages

Languages

lesser-panda/PDF-to-Text-Conversion-with-Layout-Extraction

Folders and files

Latest commit

History

Repository files navigation

PDF-to-Text-Conversion-with-Layout-Extraction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages