Skip to content

lesser-panda/PDF-to-Text-Conversion-with-Layout-Extraction

Repository files navigation

PDF-to-Text-Conversion-with-Layout-Extraction

This notebook helps process the output generated by PyMuPDF, a PDF-to-text Python module. When converting PDF files, PyMuPDF can be used to automatically identify and label the strings based on font size, font weight and the most used font. The output file generated contains HTML stype tags such as <h1>, <h2>, <p>, <s1>, <s2>.

This notebook provide scripts and interactive widgets for:

  • Loading the PDF file
  • Process the PDF file with PyMuPDF to extract headers, paragraphs and subscripts
  • Inspecting the auto-generated HTML tags (select any tags and view random samples of paragraphs with the selected tag)
  • Renaming the labelings to desired naming schemes

Launch the notebook here: Binder

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published