This notebook helps process the output generated by PyMuPDF, a PDF-to-text Python module. When converting PDF files, PyMuPDF can be used to automatically identify and label the strings based on font size, font weight and the most used font. The output file generated contains HTML stype tags such as <h1>, <h2>, <p>, <s1>, <s2>
.
This notebook provide scripts and interactive widgets for:
- Loading the PDF file
- Process the PDF file with PyMuPDF to extract headers, paragraphs and subscripts
- Inspecting the auto-generated HTML tags (select any tags and view random samples of paragraphs with the selected tag)
- Renaming the labelings to desired naming schemes