GitHub - gatlanit/PDF-To-Formatted-Text: A formatted PDF to Text program utilzing PyTorch, pytesseract, and RE (for formatting) as well as pdf2image Python library

PDF To Formatted Text

This is a PDF to Text translator that is close to conformant to LexMed's formatting requirements. This repo holds the sample PDF and the output my script gives

Dependencies/libraries used:

pytesseract
pdf2Image
re (regular expression)

Current status

55% Error rate
- Due to typos from OCR nuances

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
outputs		outputs
outputsTEST		outputsTEST
pdfs		pdfs
test		test
.gitignore		.gitignore
README.md		README.md
Test.pdf		Test.pdf
main.py		main.py
output.txt		output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF To Formatted Text

Current status

About

Releases

Packages

Languages

gatlanit/PDF-To-Formatted-Text

Folders and files

Latest commit

History

Repository files navigation

PDF To Formatted Text

Current status

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages