Skip to content

A formatted PDF to Text program utilzing PyTorch, pytesseract, and RE (for formatting) as well as pdf2image Python library

Notifications You must be signed in to change notification settings

gatlanit/PDF-To-Formatted-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF To Formatted Text

This is a PDF to Text translator that is close to conformant to LexMed's formatting requirements. This repo holds the sample PDF and the output my script gives


Dependencies/libraries used:

  • pytesseract
  • pdf2Image
  • re (regular expression)

Current status

  • 55% Error rate
    • Due to typos from OCR nuances

About

A formatted PDF to Text program utilzing PyTorch, pytesseract, and RE (for formatting) as well as pdf2image Python library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages