Optical-character-recognition

The notebook in this repository uses pytesseract to extract text from a pdf document. The script can be used to automate text acquisition from a large body of printed resources such as books. The acquired text can then be used for dowstream tasks, such as training language models, topic models, document summarization etc. For more details have a look at the notebook

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optical-character-recognition

About

Releases

Packages

Directorman9/Optical-character-recognition

Folders and files

Latest commit

History

Repository files navigation

Optical-character-recognition

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages