Skip to content

OCR application for Sailfish OS. Based on Tesseract OCR engine and Leptonica image processing library.

License

Notifications You must be signed in to change notification settings

skvark/Textractor

Repository files navigation

Text Extractor

Work in progress. However, most of the core functionality is implemented.

Documentation and Help

Textractor Documentation

Environment and building

To be able to build this, follow this Gist to setup the environment correctly: https://gist.github.com/skvark/49a2f1904192b6db311a

In short:

Add my repositories containing Tesseract OCR and Leptonica to the build machine targets.

Preprocessing

Tesseract OCR is just plain engine so Leptonica is used for preprocessing the image.

Currently following steps will be done before the image is passed to the engine for recognition:

  1. Image is first opened using QImage, dpi is set to 300, image is rotated according to device angle and the image is saved in jpg format.
  2. Load the jpg image with Leptonica and convert the 32 bpp image to gray 8 bpp image
  3. Unsharp mask
  4. Local background normalization with Otsu's algorithm
  5. Skew angle detection and rotation (Leptonica decides if the image needs to be rotated)

After those steps the image is passed to the Tesseract.

Postprocessing

The results are filtered based on the word confindence value. Confidence value is a number between 0-100. 0 means that Tesseract wasn't really sure about the detected word and 100 means that Tesseract is sure that the word is what it is.

Settings

I will make some kind of informative page which explains the parameters at some point.

Test image and result

Original:

preview0

Preprocessed

preview01

Extracted text:

This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.






 D R I N K  COFFEE
L Do Stupid Faster
 With More Energy

Screenshots

preview1 preview2 preview3 preview4 preview5

About

OCR application for Sailfish OS. Based on Tesseract OCR engine and Leptonica image processing library.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published