Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
tesseract.py		tesseract.py

README.md

Nutrient image recognition experiment

When looking for product data for Questionmark's sustainability and health scores, we also consider online sources. While most data online is textual, sometimes they're in images.

This is an experiment to find out how hard it would be to extract the data from the image. A good opportunity for some image processing and recognition.

As a sample dataset, we're looking at some images from the Dutch supermarket webshop Hoogvliet, which has nutritional values in an image.

The initial approach was to do word-based recognition using a couple of custom-trained kNN-networks (see nutrient-ocr-knn), but in the end just using tesseract was more convenient. That's what you're seeing here.

Running

Needs Python 2.5+ with PIL and tesseract 3.04 (or higher).

Since the images are quite low-resolution, the program scales them up three times, does thresholding, and calls tesseract. Common misdetections are fixed.

The following example returns ingredients from an image.

$ ./tesseract.py ../nutrient-ocr-knn/imgstest/VOED665279000.png | grep -v '^$'
Voedingswaarde per 100 Gram
Energie 1050 Kilojoule
Energie 251 Kilocalorie
Vetten 12.3 Gram
Vetzuren, totaal verzadigd 8.2 Gram
Koolhydraten 31.4 Gram
Suikers 30.2 Gram
Eiwitten 3.3 Gram
Zout 0.32 Gram

That's looking great, and pretty easy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nutrient-ocr-tesseract

nutrient-ocr-tesseract

README.md

Nutrient image recognition experiment

Running

Files

nutrient-ocr-tesseract

Directory actions

More options

Directory actions

More options

Latest commit

History

nutrient-ocr-tesseract

Folders and files

parent directory

README.md

Nutrient image recognition experiment

Running