Bindings to Tesseract-OCR: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
- Upstream Tesseract-OCR documentation: https://tesseract-ocr.github.io/tessdoc/
- Introduction: https://docs.ropensci.org/tesseract/articles/intro.html
- Reference: https://docs.ropensci.org/tesseract/reference/ocr.html
Simple example
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)
# Get XML HOCR output
xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
cat(xml)
Roundtrip test: render PDF to image and OCR it back to text
# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]
# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)
On Windows and MacOS the package binary package can be installed from CRAN:
install.packages("tesseract")
Installation from source on Linux or OSX requires the Tesseract
library (see below).
On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run examples.
sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng
On Ubuntu you can optionally use this PPA to get the latest version of Tesseract:
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get install -y libtesseract-dev tesseract-ocr-eng
On Fedora we need tesseract-devel and leptonica-devel
sudo yum install tesseract-devel leptonica-devel
On RHEL and CentOS we need tesseract-devel and leptonica-devel from EPEL
sudo yum install epel-release
sudo yum install tesseract-devel leptonica-devel
On OS-X use tesseract from Homebrew:
brew install tesseract
Tesseract uses training data to perform OCR. Most systems default to English
training data. To improve OCR results for other languages you can to install the
appropriate training data. On Windows and OSX you can do this in R using
tesseract_download()
:
tesseract_download('fra')
On Linux you need to install the appropriate training data from your distribution. For example to install the spanish training data:
- tesseract-ocr-spa (Debian, Ubuntu)
- tesseract-langpack-spa (Fedora, EPEL)
Alternatively you can manually download training data from github
and store it in a path on disk that you pass in the datapath
parameter or set a default path via the
TESSDATA_PREFIX
environment variable. Note that the Tesseract 4 and Tesseract 3 use different
training data format. Make sure to download training data from the branch that matches your libtesseract version.