Custom Apache Tika image for tika.populate.tools with OCR enabled for Spanish, Catalan and Basque.
- Base:
apache/tika:latest-full(Tika + Tesseract + English). - Adds:
tesseract-ocr-spa,tesseract-ocr-cat,tesseract-ocr-eus. - Loads
tika-config.xmlso the PDF parser usesocrStrategy=auto— OCR runs only when a page has no text layer, so text-PDF extraction stays fast.
- Per-task ceiling is 30 minutes (
taskTimeoutMillis=1800000). - Tesseract per-OCR-call ceiling is 30 minutes (
timeoutSeconds=1800). - Clients can shorten per-request via
X-Tika-Timeout-MillisandX-Tika-OCRtimeoutSeconds. They cannot raise above these defaults.
ghcr.io/populatetools/tika-ocr:latest
GitHub Actions rebuilds the image:
- on every push to
main, - on every
v*tag push (versioned releases), - nightly at 04:00 UTC, so the upstream
apache/tika:latest-fullsecurity updates get picked up, - on demand via
workflow_dispatch.
Tags published:
latest— rolling, followsmain.vX.Y.Z,X.Y,X— published when avX.Y.Zgit tag is pushed.sha-<short>— every build.nightly— the scheduled rebuild.
docker build -t tika-ocr:dev .
docker run --rm -p 9998:9998 tika-ocr:dev
curl -sL -o /tmp/anexos.pdf \
"https://contratos-files.gobierto.es/documents/tenders/6a89c71d8756f85720bb40f2c631b13d/Anexos.pdf"
curl -sX PUT --data-binary @/tmp/anexos.pdf \
-H "Content-Type: application/pdf" \
-H "Accept: text/plain" \
http://localhost:9998/tika | wc -cA successful run prints a body well over 20 bytes containing recognisable Spanish text. The same request against vanilla apache/tika:latest-full returns ~20 bytes of newlines for image-only PDFs.
Pull the published image and run it however you deploy containers:
docker pull ghcr.io/populatetools/tika-ocr:latest
docker run -d -p 9998:9998 ghcr.io/populatetools/tika-ocr:latestapache/tika:latest-full ships Tesseract + English only. Image-only PDFs from Spanish public administrations (scanner output, Hewlett-Packard MFP, etc.) silently extracted to empty strings, masking ~22% of indexed documents in contratos.gobierto.es. See the parent investigation in the contratos repo.