You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a shell script that invokes a basic ocrmypdf command on a local PDF. I've been using this script successfully in conjunction with the MacOS automation program Hazel. The Hazel automation simply watches a folder, and when a new PDF is detected, calls the following shell script (where "$1" is the full path to the PDF file being acted upon):
This automation was working fine until recently, but now the script fails while executing the ocrmypdf command. I believe this may be due to a recent change in OCRmyPDF or one of its dependencies.
Specifically, Leptonica complains about a file not found: 000001_ocr.png. When I invoke this script directly from the terminal it runs fine - only when called from Hazel does it fail. It seems like a pathing issue, but I'm trying to determine what changed since it was working fine before.
Where are you installing from?
Homebrew
What operating system are you working on?
macOS
Relevant log output
2024-01-09 10:54:34.476 hazelworker[14105] DEBUG: == script output ==
ocrmypdf 16.0.4
Running: ['tesseract', '--version']
Found tesseract 5.3.3
Running: ['tesseract', '--version']
Running: ['gs', '--version']
Found gs 10.2.1
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/opt/homebrew/share/tessdata/" (3):
eng
osd
snum
pikepdf mmap enabled
os.symlink(/Users/sam/Documents/inbox/hazel-scan-ocr/test.pdf, /tmp/ocrmypdf.io.40xr8ti6/origin)
os.symlink(/tmp/ocrmypdf.io.40xr8ti6/origin, /tmp/ocrmypdf.io.40xr8ti6/origin.pdf)
Gathering info with 1 thread workers
pikepdf mmap enabled
Using Tesseract OpenMP thread limit 3
pikepdf mmap enabled
1 Rasterize with pngmono, rotation 0
1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1', '-dLastPage=1', '-r299.909110x299.909110', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.40xr8ti6/origin.pdf']
1 Rotating output by 0
1 resolution (299.89779999999996, 299.89779999999996)
1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr.png', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr_tess', 'pdf', 'txt']
1 [tesseract] Leptonica Error in fopenReadStream: file not found: 000001_ocr.png
1 [tesseract] Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.40xr8ti6/000001_ocr.png
1 [tesseract] Leptonica Error in fopenReadStream: file not found: PNG
1 [tesseract] Leptonica Error in pixRead: image file not found: PNG
1 [tesseract] Image file PNG cannot be read!
1 [tesseract] Error during processing.
ExitCodeException
Traceback (most recent call last):
File "/opt/homebrew/Cellar/ocrmypdf/16.0.4/libexec/lib/python3.12/site-packages/ocrmypdf/_exec/tesseract.py", line 385, in generate_pdf
p = run(args_tesseract, stdout=PIPE, stderr=STDOUT, timeout=timeout, check=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.0.4/libexec/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 63, in run
proc = subprocess_run(args, env=env, check=check, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr.png', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr_tess', 'pdf', 'txt']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered:
My guess would be that macOS or Hazel is virtualizing /tmp in some way.
The file that is missing is supposed to be created by Ghostscript. (The gs -dQUIET command above tesseract.) That command completes without error, but the expected file is not created, so it's putting it somewhere else.
You could try setting an environment variable TMPDIR to redirect it to somewhere like /Users/you/tmp since I think that is less likely to be defeated.
I had the same issue DanBloomberg/leptonica#735. The problem is that leptonica 1.84.0 (23 Dec 2023), a shared library used by tesseract, cannot open files in /tmp on MacOS. On MacOS, TMPDIR is set to something in /var/folders by default, but some programs change it to /tmp. If it is set to /tmp, then you have to change it to something else such as export TMPDIR=/private/tmp before calling ocrmypdf, which writes temporary files to $TMPDIR.
What were you trying to do?
I have a shell script that invokes a basic
ocrmypdf
command on a local PDF. I've been using this script successfully in conjunction with the MacOS automation program Hazel. The Hazel automation simply watches a folder, and when a new PDF is detected, calls the following shell script (where "$1" is the full path to the PDF file being acted upon):This automation was working fine until recently, but now the script fails while executing the
ocrmypdf
command. I believe this may be due to a recent change in OCRmyPDF or one of its dependencies.Specifically, Leptonica complains about a
file not found: 000001_ocr.png
. When I invoke this script directly from the terminal it runs fine - only when called from Hazel does it fail. It seems like a pathing issue, but I'm trying to determine what changed since it was working fine before.Where are you installing from?
Homebrew
What operating system are you working on?
macOS
Relevant log output
The text was updated successfully, but these errors were encountered: