Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 'File not found' error in latest versions #1231

Closed
templeman opened this issue Jan 9, 2024 · 4 comments
Closed

[Bug]: 'File not found' error in latest versions #1231

templeman opened this issue Jan 9, 2024 · 4 comments
Assignees
Labels

Comments

@templeman
Copy link

What were you trying to do?

I have a shell script that invokes a basic ocrmypdf command on a local PDF. I've been using this script successfully in conjunction with the MacOS automation program Hazel. The Hazel automation simply watches a folder, and when a new PDF is detected, calls the following shell script (where "$1" is the full path to the PDF file being acted upon):

PATH=$PATH:/opt/homebrew/bin
export PATH

filename=$(basename "$1")
converting_directory=~/Documents/processed/

ocrmypdf --no-progress-bar -v 1 "$1" "$converting_directory""$filename"

This automation was working fine until recently, but now the script fails while executing the ocrmypdf command. I believe this may be due to a recent change in OCRmyPDF or one of its dependencies.

Specifically, Leptonica complains about a file not found: 000001_ocr.png. When I invoke this script directly from the terminal it runs fine - only when called from Hazel does it fail. It seems like a pathing issue, but I'm trying to determine what changed since it was working fine before.

Where are you installing from?

Homebrew

What operating system are you working on?

macOS

Relevant log output

2024-01-09 10:54:34.476 hazelworker[14105] DEBUG: == script output ==
ocrmypdf 16.0.4
Running: ['tesseract', '--version']
Found tesseract 5.3.3
Running: ['tesseract', '--version']
Running: ['gs', '--version']
Found gs 10.2.1
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/opt/homebrew/share/tessdata/" (3):
eng
osd
snum

pikepdf mmap enabled
os.symlink(/Users/sam/Documents/inbox/hazel-scan-ocr/test.pdf, /tmp/ocrmypdf.io.40xr8ti6/origin)
os.symlink(/tmp/ocrmypdf.io.40xr8ti6/origin, /tmp/ocrmypdf.io.40xr8ti6/origin.pdf)
Gathering info with 1 thread workers
pikepdf mmap enabled

Using Tesseract OpenMP thread limit 3
pikepdf mmap enabled
    1 Rasterize with pngmono, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1', '-dLastPage=1', '-r299.909110x299.909110', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.40xr8ti6/origin.pdf']
    1 Rotating output by 0
    1 resolution (299.89779999999996, 299.89779999999996)
    1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr.png', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] Leptonica Error in fopenReadStream: file not found: 000001_ocr.png
    1 [tesseract] Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.40xr8ti6/000001_ocr.png
    1 [tesseract] Leptonica Error in fopenReadStream: file not found: PNG
    1 [tesseract] Leptonica Error in pixRead: image file not found: PNG
    1 [tesseract] Image file PNG cannot be read!
    1 [tesseract] Error during processing.

ExitCodeException
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/ocrmypdf/16.0.4/libexec/lib/python3.12/site-packages/ocrmypdf/_exec/tesseract.py", line 385, in generate_pdf
    p = run(args_tesseract, stdout=PIPE, stderr=STDOUT, timeout=timeout, check=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.0.4/libexec/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 63, in run
    proc = subprocess_run(args, env=env, check=check, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.12.1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr.png', '/tmp/ocrmypdf.io.40xr8ti6/000001_ocr_tess', 'pdf', 'txt']' returned non-zero exit status 1.
@jbarlow83
Copy link
Collaborator

My guess would be that macOS or Hazel is virtualizing /tmp in some way.
The file that is missing is supposed to be created by Ghostscript. (The gs -dQUIET command above tesseract.) That command completes without error, but the expected file is not created, so it's putting it somewhere else.

You could try setting an environment variable TMPDIR to redirect it to somewhere like /Users/you/tmp since I think that is less likely to be defeated.

@jbarlow83
Copy link
Collaborator

I'm closing the issue because it's mainly user configuration and unlikely there's a change I could make that could resolve it.

@templeman
Copy link
Author

Thank you for the suggestion @jbarlow83, I'll give it a go.

@yonran
Copy link

yonran commented Feb 20, 2024

I had the same issue DanBloomberg/leptonica#735. The problem is that leptonica 1.84.0 (23 Dec 2023), a shared library used by tesseract, cannot open files in /tmp on MacOS. On MacOS, TMPDIR is set to something in /var/folders by default, but some programs change it to /tmp. If it is set to /tmp, then you have to change it to something else such as export TMPDIR=/private/tmp before calling ocrmypdf, which writes temporary files to $TMPDIR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants