Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

Open
0dinD opened this issue Jan 31, 2025 · 0 comments
Open

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

0dinD opened this issue Jan 31, 2025 · 0 comments
Assignees
Labels
enhancement triage Issue needs triage

Comments

@0dinD
Copy link
Contributor

0dinD commented Jan 31, 2025

Describe the proposed feature

Currently, the hOCR to PDF transform just ignores the textangle hOCR attribute and assumes that textangle is 0 degrees (the default if textangle is not specified). This creates problems with the PDF output when textangle is present with a non-zero value. This frequently happens when processing documents with mixed 90-degree text orientations (example provided below). Note that Tesseract is not perfect either and sometimes produces garbled hOCR output as well, but this is a separate issue that I've reported as tesseract-ocr/tesseract#4387. Here is the Tesseract code which sets textangle, should it be of interest to you: https://github.com/tesseract-ocr/tesseract/blob/3157ff0e741ea5c85e16fbd1c6edf20f30eccbd3/src/api/hocrrenderer.cpp#L43-L58

Below is a case where Tesseract does produce valid hOCR, with a non-zero textangle in some parts of the document:

Command used: ocrmypdf text-mixed-orientation.png text-mixed-orientation.pdf

Input image:

Image

The text in the output PDF is not correct when it comes to the 90-degree text, because OCRmyPDF treats it as 0-degree text even though it has a textangle of 90 in the hOCR (I used the -k option to see this in the hOCR output).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement triage Issue needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants