Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hOCR option to Text Extraction Media Attachment action and IIIF Manifest #897

Merged
merged 11 commits into from
Oct 21, 2022

Conversation

alxp
Copy link

@alxp alxp commented Sep 7, 2022

GitHub Issue:

Islandora/documentation#1580

What does this Pull Request do?

Add option to generate hOCR derivatives to text extraction to file field derivative action.

What's new?

The action type "Generate hOCR Extracted Text for Media Attachment" now has a drop down to select Plain Text or hOCR

Also added a select box for the field on the media to store resulting hOCR.

IIIF manifest is now populated with hOCR stream locations if they exist, in a seeAlso section as per https://github.com/dbmdz/mirador-textoverlay

  • Does this change add any new dependencies? No
  • Does this change require any other modifications to be made to the repository
    (i.e. Regeneration activity, etc.)? No
  • Could this change impact execution of existing code? No

How should this be tested?

Test on either playbook or ISLE. For ISLE I've just successfully tested using a fresh site built with make starter_dev

  1. Go to Admin -> Structure -> Media Types and go to Manage Fields on the File type.
  2. Add a File field called hOCR extracted Text which can hold files with extension xml
  3. Ensure the new field is enabled in the Manage Display tab
  4. Go to Admin -> Actions and click Create New Advanced Action with the Generate Extracted Text for Media Attachment action type.
  5. Give the new action a name that mentions hOCR.
  6. In Format field select hOCR Extracted Text with Positional Data
  7. For Destination File Field Name select the field you just created., and save the action.. Keep None for the destination text field.
  8. Go to Admin -> Context UI -> Context and edit the Page Derivatives context
  9. Click Add Reaction and choose 'Derive File for Existing Media'.
  10. In the select box choose the action you created above and save.
  11. Add a new Repository Item with type Paged Content
  12. Add a child object of type Page
  13. On that child object, add a Media of type File and populate it with a TIFF that has text on it. Check the Original File checkbox in Media Use Do not add anything to the hOCR Extracted Text field you created.
  14. Save the media.
  15. After about a minute, the extracted text with positional data field should be populated. You can verify this directly.

Testing the Manifest additions:

  1. Assuming above is done, go to Admin -> Structure -> Views and edit the IIIF Manifest view.
  2. Click the Settings for the IIIF Manifest Display
  3. Choose the field you created under Structured OCR Data file Field
    image
  4. Save the view.
  5. Go to the Paged Content node you created and add "/book-manifest" to the end of the URL.
  6. Look for a seeAlso section in the XML that should contain a reference to the hOCR field with appropriate MIME Type and Description

Documentation Status

  • Does this change existing behaviour that's currently documented? No
  • Does this change require new pages or sections of documentation? No (Forthcoming PR may require it)

Additional Notes:

Any additional information that you think would be helpful when reviewing this

Working on supporting Mirador Text Overlay and eventually search results highlighting. This supports that initiative.
PR.

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/8-x-committers

@alxp alxp changed the title Hocr Add hOCR option to Text Extraction Media Attachment action and IIIF Manifest Sep 7, 2022
@wgilling
Copy link

wgilling commented Sep 7, 2022

I will test this before the end of the week. Thanks! :)

@wgilling
Copy link

I brought the code into a isle-dc project and got through step 11 and ran into an error making a repository item. I am certain that the error I got was not due to this PR because I saw the same error when trying to edit an existing node before I made the file's media field, context rule, etc. I will sort this error out after today's I8 call.

For the record, the error I got was on create or edit of a Repository Item node:

Drupal\Component\Plugin\Exception\ContextException: The context is not a valid context. in Drupal\Core\Executable\ExecutablePluginBase->getContextDefinition() (line 184 of /var/www/drupal/web/core/lib/Drupal/Core/Plugin/ContextAwarePluginTrait.php).

@wgilling
Copy link

wgilling commented Oct 5, 2022

I did finally get around to testing this again (it was blocked by an unrelated bug in my isle code that was preventing me from making any new content). I might have something misconfigured, but the TIFF file I used did not yield an hOCR derivative. I'll repeat all of the steps again and follow up this afternoon.

@alxp
Copy link
Author

alxp commented Oct 5, 2022 via email

@alxp
Copy link
Author

alxp commented Oct 6, 2022

Hi @wgilling ,

I set up a fresh ISLE-DC with make demo and went through the above steps.

The one thing I missed was that you should check the media use checkboxes when adding media, I checked 'Original File and Service File and uploaded a JP2 and the hOCR file field is populated.

Hope this was the missing step for you to be able to reproduce it 🙏🏼

Screen Shot 2022-10-06 at 1 02 33 PM
Screen Shot 2022-10-06 at 12 54 28 PM

@wgilling
Copy link

I did finally get around to testing this again (it was blocked by an unrelated bug in my isle code that was preventing me from making any new content). The TIFF file I used did not yield an hOCR derivative. I repeated the steps and had the same result -- no hOCR derivative is ever created for me. I even used the node's media screen to trigger that system action. I saw no errors in my watchdog error, but perhaps there is a clue in the hypercube log -- when I run the action from the node's media screen, I only get this:

hypercube_1 | 172.23.0.5 - - [12/Oct/2022:14:41:27 +0000] "GET / HTTP/1.1" 499 0 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_275)" "-"

@alxp
Copy link
Author

alxp commented Oct 12, 2022

Thanks for trying again. Can you send me the TIFF you are using?

I converted a JP2 in my test data to a TIFF to test and it works the same as the JP2.

One thing that might be a clue is that out of the box and with the above testing instructions you should still get the unrelated Extracted Text media generated which should have the plain text OCR. Is this being generated?

Here's the file I used:
download.8.tiff.zip

@wgilling
Copy link

neither derivative is created with my TIF. I'd attach it but it is 28MB in size. I will try the TIF file that you have zipped up.

@wgilling
Copy link

Ok -- I double-checked a couple things and then it occurred to me that sometimes the entire docker stack needs to reset - after running make down;make up, I do get the extracted text OCR but I still did not get an hOCR. I ran the system action from the node's media screen by selecting the TIFF original file and running the "Make hOCR Extracted Text" action I created. The hypercube output attempted this loop 10 times:
hypercube_1 | [2022-10-12 15:18:40] app.DEBUG: < 200 [] [] hypercube_1 | 172.25.0.9 - - [12/Oct/2022:15:18:40 +0000] "GET / HTTP/1.1" 200 6558 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_275)" "-" hypercube_1 | [2022-10-12 15:18:41] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Checking for guard authentication credentials. {"firewall_key":"default","authenticators":1} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Checking support on guard authenticator. {"firewall_key":"default","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Calling getCredentials() on guard authenticator. {"firewall_key":"default","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} [] hypercube_1 | [2022-10-12 15:18:41] crayfish.syn.jwt_authentication.DEBUG: Token: eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJpYXQiOjE2NjU1ODc4NjksImV4cCI6MTY2NTU5NTA2OSwid2ViaWQiOiIxIiwiaXNzIjoiaHR0cHM6XC9cL2lzbGFuZG9yYS50cmFlZmlrLm1lIiwic3ViIjoiYWRtaW4iLCJyb2xlcyI6WyJhdXRoZW50aWNhdGVkIiwiYWRtaW5pc3RyYXRvciIsImZlZG9yYWFkbWluIl0sImF1ZCI6WyJpc2xhbmRvcmEiXX0.PMvhHJ80FA4rr5oe-2UaRYDYMm-oe0OvKLCfLtQupJv02QoiDnPd2k7_twXIljbyn2b1tmuZSxi_qV47GxkProlotynrJXMxtwyJ4id91pA90AXbX7-k8ZxtT1-SLxRyxU41bP-Tisy5IPfQjNVh_hE5MY462Vg_J_kK3wqHYhtfEheo0P3y_4isK8B0jNQgRKXTtZ022k9TL8yXq0JqbCPrlECq9WfkPUjam3Bmvks-u0iM65gbJixAzch5WYf1G06I6oy6nLuyjhkQ-F3If0uUbj73zzJ7xka8Nsy-yappocWSMIFosSf4W5KCKVBYQs9CU8g4zfxgyNIC28075A [] [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Passing guard token information to the GuardAuthenticationProvider {"firewall_key":"default","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} [] hypercube_1 | [2022-10-12 15:18:41] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Guard authenticator set no success response: request continues. {"authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Remember me skipped: it is not configured for the firewall. {"authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: > GET / [] [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Got Content-Type: {"type":"image/tiff"} [] hypercube_1 | [2022-10-12 15:18:41] app.DEBUG: Executing command: {"cmd":"tesseract stdin stdout "} []

@wgilling
Copy link

The system action seems to be set up right -- the destination is supposed to be that file field named field_hocr (in the screenshot below). Is there supposed to be a text field as the OCR uses field_extracted_text for the system action?
image

@alxp
Copy link
Author

alxp commented Oct 12, 2022

That action looks correct. The Tesseract log above does not look look like the event was ever fired for hOCR - the tesseract command line should have an option to generate hOCR, like this from my log:

{"cmd":"tesseract stdin stdout -c tessedit_create_hocr=1 -c hocr_font_info=0"} []

Can you post the config page for the context reaction for Page Derivatives, it should have a second set of derivatives in the left sidebar near the bottom, since the reaction is not in the normal Derivatives group:

image

@wgilling
Copy link

Yeah - I might have a borked isle.

image

@alxp
Copy link
Author

alxp commented Oct 12, 2022

Hm I don't have the Media Type context condition, what is in there?

@wgilling
Copy link

I hope that somebody else can test this because it seems like this should be working - especially since the OCR derivative is created. Could it be the version of tesseract I am running?

bash-5.1# tesseract --version
tesseract 4.1.1
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 1.1.0

@wgilling
Copy link

wgilling commented Oct 12, 2022

Hm I don't have the Media Type context condition, what is in there?

It was that the media type is not "Remote Video" -- I removed that condition after I saw your screenshot and ran it again but still did not get an hOCR.

@wgilling
Copy link

I did not intend to link this to the "[TECH DEBT] Revisit why we are using Features for Islandora Core Feature #902" ticket... I was in my terminal and typed "exit" but the browser had focus and it opened the "x" key opened the Development dialog and the "it" did the search for this task -- and enter submitted that as a linked PR for this issue. I cannot see how to remove that link between these two.

@alxp
Copy link
Author

alxp commented Oct 13, 2022

I set up ISLE-DC on my hoe mac and followed the testing steps. I needed to select a value for Resource Type but otherwise everything was the same.

hOCR file field is not getting populated and I see some timeout errors in they Hypercube log which looks strange to me:

isle-dc-hypercube-1  | 2022/10/13 02:20:28 [error] 837#837: *375 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1  | 172.20.0.14 - - [13/Oct/2022:02:20:28 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1  | [13-Oct-2022 02:20:40] WARNING: [pool www] child 29644, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (72.300542 sec), terminating
isle-dc-hypercube-1  | [13-Oct-2022 02:20:40] WARNING: [pool www] child 29644 exited on signal 15 (SIGTERM) after 180.010716 seconds from start
isle-dc-hypercube-1  | [2022-10-13 02:19:28] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1  | [2022-10-13 02:19:28] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1  | [13-Oct-2022 02:20:40] NOTICE: [pool www] child 30377 started
  isle-dc-hypercube-1  | 172.20.0.14 - - [13/Oct/2022:02:21:29 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1  | 2022/10/13 02:21:29 [error] 837#837: *377 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1  | [2022-10-13 02:21:30] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1  | [2022-10-13 02:21:30] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1  | [13-Oct-2022 02:21:40] WARNING: [pool www] child 29890, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (71.248324 sec), terminating
isle-dc-hypercube-1  | [13-Oct-2022 02:21:40] WARNING: [pool www] child 29890 exited on signal 15 (SIGTERM) after 180.029522 seconds from start
isle-dc-hypercube-1  | [2022-10-13 02:20:29] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1  | [2022-10-13 02:20:29] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1  | [13-Oct-2022 02:21:40] NOTICE: [pool www] child 30621 started
 isle-dc-hypercube-1  | 2022/10/13 02:22:30 [error] 837#837: *379 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1  | 172.20.0.14 - - [13/Oct/2022:02:22:30 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1  | [2022-10-13 02:22:31] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1  | [2022-10-13 02:22:31] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1  | [13-Oct-2022 02:22:40] WARNING: [pool www] child 30134, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (70.172754 sec), terminating
isle-dc-hypercube-1  | [13-Oct-2022 02:22:40] WARNING: [pool www] child 30134 exited on signal 15 (SIGTERM) after 180.044018 seconds from start
isle-dc-hypercube-1  | [13-Oct-2022 02:22:40] NOTICE: [pool www] child 30867 started
isle-dc-hypercube-1  | 2022/10/13 02:23:31 [error] 837#837: *381 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1  | 172.20.0.14 - - [13/Oct/2022:02:23:31 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1  | [2022-10-13 02:23:32] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1  | [2022-10-13 02:23:32] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1  | [13-Oct-2022 02:23:40] WARNING: [pool www] child 30377, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (69.114683 sec), terminating
isle-dc-hypercube-1  | [13-Oct-2022 02:23:40] WARNING: [pool www] child 30377 exited on signal 15 (SIGTERM) after 180.028184 seconds from start
isle-dc-hypercube-1  | [13-Oct-2022 02:23:40] NOTICE: [pool www] child 31111 started
isle-dc-hypercube-1  | 2022/10/13 02:24:32 [error] 837#837: *383 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1  | 172.20.0.14 - - [13/Oct/2022:02:24:32 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1  | [13-Oct-2022 02:24:40] WARNING: [pool www] child 30621, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (68.080092 sec), terminating
isle-dc-hypercube-1  | [13-Oct-2022 02:24:40] WARNING: [pool www] child 30621 exited on signal 15 (SIGTERM) after 180.048519 seconds from start
isle-dc-hypercube-1  | [13-Oct-2022 02:24:40] NOTICE: [pool www] child 31355 started
i

Will look further

@alxp
Copy link
Author

alxp commented Oct 13, 2022

Just back at my work computer and confirming that ISLE-DC does generate hOCR file field derivatives.

Attaching an exported site config so I can test the exact Drupal config when I get home again.

islandora-hocr-isle-demo-working-sync.zip

@alxp
Copy link
Author

alxp commented Oct 18, 2022

Hi @wgilling ,

Figuring out why your attempts weren't working was productive, it uncovered two bugs so far: #903 and Islandora-Devops/isle-dc#298

If you or anyone else have some time I've edited the testing instructions so they don't cause the problems noted in the bug reports that are unrelated to this PR.

TL;DR: ISLE-DC doesn't support JP2 for Hypercube, and ISLE seems to kick off an endless loop if you check both Original File and Service File for a File so just click 'Original File.'

Replicated this as working on my home and work Macs with latest ISLE-DC running make starter_dev.

Copy link

@wgilling wgilling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!! 🎉 - now that I have removed the additional media use, I see the extracted text on my TIFF file media.

image

I'd say this can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants