-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hOCR option to Text Extraction Media Attachment action and IIIF Manifest #897
Conversation
I will test this before the end of the week. Thanks! :) |
I brought the code into a isle-dc project and got through step 11 and ran into an error making a repository item. I am certain that the error I got was not due to this PR because I saw the same error when trying to edit an existing node before I made the file's media field, context rule, etc. I will sort this error out after today's I8 call. For the record, the error I got was on create or edit of a Repository Item node:
|
I did finally get around to testing this again (it was blocked by an unrelated bug in my isle code that was preventing me from making any new content). I might have something misconfigured, but the TIFF file I used did not yield an hOCR derivative. I'll repeat all of the steps again and follow up this afternoon. |
Thanks, I will double check the steps
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Willow Gillingham ***@***.***>
Sent: Wednesday, October 5, 2022 2:28:23 PM
To: Islandora/islandora ***@***.***>
Cc: alxp ***@***.***>; Author ***@***.***>
Subject: Re: [Islandora/islandora] Add hOCR option to Text Extraction Media Attachment action and IIIF Manifest (PR #897)
I did finally get around to testing this again (it was blocked by an unrelated bug in my isle code that was preventing me from making any new content). I might have something misconfigured, but the TIFF file I used did not yield an hOCR derivative. I'll repeat all of the steps again and follow up this afternoon.
—
Reply to this email directly, view it on GitHub<#897 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAUD3B3F6JRNZZNCFQALJTWBW3DPANCNFSM6AAAAAAQG6PTXI>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hi @wgilling , I set up a fresh ISLE-DC with make demo and went through the above steps. The one thing I missed was that you should check the media use checkboxes when adding media, I checked 'Original File and Service File and uploaded a JP2 and the hOCR file field is populated. Hope this was the missing step for you to be able to reproduce it 🙏🏼 |
I did finally get around to testing this again (it was blocked by an unrelated bug in my isle code that was preventing me from making any new content). The TIFF file I used did not yield an hOCR derivative. I repeated the steps and had the same result -- no hOCR derivative is ever created for me. I even used the node's media screen to trigger that system action. I saw no errors in my watchdog error, but perhaps there is a clue in the hypercube log -- when I run the action from the node's media screen, I only get this:
|
Thanks for trying again. Can you send me the TIFF you are using? I converted a JP2 in my test data to a TIFF to test and it works the same as the JP2. One thing that might be a clue is that out of the box and with the above testing instructions you should still get the unrelated Extracted Text media generated which should have the plain text OCR. Is this being generated? Here's the file I used: |
neither derivative is created with my TIF. I'd attach it but it is 28MB in size. I will try the TIF file that you have zipped up. |
Ok -- I double-checked a couple things and then it occurred to me that sometimes the entire docker stack needs to reset - after running |
That action looks correct. The Tesseract log above does not look look like the event was ever fired for hOCR - the tesseract command line should have an option to generate hOCR, like this from my log: {"cmd":"tesseract stdin stdout -c tessedit_create_hocr=1 -c hocr_font_info=0"} [] Can you post the config page for the context reaction for Page Derivatives, it should have a second set of derivatives in the left sidebar near the bottom, since the reaction is not in the normal Derivatives group: |
Hm I don't have the Media Type context condition, what is in there? |
I hope that somebody else can test this because it seems like this should be working - especially since the OCR derivative is created. Could it be the version of tesseract I am running?
|
It was that the media type is not "Remote Video" -- I removed that condition after I saw your screenshot and ran it again but still did not get an hOCR. |
I did not intend to link this to the "[TECH DEBT] Revisit why we are using Features for Islandora Core Feature #902" ticket... I was in my terminal and typed "exit" but the browser had focus and it opened the "x" key opened the Development dialog and the "it" did the search for this task -- and enter submitted that as a linked PR for this issue. I cannot see how to remove that link between these two. |
I set up ISLE-DC on my hoe mac and followed the testing steps. I needed to select a value for Resource Type but otherwise everything was the same. hOCR file field is not getting populated and I see some timeout errors in they Hypercube log which looks strange to me: isle-dc-hypercube-1 | 2022/10/13 02:20:28 [error] 837#837: *375 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1 | 172.20.0.14 - - [13/Oct/2022:02:20:28 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1 | [13-Oct-2022 02:20:40] WARNING: [pool www] child 29644, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (72.300542 sec), terminating
isle-dc-hypercube-1 | [13-Oct-2022 02:20:40] WARNING: [pool www] child 29644 exited on signal 15 (SIGTERM) after 180.010716 seconds from start
isle-dc-hypercube-1 | [2022-10-13 02:19:28] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1 | [2022-10-13 02:19:28] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1 | [13-Oct-2022 02:20:40] NOTICE: [pool www] child 30377 started
isle-dc-hypercube-1 | 172.20.0.14 - - [13/Oct/2022:02:21:29 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1 | 2022/10/13 02:21:29 [error] 837#837: *377 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1 | [2022-10-13 02:21:30] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1 | [2022-10-13 02:21:30] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1 | [13-Oct-2022 02:21:40] WARNING: [pool www] child 29890, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (71.248324 sec), terminating
isle-dc-hypercube-1 | [13-Oct-2022 02:21:40] WARNING: [pool www] child 29890 exited on signal 15 (SIGTERM) after 180.029522 seconds from start
isle-dc-hypercube-1 | [2022-10-13 02:20:29] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1 | [2022-10-13 02:20:29] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1 | [13-Oct-2022 02:21:40] NOTICE: [pool www] child 30621 started
isle-dc-hypercube-1 | 2022/10/13 02:22:30 [error] 837#837: *379 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1 | 172.20.0.14 - - [13/Oct/2022:02:22:30 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1 | [2022-10-13 02:22:31] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1 | [2022-10-13 02:22:31] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1 | [13-Oct-2022 02:22:40] WARNING: [pool www] child 30134, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (70.172754 sec), terminating
isle-dc-hypercube-1 | [13-Oct-2022 02:22:40] WARNING: [pool www] child 30134 exited on signal 15 (SIGTERM) after 180.044018 seconds from start
isle-dc-hypercube-1 | [13-Oct-2022 02:22:40] NOTICE: [pool www] child 30867 started
isle-dc-hypercube-1 | 2022/10/13 02:23:31 [error] 837#837: *381 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1 | 172.20.0.14 - - [13/Oct/2022:02:23:31 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1 | [2022-10-13 02:23:32] app.INFO: Matched route "{route}". {"route":"GET_","route_parameters":{"_controller":"hypercube.controller:get","_route":"GET_"},"request_uri":"http://hypercube:8000/","method":"GET"} []
isle-dc-hypercube-1 | [2022-10-13 02:23:32] app.INFO: Guard authentication successful! {"token":"[object] (Symfony\\Component\\Security\\Guard\\Token\\PostAuthenticationGuardToken: PostAuthenticationGuardToken(user=\"admin\", authenticated=true, roles=\"authenticated, administrator, fedoraadmin\"))","authenticator":"Islandora\\Crayfish\\Commons\\Syn\\JwtAuthenticator"} []
isle-dc-hypercube-1 | [13-Oct-2022 02:23:40] WARNING: [pool www] child 30377, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (69.114683 sec), terminating
isle-dc-hypercube-1 | [13-Oct-2022 02:23:40] WARNING: [pool www] child 30377 exited on signal 15 (SIGTERM) after 180.028184 seconds from start
isle-dc-hypercube-1 | [13-Oct-2022 02:23:40] NOTICE: [pool www] child 31111 started
isle-dc-hypercube-1 | 2022/10/13 02:24:32 [error] 837#837: *383 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.20.0.14, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm7/php-fpm7.sock", host: "hypercube:8000"
isle-dc-hypercube-1 | 172.20.0.14 - - [13/Oct/2022:02:24:32 +0000] "GET / HTTP/1.1" 504 160 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_322)" "-"
isle-dc-hypercube-1 | [13-Oct-2022 02:24:40] WARNING: [pool www] child 30621, script '/var/www/crayfish/Hypercube/src/index.php' (request: "GET /index.php") execution timed out (68.080092 sec), terminating
isle-dc-hypercube-1 | [13-Oct-2022 02:24:40] WARNING: [pool www] child 30621 exited on signal 15 (SIGTERM) after 180.048519 seconds from start
isle-dc-hypercube-1 | [13-Oct-2022 02:24:40] NOTICE: [pool www] child 31355 started
i Will look further |
Just back at my work computer and confirming that ISLE-DC does generate hOCR file field derivatives. Attaching an exported site config so I can test the exact Drupal config when I get home again. |
Hi @wgilling , Figuring out why your attempts weren't working was productive, it uncovered two bugs so far: #903 and Islandora-Devops/isle-dc#298 If you or anyone else have some time I've edited the testing instructions so they don't cause the problems noted in the bug reports that are unrelated to this PR. TL;DR: ISLE-DC doesn't support JP2 for Hypercube, and ISLE seems to kick off an endless loop if you check both Original File and Service File for a File so just click 'Original File.' Replicated this as working on my home and work Macs with latest ISLE-DC running make starter_dev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub Issue:
Islandora/documentation#1580
What does this Pull Request do?
Add option to generate hOCR derivatives to text extraction to file field derivative action.
What's new?
The action type "Generate hOCR Extracted Text for Media Attachment" now has a drop down to select Plain Text or hOCR
Also added a select box for the field on the media to store resulting hOCR.
IIIF manifest is now populated with hOCR stream locations if they exist, in a seeAlso section as per https://github.com/dbmdz/mirador-textoverlay
(i.e. Regeneration activity, etc.)? No
How should this be tested?
Test on either playbook or ISLE. For ISLE I've just successfully tested using a fresh site built with
make starter_dev
Testing the Manifest additions:
Documentation Status
Additional Notes:
Any additional information that you think would be helpful when reviewing this
Working on supporting Mirador Text Overlay and eventually search results highlighting. This supports that initiative.
PR.
Interested parties
Tag (@ mention) interested parties or, if unsure, @Islandora/8-x-committers