Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingesting HOCR derivatives as a media attachment #592

Open
dmer opened this issue Mar 31, 2023 · 2 comments
Open

Ingesting HOCR derivatives as a media attachment #592

dmer opened this issue Mar 31, 2023 · 2 comments

Comments

@dmer
Copy link

dmer commented Mar 31, 2023

This work is in support of the plans to provide search term highlighting in Mirador started by @alxp Islandora/islandora#897

And continued by @patdunlavey here:
Islandora/islandora_mirador#17 (comment)

Functions like the OCR where the contents of the extracted text file are copied into a text field on the media (original_file) for indexing purposes.

USE CASE: I have a Islandora 7 repository with a very large amount of textual content in TIF file format - each page (TIF) has an associated HOCR file. I want to migrate the pages WITH their HOCR into Islandora.

I'd like to be able to batch in the HOCR files (either as part of the node-creating csv or as an add_media job) and have them attached to the appropriate file field on the media object.

Ideally I could pull these HOCR files directly from the Islandora7 datastream with a URL like I do for the OBJ (TIF) files.

Hopefully this is a clear definition of the ask - I'm happy to answer questions or add more details if requested.

@mjordan
Copy link
Owner

mjordan commented Mar 31, 2023

Related - #572.

@dmer
Copy link
Author

dmer commented Apr 13, 2023

Hi Mark - just checking in on this. I'm expecting to need this within a month or so. I can definitely volunteer some testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants