You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the holes in Tesseract's ability to do quality OCR on the historical texts is that it's missing just three characters that prevent it from reasonably handling Old English Latin-character texts.
If you compare the characters available in the Tesseract trained date for Middle English with the character set of Old English using Latin, you'll see the omissions "Æ æ" (ash), "Ð ð" (eth), and "Ƿ ƿ" (wynn).
Admittedly these three will require quite a bit of training to distinguish them from an AE digraph, D d, and P p, respectively, but of course, that's what we do here!
As you can see on their respective Wikipedia pages, we may already have trained data for eth and ash in other languages (Danish and Norwegian, and Icelandic, Faroese, and Khmer, respectively), but there are other letter forms that may need to be accounted for, especially for wynn.
If were were able to make these changes, then we could rename the Middle English trained data to be used for Middle and Old English (Latin), differentiating it clearly from the "enm" three-letter code, and especially for those who associate "Old English" primarily with blackletter script, which this trained data would not be suitable to handle. (Blackletter OCR can be handled by the tools at this link, although they are for older versions of Tesseract https://emop.tamu.edu/.)
The text was updated successfully, but these errors were encountered:
It is possible to enhance the existing model with those additional glyphs. The original training was done with artificial training data, but I think that you will get better results with transcribed scans from historic books.
One of the holes in Tesseract's ability to do quality OCR on the historical texts is that it's missing just three characters that prevent it from reasonably handling Old English Latin-character texts.
If you compare the characters available in the Tesseract trained date for Middle English with the character set of Old English using Latin, you'll see the omissions "Æ æ" (ash), "Ð ð" (eth), and "Ƿ ƿ" (wynn).
https://github.com/tesseract-ocr/langdata_lstm/blob/main/enm/enm.unicharset
https://en.wikipedia.org/wiki/Old_English_Latin_alphabet
https://en.wikipedia.org/wiki/Old_English_Latin_alphabet
https://en.wikipedia.org/wiki/Eth
https://en.wikipedia.org/wiki/Wynn
Admittedly these three will require quite a bit of training to distinguish them from an AE digraph, D d, and P p, respectively, but of course, that's what we do here!
As you can see on their respective Wikipedia pages, we may already have trained data for eth and ash in other languages (Danish and Norwegian, and Icelandic, Faroese, and Khmer, respectively), but there are other letter forms that may need to be accounted for, especially for wynn.
If were were able to make these changes, then we could rename the Middle English trained data to be used for Middle and Old English (Latin), differentiating it clearly from the "enm" three-letter code, and especially for those who associate "Old English" primarily with blackletter script, which this trained data would not be suitable to handle. (Blackletter OCR can be handled by the tools at this link, although they are for older versions of Tesseract https://emop.tamu.edu/.)
The text was updated successfully, but these errors were encountered: