Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

grantbarrett · 2022-09-02T19:23:29Z

One of the holes in Tesseract's ability to do quality OCR on the historical texts is that it's missing just three characters that prevent it from reasonably handling Old English Latin-character texts.

If you compare the characters available in the Tesseract trained date for Middle English with the character set of Old English using Latin, you'll see the omissions "Æ æ" (ash), "Ð ð" (eth), and "Ƿ ƿ" (wynn).

https://github.com/tesseract-ocr/langdata_lstm/blob/main/enm/enm.unicharset
https://en.wikipedia.org/wiki/Old_English_Latin_alphabet

https://en.wikipedia.org/wiki/Old_English_Latin_alphabet
https://en.wikipedia.org/wiki/Eth
https://en.wikipedia.org/wiki/Wynn

Admittedly these three will require quite a bit of training to distinguish them from an AE digraph, D d, and P p, respectively, but of course, that's what we do here!

As you can see on their respective Wikipedia pages, we may already have trained data for eth and ash in other languages (Danish and Norwegian, and Icelandic, Faroese, and Khmer, respectively), but there are other letter forms that may need to be accounted for, especially for wynn.

If were were able to make these changes, then we could rename the Middle English trained data to be used for Middle and Old English (Latin), differentiating it clearly from the "enm" three-letter code, and especially for those who associate "Old English" primarily with blackletter script, which this trained data would not be suitable to handle. (Blackletter OCR can be handled by the tools at this link, although they are for older versions of Tesseract https://emop.tamu.edu/.)

stweil · 2022-11-25T16:12:41Z

It is possible to enhance the existing model with those additional glyphs. The original training was done with artificial training data, but I think that you will get better results with transcribed scans from historic books.

stweil added the enhancement label Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

grantbarrett commented Sep 2, 2022

stweil commented Nov 25, 2022

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

Comments

grantbarrett commented Sep 2, 2022

stweil commented Nov 25, 2022