Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Add or Edit [script].unicharset in langdata folder? #99

Open
sethleech opened this issue Oct 24, 2017 · 3 comments
Open

How to Add or Edit [script].unicharset in langdata folder? #99

sethleech opened this issue Oct 24, 2017 · 3 comments

Comments

@sethleech
Copy link

How to Add or Edit [script].unicharset in langdata folder?

  • I want to know How to get 'glyph_metrics' data from [font or several fonts].

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset

I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0

Please refer : when training tesseract, I tried this

1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights

Warning: properties incomplete for index 4 = 𥮗

output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed

2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67

** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
no warning

output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed

I found out

  1. [script].unicharset file is officially supported.
  2. entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

How to get 'glyph_metrics' data from [font or several fonts]?

Thank you in advance.

Regards,

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 24, 2017 via email

@sethleech
Copy link
Author

sethleech commented Oct 25, 2017

My project is running on android-device.
By now Tesseract 4.0 can't be used on android-device because of build-issue, "AVX" and "SSE".
So I can use Tesseract 3.05.01.

Pls any information?

@baishuangcheng
Copy link

I have the same question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants