-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Add or Edit [script].unicharset in langdata folder? #99
Comments
Based on comments by @theraysmith, all other properties are not required
for lstm training.
…On 24-Oct-2017 12:34 PM, "sethleech" ***@***.***> wrote:
How to Add or Edit [script].unicharset in langdata folder?
- I want to know *How to get 'glyph_metrics' data from [font or
several fonts]*.
Dear all,
I am trying tesseart recently and it is really a very good product. I
would like to ask if there is any tutorial or steps about How to Add or
Edit [script].unicharset? for example han.unicharset
I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0
Please refer : when training tesseract, I tried this
1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O
[lang].unicharset --script_dir=langdata/[lang] --X
langdata/[lang]/han.xheights
Warning: properties incomplete for index 4 = 𥮗
output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed
2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han
23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex)
line 67
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O
[lang].unicharset --script_dir=langdata/[lang] --X
langdata/[lang]/han.xheights
no warning
output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed
I found out
1. [script].unicharset file is officially supported.
2. entry properties : 'character' 'properties' 'glyph_metrics'
'script' 'other_case' 'direction' 'mirror' 'normed_form'
How to get 'glyph_metrics' data from [font or several fonts]?
Thank you in advance.
Regards,
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#99>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oyd9YSyC9dte2FuL_UxEzgIEoZguks5svYvhgaJpZM4QD8GV>
.
|
My project is running on android-device. Pls any information? |
I have the same question. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
How to Add or Edit [script].unicharset in langdata folder?
Dear all,
I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset
I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0
Please refer : when training tesseract, I tried this
1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
Warning: properties incomplete for index 4 = 𥮗
output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed
2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
no warning
output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed
I found out
How to get 'glyph_metrics' data from [font or several fonts]?
Thank you in advance.
Regards,
The text was updated successfully, but these errors were encountered: