Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

character order normalization #17

Open
eroux opened this issue Aug 20, 2018 · 0 comments
Open

character order normalization #17

eroux opened this issue Aug 20, 2018 · 0 comments

Comments

@eroux
Copy link
Collaborator

eroux commented Aug 20, 2018

It would be useful to apply some optional normalization to input strings to that the search gets more accurate:

  • 0f7a or 0f7c repeated several times -> 0f7b or 0f7d
  • any other vowel repeated several times -> ignore repetition
  • vowel + 0f71 -> 0f71 + vowel
  • vowel + subscript -> subscript + vowel
  • \uOF65\u0FB1 -> \uOF62\u0FB1
  • \uOF62\u0FBB -> \uOF65\u0FBB

some these problems are often ignored by etext producers because the layout engines reorder characters without the user knowing (see here), and some fonts combine 0f7a + 0f7a -> 0f7b (which is not really correct)

eroux added a commit that referenced this issue Jun 14, 2022
eroux added a commit that referenced this issue Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant