Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why TibSyllableTokenizer char code limit is less than 255? #4

Open
haochun opened this issue Oct 26, 2017 · 3 comments
Open

why TibSyllableTokenizer char code limit is less than 255? #4

haochun opened this issue Oct 26, 2017 · 3 comments

Comments

@haochun
Copy link

haochun commented Oct 26, 2017

tibetan char code is 0x0f00 - 0x0xfff.even though some char is bad.but,why did you limit
('\u0F40' <= c && c <= '\u0FBC') || ('\u0F20' <= c && c <= '\u0F33') || (c == '\u0F00');
why remove head marks and others?

@eroux
Copy link
Collaborator

eroux commented Oct 26, 2017

Well, it seems to me that it's the usual behavior of Lucene analyzers, and I can't see any use case on our website where it would be relevant to look for punctuation. That said, we're not building the analyzer for our website only, so if you want to add an option to keep punctuation please go ahead!

Out of curiosity, in what context will you be using the analyzer?

@drupchen
Copy link
Collaborator

In our use cases, we are not interested in indexing any tibetan punctuation, but only words like the ones we would find in dictionaries. This is also what StandardAnalyzer does for English, for example.

Could you give us an example of a situation where this behavior is problematic?

@drupchen
Copy link
Collaborator

drupchen commented Oct 26, 2017

As for the word length limit of 255 chars, it is simply inherited from CharTokenizer (TibWordTokenizer is derived from it)

Have you run into a problem because of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants