why TibSyllableTokenizer char code limit is less than 255? #4

haochun · 2017-10-26T03:42:51Z

tibetan char code is 0x0f00 - 0x0xfff.even though some char is bad.but,why did you limit
('\u0F40' <= c && c <= '\u0FBC') || ('\u0F20' <= c && c <= '\u0F33') || (c == '\u0F00');
why remove head marks and others?

The text was updated successfully, but these errors were encountered:

eroux · 2017-10-26T07:20:17Z

Well, it seems to me that it's the usual behavior of Lucene analyzers, and I can't see any use case on our website where it would be relevant to look for punctuation. That said, we're not building the analyzer for our website only, so if you want to add an option to keep punctuation please go ahead!

Out of curiosity, in what context will you be using the analyzer?

drupchen · 2017-10-26T11:06:28Z

In our use cases, we are not interested in indexing any tibetan punctuation, but only words like the ones we would find in dictionaries. This is also what StandardAnalyzer does for English, for example.

Could you give us an example of a situation where this behavior is problematic?

drupchen · 2017-10-26T11:30:45Z

As for the word length limit of 255 chars, it is simply inherited from CharTokenizer (TibWordTokenizer is derived from it)

Have you run into a problem because of it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why TibSyllableTokenizer char code limit is less than 255? #4

why TibSyllableTokenizer char code limit is less than 255? #4

haochun commented Oct 26, 2017

eroux commented Oct 26, 2017

drupchen commented Oct 26, 2017

drupchen commented Oct 26, 2017 •

edited

Loading

why TibSyllableTokenizer char code limit is less than 255? #4

why TibSyllableTokenizer char code limit is less than 255? #4

Comments

haochun commented Oct 26, 2017

eroux commented Oct 26, 2017

drupchen commented Oct 26, 2017

drupchen commented Oct 26, 2017 • edited Loading

drupchen commented Oct 26, 2017 •

edited

Loading