You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tibetan char code is 0x0f00 - 0x0xfff.even though some char is bad.but,why did you limit ('\u0F40' <= c && c <= '\u0FBC') || ('\u0F20' <= c && c <= '\u0F33') || (c == '\u0F00');
why remove head marks and others?
The text was updated successfully, but these errors were encountered:
Well, it seems to me that it's the usual behavior of Lucene analyzers, and I can't see any use case on our website where it would be relevant to look for punctuation. That said, we're not building the analyzer for our website only, so if you want to add an option to keep punctuation please go ahead!
Out of curiosity, in what context will you be using the analyzer?
In our use cases, we are not interested in indexing any tibetan punctuation, but only words like the ones we would find in dictionaries. This is also what StandardAnalyzer does for English, for example.
Could you give us an example of a situation where this behavior is problematic?
tibetan char code is 0x0f00 - 0x0xfff.even though some char is bad.but,why did you limit
('\u0F40' <= c && c <= '\u0FBC') || ('\u0F20' <= c && c <= '\u0F33') || (c == '\u0F00');
why remove head marks and others?
The text was updated successfully, but these errors were encountered: