-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way to use a custom tokenizer? #791
Comments
That would require you to have to build toshi with that support right? I suppose we could start conditionally including them and have releases that include tokenizers. Do you think that would solve your use? |
@hntd187 Yes, that would be very helpful. Thanks for your awesome work, This project seems very promising |
What tokenizers specifically would you like to see included? I know the one you linked hasn't been updated in some time and is 2 versions behind on tantivy version so I do not know if it will work anymore. |
I added in https://github.com/toshi-search/Toshi/blob/master/toshi-server/src/lib.rs#L55 the ability to conditionally add tokenizer cang_jie if you build toshi with it. If you want we can add more tokenizers, I'll probably come up with some more general traits to make this impl easier for me in the future. |
@hntd187 Thanks very much, I think that's pretty much what I need. In the future there might people want to use other tokenizers like Japanese and Korean ones though, but for me I only need Chinese. |
Is your feature request related to a problem? Please describe.
https://github.com/tantivy-search/tantivy#features
One of the features tantivy provides is to support custom tokenizers. For example
tantivy-jieba
. Is it possible for Toshi to support this feature?Does another search engine have this functionality? Can you describe it's function?
Do you have a specific use case you are trying to solve?
Additional context
The text was updated successfully, but these errors were encountered: