Is there any way to use a custom tokenizer? #791

dzcpy · 2021-06-03T12:06:39Z

Is your feature request related to a problem? Please describe.
https://github.com/tantivy-search/tantivy#features
One of the features tantivy provides is to support custom tokenizers. For example tantivy-jieba. Is it possible for Toshi to support this feature?

Does another search engine have this functionality? Can you describe it's function?

Do you have a specific use case you are trying to solve?

Additional context

The text was updated successfully, but these errors were encountered:

hntd187 · 2021-06-04T20:10:04Z

That would require you to have to build toshi with that support right? I suppose we could start conditionally including them and have releases that include tokenizers. Do you think that would solve your use?

dzcpy · 2021-06-05T04:07:23Z

@hntd187 Yes, that would be very helpful. Thanks for your awesome work, This project seems very promising

hntd187 · 2021-06-06T16:26:16Z

What tokenizers specifically would you like to see included? I know the one you linked hasn't been updated in some time and is 2 versions behind on tantivy version so I do not know if it will work anymore.

hntd187 · 2021-06-06T21:27:49Z

I added in https://github.com/toshi-search/Toshi/blob/master/toshi-server/src/lib.rs#L55 the ability to conditionally add tokenizer cang_jie if you build toshi with it. If you want we can add more tokenizers, I'll probably come up with some more general traits to make this impl easier for me in the future.

dzcpy · 2021-06-07T02:09:18Z

@hntd187 Thanks very much, I think that's pretty much what I need. In the future there might people want to use other tokenizers like Japanese and Korean ones though, but for me I only need Chinese.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to use a custom tokenizer? #791

Is there any way to use a custom tokenizer? #791

dzcpy commented Jun 3, 2021

hntd187 commented Jun 4, 2021

dzcpy commented Jun 5, 2021

hntd187 commented Jun 6, 2021

hntd187 commented Jun 6, 2021

dzcpy commented Jun 7, 2021

Is there any way to use a custom tokenizer? #791

Is there any way to use a custom tokenizer? #791

Comments

dzcpy commented Jun 3, 2021

hntd187 commented Jun 4, 2021

dzcpy commented Jun 5, 2021

hntd187 commented Jun 6, 2021

hntd187 commented Jun 6, 2021

dzcpy commented Jun 7, 2021