-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052
Comments
It's good to see this RFC. I just wonder:
|
Hi @yuye-aws ,
Yes, we'll build the bert tokenizer as a built-in tokenizer. For other supported tokenizers see https://opensearch.org/docs/latest/analyzers/tokenizers/index/
Users only need to configure the analyzer in index mappings. No need for register model.
I don't see the overlap of tokenizer and neural dense query. The tokenizer can't work along for dense retrieval, and the text embedding model contains tokenizers |
After further investigation on this issue, we found that the
|
What/Why
What problems are you trying to solve?
Currently for neural sparse query, users need to register a sparse_encoding/sparse_tokenize model in advance and provide the model id in query body. For bi-encoder mode, we do need the ml-commons suite to manage the lifecycle of sparse encoding models. But for doc-only mode, we only use a tokenizer for query, and it would be somehow heavy to manage it with ml-commons suite. There will be several drawbacks:
What are you proposing?
Build the analyzer-based neural sparse query. The sparse_tokenize model will be wrapped as a Lucene Analyzer. Users bind the analyzer to index field, and the neural sparse query will call the analyzer to encode the query.
The pretrained
amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1
will be supported as pre-defined tokenizer. The token weight is encoded in payload attribute.Besides used for neural sparse query, the analyzer can also be invoked like others. E.g. analyze API, chunking processor.
What is the developer experience going to be?
Will alter the model_id verification logics at neural sparse query builder. And add the pre-defined bert analyzer.
Are there any security considerations?
N/A
Are there any breaking changes to the API
We'll support a new query type for neural sparse query. I.e. users can bind the analyzer to index field, instead of providing the model id in query body.
What is the user experience going to be?
create index
search
What will it take to execute?
HuggingFaceTokenizer
implementation from djl library. DJL is already a dependency in ml-commons.amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1
to plugin resource directory.The text was updated successfully, but these errors were encountered: