You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opensearch implemented "efficient filters" to apply filtering in an iterative approach dynamically during a kNN-search, not only before or after (see https://opensearch.org/blog/efficient-filters-in-knn/). I think it would be nice if Llamaindex would use this search by default, if a supported engine is used (compatibility can be seen here). At this time, this is supported for lucene (hnsw) and faiss (hnsw, ivf).
Reason
In the current implementation, a "painless script" is used if filters are present. If I understood it correctly, this does not allow for ANN search and should therefore scale much worse to large databases, as it has to calculate scores for every item that matches the filter.
As described in the current implementation in llama_index.vector_stores.opensearch.OpensearchVectorClient._knn_search_query():
If there are no filters do approx-knn search.
If there are (pre)-filters, do an exhaustive exact knn search using 'painless scripting'.
See the below implementation in llama_index.vector_stores.opensearch.OpensearchVectorClient as well:
def__get_painless_scripting_source(
self, space_type: str, vector_field: str="embedding"
) ->str:
"""For Painless Scripting, it returns the script source based on space type."""source_value= (
f"(1.0 + {space_type}(params.query_value, doc['{vector_field}']))"
)
ifspace_type=="cosineSimilarity":
returnsource_valueelse:
returnf"1/{source_value}"def_default_painless_scripting_query(
self,
query_vector: List[float],
k: int=4,
space_type: str="l2Squared",
pre_filter: Optional[Union[Dict, List]] =None,
vector_field: str="embedding",
) ->Dict:
"""For Painless Scripting Search, this is the default query."""ifnotpre_filter:
pre_filter=MATCH_ALL_QUERYsource=self.__get_painless_scripting_source(space_type, vector_field)
return {
"size": k,
"query": {
"script_score": {
"query": pre_filter,
"script": {
"source": source,
"params": {
"field": vector_field,
"query_value": query_vector,
},
},
}
},
}
Value of Feature
If I understood this correctly, the current implementation will make vector-queries with filters unfeasible on large datasets. In the current approach you would have to rely on post-filtering then, which is not reliable in the number of results.
I really like llamaindex and would be happy to see it implemented here. :) I am not an Opensearch expert, but if I have time I could also try to look if I can implement it myself and create a PR.
The text was updated successfully, but these errors were encountered:
I had a look in Haystack's implementation of their OpenSearchDocumentStore and they implement it as follows. But if I see this correctly, it should do simple post-search filtering which is not ideal. Although they state that
Filters are applied during the approximate kNN search to ensure that top_k matching documents are returned.
But according to the Opensearch blog post I linked above, the filters should be inside the knn-query, no?
Feature Description
Opensearch implemented "efficient filters" to apply filtering in an iterative approach dynamically during a kNN-search, not only before or after (see https://opensearch.org/blog/efficient-filters-in-knn/). I think it would be nice if Llamaindex would use this search by default, if a supported engine is used (compatibility can be seen here). At this time, this is supported for lucene (hnsw) and faiss (hnsw, ivf).
Reason
In the current implementation, a "painless script" is used if filters are present. If I understood it correctly, this does not allow for ANN search and should therefore scale much worse to large databases, as it has to calculate scores for every item that matches the filter.
As described in the current implementation in
llama_index.vector_stores.opensearch.OpensearchVectorClient._knn_search_query()
:See the below implementation in
llama_index.vector_stores.opensearch.OpensearchVectorClient
as well:Value of Feature
If I understood this correctly, the current implementation will make vector-queries with filters unfeasible on large datasets. In the current approach you would have to rely on post-filtering then, which is not reliable in the number of results.
I really like llamaindex and would be happy to see it implemented here. :) I am not an Opensearch expert, but if I have time I could also try to look if I can implement it myself and create a PR.
The text was updated successfully, but these errors were encountered: