[Feature Request]: Opensearch efficient filtering #14433

spreeni · 2024-06-27T15:10:27Z

Feature Description

Opensearch implemented "efficient filters" to apply filtering in an iterative approach dynamically during a kNN-search, not only before or after (see https://opensearch.org/blog/efficient-filters-in-knn/). I think it would be nice if Llamaindex would use this search by default, if a supported engine is used (compatibility can be seen here). At this time, this is supported for lucene (hnsw) and faiss (hnsw, ivf).

Reason

In the current implementation, a "painless script" is used if filters are present. If I understood it correctly, this does not allow for ANN search and should therefore scale much worse to large databases, as it has to calculate scores for every item that matches the filter.

As described in the current implementation in llama_index.vector_stores.opensearch.OpensearchVectorClient._knn_search_query():

If there are no filters do approx-knn search.
If there are (pre)-filters, do an exhaustive exact knn search using 'painless scripting'.

See the below implementation in llama_index.vector_stores.opensearch.OpensearchVectorClient as well:

def __get_painless_scripting_source(
    self, space_type: str, vector_field: str = "embedding"
) -> str:
    """For Painless Scripting, it returns the script source based on space type."""
    source_value = (
        f"(1.0 + {space_type}(params.query_value, doc['{vector_field}']))"
    )
    if space_type == "cosineSimilarity":
        return source_value
    else:
        return f"1/{source_value}"

def _default_painless_scripting_query(
    self,
    query_vector: List[float],
    k: int = 4,
    space_type: str = "l2Squared",
    pre_filter: Optional[Union[Dict, List]] = None,
    vector_field: str = "embedding",
) -> Dict:
    """For Painless Scripting Search, this is the default query."""
    if not pre_filter:
        pre_filter = MATCH_ALL_QUERY

    source = self.__get_painless_scripting_source(space_type, vector_field)
    return {
        "size": k,
        "query": {
            "script_score": {
                "query": pre_filter,
                "script": {
                    "source": source,
                    "params": {
                        "field": vector_field,
                        "query_value": query_vector,
                    },
                },
            }
        },
    }

Value of Feature

If I understood this correctly, the current implementation will make vector-queries with filters unfeasible on large datasets. In the current approach you would have to rely on post-filtering then, which is not reliable in the number of results.

I really like llamaindex and would be happy to see it implemented here. :) I am not an Opensearch expert, but if I have time I could also try to look if I can implement it myself and create a PR.

The text was updated successfully, but these errors were encountered:

spreeni · 2024-06-27T15:29:27Z

I had a look in Haystack's implementation of their OpenSearchDocumentStore and they implement it as follows. But if I see this correctly, it should do simple post-search filtering which is not ideal. Although they state that

Filters are applied during the approximate kNN search to ensure that top_k matching documents are returned.

But according to the Opensearch blog post I linked above, the filters should be inside the knn-query, no?

body = {
    "query": {
        "bool": {
            "must": [
                {
                    "knn": {
                        "embedding": {
                            "vector": query_embedding,
                            "k": top_k,
                        }
                    }
                }
            ],
        }
    },
}

if filters:
    body["query"]["bool"]["filter"] = normalize_filters(filters)

body["size"] = top_k

spreeni added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jun 27, 2024

logan-markewich added P2 and removed triage Issue needs to be triaged/prioritized labels Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Opensearch efficient filtering #14433

[Feature Request]: Opensearch efficient filtering #14433

spreeni commented Jun 27, 2024 •

edited

Loading

spreeni commented Jun 27, 2024

[Feature Request]: Opensearch efficient filtering #14433

[Feature Request]: Opensearch efficient filtering #14433

Comments

spreeni commented Jun 27, 2024 • edited Loading

Feature Description

Reason

Value of Feature

spreeni commented Jun 27, 2024

spreeni commented Jun 27, 2024 •

edited

Loading