Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support score type threshold in radial search #1589

Conversation

junqiu-lei
Copy link
Member

@junqiu-lei junqiu-lei commented Apr 3, 2024

Description

This PR allows user to use score as threshold to search vectors, the search result contains all docs which score higher than the score for Lucene and Faiss engines ANN search.

Usage

During index mapping and indexing stages no behavior change. Some query examples:

  • Faiss engine with L2 space type
curl --location --request GET 'http://localhost:9200/target-index-faiss/_search' \
--header 'Content-Type: application/json' \
--data '{
  "size":12,  
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [7.5, 8.5],
        "score": 0.5
      }
    }
  }
}'

// Query results
{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.7462686,
        "hits": [
            {
                "_index": "target-index-faiss",
                "_id": "1",
                "_score": 0.7462686,
                "_source": {
                    "my_vector1": [
                        7.0,
                        8.2
                    ],
                    "price": 4.4
                }
            }
        ]
    }
}
  • Lucene engine with Inner product space type
curl --location --request GET 'http://localhost:9200/target-index-lucene/_search' \
--header 'Content-Type: application/json' \
--data '{
  "size":12,  
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [7.0, 8.5],
        "score": 25
      }
    }
  }
}'
 
// Query results
{
    "took": 303,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 69.45,
        "hits": [
            {
                "_index": "target-index-lucene",
                "_id": "13",
                "_score": 69.45,
                "_source": {
                    "my_vector1": [
                        -2.0,
                        9.7
                    ],
                    "price": 18.0
                }
            },
            {
                "_index": "target-index-lucene",
                "_id": "17",
                "_score": 39.3,
                "_source": {
                    "my_vector1": [
                        -5.7,
                        9.2
                    ],
                    "price": 3.1
                }
            },
            {
                "_index": "target-index-lucene",
                "_id": "7",
                "_score": 34.45,
                "_source": {
                    "my_vector1": [
                        -1.9,
                        5.5
                    ],
                    "price": 11.0
                }
            },
            {
                "_index": "target-index-lucene",
                "_id": "3",
                "_score": 27.5,
                "_source": {
                    "my_vector1": [
                        -3.5,
                        6.0
                    ],
                    "price": 19.1
                }
            }
        ]
    }
}

Issues Resolved

Part of #814

Check List

  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@junqiu-lei junqiu-lei added Features Introduces a new unit of functionality that satisfies a requirement feature branch v2.14.0 labels Apr 3, 2024
@junqiu-lei junqiu-lei self-assigned this Apr 3, 2024
Copy link

codecov bot commented Apr 3, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 84.85%. Comparing base (3701835) to head (d4f70de).
Report is 6 commits behind head on feature/radius-search.

Files Patch % Lines
...rg/opensearch/knn/index/query/KNNQueryBuilder.java 79.06% 4 Missing and 5 partials ⚠️
.../main/java/org/opensearch/knn/index/SpaceType.java 25.00% 2 Missing and 1 partial ⚠️
...main/java/org/opensearch/knn/index/util/Faiss.java 83.33% 0 Missing and 1 partial ⚠️
...ain/java/org/opensearch/knn/index/util/Lucene.java 80.00% 1 Missing ⚠️
...ain/java/org/opensearch/knn/index/util/Nmslib.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                     Coverage Diff                     @@
##             feature/radius-search    #1589      +/-   ##
===========================================================
+ Coverage                    84.82%   84.85%   +0.02%     
- Complexity                    1400     1429      +29     
===========================================================
  Files                          174      174              
  Lines                         5713     5802      +89     
  Branches                       569      588      +19     
===========================================================
+ Hits                          4846     4923      +77     
- Misses                         621      630       +9     
- Partials                       246      249       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SpaceType,
Function<Float, Float>>builder()
.put(SpaceType.INNER_PRODUCT, score -> score > 1 ? 1 - score : 1 / score - 1)
.put(SpaceType.L2, score -> 1 / score - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for L2 spacetype can we put this translation in SpaceType enum class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes updated.

private final static Map<SpaceType, Function<Float, Float>> SCORE_TO_DISTANCE_TRANSFORMATIONS = ImmutableMap.<
SpaceType,
Function<Float, Float>>builder()
.put(SpaceType.INNER_PRODUCT, score -> score > 1 ? 1 - score : 1 / score - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add some java doc here why this conversion make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v
Copy link
Collaborator

Minor comments overall code looks good to me.

@jmazanec15
Copy link
Member

In general, Im not in favor of "score" naming. Users are going to expect it to match the final score produced by the search results, regardless of the query type. So, they will be confused and cut issues when a result is returned from a more complex query with a hit score that is below the score they passed in the filter. I prefer the name "similarity" - this is what lucene refers to this as - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityQuery.java - it will disambiguate this case from the user experience.

@junqiu-lei
Copy link
Member Author

In general, Im not in favor of "score" naming. Users are going to expect it to match the final score produced by the search results, regardless of the query type. So, they will be confused and cut issues when a result is returned from a more complex query with a hit score that is below the score they passed in the filter. I prefer the name "similarity" - this is what lucene refers to this as - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityQuery.java - it will disambiguate this case from the user experience.

I also thought about similarity. But I think user can easily understand the value mean of naming score in k-NN. The name similarity might be confused especially for feature new users for what value need provide. When say similarity, it could be related to score similarity or distance similarity. In Lucene, it only have score threshold, so I think it's fine to use similarity naming, in OpenSearch k-KNN we support both score threshold and distance threshold, so I think score as lower level words might be more suitable for us?

cc: @vamshin @navneet1v @dylan-tong-aws

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per sync offline, I think "score" param is okay if we detail explanation in the documentation. Key is that score in this context is scoped to this query context.

@@ -499,4 +541,24 @@ protected int doHashCode() {
public String getWriteableName() {
return NAME;
}

private static void validSingleQueryType(Integer k, Float distance, Float score) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: valid -> validate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Signed-off-by: Junqiu Lei <[email protected]>
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @junqiu-lei

@@ -36,6 +36,14 @@ public float scoreTranslation(float rawScore) {
public VectorSimilarityFunction getVectorSimilarityFunction() {
return VectorSimilarityFunction.EUCLIDEAN;
}

@Override
public float scoreToDistanceTranslation(float score) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we adding the score to distance translation for L2 only? not for cosine and dot product?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v IMO no need for cosine and dot product here. The SpaceType Enum class is used for common translation for different engines:

  • Cosine's score to distance function doesn't needed because Lucene's radial search api just need score parameter, in k-NN faiss, it doesn't support Cosine's type.
  • Dot product's score to distance function is only need in Faiss engine and in Faiss it has different translation from Lucene, so I just put the function inside Lucene.java where it needed.

@junqiu-lei junqiu-lei merged commit d334fa4 into opensearch-project:feature/radius-search Apr 5, 2024
50 of 53 checks passed
@junqiu-lei junqiu-lei deleted the score-threshold-radial-1 branch April 5, 2024 18:16
junqiu-lei added a commit that referenced this pull request Apr 6, 2024
* Support score type threshold in radial search

Signed-off-by: Junqiu Lei <[email protected]>
junqiu-lei added a commit that referenced this pull request Apr 11, 2024
* Support score type threshold in radial search

Signed-off-by: Junqiu Lei <[email protected]>
junqiu-lei added a commit to junqiu-lei/k-NN that referenced this pull request Apr 17, 2024
* Support score type threshold in radial search

Signed-off-by: Junqiu Lei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature branch Features Introduces a new unit of functionality that satisfies a requirement v2.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants