Investigate migrating from native ANN search at weight.scorer create level to query.rewrite leve #1845

jmazanec15 · 2024-07-17T20:25:24Z

Description

Currently, the ANN search is executed at KNNWeight.scorer() stage. Scorers are created at the per segment level. The scorer returned more or less just iterates over the matches that are returned.

In Lucene, they do the Ann search the query rewrite phase. This will build scorers for each segment, but limit the number of hits for the shard to k.

I want to investigate feasibility of migrating to ANN search execution at rewrite state in the plugin. Off the top of my head, it has the following benefits:

Number of results returned per shard are always k. For the plugin, we will do num_segments*k results for matching. Just returning k results will therefore reduce the search behavior dependency on the number of segments. This would let us also limit the number of re-score operations that would take place as well as part of [RFC] Optimized Disk-Based Vector Search #1779
Ability to share state between segments during search. Lucene introduced a really cool feature in https://github.com/apache/lucene/pull/12962/files that relies on collectors in order to share best hit information between segments. With this, segment count can be less of a factor for performance. The shared state across segments is implemented via the KnnCollector interface. This may be possible via creating something new in the KNNWeight class, but would require duplicate effort
On duplicate effort, moving to follow convention of Lucene to execute search at re-write would give us the ability to more easily migrate fully to the KnnVectorFormat provided by Lucene (see Investigate migrating custom codec from BinaryDocValuesFormat to KnnVectorsFormat #1087).

That being said, I still need to figure out what the downsides may be. At the moment, I cannot think of any, but will update here if I find any.

In order to do this migration, I think we would need to

Build PoC - see what tests break
A/B performance test the PoC
Feature compatibility deep dive (concurrent segment search, neural search, hybrid search, etc.)

jmazanec15 added the Enhancements Increases software capabilities beyond original client specifications label Jul 17, 2024

github-actions bot added the untriaged label Jul 17, 2024

jmazanec15 removed the untriaged label Jul 17, 2024

jmazanec15 mentioned this issue Jul 22, 2024

[RFC] Two-phased Search Re-score Design #1861

Closed

shatejas mentioned this issue Aug 1, 2024

Introduces NativeEngineKNNQuery which executes ANN on rewrite #1877

Merged

5 tasks

naveentatikonda closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate migrating from native ANN search at weight.scorer create level to query.rewrite leve #1845

Investigate migrating from native ANN search at weight.scorer create level to query.rewrite leve #1845

jmazanec15 commented Jul 17, 2024

Investigate migrating from native ANN search at weight.scorer create level to query.rewrite leve #1845

Investigate migrating from native ANN search at weight.scorer create level to query.rewrite leve #1845

Comments

jmazanec15 commented Jul 17, 2024

Description