Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Improve performance for exact search with script scoring #1709

Open
4 tasks done
jmazanec15 opened this issue May 20, 2024 · 0 comments
Open
4 tasks done

[META] Improve performance for exact search with script scoring #1709

jmazanec15 opened this issue May 20, 2024 · 0 comments
Labels
Enhancements Increases software capabilities beyond original client specifications

Comments

@jmazanec15
Copy link
Member

jmazanec15 commented May 20, 2024

Description

Meta-issue for improving performance of exact search with script scoring.

AIs

Single Node Experiments

Overview

I ran several experiments show-casing performance of exact scoring for single node. In addition, I captured several profiling examples.

The testing code can be found in https://github.com/jmazanec15/opensearch-knn-rescore-experiments/. Code that was benchmarked can be found in https://github.com/jmazanec15/k-NN-1/tree/exact-scoring-exps. The cohere dataset with 1M-768 dim vecs, 10k queries, and innerproduct space type was used.

Result Summary

  1. For 1M-768 dim vectors, the fastest exact search can do is 102 ms/119 ms/124 ms with lucene backed storage and simd enabled
  2. There is a 2x difference in perf between lucene and plugin formats. When all vectors can fit in memory, the p50/p90/p99 latency is 323 ms/325 ms/344 ms (no simd) for lucene backed storage and 568 ms/584 ms/594 ms (no simd) for plugin backed storage. This indicates that lucene’s vector format is almost 2x faster than the plugins for script scoring. The cause of this appears to be that Lucene is able to directly map float vectors into JVM via Panama. For the plugin, it requires copying bytes in and then deserializing. There is overlap with [Enhancement] Optimize the de-serialization of vector when reading from Doc Values  #1050
  3. SIMD gave a 3x improvement over non-SIMD. Without SIMD, for the Lucene backed storage, the p50/p90/p99 latency is 323 ms/325 ms/344 ms. With SIMD, it is 101 ms/119 ms/124 ms.

Test Configuration

The following configurations were used to execute these tests:

Config Key Value
Base OS version 2.14
Instance type r5.4xlarge
Data set cohere
Num index vecs 1M
Dimension 768
OSB Mem 64
OSB CPU 4
OS-metrics Mem 1 GB
OS-metrics JVM 512 MB
OS-metrics CPUs 2
OS-metrics Mem 12 GB
OS-test JVM 4 GB
OS-test CPUs 8
Primary shards 8
Replica shards 0
segment count 1
k 100
space type innerproduct
disk gp3
disk size 500 GB
query clients 1
index clients 1

Results - Lucene w/o SIMD

Run # p50 latency (ms) p90 latency (ms) p99 latency (ms) Recall
1 324 326 344 0.99998
2 324 325 328 0.99998

image

Results - Lucene w/ SIMD

Run # p50 latency (ms) p90 latency (ms) p99 latency (ms) Recall
1 102 119 124 0.999999
2 103 119 125 0.999999

image

Results - Plugin w/o SIMD

Run # p50 latency (ms) p90 latency (ms) p99 latency (ms) Recall
1 674 684 692 0.99998
2 674 684 692 0.99998

image

Results - Plugin w/o SIMD (with optimizations)

Run # p50 latency (ms) p90 latency (ms) p99 latency (ms) Recall
1 568 584 594 0.99998
2 568 584 596 0.99998

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

No branches or pull requests

1 participant