[FEATURE] Creating Vector data structures Greedily #1942

navneet1v · 2024-08-08T18:31:24Z

Description

As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.

Hence the idea is we should build these data structures greedily.

Delaying Vector Data structures Creation:
1. For bulk ingestion the proposal is we should completely stop creating vector data structures during segment creation ~~and merges of segments~~. We should only create vector data structures once the whole indexing ~~and merges(including force merge if possible)~~ is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this of GH issue: [META] [Build-Time] Improving Build time for Vector Indices #1599.
2. If a user is constantly indexing and searching on data then not building graphs will lead to high latency + user might not have time to do force merge to trigger the graph creation, hence, we should build the capability where based on a certain threshold we can Opensearch can decide for which segment graph needs to be created. This will improve the ingestion speed. Ref this community request: [FEATURE] Separate refresh interval for KNN segment merge #1577

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

Incremental graph creation: For native engines(Nmslib and Faiss) the vector data structures are build during segment creation(aka OS refresh), which leads to spikes in CPU and sometimes throttling due to high CPU(we saw this in various benchmarks). On the other time when segments are not getting created the CPU util stays very low. This leads uneven usage of CPU(basically a SAW tooth curve of CPU utilization is created) and users are not able to push more documents at Opensearch at steady state. Specifically for Streaming Ingestion use cases we can create graphs incrementally during ingestion(Lucene engine already implements this) this will ensure that we spread out the CPU utilization peak over a period of time and when OS refresh happens we already have graph created. Which will ensure that it is available for search for users hence reduce the Search after Indexing time.

References:

Meta Issue: [META] [Build-Time] Improving Build time for Vector Indices #1599
Community request: [FEATURE] Separate refresh interval for KNN segment merge #1577

navneet1v · 2024-08-08T18:35:55Z

With the changes that are done as part of #1938, and #1853 we will have the ground work to do the Incremental graph creation. Once these issues are resolved we can start working on this feature.

navneet1v · 2024-08-28T01:33:49Z

So for 1.i) the process will be

user disable the graph creation for the index and index all the vectors
user enables the graph creation for the index.
User now run force merge to 1 segment to ensure that all KNN DS are created.
Perform search.

The idea is with #2007, there will be speed up in 2, and overall there will be reduction in build time.

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

navneet1v added untriaged enhancement labels Aug 8, 2024

navneet1v assigned VijayanB Aug 8, 2024

navneet1v removed the untriaged label Aug 8, 2024

navneet1v mentioned this issue Aug 8, 2024

[META] [Build-Time] Improving Build time for Vector Indices #1599

Open

7 tasks

navneet1v added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Features Introduces a new unit of functionality that satisfies a requirement labels Aug 8, 2024

VijayanB mentioned this issue Aug 27, 2024

Introduce new setting to configure when to build graph during segment creation #2007

Merged

5 tasks

This was referenced Sep 24, 2024

Add exact search if no native engine files are available #2136

Merged

Add integration test for binary vector values #2142

Merged

KNNIterators should support with and without filters #2155

Merged

This was referenced Oct 1, 2024

Add support for radial search in exact search #2174

Merged

Allow build graph greedily for quantization scenarios #2175

Merged

Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

Open

VijayanB added v2.18.0 Roadmap:Vector Database/GenAI Project-wide roadmap label labels Oct 4, 2024

VijayanB mentioned this issue Oct 9, 2024

[DOC] Vector Search now supports new advanced setting to improve indexing performance opensearch-project/documentation-website#8482

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Creating Vector data structures Greedily #1942

[FEATURE] Creating Vector data structures Greedily #1942

navneet1v commented Aug 8, 2024 •

edited

Loading

navneet1v commented Aug 8, 2024

navneet1v commented Aug 28, 2024

[FEATURE] Creating Vector data structures Greedily #1942

[FEATURE] Creating Vector data structures Greedily #1942

Comments

navneet1v commented Aug 8, 2024 • edited Loading

Description

navneet1v commented Aug 8, 2024

navneet1v commented Aug 28, 2024

navneet1v commented Aug 8, 2024 •

edited

Loading