[FEATURE] Creating Vector data structures Greedily #1942
Labels
enhancement
Features
Introduces a new unit of functionality that satisfies a requirement
indexing-improvements
This label should be attached to all the github issues which will help improving the indexing time.
Roadmap:Vector Database/GenAI
Project-wide roadmap label
v2.18.0
Description
As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.
Hence the idea is we should build these data structures greedily.
and merges of segments. We should only create vector data structures once the whole indexingand merges(including force merge if possible)is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this of GH issue: [META] [Build-Time] Improving Build time for Vector Indices #1599.Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.
References:
The text was updated successfully, but these errors were encountered: