Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Creating Vector data structures Greedily #1942

Open
navneet1v opened this issue Aug 8, 2024 · 2 comments
Open

[FEATURE] Creating Vector data structures Greedily #1942

navneet1v opened this issue Aug 8, 2024 · 2 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Aug 8, 2024

Description

As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.

Hence the idea is we should build these data structures greedily.

  1. Delaying Vector Data structures Creation:
    1. For bulk ingestion the proposal is we should completely stop creating vector data structures during segment creation and merges of segments. We should only create vector data structures once the whole indexing and merges(including force merge if possible) is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this of GH issue: [META] [Build-Time] Improving Build time for Vector Indices #1599.
    2. If a user is constantly indexing and searching on data then not building graphs will lead to high latency + user might not have time to do force merge to trigger the graph creation, hence, we should build the capability where based on a certain threshold we can Opensearch can decide for which segment graph needs to be created. This will improve the ingestion speed. Ref this community request: [FEATURE] Separate refresh interval for KNN segment merge #1577

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

  1. Incremental graph creation: For native engines(Nmslib and Faiss) the vector data structures are build during segment creation(aka OS refresh), which leads to spikes in CPU and sometimes throttling due to high CPU(we saw this in various benchmarks). On the other time when segments are not getting created the CPU util stays very low. This leads uneven usage of CPU(basically a SAW tooth curve of CPU utilization is created) and users are not able to push more documents at Opensearch at steady state. Specifically for Streaming Ingestion use cases we can create graphs incrementally during ingestion(Lucene engine already implements this) this will ensure that we spread out the CPU utilization peak over a period of time and when OS refresh happens we already have graph created. Which will ensure that it is available for search for users hence reduce the Search after Indexing time.

References:

  1. Meta Issue: [META] [Build-Time] Improving Build time for Vector Indices #1599
  2. Community request: [FEATURE] Separate refresh interval for KNN segment merge #1577
@navneet1v
Copy link
Collaborator Author

With the changes that are done as part of #1938, and #1853 we will have the ground work to do the Incremental graph creation. Once these issues are resolved we can start working on this feature.

@navneet1v navneet1v added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Features Introduces a new unit of functionality that satisfies a requirement labels Aug 8, 2024
@navneet1v
Copy link
Collaborator Author

So for 1.i) the process will be

  1. user disable the graph creation for the index and index all the vectors
  2. user enables the graph creation for the index.
  3. User now run force merge to 1 segment to ensure that all KNN DS are created.
  4. Perform search.

The idea is with #2007, there will be speed up in 2, and overall there will be reduction in build time.

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0
Projects
Status: 2.18.0
Development

No branches or pull requests

2 participants