Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use JVector to index Vetors of floats - POC #814

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Oct 13, 2023

This is a POC about using jvector to build an index over vectors of float.

JVector is the most advanced library to build indexes over this data type and it will be used in Cassandra 5.0.

Please note that when using the index you won't be doing a full table scan, but on the other side the results with be an "approximation", that is fine for most of the use cases, especially Vector Search for Generative AI.

This is currently a POC.

Easy things to implement:

  • integrate with DDL language (we need to add more space in the index metadata for all the side parameters of the index)
  • integrate with the Planner (detect ORDER BY .... and decide to use the Index)

Hard things:

  • find a way to not have the whole JVector index in memory
  • Implement persistent datastorage
  • implement checkpoint
  • Implement a mapping from the "nodeId" (integer) to the primary key (byte array)
  • implement DELETE (not supported yet in JVector)

The main issue is that It seems that when the index is open for writing it is always fully stored in memory, and we can flush it to disk periodically.

I cannot find a good way to not flush the index to disk, the only way I can see with the current version of JVector is to flush the index during a check point.
I guess that in Cassandra there is no problem because they flush the index when the SSTable is flushed to disk and then it become immutable.
In HerdDB we have long lived table-wide indexes and the paging mechanism is handle in another way: we still have immutable pages when they are flushed to disk and we have pages for indexes and indexes are flushed next to the data pages.

We will have to be creative or work with JVector folks to have more support there.

Also in is awkward that we need to store the mapping between a "nodeId" with the PK of the record out side the JVector data set. Currently we can do it with the usual BLink as we do for the PK (the PK stored a mapping bytes -> long) but if we could store the PK into the JVector we will save some coordination (an very likely also disk accesses)

To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.

@eolivelli
Copy link
Contributor Author

This is the PR to add jvector in Cassandra
https://github.com/apache/cassandra/pull/2673/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant