Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to build an index while keeping it on disk (GraphIndexBuilder + OnDiskGraphIndex ?) #125

Closed
eolivelli opened this issue Oct 13, 2023 · 4 comments

Comments

@eolivelli
Copy link

I am writing a POC to integrate JVector into HerdDB.

This is my work, for reference: diennea/herddb#814

This issue is about asking if there is a good way to have an GraphIndexBuilder backed by a OnDiskGraphIndex.
In HerdDB the index is always "open for writes" and it seems that GraphIndexBuilder is currently keeping everything on the Heap.

My current plan is to "flush" the index periodically to disk (during a checkpoint) but it doesn't seem efficient and it will lead to unwanted behaviour of the service (big writes to disk). Usually the checkpoint in HerdDB is like flushing a bunch of metadata with the list of "active pages".

@eolivelli eolivelli changed the title Is there a way to build and index while keeping it on disk (GraphIndexBuilder + OnDiskGraphIndex ?) Is there a way to build an index while keeping it on disk (GraphIndexBuilder + OnDiskGraphIndex ?) Oct 13, 2023
@jbellis
Copy link
Owner

jbellis commented Oct 13, 2023

Mostly, yes. This PR adds save() and load() method to OnHeapGraphIndex so that you can checkpoint to disk but also continue modifying it. #117

@eolivelli
Copy link
Author

Great, #117 also unblocks DELETEs (and UPDATEs).

@jbellis
Copy link
Owner

jbellis commented Oct 13, 2023

technically yes, although updates are still expensive since you have to cleanup() before re-using the node id, which is O(N). better to use a new id if possible.

@eolivelli
Copy link
Author

Sorry I wasn't clear, for UPDATE I was referring to updating the value of the vector in a database row. In that case I would unregister the previous value and create a new node id with the new vector.

I have another problem (that deserves another GH issue) about linking the "node id" to the physical id of the row in the DB (actually it is the Primary key of the record). Currently I am going to use a separate struct to keep track of this link.
It would be great to have a "metadata" (byte array) field to attach to the "node" and let the GraphSearcher return it (together with the node id).
I will open a new discussion for this.

@jbellis jbellis closed this as completed Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants