Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: Add index_gt::merge() #572

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add: Add index_gt::merge() #572

wants to merge 1 commit into from

Conversation

kou
Copy link

@kou kou commented Mar 4, 2025

#84

This focus only on index_gt. index_dense_gt is out of scope of this PR. If we get consensus of implementation approach, we'll be able to implement index_dense_gt::merge() too.

This adds mutable memory_mapped_file_t and you can create a mutable memory-mapped index with it. You can merge multiple indexes to the mutable memory-mapped index without allocating all data on RAM.

index_gt::merge() does just the followings:

  1. Reserve enough size
  2. Add all values in the source index to destination index

The important changes in this PR are the followings:

  • Mutable memory_mapped_file_t
  • Mutable index_gt with memory_mapped_file_t

For the former: memory_mapped_file_t::is_writable and memory_mapped_file_t::reserve() are added. memory_mapped_file_t::open_if_not() opens a file with write mode when is_writable.

For the latter: There are some change in the following methods:

  • index_gt::reset(): Write header and levels before the memory-mapped file is closd
  • index_gt::reserve(): Extend the memory-mapped file and adjust nodes_ after it
  • index_gt::node_malloc_(): Allocate a memory from the memory-mapped file

This just focus on index_gt. index_dense_gt is out of scope of this
PR. If we get consensus of implementation approach, we'll be able to
implement index_dense_gt::merge() too.

This adds mutable `memory_mapped_file_t` and you can create a mutable
memory-mapped index with it. You can merge multiple indexes to the
mutable memory-mapped index without allocating all data on RAM.
@@ -3272,7 +3449,6 @@ class index_gt {

// We are loading an empty index, no more work to do
if (!header.size) {
reset();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this PR but I add this PR because I need similar change in view(). If this should be a separated PR, I'll open a separated PR.

We don't need to call reset() here because we don't change anything after the above reset().

Comment on lines -3423 to +3619
serialization_result_t stream_result = load_from_stream(
[&](void* buffer, std::size_t length) {
if (offset + length > file.size())
return false;
std::memcpy(buffer, file.data() + offset, length);
offset += length;
return true;
},
std::forward<progress_at>(progress));

return stream_result;
is_mutable_ = true;
return {};
} else {
serialization_result_t io_result = file.open_if_not();
if (!io_result)
return io_result;

serialization_result_t stream_result = load_from_stream(
[&](void* buffer, std::size_t length) {
if (offset + length > file.size())
return false;
std::memcpy(buffer, file.data() + offset, length);
offset += length;
return true;
},
std::forward<progress_at>(progress));
return stream_result;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is only changed because this is moved to else.

if (!header.size) {
reset();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to call reset() because we don't change anything after the above reset().

@kou kou mentioned this pull request Mar 4, 2025
3 tasks
// Add all values in `index` to this index.
add_result_t merge_result;
for (const auto& member : index) {
auto& value = get_value(member);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to filter the target values as suggested in #84 (comment) , we can do it here.

@ashvardanian
Copy link
Contributor

Thank you, @kou! That's a good start! I think our key goal for the "merge" operation is to reduce the asymptotic complexity from ~$O(N \log N)$ to something closer to $O(N)$.

@kou
Copy link
Author

kou commented Mar 5, 2025

OK. Should we work on the key goal in this PR? Or can we work on it as a follow-up task?

For $O(N)$, we may need to merge two HNSWs directly instead of adding values in an index_gt to another. I'm not sure whether we can keep efficient HNSW by it... Do you have any idea how to implement it or do you know any prior work? I haven't read carefully but I found an open issue in apache/lucene: apache/lucene#12440

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants