Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narek/external index storage #335

Open
wants to merge 80 commits into
base: main-dev
Choose a base branch
from

Conversation

Ngalstyan4
Copy link
Contributor

Ashot jan Hi!

tldr;

This is an attempt to add external storage to USearch to help our upgrades at Lantern !
It allows swapping the storage format (usearch-v2, usearch-v3, lantern-postgres) without touching the core index structures.

As far as I can tell it does not have a runtime performance impact.

Would you be open to merging this kind of interface into upstream usearch or should we maintain it outside?


We have been using a fork of USearch with this kind of external storage for about half a year at Lantern. This is an attempt to upstream it. We have chatted about this before so some of the stuff below may be repetition, but putting it here for completeness.

Motivation

Currently, the core high-performance implementation of vector search is weaved through storage, serialization, file IO interfaces. This makes it harder to:

  1. Change the underlying storage.
  2. Change the serialization format (e.g. usearch's planned v2->v3 transition)
  3. Add storage-level features such as neighbor list compression (motivation: when m in hnsw is large, neighbor lists become a significant portion of index memory footprint)

One might argue that (1) can be achieved by passing a custom allocator to index_gt or index_dense_gt. This has limitations and did not work for us for two reasons:

  1. (most important) allocators tie the lifetime of the index to the lifetime of index_gt. In Lantern, we are dealing with a persistent index - all changes are saved to postgres data files and replicated if needed. So, the index memory needs to outlive any usearch data structures.
  2. Existing allocator interfaces allows defining allocation logic per memory-type granularity (memory for vectors, memory for nodes, etc.). We needed to do allocations with a different kind of partitioning (memory for all components of node i, node i+1, etc)

The storage interface proposed here helps us achieve the goals above.

Design

This PR adds a storage_at template parameter to usearch index types which implements:

  1. node and vector allocation and reset
  2. Access management for concurrent node and vector access
  3. Index save/load from a stream
  4. Viewing a memory mapped index
  5. Compile-time exhaustive API type-checking for storage providers

The exact storage layout is opaque to the rest of usearch - all serialization/deserialization logic is in storage_at so new storage formats can be implemented without touching the rest of the code.
As an example, I implemented a new storage provider in std_storage.hpp that uses cpp standard library containers and stores nodes and vectors adjacent to each other when serializing to a file (similar to usearch v1 format, but this one adds padding between node tape and vector tape in serialization to make sure view() does not result in unaligned memory accesses).

The Storage API

I designed the storage API around how the current usearch v2 storage worked. I tried to minimize amount of changes in index.hpp and index_dense.hpp to hopefully make reviewing easier. I think the storage interface can be simplified and improved in many ways, especially after a usearch v3 format transition. I am open to changing the full API, so long as there is some kind of storage API.

NOTE: There is no new logic in this PR. most of it is just factoring out storage-related interfaces and functions to the separate header.

The storage API, as defined in the beginning of storage.hpp and implemented by several storage backends.
index_gt and index_dense_gt were modified to use this storage API.
I added a helper type-enforcer macro that runs compile-time checks to make sure the provided interface meets the necessary interface- requirements to be a usearch storage provider.

Next?

This has some rough edges, most of which should be listed below. I will come back and update this if more things come up.
Before putting time into those, however, I just wanted to see whether you would be open to merging this into mainline usearch. This would help us at Lantern a lot and would be a big step towards upstream-usearch compatibility for us.

We will likely start using a simplified version of this API from Lantern soon, so can report back on how well it works for our case.

TODOs

  • Fix comments around view+view_internal+reset
  • Figure out whether (or how?) storage layer should maintain info about number of vectors it is storing
    • Needed for save/restore/reset
    • Hard with set_at interface which does not tell whether the old spot was updated or new spot is created
  • implement swap move for index_dense.hpp
  • Add tests around swap/move/copy indexes
  • Implement swap for index.hpp
  • Add node_copy to storage api
  • (Maybe?) Move config_, nodes_count,... etc and other serialization to storage_ as well
  • Implement compact in index.hpp
  • Add memory_useage() interface to storage
  • Check that My choices of taking a reference vs r-value reference are correct
  • Save serialization_config on the stored file
  • Store punned info in index binary to prevent accidental wrong code loading the index
  • Split precomputed_constants to keep storage-related constants in storage layer
  • (maybe?) Move slot_lookup into storage
  • (maybe?) Move or copy nodes_count_ to storage so clear() and reset() can have more intuitive implementations
  • Make all serialization interfaces take and use progress& (I copied the API from current usearch and some APIs there are not taking progress&)
  • Get rid of matrix_rows_ and matrix_cols_ in storage_v2

Copy link
Contributor Author

@Ngalstyan4 Ngalstyan4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, to hopefully help in the review process

@@ -77,7 +77,7 @@ void test_cosine(index_at& index, std::vector<std::vector<scalar_at>> const& vec
expect((index.stats(0).nodes == 3));

// Check if clustering endpoint compiles
index.cluster(vector_first, 0, args...);
// index.cluster(vector_first, 0, args...);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage interface has not yet been added for this endpoint, so I removed the test.
Will add it back when implemented

auto compaction_result = index.compact();
expect(bool(compaction_result));
// auto compaction_result = index.compact();
// expect(bool(compaction_result));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

using key_t = std::int64_t;
{
using slot_t = std::uint32_t;
using storage_v2_t = storage_v2_at<key_t, slot_t>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs tests for the two storage provider APIs

storage_v2_t is the current storage interface, rearranged into a separate API
std_storage_t is an example storage provider that demonstrates the API use.
It stores all data in std:: containers and serializes data to disk similar to usearch v1.
it does not do error handling (asserts all errors

Comment on lines +1608 to +1614
using level_t = std::int16_t;

struct precomputed_constants_t {
double inverse_log_connectivity{};
std::size_t neighbors_bytes{};
std::size_t neighbors_base_bytes{};
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were moved higher up from later in this file, so I can refer to them from the node abstract type below.

precomputed_constants_t is used from storage to figure out the sizes of node_t structs.
I think it would make sense to split this struct, move neighbors_* to storage.hpp and keep inverse_log_connectivity as part of index_gt.

* then the { `neighbors_count_t`, `compressed_slot_t`, `compressed_slot_t` ... } sequences
* for @b each-level.
*/
template <typename key_at, typename slot_at> class node_at {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly the same as the node_t structure from before.

Below are all the changes

  1. Moved static constexpr std::size_t node_head_bytes_() from a private member of index_gt to a public member here, that is called node_t::head_size_bytes()
  2. Moved out of index_gt for global visibility as it is not used from storage.hpp
  3. Moved node_t- related functions such as node_bytes_ to be member functions inside here so all node_t APIs are grouped together.
    NOTE: The only node-related API outside of node_t now is the neighbor iterator and retriever functions. I can move those here as well, but this diff was already becoming very large, so I postponed that for now
  4. Moved precompute_ to be a static member here so it can use the template arguments of node_t. It used to be a private member in index_gt. As already noted, inverse_log_connectivity of precomputed_constants_t does not really belong here. Happy to address it.


other.nodes_count_ = nodes_count_.load();
other.max_level_ = max_level_;
other.entry_slot_ = entry_slot_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy not implemented

*/
template <typename input_callback_at, typename progress_at = dummy_progress_t>
serialization_result_t load_from_stream(input_callback_at&& input, progress_at&& progress = {}) noexcept {

serialization_result_t result;

// Remove previously stored objects
reset();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done at a higher level API.
We cannot do it here because the higher level in index_dense could have already loaded vectors into storage.
Calling reset on the inner index would call reset on storage and wipe out newly loaded vectors.

This is kind of bad and tricky. I have not found a better design around this.
So far, I think this trickyness is fundamental in usearch v2 storage (separate vectors and nodes) where index_dense owns and takes care of vectors, while index takes care of nodes.

I think this division requires that any storage which stores both have shared ownership between index_dense and index.

Assuming in Usearch v3 we move to a format that stores vectors and nodes together, this problem will go away.

Open to other suggestions in the meantime.

static_assert( //
sizeof(typename tape_allocator_traits_t::value_type) == 1, //
"Tape allocator must allocate separate addressable bytes");
using span_bytes_t = span_gt<byte_t>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the result of a call to node_bytes


// Load metadata and choose the right metric
{
index_dense_head_buffer_t buffer;
if (!input(buffer, sizeof(buffer)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_at::load_vectors_from_stream takes a generic buffer and reads off bytes into it from the specified section in the storage buffer, per storage format spec

@@ -748,11 +750,10 @@ class index_dense_gt {
unique_lock_t lookup_lock(slot_lookup_mutex_);

std::unique_lock<std::mutex> free_lock(free_keys_mutex_);
// storage_ cleared by typed_ todo:: is this confusing?
typed_->clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_ is reset by typed_.reset();
nothing bad will happen here if I do it again, but this seemed clearer

ashvardanian pushed a commit that referenced this pull request Jan 14, 2024
* Add move construction tests and fix an issue caused by them

* Only consider zero length IO an error if input buffer was larger than zero

* Move option-override policy opt-in before policy definitions so overrides actually take effect
SIMSIMD, OPENMP and FP16 related cmake options are not properly propaged
to compiler header definitions, when they are set to non-default values.

This commit fixes compile definitions so those values are always
propagated properly

E.g., by default, simsimd usage is turned off and as we see in the
commands below, correct default `#define`s(i.e.
`-DUSEARCH_USE_SIMSIMD=0`) are passed to the compiler:

cmake ..
make VERBOSE=1
> cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp &&
/usr/bin/c++ -DUSEARCH_USE_OPENMP=0 -DUSEARCH_USE_SIMSIMD=0
...
 -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp

But, if we try to enable simsimd via cmake for benchmarking and shared C
libraries, we do not get the corresponding -DUSEARCH_USE_SIMSIMD=1
definition.

cmake .. -DUSEARCH_USE_SIMSIMD=1
make VERBOSE=1
cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp &&
/usr/bin/c++ -DUSEARCH_USE_OPENMP=0
...
-o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp

Note that no definition for `USEARCH_USE_SIMSIMD` was passed to the
compiler.
Internally, the lack simsimd config definition assumes
-DUSEARCH_USE_SIMSIMD=0 value. (see [1_simsimd_logic_in_plugins])

When compiling after adding this commit, we see that we can successfully
enable simsimd via cmake option
cmake .. -DUSEARCH_USE_SIMSIMD=1
make VERBOSE=1
cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp &&
/usr/bin/c++ -DUSEARCH_USE_FP16LIB=1 -DUSEARCH_USE_OPENMP=0
-DUSEARCH_USE_SIMSIMD=1
-o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp

[1_simsimd_logic_in_plugins]:
https://github.com/unum-cloud/usearch/blob/4747ef42f4140a1fde16118f25f079f9af79649e/include/usearch/index_plugins.hpp#L43-L45
Copied the logic from simsimd. Alternatively, the whole block could
be dropped to offload detection to simsimd
index_plugins configures simsimd and when simsimd is included
before this configuration gets a chance to run during compilation,
simsimd.h may be misconfigured

In particular, index_plugins propagates USEARCH_FP16LIB cmake
options as !SIMSIMD_NATIVE_FP16 (see [1]) and if simsimd.h
is included before index_plugins, wrong value of
SIMSIMD_NATIVE_FP16 may be chosen

[1]:
https://github.com/unum-cloud/usearch/blob/ce54b814a8a10f4c0c32fee7aad9451231b63f75/include/usearch/index_plugins.hpp#L50
passing all functional tests, but there are memory leaks
@Ngalstyan4 Ngalstyan4 force-pushed the narek/external-index-storage branch from 03ced4c to ec3ed82 Compare January 30, 2024 06:40
Ngalstyan4 added a commit to Ngalstyan4/usearch that referenced this pull request Feb 5, 2024
* Add move construction tests and fix an issue caused by them

* Only consider zero length IO an error if input buffer was larger than zero

* Move option-override policy opt-in before policy definitions so overrides actually take effect
ashvardanian pushed a commit that referenced this pull request Feb 22, 2024
# [2.9.0](v2.8.16...v2.9.0) (2024-02-22)

### Add

* SQLite binding ([222de55](222de55))
* String distances to SQLite ([ae4d0f0](ae4d0f0))

### Docs

* Header refreshed ([7465c29](7465c29))
* Py and SQLite extensions ([550624b](550624b))
* README.md link to Joins (#327) ([1279c54](1279c54)), closes [#327](#327)

### Fix

* bug reports were immediately marked invalid ([c5fc825](c5fc825))
* Error handling, mem safety bugs #335 (#339) ([4747ef4](4747ef4)), closes [#335](#335) [#339](#339)
* Passing SQLite tests ([6334983](6334983))
* Reported number of levels ([9b1a06a](9b1a06a))
* Skip non-Linux SQLite tests ([b02d262](b02d262))
* SQLite cosine function + tests ([55464fb](55464fb))
* undefined var error in `remove` api ([8d86a9e](8d86a9e))

### Improve

* Multi property lookup ([e8bf02c](e8bf02c))
* Support multi-column vectors ([66f1716](66f1716))

### Make

* `npi ci` (#330) ([5680920](5680920)), closes [#330](#330)
* Add 3.12 wheels ([d66f697](d66f697))
* Change include paths ([21db294](21db294))
* invalid C++17 Clang arg ([2a6d779](2a6d779))
* Link libpthread for older Linux GCC builds (#324) ([6f1e5dd](6f1e5dd)), closes [#324](#324)
* Parallel CI for Python wheels ([a9ad89e](a9ad89e))
* Upgrade SimSIMD & StringZilla ([5481bdf](5481bdf))

### Revert

* Postpone Apache Arrow integration ([5d040ca](5d040ca))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant