Skip to content

Commit

Permalink
Docs: Clustering in C++ API
Browse files Browse the repository at this point in the history
Closes #296
  • Loading branch information
ashvardanian committed Nov 6, 2023
1 parent 299aaf2 commit 3e13d40
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 28 deletions.
110 changes: 89 additions & 21 deletions cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,20 @@ metric_punned_t metric(256, metric_kind_t::l2sq_k, scalar_kind_t::f32_k);
index_dense_t index = index_dense_t::make(metric);
float vec[3] = {0.1, 0.3, 0.2};

index.reserve(10);
index.add(/* key: */ 42, /* vector: */ {&vec[0], 3});
auto results = index.search(/* query: */ {&vec[0], 3}, 5 /* neighbors */);
index.reserve(10); // Pre-allocate memory for 10 vectors
index.add(42, &vec[0]); // Pass a key and a vector
auto results = index.search(&vec[0], 5); // Pass a query and limit number of results

for (std::size_t i = 0; i != results.size(); ++i)
results[i].element.key, results[i].element.vector, results[i].distance;
```
Here we:
- define a metric of kind [`metric_kind_t::l2sq_k`](https://unum-cloud.github.io/usearch/cpp/reference.html#_CPPv413metric_kind_t),
- to be applied to [`scalar_kind_t::f32_k`](https://unum-cloud.github.io/usearch/cpp/reference.html#_CPPv413scalar_kind_t) floating-point vectors,
- instantiate an [`index_dense_t`](https://unum-cloud.github.io/usearch/cpp/reference.html#_CPPv4I00EN4unum7usearch14index_dense_gtE) index.
The `add` is thread-safe for concurrent index construction.
It also has an overload for different vector types, casting them under the hood.
The same applies to the `search`, `get`, `cluster`, and `distance_between` functions.
Expand All @@ -52,20 +58,78 @@ index.load("index.usearch"); // Copying from disk
index.view("index.usearch"); // Memory-mapping from disk
```

## User-Defined Metrics in C++
## Multi-Threading

For advanced users, more compile-time abstractions are available.
Most AI, HPC, or Big Data packages use some form of a thread pool.
Instead of spawning additional threads within USearch, we focus on the thread safety of `add()` function, simplifying resource management.

```cpp
template <typename distance_at = default_distance_t, // `float`
typename key_at = default_key_t, // `int64_t`, `uuid_t`
typename compressed_slot_at = default_slot_t, // `uint32_t`, `uint40_t`
typename dynamic_allocator_at = std::allocator<byte_t>, //
typename tape_allocator_at = dynamic_allocator_at> //
class index_gt;
#pragma omp parallel for
for (std::size_t i = 0; i < n; ++i)
native.add(key, span_t{vector, dims});
```
During initialization, we allocate enough temporary memory for all the cores on the machine.
On the call, the user can supply the identifier of the current thread, making this library easy to integrate with OpenMP and similar tools.
Moreover, you can take advantage of one of the provided "executors" to parallelize the search:
- `executor_openmp_t`, that would use OpenMP under the hood.
- `executor_stl_t`, that will spawn `std::thread` instances.
- `dummy_executor_t`, that will run everything sequentially.
## Clustering
Aside from basic Create-Read-Update-Delete (CRUD) operations and search, USearch also supports clustering.
Once the index is constructed, you can either:
- Identify a cluster to which any external vector belongs, once mapped onto the index.
- Split the entire index into a set of clusters, each with its own centroid.
For the first, the interface accepts a vector and a "clustering level", which is essentially the index of the HNSW graph layer, in which to search.
If you pass zero, the traversal will happen in every level except the bottom one.
Otherwise, the search will be limited to the specified level.
```cpp
some_scalar_t vector[3] = {0.1, 0.3, 0.2};
cluster_result_t result = index.cluster(&vector, index.max_level() / 2);
match_t cluster = result.cluster;
member_cref_t member = cluster.member;
distance_t distance = cluster.distance;
```

The following distances are pre-packaged:
If you wish to split the whole structure into clusters, you must provide an iterator over a range of vectors, that will be processed in parallel using the previously described function.
Unlike the previous function, you don't have to manually specify the level, as the algorithm will pick the best one for you, depending on the number of clusters you want to highlight.
Aside from that auto-tuning, this function will regroup some of the clusters, if they are too small, and return the final number of clusters.

```cpp
std::size_t queries_count = queries_end - queries_begin;
index_dense_clustering_config_t config;
config.min_clusters = 1000;
config.max_clusters = 2000;
config.mode = index_dense_clustering_config_t::merge_smallest_k;

// Outputs:
vector_key_t cluster_centroids_keys[queries_count];
distance_t distances_to_cluster_centroids[queries_count];
executor_default_t thread_pool;
dummy_progress_t progress_bar;

clustering_result_t result = cluster(
queries_begin, queries_end,
config,
&cluster_centroids_keys, &distances_to_cluster_centroids,
thread_pool, progress_bar);
```

This approach requires basic understanding of templates meta-programming to implement the `queries_begin` and `queries_end` smart-iterators.
On the bright side, it allows iteratively deepening into a specific cluster.

As in many other bulk-processing APIs, the `executor` and `progress` are optional.

## User-Defined Metrics

In its high-level interface, USearch supports a variety of metrics, including the most popular ones:

- `metric_cos_gt<scalar_t>` for "Cosine" or "Angular" distance.
- `metric_ip_gt<scalar_t>` for "Inner Product" or "Dot Product" distance.
Expand All @@ -78,16 +142,20 @@ The following distances are pre-packaged:
- `metric_haversine_gt<scalar_t>` for "Haversine" or "Great Circle" distance between coordinates used in GIS applications.
- `metric_divergence_gt<scalar_t>` for the "Jensen Shannon" similarity between probability distributions.

## Multi-Threading
In reality, for most common types, one of the [SimSIMD](https://github.com/ashvardanian/SimSIMD) backends will be triggered, providing hardware-acceleration for most common CPUs.

Most AI, HPC, or Big Data packages use some form of a thread pool.
Instead of spawning additional threads within USearch, we focus on the thread safety of `add()` function, simplifying resource management.
If you need a different metric, you can implement it yourself and wrap it into a `metric_punned_t`, which is our alternative to the `std::function`.
Unlike the `std::function`, it is a trivial type, which is important for performance.

## Advanced Interface

If you are proficient in C++ and ready to get your hands dirty, you can use the low-level interface.

```cpp
#pragma omp parallel for
for (std::size_t i = 0; i < n; ++i)
native.add(key, span_t{vector, dims});
template <typename distance_at = default_distance_t, // `float`
typename key_at = default_key_t, // `int64_t`, `uuid_t`
typename compressed_slot_at = default_slot_t, // `uint32_t`, `uint40_t`
typename dynamic_allocator_at = std::allocator<byte_t>, //
typename tape_allocator_at = dynamic_allocator_at> //
class index_gt;
```
During initialization, we allocate enough temporary memory for all the cores on the machine.
On the call, the user can supply the identifier of the current thread, making this library easy to integrate with OpenMP and similar tools.
18 changes: 11 additions & 7 deletions include/usearch/index_dense.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ class index_dense_gt {
/// @brief Schema: input buffer, bytes in input buffer, output buffer.
using cast_t = std::function<bool(byte_t const*, std::size_t, byte_t*)>;
/// @brief Punned index.
using index_t = index_gt< //
using index_t = index_gt< //
distance_t, vector_key_t, compressed_slot_t, //
dynamic_allocator_t, tape_allocator_t>;
using index_allocator_t = aligned_allocator_gt<index_t, 64>;
Expand Down Expand Up @@ -1512,10 +1512,14 @@ class index_dense_gt {
* @brief Implements clustering, classifying the given objects (vectors of member keys)
* into a given number of clusters.
*
* @param[in] queries_begin Iterator targeting the fiest query.
* @param[in] queries_end
* @param[in] queries_begin Iterator pointing to the first query.
* @param[in] queries_end Iterator pointing to the last query.
* @param[in] executor Thread-pool to execute the job in parallel.
* @param[in] progress Callback to report the execution progress.
* @param[in] config Configuration parameters for clustering.
*
* @param[out] cluster_keys Pointer to the array where the cluster keys will be exported.
* @param[out] cluster_distances Pointer to the array where the distances to those centroids will be exported.
*/
template < //
typename queries_iterator_at, //
Expand All @@ -1526,7 +1530,7 @@ class index_dense_gt {
queries_iterator_at queries_begin, //
queries_iterator_at queries_end, //
index_dense_clustering_config_t config, //
vector_key_t* cluster_keys, //
vector_key_t* cluster_keys, //
distance_t* cluster_distances, //
executor_at&& executor = executor_at{}, //
progress_at&& progress = progress_at{}) {
Expand Down Expand Up @@ -1715,7 +1719,7 @@ class index_dense_gt {
}

template <typename scalar_at>
add_result_t add_( //
add_result_t add_( //
vector_key_t key, scalar_at const* vector, //
std::size_t thread, bool force_vector_copy, cast_t const& cast) {

Expand Down Expand Up @@ -1811,8 +1815,8 @@ class index_dense_gt {
}

template <typename scalar_at>
aggregated_distances_t distance_between_( //
vector_key_t key, scalar_at const* vector, //
aggregated_distances_t distance_between_( //
vector_key_t key, scalar_at const* vector, //
std::size_t thread, cast_t const& cast) const {

// Cast the vector, if needed for compatibility with `metric_`
Expand Down

0 comments on commit 3e13d40

Please sign in to comment.