Docs: Clustering in C++ API

Closes #296
unum-cloud · Nov 6, 2023 · 3e13d40 · 3e13d40
1 parent 299aaf2
commit 3e13d40
Show file tree

Hide file tree

Showing 2 changed files with 100 additions and 28 deletions.
diff --git a/cpp/README.md b/cpp/README.md
@@ -25,14 +25,20 @@ metric_punned_t metric(256, metric_kind_t::l2sq_k, scalar_kind_t::f32_k);
 index_dense_t index = index_dense_t::make(metric);
 float vec[3] = {0.1, 0.3, 0.2};
 
-index.reserve(10);
-index.add(/* key: */ 42, /* vector: */ {&vec[0], 3});
-auto results = index.search(/* query: */ {&vec[0], 3}, 5 /* neighbors */);
+index.reserve(10); // Pre-allocate memory for 10 vectors
+index.add(42, &vec[0]); // Pass a key and a vector
+auto results = index.search(&vec[0], 5); // Pass a query and limit number of results
 
 for (std::size_t i = 0; i != results.size(); ++i)
     results[i].element.key, results[i].element.vector, results[i].distance;
 ```
 
+Here we:
+
+- define a metric of kind [`metric_kind_t::l2sq_k`](https://unum-cloud.github.io/usearch/cpp/reference.html#_CPPv413metric_kind_t),
+- to be applied to [`scalar_kind_t::f32_k`](https://unum-cloud.github.io/usearch/cpp/reference.html#_CPPv413scalar_kind_t) floating-point vectors,
+- instantiate an [`index_dense_t`](https://unum-cloud.github.io/usearch/cpp/reference.html#_CPPv4I00EN4unum7usearch14index_dense_gtE) index.
+
 The `add` is thread-safe for concurrent index construction.
 It also has an overload for different vector types, casting them under the hood.
 The same applies to the `search`, `get`, `cluster`, and `distance_between` functions.
@@ -52,20 +58,78 @@ index.load("index.usearch"); // Copying from disk
 index.view("index.usearch"); // Memory-mapping from disk
 ```
 
-## User-Defined Metrics in C++
+## Multi-Threading
 
-For advanced users, more compile-time abstractions are available.
+Most AI, HPC, or Big Data packages use some form of a thread pool.
+Instead of spawning additional threads within USearch, we focus on the thread safety of `add()` function, simplifying resource management.
 
 ```cpp
-template <typename distance_at = default_distance_t,              // `float`
-          typename key_at = default_key_t,                        // `int64_t`, `uuid_t`
-          typename compressed_slot_at = default_slot_t,           // `uint32_t`, `uint40_t`
-          typename dynamic_allocator_at = std::allocator<byte_t>, //
-          typename tape_allocator_at = dynamic_allocator_at>      //
-class index_gt;
+#pragma omp parallel for
+    for (std::size_t i = 0; i < n; ++i)
+        native.add(key, span_t{vector, dims});
+```
+
+During initialization, we allocate enough temporary memory for all the cores on the machine.
+On the call, the user can supply the identifier of the current thread, making this library easy to integrate with OpenMP and similar tools.
+
+Moreover, you can take advantage of one of the provided "executors" to parallelize the search:
+
+- `executor_openmp_t`, that would use OpenMP under the hood.
+- `executor_stl_t`, that will spawn `std::thread` instances.
+- `dummy_executor_t`, that will run everything sequentially.
+
+## Clustering
+
+Aside from basic Create-Read-Update-Delete (CRUD) operations and search, USearch also supports clustering.
+Once the index is constructed, you can either:
+
+- Identify a cluster to which any external vector belongs, once mapped onto the index.
+- Split the entire index into a set of clusters, each with its own centroid.
+
+For the first, the interface accepts a vector and a "clustering level", which is essentially the index of the HNSW graph layer, in which to search.
+If you pass zero, the traversal will happen in every level except the bottom one.
+Otherwise, the search will be limited to the specified level.
+
+```cpp
+some_scalar_t vector[3] = {0.1, 0.3, 0.2};
+cluster_result_t result = index.cluster(&vector, index.max_level() / 2);
+match_t cluster = result.cluster;
+member_cref_t member = cluster.member;
+distance_t distance = cluster.distance;
 ```
 
-The following distances are pre-packaged:
+If you wish to split the whole structure into clusters, you must provide an iterator over a range of vectors, that will be processed in parallel using the previously described function.
+Unlike the previous function, you don't have to manually specify the level, as the algorithm will pick the best one for you, depending on the number of clusters you want to highlight.
+Aside from that auto-tuning, this function will regroup some of the clusters, if they are too small, and return the final number of clusters.
+
+```cpp
+std::size_t queries_count = queries_end - queries_begin;
+index_dense_clustering_config_t config;
+config.min_clusters = 1000;
+config.max_clusters = 2000;
+config.mode = index_dense_clustering_config_t::merge_smallest_k;
+
+// Outputs:
+vector_key_t cluster_centroids_keys[queries_count];
+distance_t distances_to_cluster_centroids[queries_count];
+executor_default_t thread_pool;
+dummy_progress_t progress_bar;
+
+clustering_result_t result = cluster(
+        queries_begin, queries_end,
+        config,
+        &cluster_centroids_keys, &distances_to_cluster_centroids,
+        thread_pool, progress_bar);
+```
+
+This approach requires basic understanding of templates meta-programming to implement the `queries_begin` and `queries_end` smart-iterators.
+On the bright side, it allows iteratively deepening into a specific cluster.
+
+As in many other bulk-processing APIs, the `executor` and `progress` are optional.
+
+## User-Defined Metrics
+
+In its high-level interface, USearch supports a variety of metrics, including the most popular ones:
 
 - `metric_cos_gt<scalar_t>` for "Cosine" or "Angular" distance.
 - `metric_ip_gt<scalar_t>` for "Inner Product" or "Dot Product" distance.
@@ -78,16 +142,20 @@ The following distances are pre-packaged:
 - `metric_haversine_gt<scalar_t>` for "Haversine" or "Great Circle" distance between coordinates used in GIS applications.
 - `metric_divergence_gt<scalar_t>` for the "Jensen Shannon" similarity between probability distributions.
 
-## Multi-Threading
+In reality, for most common types, one of the [SimSIMD](https://github.com/ashvardanian/SimSIMD) backends will be triggered, providing hardware-acceleration for most common CPUs.
 
-Most AI, HPC, or Big Data packages use some form of a thread pool.
-Instead of spawning additional threads within USearch, we focus on the thread safety of `add()` function, simplifying resource management.
+If you need a different metric, you can implement it yourself and wrap it into a `metric_punned_t`, which is our alternative to the `std::function`.
+Unlike the `std::function`, it is a trivial type, which is important for performance.
+
+## Advanced Interface
+
+If you are proficient in C++ and ready to get your hands dirty, you can use the low-level interface.
 
 ```cpp
-#pragma omp parallel for
-    for (std::size_t i = 0; i < n; ++i)
-        native.add(key, span_t{vector, dims});
+template <typename distance_at = default_distance_t,              // `float`
+          typename key_at = default_key_t,                        // `int64_t`, `uuid_t`
+          typename compressed_slot_at = default_slot_t,           // `uint32_t`, `uint40_t`
+          typename dynamic_allocator_at = std::allocator<byte_t>, //
+          typename tape_allocator_at = dynamic_allocator_at>      //
+class index_gt;
 ```
-
-During initialization, we allocate enough temporary memory for all the cores on the machine.
-On the call, the user can supply the identifier of the current thread, making this library easy to integrate with OpenMP and similar tools.
diff --git a/include/usearch/index_dense.hpp b/include/usearch/index_dense.hpp
@@ -323,7 +323,7 @@ class index_dense_gt {
     /// @brief Schema: input buffer, bytes in input buffer, output buffer.
     using cast_t = std::function<bool(byte_t const*, std::size_t, byte_t*)>;
     /// @brief Punned index.
-    using index_t = index_gt<                 //
+    using index_t = index_gt<                        //
         distance_t, vector_key_t, compressed_slot_t, //
         dynamic_allocator_t, tape_allocator_t>;
     using index_allocator_t = aligned_allocator_gt<index_t, 64>;
@@ -1512,10 +1512,14 @@ class index_dense_gt {
      *  @brief  Implements clustering, classifying the given objects (vectors of member keys)
      *          into a given number of clusters.
      *
-     *  @param[in] queries_begin Iterator targeting the fiest query.
-     *  @param[in] queries_end
+     *  @param[in] queries_begin Iterator pointing to the first query.
+     *  @param[in] queries_end Iterator pointing to the last query.
      *  @param[in] executor Thread-pool to execute the job in parallel.
      *  @param[in] progress Callback to report the execution progress.
+     *  @param[in] config Configuration parameters for clustering.
+     *
+     *  @param[out] cluster_keys Pointer to the array where the cluster keys will be exported.
+     *  @param[out] cluster_distances Pointer to the array where the distances to those centroids will be exported.
      */
     template <                                   //
         typename queries_iterator_at,            //
@@ -1526,7 +1530,7 @@ class index_dense_gt {
         queries_iterator_at queries_begin,      //
         queries_iterator_at queries_end,        //
         index_dense_clustering_config_t config, //
-        vector_key_t* cluster_keys,                    //
+        vector_key_t* cluster_keys,             //
         distance_t* cluster_distances,          //
         executor_at&& executor = executor_at{}, //
         progress_at&& progress = progress_at{}) {
@@ -1715,7 +1719,7 @@ class index_dense_gt {
     }
 
     template <typename scalar_at>
-    add_result_t add_(                      //
+    add_result_t add_(                             //
         vector_key_t key, scalar_at const* vector, //
         std::size_t thread, bool force_vector_copy, cast_t const& cast) {
 
@@ -1811,8 +1815,8 @@ class index_dense_gt {
     }
 
     template <typename scalar_at>
-    aggregated_distances_t distance_between_( //
-        vector_key_t key, scalar_at const* vector,   //
+    aggregated_distances_t distance_between_(      //
+        vector_key_t key, scalar_at const* vector, //
         std::size_t thread, cast_t const& cast) const {
 
         // Cast the vector, if needed for compatibility with `metric_`