diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1330990c6..7ce4a14c3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,76 @@
+# cuvs 24.10.00 (9 Oct 2024)
+
+## 🐛 Bug Fixes
+
+- Use 64 bit types for dataset size calculation in CAGRA graph optimizer ([#380](https://github.com/rapidsai/cuvs/pull/380)) [@tfeher](https://github.com/tfeher)
+- Remove EXPLICIT_INSTANTIATE_ONLY macros ([#358](https://github.com/rapidsai/cuvs/pull/358)) [@achirkin](https://github.com/achirkin)
+- Fix order of operations for cosine IVF Flat ([#329](https://github.com/rapidsai/cuvs/pull/329)) [@lowener](https://github.com/lowener)
+- Exclude any kernel symbol that uses cutlass ([#314](https://github.com/rapidsai/cuvs/pull/314)) [@benfred](https://github.com/benfred)
+- [Fix] pin raft dependent to rapidsai ([#299](https://github.com/rapidsai/cuvs/pull/299)) [@rhdong](https://github.com/rhdong)
+- Fix dataset dimension in IVF-PQ C wrappers ([#292](https://github.com/rapidsai/cuvs/pull/292)) [@tfeher](https://github.com/tfeher)
+- Fix python ivf-pq for int8/uint8 dtypes ([#271](https://github.com/rapidsai/cuvs/pull/271)) [@benfred](https://github.com/benfred)
+- FP16 API for CAGRA and IVF-PQ ([#264](https://github.com/rapidsai/cuvs/pull/264)) [@tfeher](https://github.com/tfeher)
+
+## 📖 Documentation
+
+- More doc updates for 24.10 ([#396](https://github.com/rapidsai/cuvs/pull/396)) [@cjnolet](https://github.com/cjnolet)
+- fix 404 in documentation link in readme ([#395](https://github.com/rapidsai/cuvs/pull/395)) [@benfred](https://github.com/benfred)
+- Improving getting started materials ([#342](https://github.com/rapidsai/cuvs/pull/342)) [@cjnolet](https://github.com/cjnolet)
+- Fix broken examples link in README. ([#326](https://github.com/rapidsai/cuvs/pull/326)) [@bdice](https://github.com/bdice)
+- Recommend `miniforge` for conda install ([#325](https://github.com/rapidsai/cuvs/pull/325)) [@bdice](https://github.com/bdice)
+
+## 🚀 New Features
+
+- Port remaining scripts to `cuvs_bench` ([#368](https://github.com/rapidsai/cuvs/pull/368)) [@divyegala](https://github.com/divyegala)
+- [Feat] Relative change with `bitset` API feature #2439 in raft ([#350](https://github.com/rapidsai/cuvs/pull/350)) [@rhdong](https://github.com/rhdong)
+- cuvs_bench plotting functions ([#347](https://github.com/rapidsai/cuvs/pull/347)) [@dantegd](https://github.com/dantegd)
+- CosineExpanded Metric for IVF-PQ (normalize inputs) ([#346](https://github.com/rapidsai/cuvs/pull/346)) [@tarang-jain](https://github.com/tarang-jain)
+- Python API for CAGRA+HNSW ([#246](https://github.com/rapidsai/cuvs/pull/246)) [@divyegala](https://github.com/divyegala)
+- C API for CAGRA+HNSW ([#240](https://github.com/rapidsai/cuvs/pull/240)) [@divyegala](https://github.com/divyegala)
+- SNMG ANN ([#231](https://github.com/rapidsai/cuvs/pull/231)) [@viclafargue](https://github.com/viclafargue)
+- [FEA] Support for half-float mixed precise in brute-force ([#225](https://github.com/rapidsai/cuvs/pull/225)) [@rhdong](https://github.com/rhdong)
+
+## 🛠️ Improvements
+
+- Remove cuvs-cagra-search from cuvs_static link ([#388](https://github.com/rapidsai/cuvs/pull/388)) [@benfred](https://github.com/benfred)
+- Add a static library for cuvs ([#382](https://github.com/rapidsai/cuvs/pull/382)) [@benfred](https://github.com/benfred)
+- Put the ann-bench large_workspace_resource in managed memory ([#372](https://github.com/rapidsai/cuvs/pull/372)) [@achirkin](https://github.com/achirkin)
+- Add multigpu kmeans fit function ([#348](https://github.com/rapidsai/cuvs/pull/348)) [@benfred](https://github.com/benfred)
+- Update update-version.sh to use packaging lib ([#344](https://github.com/rapidsai/cuvs/pull/344)) [@AyodeAwe](https://github.com/AyodeAwe)
+- remove NCCL pins in build and test environments ([#341](https://github.com/rapidsai/cuvs/pull/341)) [@jameslamb](https://github.com/jameslamb)
+- Vamana/DiskANN index build ([#339](https://github.com/rapidsai/cuvs/pull/339)) [@bkarsin](https://github.com/bkarsin)
+- Use CI workflow branch &#39;branch-24.10&#39; again ([#331](https://github.com/rapidsai/cuvs/pull/331)) [@jameslamb](https://github.com/jameslamb)
+- fix style checks on Python 3.12 ([#328](https://github.com/rapidsai/cuvs/pull/328)) [@jameslamb](https://github.com/jameslamb)
+- Update flake8 to 7.1.1. ([#327](https://github.com/rapidsai/cuvs/pull/327)) [@bdice](https://github.com/bdice)
+- Add function for calculating the mutual_reachability_graph ([#323](https://github.com/rapidsai/cuvs/pull/323)) [@benfred](https://github.com/benfred)
+- Simplify libcuvs conda recipe. ([#322](https://github.com/rapidsai/cuvs/pull/322)) [@bdice](https://github.com/bdice)
+- Refactor dependencies.yaml to use depends-on pattern. ([#321](https://github.com/rapidsai/cuvs/pull/321)) [@bdice](https://github.com/bdice)
+- Update Python versions in cuvs_bench pyproject.toml. ([#318](https://github.com/rapidsai/cuvs/pull/318)) [@bdice](https://github.com/bdice)
+- Brute force knn tile size heuristic ([#316](https://github.com/rapidsai/cuvs/pull/316)) [@mfoerste4](https://github.com/mfoerste4)
+- Euclidean distance example ([#315](https://github.com/rapidsai/cuvs/pull/315)) [@abner-ma](https://github.com/abner-ma)
+- Migrate trustworthiness and silhouette_score stats from RAFT ([#313](https://github.com/rapidsai/cuvs/pull/313)) [@benfred](https://github.com/benfred)
+- Add support for Python 3.12 ([#312](https://github.com/rapidsai/cuvs/pull/312)) [@jameslamb](https://github.com/jameslamb)
+- Add `managed` option for RMM Pool memory resource to C API ([#305](https://github.com/rapidsai/cuvs/pull/305)) [@ajit283](https://github.com/ajit283)
+- Update rapidsai/pre-commit-hooks ([#303](https://github.com/rapidsai/cuvs/pull/303)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
+- Expose search function with pre-filter for ANN ([#302](https://github.com/rapidsai/cuvs/pull/302)) [@lowener](https://github.com/lowener)
+- Drop Python 3.9 support ([#301](https://github.com/rapidsai/cuvs/pull/301)) [@jameslamb](https://github.com/jameslamb)
+- Use CUDA math wheels ([#298](https://github.com/rapidsai/cuvs/pull/298)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
+- Remove NumPy &lt;2 pin ([#297](https://github.com/rapidsai/cuvs/pull/297)) [@seberg](https://github.com/seberg)
+- CAGRA - separable compilation for distance computation ([#296](https://github.com/rapidsai/cuvs/pull/296)) [@achirkin](https://github.com/achirkin)
+- Updating example notebooks ([#294](https://github.com/rapidsai/cuvs/pull/294)) [@cjnolet](https://github.com/cjnolet)
+- Add RMM Pool memory resource to C API ([#285](https://github.com/rapidsai/cuvs/pull/285)) [@ajit283](https://github.com/ajit283)
+- Update pre-commit hooks ([#283](https://github.com/rapidsai/cuvs/pull/283)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
+- Improve update-version.sh ([#282](https://github.com/rapidsai/cuvs/pull/282)) [@bdice](https://github.com/bdice)
+- Use tool.scikit-build.cmake.version, set scikit-build-core minimum-version ([#280](https://github.com/rapidsai/cuvs/pull/280)) [@jameslamb](https://github.com/jameslamb)
+- Add cuvs_bench.run python code and build ([#279](https://github.com/rapidsai/cuvs/pull/279)) [@dantegd](https://github.com/dantegd)
+- Add cuvs-bench to dependencies and conda environments ([#275](https://github.com/rapidsai/cuvs/pull/275)) [@dantegd](https://github.com/dantegd)
+- Update pip devcontainers to UCX v1.17.0 ([#262](https://github.com/rapidsai/cuvs/pull/262)) [@jameslamb](https://github.com/jameslamb)
+- Adding example for tuning build and search params using Optuna ([#257](https://github.com/rapidsai/cuvs/pull/257)) [@dpadmanabhan03](https://github.com/dpadmanabhan03)
+- Fixed link to build docs and corrected ivf_flat_example ([#255](https://github.com/rapidsai/cuvs/pull/255)) [@mmccarty](https://github.com/mmccarty)
+- Merge branch-24.08 into branch-24.10 ([#254](https://github.com/rapidsai/cuvs/pull/254)) [@jameslamb](https://github.com/jameslamb)
+- Persistent CAGRA kernel ([#215](https://github.com/rapidsai/cuvs/pull/215)) [@achirkin](https://github.com/achirkin)
+- [FEA] Support for Cosine distance in IVF-Flat ([#179](https://github.com/rapidsai/cuvs/pull/179)) [@lowener](https://github.com/lowener)
+
 # cuvs 24.08.00 (7 Aug 2024)
 
 ## 🚨 Breaking Changes
diff --git a/README.md b/README.md
index e23b94616..c1b74a9e8 100755
--- a/README.md
+++ b/README.md
@@ -1,11 +1,7 @@
 # <div align="left"><img src="https://rapids.ai/assets/images/rapids_logo.png" width="90px"/>&nbsp;cuVS: Vector Search and Clustering on the GPU</div>
 
 > [!note]
-> cuVS is a new library mostly derived from the approximate nearest neighbors and clustering algorithms in the [RAPIDS RAFT](https://github.com/rapidsai/raft) library of data mining primitives. RAPIDS RAFT currently contains the most fully-featured versions of the approximate nearest neighbors and clustering algorithms in cuVS. We are in the process of migrating the algorithms from RAFT to cuVS, but if you are unsure of which to use, please consider the following:
-> 1. RAFT contains C++ and Python APIs for all of the approximate nearest neighbors and clustering algorithms.
-> 2. cuVS contains a growing support for different languages, including C, C++, Python, and Rust. We will be adding more language support to cuVS in the future but will not be improving the language support for RAFT.
-> 3. Once all of RAFT's approximate nearest neighbors and clustering algorithms are moved to cuVS, the RAFT APIs will be deprecated and eventually removed altogether. Once removed, RAFT will become a lightweight header-only library. In the meantime, there's no harm in using RAFT if support for additional languages is not needed.
-
+> cuVS is a new library mostly derived from the approximate nearest neighbors and clustering algorithms in the [RAPIDS RAFT](https://github.com/rapidsai/raft) library of machine learning and data mining primitives. As of version 24.10 (Release in October 2024), cuVS contains the most fully-featured versions of the approximate nearest neighbors and clustering algorithms from RAFT. The algorithms which have been migrated over to cuVS will be removed from RAFT in version 24.12 (released in December 2024).
 
 ## Contents
 
@@ -18,10 +14,11 @@
 
 ## Useful Resources
 
+- [Documentation](https://docs.rapids.ai/api/cuvs/nightly/): Library documentation.
+- [Build and Install Guide](https://docs.rapids.ai/api/cuvs/nightly/build): Instructions for installing and building cuVS.
+- [Getting Started Guide](https://docs.rapids.ai/api/cuvs/nightly/getting_started): Guide to getting started with cuVS.
 - [Code Examples](https://github.com/rapidsai/cuvs/tree/HEAD/examples): Self-contained Code Examples.
 - [API Reference Documentation](https://docs.rapids.ai/api/cuvs/nightly/api_docs): API Documentation.
-- [Getting Started Guide](https://docs.rapids.ai/api/cuvs/nightly/getting_started): Getting started with RAFT.
-- [Build and Install Guide](https://docs.rapids.ai/api/cuvs/nightly/build): Instructions for installing and building cuVS.
 - [RAPIDS Community](https://rapids.ai/community.html): Get help, contribute, and collaborate.
 - [GitHub repository](https://github.com/rapidsai/cuvs): Download the cuVS source code.
 - [Issue tracker](https://github.com/rapidsai/cuvs/issues): Report issues or request features.
@@ -30,34 +27,87 @@
 
 cuVS contains state-of-the-art implementations of several algorithms for running approximate nearest neighbors and clustering on the GPU. It can be used directly or through the various databases and other libraries that have integrated it. The primary goal of cuVS is to simplify the use of GPUs for vector similarity search and clustering.
 
+Vector search is an information retrieval method that has been growing in popularity over the past few  years, partly because of the rising importance of multimedia embeddings created from unstructured data and the need to perform semantic search on the embeddings to find items which are semantically similar to each other.
+
+Vector search is also used in _data mining and machine learning_ tasks and comprises an important step in many _clustering_ and _visualization_ algorithms like [UMAP](https://arxiv.org/abs/2008.00325), [t-SNE](https://lvdmaaten.github.io/tsne/), K-means, and [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html). 
+
+Finally, faster vector search enables interactions between dense vectors and graphs. Converting a pile of dense vectors into nearest neighbors graphs unlocks the entire world of graph analysis algorithms, such as those found in [GraphBLAS](https://graphblas.org/) and [cuGraph](https://github.com/rapidsai/cugraph). 
+
+Below are some common use-cases for vector search
+
+- ### Semantic search
+  - Generative AI & Retrieval augmented generation (RAG)
+  - Recommender systems
+  - Computer vision
+  - Image search
+  - Text search
+  - Audio search
+  - Molecular search
+  - Model training
+  
+
+- ### Data mining
+  - Clustering algorithms
+  - Visualization algorithms
+  - Sampling algorithms
+  - Class balancing
+  - Ensemble methods
+  - k-NN graph construction
+
+## Why cuVS?
+
+There are several benefits to using cuVS and GPUs for vector search, including
+
+1. Fast index build
+2. Latency critical and high throughput search
+3. Parameter tuning
+4. Cost savings
+5. Interoperability (build on GPU, deploy on CPU)
+6. Multiple language support
+7. Building blocks for composing new or accelerating existing algorithms
+
+In addition to the items above, cuVS takes on the burden of keeping non-trivial accelerated code up to date as new NVIDIA architectures and CUDA versions are released. This provides a deslightful development experimence, guaranteeing that any libraries, databases, or applications built on top of it will always be getting the best performance and scale. 
+
 ## Installing cuVS
 
-cuVS comes with pre-built packages that can be installed through [conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-python). Different packages are available for the different languages supported by cuVS:
+cuVS comes with pre-built packages that can be installed through [conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-python) and [pip](https://pip.pypa.io/en/stable/). Different packages are available for the different languages supported by cuVS:
 
-| Python | C/C++                       |
-|--------|-----------------------------|
-| `cuvs` | `libcuvs`, `libcuvs-static` |
+| Python | C/C++     |
+|--------|-----------|
+| `cuvs` | `libcuvs` |
 
 ### Stable release
 
-It is recommended to use [mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) to install the desired packages. The following command will install the Python package. You can substitute `cuvs` for any of the packages in the table above:
+It is recommended to use [mamba](https://conda.github.io/conda-libmamba-solver/user-guide/) to install the desired packages. The following command will install the Python package. You can substitute `cuvs` for any of the packages in the table above:
 
 ```bash
-mamba install -c conda-forge -c nvidia -c rapidsai cuvs
+conda install -c conda-forge -c nvidia -c rapidsai cuvs
 ```
 
+The cuVS Python package can also be `installed through pip <https://docs.rapids.ai/install#pip>`_.
+
+For CUDA 11 packages:
+```bash
+pip install cuvs-cu11 --extra-index-url=https://pypi.nvidia.com
+````
+
+And CUDA 12 packages:
+```bash
+pip install cuvs-cu12 --extra-index-url=https://pypi.nvidia.com
+```    
+
 ### Nightlies
 If installing a version that has not yet been released, the `rapidsai` channel can be replaced with `rapidsai-nightly`:
 
 ```bash
-mamba install -c conda-forge -c nvidia -c rapidsai-nightly cuvs=24.12
+conda install -c conda-forge -c nvidia -c rapidsai-nightly cuvs=24.12
 ```
 
-Please see the [Build and Install Guide](https://docs.rapids.ai/api/cuvs/stable/build/) for more information on installing cuVS and building from source.
+cuVS also has `pip` wheel packages that can be installed. Please see the [Build and Install Guide](https://docs.rapids.ai/api/cuvs/nightly/build/) for more information on installing the available cuVS packages and building from source.
 
 ## Getting Started
 
-The following code snippets train an approximate nearest neighbors index for the CAGRA algorithm.
+The following code snippets train an approximate nearest neighbors index for the CAGRA algorithm in the various different languages supported by cuVS.
 
 ### Python API
 
@@ -85,7 +135,7 @@ cagra::index_params index_params;
 auto index = cagra::build(res, index_params, dataset);
 ```
 
-For more examples of the C++ APIs, refer to the [examples](https://github.com/rapidsai/cuvs/tree/HEAD/examples) directory in the codebase.
+For more code examples of the C++ APIs, including drop-in Cmake project templates, please refer to the [C++ examples](https://github.com/rapidsai/cuvs/tree/HEAD/examples) directory in the codebase.
 
 ### C API
 
@@ -110,6 +160,8 @@ cuvsCagraIndexParamsDestroy(index_params);
 cuvsResourcesDestroy(res);
 ```
 
+For more code examples of the C APIs, including drop-in Cmake project templates, please refer to the [C examples](https://github.com/rapidsai/cuvs/tree/branch-24.10/examples/c)
+
 ### Rust API
 
 ```rust
@@ -171,6 +223,7 @@ fn cagra_example() -> Result<()> {
 }
 ```
 
+For more code examples of the Rust APIs, including a drop-in project templates, please refer to the [Rust examples](https://github.com/rapidsai/cuvs/tree/branch-24.10/examples/rust).
 
 ## Contributing
 
@@ -178,60 +231,9 @@ If you are interested in contributing to the cuVS library, please read our [Cont
 
 ## References
 
-When citing cuVS generally, please consider referencing this Github repository.
-```bibtex
-@misc{rapidsai,
-  title={Rapidsai/cuVS: Vector Search and Clustering on the GPU.},
-  url={https://github.com/rapidsai/cuvs},
-  journal={GitHub},
-  publisher={Nvidia RAPIDS},
-  author={Rapidsai},
-  year={2024}
-}
-```
-
-If citing CAGRA, please consider the following bibtex:
-```bibtex
-@misc{ootomo2023cagra,
-      title={CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs},
-      author={Hiroyuki Ootomo and Akira Naruse and Corey Nolet and Ray Wang and Tamas Feher and Yong Wang},
-      year={2023},
-      eprint={2308.15136},
-      archivePrefix={arXiv},
-      primaryClass={cs.DS}
-}
-```
-
-If citing the k-selection routines, please consider the following bibtex:
-```bibtex
-@proceedings{10.1145/3581784,
-    title = {SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
-    year = {2023},
-    isbn = {9798400701092},
-    publisher = {Association for Computing Machinery},
-    address = {New York, NY, USA},
-    abstract = {Started in 1988, the SC Conference has become the annual nexus for researchers and practitioners from academia, industry and government to share information and foster collaborations to advance the state of the art in High Performance Computing (HPC), Networking, Storage, and Analysis.},
-    location = {, Denver, CO, USA, }
-}
-```
-
-If citing the nearest neighbors descent API, please consider the following bibtex:
-```bibtex
-@inproceedings{10.1145/3459637.3482344,
-    author = {Wang, Hui and Zhao, Wan-Lei and Zeng, Xiangxiang and Yang, Jianye},
-    title = {Fast K-NN Graph Construction by GPU Based NN-Descent},
-    year = {2021},
-    isbn = {9781450384469},
-    publisher = {Association for Computing Machinery},
-    address = {New York, NY, USA},
-    url = {https://doi.org/10.1145/3459637.3482344},
-    doi = {10.1145/3459637.3482344},
-    abstract = {NN-Descent is a classic k-NN graph construction approach. It is still widely employed in machine learning, computer vision, and information retrieval tasks due to its efficiency and genericness. However, the current design only works well on CPU. In this paper, NN-Descent has been redesigned to adapt to the GPU architecture. A new graph update strategy called selective update is proposed. It reduces the data exchange between GPU cores and GPU global memory significantly, which is the processing bottleneck under GPU computation architecture. This redesign leads to full exploitation of the parallelism of the GPU hardware. In the meantime, the genericness, as well as the simplicity of NN-Descent, are well-preserved. Moreover, a procedure that allows to k-NN graph to be merged efficiently on GPU is proposed. It makes the construction of high-quality k-NN graphs for out-of-GPU-memory datasets tractable. Our approach is 100-250\texttimes{} faster than the single-thread NN-Descent and is 2.5-5\texttimes{} faster than the existing GPU-based approaches as we tested on million as well as billion scale datasets.},
-    booktitle = {Proceedings of the 30th ACM International Conference on Information \& Knowledge Management},
-    pages = {1929–1938},
-    numpages = {10},
-    keywords = {high-dimensional, nn-descent, gpu, k-nearest neighbor graph},
-    location = {Virtual Event, Queensland, Australia},
-    series = {CIKM '21}
-}
-```
+For the interested reader, many of the accelerated implementations in cuVS are also based on research papers which can provide a lot more background. We also ask you to please cite the corresponding algorithms by referencing them in your own research. 
+- [CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search](https://arxiv.org/abs/2308.15136)
+- [Top-K Algorithms on GPU: A Comprehensive Study and New Methods](https://dl.acm.org/doi/10.1145/3581784.3607062>)
+- [Fast K-NN Graph Construction by GPU Based NN-Descent](https://dl.acm.org/doi/abs/10.1145/3459637.3482344?casa_token=O_nan1B1F5cAAAAA:QHWDEhh0wmd6UUTLY9_Gv6c3XI-5DXM9mXVaUXOYeStlpxTPmV3nKvABRfoivZAaQ3n8FWyrkWw>)
+- [cuSLINK: Single-linkage Agglomerative Clustering on the GPU](https://arxiv.org/abs/2306.16354)
+- [GPU Semiring Primitives for Sparse Neighborhood Methods](https://arxiv.org/abs/2104.06357)
diff --git a/build.sh b/build.sh
index b463f0f0d..b787d3a41 100755
--- a/build.sh
+++ b/build.sh
@@ -18,7 +18,7 @@ ARGS=$*
 # scripts, and that this script resides in the repo dir!
 REPODIR=$(cd $(dirname $0); pwd)
 
-VALIDARGS="clean libcuvs python rust docs tests bench-ann examples --uninstall  -v -g -n --compile-static-lib --allgpuarch --no-nvtx --show_depr_warn --incl-cache-stats --time -h"
+VALIDARGS="clean libcuvs python rust docs tests bench-ann examples --uninstall  -v -g -n --compile-static-lib --allgpuarch --no-mg --no-cpu --cpu-only --no-shared-libs --no-nvtx --show_depr_warn --incl-cache-stats --time -h"
 HELP="$0 [<target> ...] [<flag> ...] [--cmake-args=\"<args>\"] [--cache-tool=<tool>] [--limit-tests=<targets>] [--limit-bench-ann=<targets>] [--build-metrics=<filename>]
  where <target> is:
    clean            - remove all existing build artifacts and configuration (start over)
@@ -37,10 +37,13 @@ HELP="$0 [<target> ...] [<flag> ...] [--cmake-args=\"<args>\"] [--cache-tool=<to
    -n                          - no install step
    --uninstall                 - uninstall files for specified targets which were built and installed prior
    --compile-static-lib        - compile static library for all components
+   --cpu-only                  - build CPU only components without CUDA. Currently only applies to bench-ann.
    --limit-tests               - semicolon-separated list of test executables to compile (e.g. NEIGHBORS_TEST;CLUSTER_TEST)
    --limit-bench-ann           - semicolon-separated list of ann benchmark executables to compute (e.g. HNSWLIB_ANN_BENCH;RAFT_IVF_PQ_ANN_BENCH)
    --allgpuarch                - build for all supported GPU architectures
+   --no-mg                     - disable multi-GPU support
    --no-nvtx                   - disable nvtx (profiling markers), but allow enabling it in downstream projects
+   --no-shared-libs            - build without shared libraries
    --show_depr_warn            - show cmake deprecation warnings
    --build-metrics             - filename for generating build metrics report for libcuvs
    --incl-cache-stats          - include cache statistics in build metrics report
@@ -65,11 +68,13 @@ CMAKE_LOG_LEVEL=""
 VERBOSE_FLAG=""
 BUILD_ALL_GPU_ARCH=0
 BUILD_TESTS=ON
+BUILD_MG_ALGOS=ON
 BUILD_TYPE=Release
 COMPILE_LIBRARY=OFF
 INSTALL_TARGET=install
 BUILD_REPORT_METRICS=""
 BUILD_REPORT_INCL_CACHE_STATS=OFF
+BUILD_SHARED_LIBS=ON
 
 TEST_TARGETS="NEIGHBORS_ANN_CAGRA_TEST"
 ANN_BENCH_TARGETS="CUVS_ANN_BENCH_ALL"
@@ -261,6 +266,10 @@ if hasArg --allgpuarch; then
     BUILD_ALL_GPU_ARCH=1
 fi
 
+if hasArg --no-mg; then
+    BUILD_MG_ALGOS=OFF
+fi
+
 if hasArg tests || (( ${NUMARGS} == 0 )); then
     BUILD_TESTS=ON
     CMAKE_TARGET="${CMAKE_TARGET};${TEST_TARGETS}"
@@ -276,7 +285,20 @@ fi
 
 if hasArg bench-ann || (( ${NUMARGS} == 0 )); then
     BUILD_CUVS_BENCH=ON
+    if ! hasArg tests; then
+        BUILD_TESTS=OFF
+    fi
+    COMPILE_LIBRARY=OFF
     CMAKE_TARGET="${CMAKE_TARGET};${ANN_BENCH_TARGETS}"
+    if hasArg --cpu-only; then
+        BUILD_CPU_ONLY=ON
+        BUILD_SHARED_LIBS=OFF
+        NVTX=OFF
+    fi
+fi
+
+if hasArg --no-shared-libs; then
+    BUILD_SHARED_LIBS=OFF
 fi
 
 if hasArg --no-nvtx; then
@@ -323,7 +345,11 @@ fi
 # Configure for building all C++ targets
 if (( ${NUMARGS} == 0 )) || hasArg libcuvs || hasArg docs || hasArg tests || hasArg bench-prims || hasArg bench-ann; then
     COMPILE_LIBRARY=ON
-    CMAKE_TARGET="${CMAKE_TARGET};cuvs"
+    if (( ${BUILD_SHARED_LIBS} == "OFF" )); then
+        CMAKE_TARGET="${CMAKE_TARGET};"
+    else
+        CMAKE_TARGET="${CMAKE_TARGET};cuvs"
+    fi
 
     if (( ${BUILD_ALL_GPU_ARCH} == 0 )); then
         CUVS_CMAKE_CUDA_ARCHITECTURES="NATIVE"
@@ -353,7 +379,9 @@ if (( ${NUMARGS} == 0 )) || hasArg libcuvs || hasArg docs || hasArg tests || has
           -DBUILD_C_TESTS=${BUILD_TESTS} \
           -DBUILD_CUVS_BENCH=${BUILD_CUVS_BENCH} \
           -DBUILD_CPU_ONLY=${BUILD_CPU_ONLY} \
+          -DBUILD_MG_ALGOS=${BUILD_MG_ALGOS} \
           -DCMAKE_MESSAGE_LOG_LEVEL=${CMAKE_LOG_LEVEL} \
+          -DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBS} \
           ${CACHE_ARGS} \
           ${EXTRA_CMAKE_ARGS}
 
diff --git a/ci/build_python.sh b/ci/build_python.sh
index 82ecba43a..7b0c639af 100755
--- a/ci/build_python.sh
+++ b/ci/build_python.sh
@@ -31,4 +31,22 @@ rapids-conda-retry mambabuild \
   --channel "${CPP_CHANNEL}" \
   conda/recipes/cuvs
 
+# Build cuvs_bench for each cuda and python version
+rapids-conda-retry mambabuild \
+  --no-test \
+  --channel "${CPP_CHANNEL}" \
+  --channel "${RAPIDS_CONDA_BLD_OUTPUT_DIR}" \
+  conda/recipes/cuvs_bench
+
+# Build cuvs_bench_cpu only in CUDA 12 jobs since it only depends on python
+# version
+RAPIDS_CUDA_MAJOR="${RAPIDS_CUDA_VERSION%%.*}"
+if [[ ${RAPIDS_CUDA_MAJOR} == "12" ]]; then
+  rapids-conda-retry mambabuild \
+  --no-test \
+  --channel "${CPP_CHANNEL}" \
+  --channel "${RAPIDS_CONDA_BLD_OUTPUT_DIR}" \
+  conda/recipes/cuvs_bench_cpu
+fi
+
 rapids-upload-conda-to-s3 python
diff --git a/conda/environments/all_cuda-118_arch-aarch64.yaml b/conda/environments/all_cuda-118_arch-aarch64.yaml
index 26817cd25..aa12b4ed6 100644
--- a/conda/environments/all_cuda-118_arch-aarch64.yaml
+++ b/conda/environments/all_cuda-118_arch-aarch64.yaml
@@ -37,6 +37,7 @@ dependencies:
 - libcusparse=11.7.5.86
 - librmm==24.12.*,>=0.0.0a0
 - make
+- nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0a0
 - numpydoc
diff --git a/conda/environments/all_cuda-118_arch-x86_64.yaml b/conda/environments/all_cuda-118_arch-x86_64.yaml
index d378f1ca1..494ec394d 100644
--- a/conda/environments/all_cuda-118_arch-x86_64.yaml
+++ b/conda/environments/all_cuda-118_arch-x86_64.yaml
@@ -37,6 +37,7 @@ dependencies:
 - libcusparse=11.7.5.86
 - librmm==24.12.*,>=0.0.0a0
 - make
+- nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0a0
 - numpydoc
diff --git a/conda/environments/all_cuda-125_arch-aarch64.yaml b/conda/environments/all_cuda-125_arch-aarch64.yaml
index 030a5ddec..f4f03ccee 100644
--- a/conda/environments/all_cuda-125_arch-aarch64.yaml
+++ b/conda/environments/all_cuda-125_arch-aarch64.yaml
@@ -34,6 +34,7 @@ dependencies:
 - libcusparse-dev
 - librmm==24.12.*,>=0.0.0a0
 - make
+- nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0a0
 - numpydoc
diff --git a/conda/environments/all_cuda-125_arch-x86_64.yaml b/conda/environments/all_cuda-125_arch-x86_64.yaml
index 40f57f5f6..a295e93f4 100644
--- a/conda/environments/all_cuda-125_arch-x86_64.yaml
+++ b/conda/environments/all_cuda-125_arch-x86_64.yaml
@@ -34,6 +34,7 @@ dependencies:
 - libcusparse-dev
 - librmm==24.12.*,>=0.0.0a0
 - make
+- nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0a0
 - numpydoc
diff --git a/conda/environments/bench_ann_cuda-118_arch-aarch64.yaml b/conda/environments/bench_ann_cuda-118_arch-aarch64.yaml
index 48fe2925e..a73839457 100644
--- a/conda/environments/bench_ann_cuda-118_arch-aarch64.yaml
+++ b/conda/environments/bench_ann_cuda-118_arch-aarch64.yaml
@@ -35,6 +35,7 @@ dependencies:
 - libcusparse=11.7.5.86
 - librmm==24.12.*,>=0.0.0a0
 - matplotlib
+- nccl>=2.19
 - ninja
 - nlohmann_json>=3.11.2
 - nvcc_linux-aarch64=11.8
diff --git a/conda/environments/bench_ann_cuda-118_arch-x86_64.yaml b/conda/environments/bench_ann_cuda-118_arch-x86_64.yaml
index 7219afb3d..3f869da9a 100644
--- a/conda/environments/bench_ann_cuda-118_arch-x86_64.yaml
+++ b/conda/environments/bench_ann_cuda-118_arch-x86_64.yaml
@@ -35,6 +35,7 @@ dependencies:
 - libcusparse=11.7.5.86
 - librmm==24.12.*,>=0.0.0a0
 - matplotlib
+- nccl>=2.19
 - ninja
 - nlohmann_json>=3.11.2
 - nvcc_linux-64=11.8
diff --git a/conda/environments/bench_ann_cuda-125_arch-aarch64.yaml b/conda/environments/bench_ann_cuda-125_arch-aarch64.yaml
index 4a6d453a5..407fb6058 100644
--- a/conda/environments/bench_ann_cuda-125_arch-aarch64.yaml
+++ b/conda/environments/bench_ann_cuda-125_arch-aarch64.yaml
@@ -32,6 +32,7 @@ dependencies:
 - libcusparse-dev
 - librmm==24.12.*,>=0.0.0a0
 - matplotlib
+- nccl>=2.19
 - ninja
 - nlohmann_json>=3.11.2
 - openblas
diff --git a/conda/environments/bench_ann_cuda-125_arch-x86_64.yaml b/conda/environments/bench_ann_cuda-125_arch-x86_64.yaml
index 3f6770fd9..81943b184 100644
--- a/conda/environments/bench_ann_cuda-125_arch-x86_64.yaml
+++ b/conda/environments/bench_ann_cuda-125_arch-x86_64.yaml
@@ -32,6 +32,7 @@ dependencies:
 - libcusparse-dev
 - librmm==24.12.*,>=0.0.0a0
 - matplotlib
+- nccl>=2.19
 - ninja
 - nlohmann_json>=3.11.2
 - openblas
diff --git a/conda/recipes/cuvs_bench/build.sh b/conda/recipes/cuvs_bench/build.sh
new file mode 100644
index 000000000..05fb7bada
--- /dev/null
+++ b/conda/recipes/cuvs_bench/build.sh
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+# Copyright (c) 2024, NVIDIA CORPORATION.
+
+./build.sh bench-ann --allgpuarch --no-nvtx --build-metrics=bench_ann --incl-cache-stats
+cmake --install cpp/build --component ann_bench
diff --git a/conda/recipes/cuvs_bench/conda_build_config.yaml b/conda/recipes/cuvs_bench/conda_build_config.yaml
new file mode 100644
index 000000000..47bd730da
--- /dev/null
+++ b/conda/recipes/cuvs_bench/conda_build_config.yaml
@@ -0,0 +1,70 @@
+c_compiler_version:
+  - 11
+
+cxx_compiler_version:
+  - 11
+
+cuda_compiler:
+  - cuda-nvcc
+
+cuda11_compiler:
+  - nvcc
+
+c_stdlib:
+  - sysroot
+
+c_stdlib_version:
+  - "2.17"
+
+cmake_version:
+  - ">=3.26.4,!=3.30.0"
+
+nccl_version:
+  - ">=2.19"
+
+glog_version:
+  - ">=0.6.0"
+
+h5py_version:
+  - ">=3.8.0"
+
+nlohmann_json_version:
+  - ">=3.11.2"
+
+# The CTK libraries below are missing from the conda-forge::cudatoolkit package
+# for CUDA 11. The "*_host_*" version specifiers correspond to `11.8` packages
+# and the "*_run_*" version specifiers correspond to `11.x` packages.
+
+cuda11_libcublas_host_version:
+  - "=11.11.3.6"
+
+cuda11_libcublas_run_version:
+  - ">=11.5.2.43,<12.0.0"
+
+cuda11_libcurand_host_version:
+  - "=10.3.0.86"
+
+cuda11_libcurand_run_version:
+  - ">=10.2.5.43,<10.3.1"
+
+cuda11_libcusolver_host_version:
+  - "=11.4.1.48"
+
+cuda11_libcusolver_run_version:
+  - ">=11.2.0.43,<11.4.2"
+
+cuda11_libcusparse_host_version:
+  - "=11.7.5.86"
+
+cuda11_libcusparse_run_version:
+  - ">=11.6.0.43,<12.0.0"
+
+# `cuda-profiler-api` only has `11.8.0` and `12.0.0` packages for all
+# architectures. The "*_host_*" version specifiers correspond to `11.8` packages and the
+# "*_run_*" version specifiers correspond to `11.x` packages.
+
+cuda11_cuda_profiler_api_host_version:
+  - "=11.8.86"
+
+cuda11_cuda_profiler_api_run_version:
+  - ">=11.4.240,<12"
diff --git a/conda/recipes/cuvs_bench/meta.yaml b/conda/recipes/cuvs_bench/meta.yaml
new file mode 100644
index 000000000..9ecbf82bb
--- /dev/null
+++ b/conda/recipes/cuvs_bench/meta.yaml
@@ -0,0 +1,105 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+
+# Usage:
+#   conda build . -c rapidsai -c conda-forge -c nvidia
+{% set version = environ['RAPIDS_PACKAGE_VERSION'].lstrip('v') + environ.get('VERSION_SUFFIX', '') %}
+{% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %}
+{% set py_version = environ['CONDA_PY'] %}
+{% set cuda_version = '.'.join(environ['RAPIDS_CUDA_VERSION'].split('.')[:2]) %}
+{% set cuda_major = cuda_version.split('.')[0] %}
+{% set date_string = environ['RAPIDS_DATE_STRING'] %}
+
+package:
+  name: cuvs_bench
+  version: {{ version }}
+  script: build.sh
+
+source:
+  path: ../../..
+
+build:
+  script_env:
+    - AWS_ACCESS_KEY_ID
+    - AWS_SECRET_ACCESS_KEY
+    - AWS_SESSION_TOKEN
+    - CMAKE_C_COMPILER_LAUNCHER
+    - CMAKE_CUDA_COMPILER_LAUNCHER
+    - CMAKE_CXX_COMPILER_LAUNCHER
+    - CMAKE_GENERATOR
+    - PARALLEL_LEVEL
+    - RAPIDS_ARTIFACTS_DIR
+    - SCCACHE_BUCKET
+    - SCCACHE_IDLE_TIMEOUT
+    - SCCACHE_REGION
+    - SCCACHE_S3_KEY_PREFIX=cuvs-bench-aarch64 # [aarch64]
+    - SCCACHE_S3_KEY_PREFIX=cuvs-bench-linux64 # [linux64]
+    - SCCACHE_S3_USE_SSL
+  number: {{ GIT_DESCRIBE_NUMBER }}
+  string: cuda{{ cuda_major }}_py{{ py_version }}_{{ date_string }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }}
+  ignore_run_exports_from:
+    {% if cuda_major == "11" %}
+    - {{ compiler('cuda11') }}
+    {% else %}
+    - {{ compiler('cuda') }}
+    - cuda-cudart-dev
+    - libcublas-dev
+    {% endif %}
+
+requirements:
+  build:
+    - {{ compiler('c') }}
+    - {{ compiler('cxx') }}
+    {% if cuda_major == "11" %}
+    - {{ compiler('cuda11') }} ={{ cuda_version }}
+    {% else %}
+    - {{ compiler('cuda') }}
+    {% endif %}
+    - cuda-version ={{ cuda_version }}
+    - cmake {{ cmake_version }}
+    - ninja
+    - {{ stdlib("c") }}
+
+  host:
+    - benchmark
+    - cuda-version ={{ cuda_version }}
+    {% if cuda_major == "11" %}
+    - cuda-profiler-api {{ cuda11_cuda_profiler_api_run_version }}
+    - libcublas {{ cuda11_libcublas_host_version }}
+    - libcublas-dev {{ cuda11_libcublas_host_version }}
+    {% else %}
+    - cuda-cudart-dev
+    - cuda-profiler-api
+    - libcublas-dev
+    {% endif %}
+    - glog {{ glog_version }}
+    - libcuvs {{ version }}
+    - nlohmann_json {{ nlohmann_json_version }}
+    - openblas
+    # rmm is needed to determine if package is gpu-enabled
+    - python
+    - rapids-build-backend>=0.3.0,<0.4.0.dev0
+    - rmm ={{ minor_version }}
+
+  run:
+    - benchmark
+    - {{ pin_compatible('cuda-version', max_pin='x', min_pin='x') }}
+    {% if cuda_major == "11" %}
+    - cudatoolkit
+    {% else %}
+    - cuda-cudart
+    - libcublas
+    {% endif %}
+    - glog {{ glog_version }}
+    - libcuvs {{ version }}
+    - h5py {{ h5py_version }}
+    - matplotlib
+    - pandas
+    - pyyaml
+    # rmm is needed to determine if package is gpu-enabled
+    - pylibraft ={{ minor_version }}
+    - python
+    - rmm ={{ minor_version }}
+about:
+  home: https://rapids.ai/
+  license: Apache-2.0
+  summary: cuVS GPU and CPU benchmarks
diff --git a/conda/recipes/cuvs_bench_cpu/build.sh b/conda/recipes/cuvs_bench_cpu/build.sh
new file mode 100644
index 000000000..163872053
--- /dev/null
+++ b/conda/recipes/cuvs_bench_cpu/build.sh
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+# Copyright (c) 2024, NVIDIA CORPORATION.
+
+./build.sh bench-ann --cpu-only --no-nvtx --build-metrics=bench_ann_cpu --incl-cache-stats
+cmake --install cpp/build --component ann_bench
diff --git a/conda/recipes/cuvs_bench_cpu/conda_build_config.yaml b/conda/recipes/cuvs_bench_cpu/conda_build_config.yaml
new file mode 100644
index 000000000..ed6f708e1
--- /dev/null
+++ b/conda/recipes/cuvs_bench_cpu/conda_build_config.yaml
@@ -0,0 +1,29 @@
+c_compiler_version:
+  - 11
+
+cxx_compiler_version:
+  - 11
+
+c_stdlib:
+  - sysroot
+
+c_stdlib_version:
+  - "2.17"
+
+cmake_version:
+  - ">=3.26.4,!=3.30.0"
+
+glog_version:
+  - ">=0.6.0"
+
+h5py_version:
+  - ">=3.8.0"
+
+nlohmann_json_version:
+  - ">=3.11.2"
+
+spdlog_version:
+  - ">=1.14.1,<1.15"
+
+fmt_version:
+  - ">=11.0.2,<12"
diff --git a/conda/recipes/cuvs_bench_cpu/meta.yaml b/conda/recipes/cuvs_bench_cpu/meta.yaml
new file mode 100644
index 000000000..0ce5db744
--- /dev/null
+++ b/conda/recipes/cuvs_bench_cpu/meta.yaml
@@ -0,0 +1,67 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+
+# Usage:
+#   conda build .  -c rapidsai -c conda-forge -c nvidia
+{% set version = environ['RAPIDS_PACKAGE_VERSION'].lstrip('v') + environ.get('VERSION_SUFFIX', '') %}
+{% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %}
+{% set py_version = environ['CONDA_PY'] %}
+{% set date_string = environ['RAPIDS_DATE_STRING'] %}
+
+package:
+  name: cuvs_bench_cpu
+  version: {{ version }}
+  script: build.sh
+
+source:
+  path: ../../..
+
+build:
+  script_env:
+    - AWS_ACCESS_KEY_ID
+    - AWS_SECRET_ACCESS_KEY
+    - AWS_SESSION_TOKEN
+    - CMAKE_C_COMPILER_LAUNCHER
+    - CMAKE_CUDA_COMPILER_LAUNCHER
+    - CMAKE_CXX_COMPILER_LAUNCHER
+    - CMAKE_GENERATOR
+    - PARALLEL_LEVEL
+    - RAPIDS_ARTIFACTS_DIR
+    - SCCACHE_BUCKET
+    - SCCACHE_IDLE_TIMEOUT
+    - SCCACHE_REGION
+    - SCCACHE_S3_KEY_PREFIX=cuvs-bench-cpu-aarch64 # [aarch64]
+    - SCCACHE_S3_KEY_PREFIX=cuvs-bench-cpu-linux64 # [linux64]
+    - SCCACHE_S3_USE_SSL
+  number: {{ GIT_DESCRIBE_NUMBER }}
+  string: py{{ py_version }}_{{ date_string }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }}
+
+requirements:
+  build:
+    - {{ compiler('c') }}
+    - {{ compiler('cxx') }}
+    - cmake {{ cmake_version }}
+    - ninja
+    - {{ stdlib("c") }}
+
+  host:
+    - benchmark
+    - fmt {{ fmt_version }}
+    - glog {{ glog_version }}
+    - nlohmann_json {{ nlohmann_json_version }}
+    - openblas
+    - python
+    - rapids-build-backend>=0.3.0,<0.4.0.dev0
+    - spdlog {{ spdlog_version }}
+
+  run:
+    - benchmark
+    - glog {{ glog_version }}
+    - h5py {{ h5py_version }}
+    - matplotlib
+    - pandas
+    - pyyaml
+    - python
+about:
+  home: https://rapids.ai/
+  license: Apache-2.0
+  summary: cuVS CPU benchmarks
diff --git a/conda/recipes/libcuvs/conda_build_config.yaml b/conda/recipes/libcuvs/conda_build_config.yaml
index e165f7ed9..b8c49943e 100644
--- a/conda/recipes/libcuvs/conda_build_config.yaml
+++ b/conda/recipes/libcuvs/conda_build_config.yaml
@@ -22,6 +22,9 @@ cmake_version:
 h5py_version:
   - ">=3.8.0"
 
+nccl_version:
+  - ">=2.19"
+
 # The CTK libraries below are missing from the conda-forge::cudatoolkit package
 # for CUDA 11. The "*_host_*" version specifiers correspond to `11.8` packages
 # and the "*_run_*" version specifiers correspond to `11.x` packages.
diff --git a/conda/recipes/libcuvs/meta.yaml b/conda/recipes/libcuvs/meta.yaml
index e154ccf41..46552c397 100644
--- a/conda/recipes/libcuvs/meta.yaml
+++ b/conda/recipes/libcuvs/meta.yaml
@@ -65,6 +65,7 @@ outputs:
       host:
         - librmm ={{ minor_version }}
         - libraft-headers ={{ minor_version }}
+        - nccl {{ nccl_version }}
         - cuda-version ={{ cuda_version }}
         {% if cuda_major == "11" %}
         - cuda-profiler-api {{ cuda11_cuda_profiler_api_host_version }}
@@ -131,6 +132,7 @@ outputs:
       host:
         - librmm ={{ minor_version }}
         - libraft-headers ={{ minor_version }}
+        - nccl {{ nccl_version }}
         - cuda-version ={{ cuda_version }}
         {% if cuda_major == "11" %}
         - cuda-profiler-api {{ cuda11_cuda_profiler_api_host_version }}
@@ -159,6 +161,7 @@ outputs:
         - libcusolver
         - libcusparse
         {% endif %}
+        - libraft-headers ={{ minor_version }}
     about:
       home: https://rapids.ai/
       license: Apache-2.0
@@ -197,6 +200,7 @@ outputs:
       host:
         - librmm ={{ minor_version }}
         - libraft-headers ={{ minor_version }}
+        - nccl {{ nccl_version }}
         - {{ pin_subpackage('libcuvs', exact=True) }}
         - cuda-version ={{ cuda_version }}
         - openblas # required by some CPU algos in benchmarks
@@ -268,6 +272,7 @@ outputs:
       host:
         - librmm ={{ minor_version }}
         - libraft-headers ={{ minor_version }}
+        - nccl {{ nccl_version }}
         - {{ pin_subpackage('libcuvs', exact=True) }}
         - cuda-version ={{ cuda_version }}
         {% if cuda_major == "11" %}
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index b05030cef..3e98a247e 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -57,6 +57,7 @@ option(BUILD_C_LIBRARY "Build cuVS C API library" OFF)
 option(BUILD_C_TESTS "Build cuVS C API tests" OFF)
 option(BUILD_CUVS_BENCH "Build cuVS ann benchmarks" OFF)
 option(BUILD_CAGRA_HNSWLIB "Build CAGRA+hnswlib interface" ON)
+option(BUILD_MG_ALGOS "Build with multi-GPU support" ON)
 option(CUDA_ENABLE_KERNELINFO "Enable kernel resource usage info" OFF)
 option(CUDA_ENABLE_LINEINFO
        "Enable the -lineinfo option for nvcc (useful for cuda-memcheck / profiler)" OFF
@@ -85,6 +86,12 @@ if(NOT BUILD_C_LIBRARY)
   set(BUILD_C_TESTS OFF)
 endif()
 
+if(NOT BUILD_SHARED_LIBS)
+  set(BUILD_TESTS OFF)
+  set(BUILD_C_LIBRARY OFF)
+  set(BUILD_CAGRA_HNSWLIB OFF)
+endif()
+
 # Needed because GoogleBenchmark changes the state of FindThreads.cmake, causing subsequent runs to
 # have different values for the `Threads::Threads` target. Setting this flag ensures
 # `Threads::Threads` is the same value across all builds so that cache hits occur
@@ -175,6 +182,7 @@ rapids_cpm_init()
 
 if(NOT BUILD_CPU_ONLY)
   include(cmake/thirdparty/get_raft.cmake)
+  include(cmake/thirdparty/get_cutlass.cmake)
 endif()
 
 if(BUILD_C_LIBRARY)
@@ -186,8 +194,6 @@ if(BUILD_TESTS OR BUILD_C_TESTS)
   rapids_cpm_gtest(BUILD_STATIC)
 endif()
 
-include(cmake/thirdparty/get_cutlass.cmake)
-
 if(BUILD_CUVS_BENCH)
   include(${rapids-cmake-dir}/cpm/gbench.cmake)
   rapids_cpm_gbench(BUILD_STATIC)
@@ -199,460 +205,568 @@ endif()
 
 # ##################################################################################################
 # * cuvs ---------------------------------------------------------------------
-add_library(
-  cuvs-cagra-search STATIC
-  src/neighbors/cagra_search_float.cu
-  src/neighbors/cagra_search_half.cu
-  src/neighbors/cagra_search_int8.cu
-  src/neighbors/cagra_search_uint8.cu
-  src/neighbors/detail/cagra/compute_distance.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_float_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_float_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_float_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_half_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_half_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_half_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_int8_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_int8_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_int8_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_uint8_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_uint8_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_uint8_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_float_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_float_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_float_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_half_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_half_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_half_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_int8_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_int8_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_int8_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_uint8_uint32_dim128_t8.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_uint8_uint32_dim256_t16.cu
-  src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_uint8_uint32_dim512_t32.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim128_t8_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim128_t8_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim256_t16_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim256_t16_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim512_t32_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim512_t32_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim128_t8_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim128_t8_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim256_t16_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim256_t16_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim512_t32_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim512_t32_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim128_t8_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim128_t8_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim256_t16_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim256_t16_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim512_t32_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim512_t32_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim128_t8_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim128_t8_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim256_t16_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim256_t16_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim512_t32_8pq_2subd_half.cu
-  src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim512_t32_8pq_4subd_half.cu
-  src/neighbors/detail/cagra/search_multi_cta_float_uint32.cu
-  src/neighbors/detail/cagra/search_multi_cta_half_uint32.cu
-  src/neighbors/detail/cagra/search_multi_cta_int8_uint32.cu
-  src/neighbors/detail/cagra/search_multi_cta_uint8_uint32.cu
-  src/neighbors/detail/cagra/search_single_cta_float_uint32.cu
-  src/neighbors/detail/cagra/search_single_cta_half_uint32.cu
-  src/neighbors/detail/cagra/search_single_cta_int8_uint32.cu
-  src/neighbors/detail/cagra/search_single_cta_uint8_uint32.cu
-)
+if(BUILD_SHARED_LIBS)
+  add_library(
+    cuvs-cagra-search STATIC
+    src/neighbors/cagra_search_float.cu
+    src/neighbors/cagra_search_half.cu
+    src/neighbors/cagra_search_int8.cu
+    src/neighbors/cagra_search_uint8.cu
+    src/neighbors/detail/cagra/compute_distance.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_float_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_float_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_float_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_half_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_half_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_half_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_int8_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_int8_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_int8_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_uint8_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_uint8_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_InnerProduct_uint8_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_float_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_float_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_float_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_half_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_half_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_half_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_int8_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_int8_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_int8_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_uint8_uint32_dim128_t8.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_uint8_uint32_dim256_t16.cu
+    src/neighbors/detail/cagra/compute_distance_standard_L2Expanded_uint8_uint32_dim512_t32.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim128_t8_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim128_t8_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim256_t16_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim256_t16_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim512_t32_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_float_uint32_dim512_t32_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim128_t8_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim128_t8_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim256_t16_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim256_t16_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim512_t32_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_half_uint32_dim512_t32_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim128_t8_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim128_t8_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim256_t16_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim256_t16_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim512_t32_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_int8_uint32_dim512_t32_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim128_t8_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim128_t8_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim256_t16_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim256_t16_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim512_t32_8pq_2subd_half.cu
+    src/neighbors/detail/cagra/compute_distance_vpq_L2Expanded_uint8_uint32_dim512_t32_8pq_4subd_half.cu
+    src/neighbors/detail/cagra/search_multi_cta_float_uint32.cu
+    src/neighbors/detail/cagra/search_multi_cta_half_uint32.cu
+    src/neighbors/detail/cagra/search_multi_cta_int8_uint32.cu
+    src/neighbors/detail/cagra/search_multi_cta_uint8_uint32.cu
+    src/neighbors/detail/cagra/search_single_cta_float_uint32.cu
+    src/neighbors/detail/cagra/search_single_cta_half_uint32.cu
+    src/neighbors/detail/cagra/search_single_cta_int8_uint32.cu
+    src/neighbors/detail/cagra/search_single_cta_uint8_uint32.cu
+  )
 
-file(GLOB_RECURSE compute_distance_sources "src/neighbors/detail/cagra/compute_distance_*.cu")
-set_source_files_properties(${compute_distance_sources} PROPERTIES COMPILE_FLAGS -maxrregcount=64)
-
-set_target_properties(
-  cuvs-cagra-search
-  PROPERTIES BUILD_RPATH "\$ORIGIN"
-             CXX_STANDARD 17
-             CXX_STANDARD_REQUIRED ON
-             CUDA_STANDARD 17
-             CUDA_STANDARD_REQUIRED ON
-             CUDA_SEPARABLE_COMPILATION ON
-             INTERFACE_POSITION_INDEPENDENT_CODE ON
-             POSITION_INDEPENDENT_CODE ON
-)
-target_link_libraries(cuvs-cagra-search PRIVATE raft::raft)
-target_include_directories(
-  cuvs-cagra-search PRIVATE "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
-)
-target_compile_options(
-  cuvs-cagra-search PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>"
-                            "$<$<COMPILE_LANGUAGE:CUDA>:${CUVS_CUDA_FLAGS}>"
-)
+  file(GLOB_RECURSE compute_distance_sources "src/neighbors/detail/cagra/compute_distance_*.cu")
+  set_source_files_properties(${compute_distance_sources} PROPERTIES COMPILE_FLAGS -maxrregcount=64)
 
-add_library(
-  cuvs SHARED
-  src/cluster/kmeans_balanced_fit_float.cu
-  src/cluster/kmeans_fit_mg_float.cu
-  src/cluster/kmeans_fit_mg_double.cu
-  src/cluster/kmeans_fit_double.cu
-  src/cluster/kmeans_fit_float.cu
-  src/cluster/kmeans_auto_find_k_float.cu
-  src/cluster/kmeans_fit_predict_double.cu
-  src/cluster/kmeans_fit_predict_float.cu
-  src/cluster/kmeans_predict_double.cu
-  src/cluster/kmeans_predict_float.cu
-  src/cluster/kmeans_balanced_fit_float.cu
-  src/cluster/kmeans_balanced_fit_predict_float.cu
-  src/cluster/kmeans_balanced_predict_float.cu
-  src/cluster/kmeans_balanced_fit_int8.cu
-  src/cluster/kmeans_balanced_fit_predict_int8.cu
-  src/cluster/kmeans_balanced_predict_int8.cu
-  src/cluster/kmeans_transform_double.cu
-  src/cluster/kmeans_transform_float.cu
-  src/cluster/single_linkage_float.cu
-  src/core/bitset.cu
-  src/distance/detail/pairwise_matrix/dispatch_canberra_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_canberra_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_canberra_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_correlation_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_correlation_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_correlation_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_cosine_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_cosine_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_cosine_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_hamming_unexpanded_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_hamming_unexpanded_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_hamming_unexpanded_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_hellinger_expanded_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_hellinger_expanded_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_hellinger_expanded_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_jensen_shannon_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_jensen_shannon_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_jensen_shannon_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_kl_divergence_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_kl_divergence_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_kl_divergence_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l1_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l1_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l1_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_expanded_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_expanded_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_expanded_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_unexpanded_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_unexpanded_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_unexpanded_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l_inf_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l_inf_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_l_inf_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_lp_unexpanded_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_lp_unexpanded_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_lp_unexpanded_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_russel_rao_float_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_russel_rao_half_float_float_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_russel_rao_double_double_double_int.cu
-  src/distance/detail/pairwise_matrix/dispatch_rbf.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_expanded_double_double_double_int64_t.cu
-  src/distance/detail/pairwise_matrix/dispatch_l2_expanded_float_float_float_int64_t.cu
-  src/distance/detail/fused_distance_nn.cu
-  src/distance/distance.cu
-  src/distance/pairwise_distance.cu
-  src/neighbors/brute_force.cu
-  src/neighbors/cagra_build_float.cu
-  src/neighbors/cagra_build_half.cu
-  src/neighbors/cagra_build_int8.cu
-  src/neighbors/cagra_build_uint8.cu
-  src/neighbors/cagra_extend_float.cu
-  src/neighbors/cagra_extend_int8.cu
-  src/neighbors/cagra_extend_uint8.cu
-  src/neighbors/cagra_optimize.cu
-  src/neighbors/cagra_serialize_float.cu
-  src/neighbors/cagra_serialize_half.cu
-  src/neighbors/cagra_serialize_int8.cu
-  src/neighbors/cagra_serialize_uint8.cu
-  src/neighbors/detail/cagra/cagra_build.cpp
-  src/neighbors/detail/cagra/topk_for_cagra/topk.cu
-  $<$<BOOL:${BUILD_CAGRA_HNSWLIB}>:src/neighbors/hnsw.cpp>
-  src/neighbors/ivf_flat_index.cpp
-  src/neighbors/ivf_flat/ivf_flat_build_extend_float_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_build_extend_int8_t_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_build_extend_uint8_t_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_helpers.cu
-  src/neighbors/ivf_flat/ivf_flat_search_float_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_search_int8_t_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_search_uint8_t_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_serialize_float_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_serialize_int8_t_int64_t.cu
-  src/neighbors/ivf_flat/ivf_flat_serialize_uint8_t_int64_t.cu
-  src/neighbors/ivf_pq_index.cpp
-  src/neighbors/ivf_pq/ivf_pq_build_common.cu
-  src/neighbors/ivf_pq/ivf_pq_serialize.cu
-  src/neighbors/ivf_pq/ivf_pq_deserialize.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_build_extend_float_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_build_extend_half_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_build_extend_int8_t_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_build_extend_uint8_t_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_false.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_true.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_half.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_half.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_float.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_false.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_true.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_false_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_true_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_half_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_half_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_float_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_false_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_true_bitset64.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_search_float_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_search_half_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_search_int8_t_int64_t.cu
-  src/neighbors/ivf_pq/detail/ivf_pq_search_uint8_t_int64_t.cu
-  src/neighbors/nn_descent.cu
-  src/neighbors/nn_descent_float.cu
-  src/neighbors/nn_descent_half.cu
-  src/neighbors/nn_descent_int8.cu
-  src/neighbors/nn_descent_uint8.cu
-  src/neighbors/reachability.cu
-  src/neighbors/refine/detail/refine_device_float_float.cu
-  src/neighbors/refine/detail/refine_device_half_float.cu
-  src/neighbors/refine/detail/refine_device_int8_t_float.cu
-  src/neighbors/refine/detail/refine_device_uint8_t_float.cu
-  src/neighbors/refine/detail/refine_host_float_float.cpp
-  src/neighbors/refine/detail/refine_host_half_float.cpp
-  src/neighbors/refine/detail/refine_host_int8_t_float.cpp
-  src/neighbors/refine/detail/refine_host_uint8_t_float.cpp
-  src/neighbors/sample_filter.cu
-  src/neighbors/vamana_build_float.cu
-  src/neighbors/vamana_build_uint8.cu
-  src/neighbors/vamana_build_int8.cu
-  src/neighbors/vamana_serialize_float.cu
-  src/neighbors/vamana_serialize_uint8.cu
-  src/neighbors/vamana_serialize_int8.cu
-  src/selection/select_k_float_int64_t.cu
-  src/selection/select_k_float_int32_t.cu
-  src/selection/select_k_float_uint32_t.cu
-  src/selection/select_k_half_uint32_t.cu
-  src/stats/silhouette_score.cu
-  src/stats/trustworthiness_score.cu
-)
+  set_target_properties(
+    cuvs-cagra-search
+    PROPERTIES BUILD_RPATH "\$ORIGIN"
+               CXX_STANDARD 17
+               CXX_STANDARD_REQUIRED ON
+               CUDA_STANDARD 17
+               CUDA_STANDARD_REQUIRED ON
+               CUDA_SEPARABLE_COMPILATION ON
+               INTERFACE_POSITION_INDEPENDENT_CODE ON
+               POSITION_INDEPENDENT_CODE ON
+  )
+  target_link_libraries(cuvs-cagra-search PRIVATE raft::raft)
+  target_include_directories(
+    cuvs-cagra-search PRIVATE "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
+  )
+  target_compile_options(
+    cuvs-cagra-search PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>"
+                              "$<$<COMPILE_LANGUAGE:CUDA>:${CUVS_CUDA_FLAGS}>"
+  )
 
-target_compile_options(
-  cuvs INTERFACE $<$<COMPILE_LANG_AND_ID:CUDA,NVIDIA>:--expt-extended-lambda
-                 --expt-relaxed-constexpr>
-)
+  if(BUILD_MG_ALGOS)
+    set(CUVS_MG_ALGOS
+        src/neighbors/mg/mg_flat_float_int64_t.cu
+        src/neighbors/mg/mg_flat_int8_t_int64_t.cu
+        src/neighbors/mg/mg_flat_uint8_t_int64_t.cu
+        src/neighbors/mg/mg_pq_float_int64_t.cu
+        src/neighbors/mg/mg_pq_half_int64_t.cu
+        src/neighbors/mg/mg_pq_int8_t_int64_t.cu
+        src/neighbors/mg/mg_pq_uint8_t_int64_t.cu
+        src/neighbors/mg/mg_cagra_float_uint32_t.cu
+        src/neighbors/mg/mg_cagra_half_uint32_t.cu
+        src/neighbors/mg/mg_cagra_int8_t_uint32_t.cu
+        src/neighbors/mg/mg_cagra_uint8_t_uint32_t.cu
+        src/neighbors/mg/omp_checks.cpp
+        src/neighbors/mg/nccl_comm.cpp
+    )
+  endif()
 
-add_library(cuvs::cuvs ALIAS cuvs)
+  add_library(
+    cuvs_objs OBJECT
+    src/cluster/kmeans_balanced_fit_float.cu
+    src/cluster/kmeans_fit_mg_float.cu
+    src/cluster/kmeans_fit_mg_double.cu
+    src/cluster/kmeans_fit_double.cu
+    src/cluster/kmeans_fit_float.cu
+    src/cluster/kmeans_auto_find_k_float.cu
+    src/cluster/kmeans_fit_predict_double.cu
+    src/cluster/kmeans_fit_predict_float.cu
+    src/cluster/kmeans_predict_double.cu
+    src/cluster/kmeans_predict_float.cu
+    src/cluster/kmeans_balanced_fit_float.cu
+    src/cluster/kmeans_balanced_fit_predict_float.cu
+    src/cluster/kmeans_balanced_predict_float.cu
+    src/cluster/kmeans_balanced_fit_int8.cu
+    src/cluster/kmeans_balanced_fit_predict_int8.cu
+    src/cluster/kmeans_balanced_predict_int8.cu
+    src/cluster/kmeans_transform_double.cu
+    src/cluster/kmeans_transform_float.cu
+    src/cluster/single_linkage_float.cu
+    src/core/bitset.cu
+    src/distance/detail/pairwise_matrix/dispatch_canberra_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_canberra_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_canberra_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_correlation_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_correlation_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_correlation_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_cosine_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_cosine_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_cosine_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_hamming_unexpanded_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_hamming_unexpanded_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_hamming_unexpanded_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_hellinger_expanded_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_hellinger_expanded_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_hellinger_expanded_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_jensen_shannon_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_jensen_shannon_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_jensen_shannon_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_kl_divergence_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_kl_divergence_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_kl_divergence_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l1_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l1_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l1_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_expanded_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_expanded_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_expanded_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_unexpanded_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_unexpanded_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_unexpanded_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l_inf_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l_inf_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_l_inf_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_lp_unexpanded_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_lp_unexpanded_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_lp_unexpanded_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_russel_rao_float_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_russel_rao_half_float_float_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_russel_rao_double_double_double_int.cu
+    src/distance/detail/pairwise_matrix/dispatch_rbf.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_expanded_double_double_double_int64_t.cu
+    src/distance/detail/pairwise_matrix/dispatch_l2_expanded_float_float_float_int64_t.cu
+    src/distance/detail/fused_distance_nn.cu
+    src/distance/distance.cu
+    src/distance/pairwise_distance.cu
+    src/neighbors/brute_force.cu
+    src/neighbors/cagra_build_float.cu
+    src/neighbors/cagra_build_half.cu
+    src/neighbors/cagra_build_int8.cu
+    src/neighbors/cagra_build_uint8.cu
+    src/neighbors/cagra_extend_float.cu
+    src/neighbors/cagra_extend_int8.cu
+    src/neighbors/cagra_extend_uint8.cu
+    src/neighbors/cagra_optimize.cu
+    src/neighbors/cagra_serialize_float.cu
+    src/neighbors/cagra_serialize_half.cu
+    src/neighbors/cagra_serialize_int8.cu
+    src/neighbors/cagra_serialize_uint8.cu
+    src/neighbors/iface/iface_cagra_float_uint32_t.cu
+    src/neighbors/iface/iface_cagra_half_uint32_t.cu
+    src/neighbors/iface/iface_cagra_int8_t_uint32_t.cu
+    src/neighbors/iface/iface_cagra_uint8_t_uint32_t.cu
+    src/neighbors/iface/iface_flat_float_int64_t.cu
+    src/neighbors/iface/iface_flat_int8_t_int64_t.cu
+    src/neighbors/iface/iface_flat_uint8_t_int64_t.cu
+    src/neighbors/iface/iface_pq_float_int64_t.cu
+    src/neighbors/iface/iface_pq_half_int64_t.cu
+    src/neighbors/iface/iface_pq_int8_t_int64_t.cu
+    src/neighbors/iface/iface_pq_uint8_t_int64_t.cu
+    src/neighbors/detail/cagra/cagra_build.cpp
+    src/neighbors/detail/cagra/topk_for_cagra/topk.cu
+    $<$<BOOL:${BUILD_CAGRA_HNSWLIB}>:src/neighbors/hnsw.cpp>
+    src/neighbors/ivf_flat_index.cpp
+    src/neighbors/ivf_flat/ivf_flat_build_extend_float_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_build_extend_int8_t_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_build_extend_uint8_t_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_helpers.cu
+    src/neighbors/ivf_flat/ivf_flat_search_float_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_search_int8_t_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_search_uint8_t_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_serialize_float_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_serialize_int8_t_int64_t.cu
+    src/neighbors/ivf_flat/ivf_flat_serialize_uint8_t_int64_t.cu
+    src/neighbors/ivf_pq_index.cpp
+    src/neighbors/ivf_pq/ivf_pq_build_common.cu
+    src/neighbors/ivf_pq/ivf_pq_serialize.cu
+    src/neighbors/ivf_pq/ivf_pq_deserialize.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_build_extend_float_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_build_extend_half_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_build_extend_int8_t_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_build_extend_uint8_t_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_false.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_true.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_half.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_half.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_float.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_false.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_true.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_false_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_fp8_true_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_half_half_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_half_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_float_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_false_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_compute_similarity_float_fp8_true_bitset64.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_search_float_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_search_half_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_search_int8_t_int64_t.cu
+    src/neighbors/ivf_pq/detail/ivf_pq_search_uint8_t_int64_t.cu
+    src/neighbors/nn_descent.cu
+    src/neighbors/nn_descent_float.cu
+    src/neighbors/nn_descent_half.cu
+    src/neighbors/nn_descent_int8.cu
+    src/neighbors/nn_descent_uint8.cu
+    src/neighbors/reachability.cu
+    src/neighbors/refine/detail/refine_device_float_float.cu
+    src/neighbors/refine/detail/refine_device_half_float.cu
+    src/neighbors/refine/detail/refine_device_int8_t_float.cu
+    src/neighbors/refine/detail/refine_device_uint8_t_float.cu
+    src/neighbors/refine/detail/refine_host_float_float.cpp
+    src/neighbors/refine/detail/refine_host_half_float.cpp
+    src/neighbors/refine/detail/refine_host_int8_t_float.cpp
+    src/neighbors/refine/detail/refine_host_uint8_t_float.cpp
+    src/neighbors/sample_filter.cu
+    src/neighbors/vamana_build_float.cu
+    src/neighbors/vamana_build_uint8.cu
+    src/neighbors/vamana_build_int8.cu
+    src/neighbors/vamana_serialize_float.cu
+    src/neighbors/vamana_serialize_uint8.cu
+    src/neighbors/vamana_serialize_int8.cu
+    src/selection/select_k_float_int64_t.cu
+    src/selection/select_k_float_int32_t.cu
+    src/selection/select_k_float_uint32_t.cu
+    src/selection/select_k_half_uint32_t.cu
+    src/stats/silhouette_score.cu
+    src/stats/trustworthiness_score.cu
+    ${CUVS_MG_ALGOS}
+  )
 
-target_include_directories(
-  cuvs PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
-              "$<INSTALL_INTERFACE:include>"
-)
+  set_target_properties(
+    cuvs_objs
+    PROPERTIES CXX_STANDARD 17
+               CXX_STANDARD_REQUIRED ON
+               CUDA_STANDARD 17
+               CUDA_STANDARD_REQUIRED ON
+               POSITION_INDEPENDENT_CODE ON
+  )
+  target_compile_options(
+    cuvs_objs PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>"
+                      "$<$<COMPILE_LANGUAGE:CUDA>:${CUVS_CUDA_FLAGS}>"
+  )
+  target_link_libraries(
+    cuvs_objs PUBLIC raft::raft rmm::rmm ${CUVS_CTK_MATH_DEPENDENCIES}
+                     $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
+  )
 
-rapids_find_package(
-  OpenMP REQUIRED
-  BUILD_EXPORT_SET cuvs-exports
-  INSTALL_EXPORT_SET cuvs-exports
-)
+  add_library(cuvs SHARED $<TARGET_OBJECTS:cuvs_objs>)
+  add_library(cuvs_static STATIC $<TARGET_OBJECTS:cuvs_objs>)
 
-if(NOT BUILD_CPU_ONLY)
+  target_compile_options(
+    cuvs INTERFACE $<$<COMPILE_LANG_AND_ID:CUDA,NVIDIA>:--expt-extended-lambda
+                   --expt-relaxed-constexpr>
+  )
 
-  set(CUVS_CUSOLVER_DEPENDENCY CUDA::cusolver${_ctk_static_suffix})
-  set(CUVS_CUBLAS_DEPENDENCY CUDA::cublas${_ctk_static_suffix})
-  set(CUVS_CURAND_DEPENDENCY CUDA::curand${_ctk_static_suffix})
-  set(CUVS_CUSPARSE_DEPENDENCY CUDA::cusparse${_ctk_static_suffix})
+  add_library(cuvs::cuvs ALIAS cuvs)
+  add_library(cuvs::cuvs_static ALIAS cuvs_static)
 
-  set(CUVS_CTK_MATH_DEPENDENCIES ${CUVS_CUBLAS_DEPENDENCY} ${CUVS_CUSOLVER_DEPENDENCY}
-                                 ${CUVS_CUSPARSE_DEPENDENCY} ${CUVS_CURAND_DEPENDENCY}
+  set_target_properties(
+    cuvs_static
+    PROPERTIES BUILD_RPATH "\$ORIGIN"
+               INSTALL_RPATH "\$ORIGIN"
+               CXX_STANDARD 17
+               CXX_STANDARD_REQUIRED ON
+               POSITION_INDEPENDENT_CODE ON
+               INTERFACE_POSITION_INDEPENDENT_CODE ON
+               EXPORT_NAME cuvs_static
   )
 
-  # Keep cuVS as lightweight as possible. Only CUDA libs and rmm should be used in global target.
-  target_link_libraries(
-    cuvs
-    PUBLIC rmm::rmm raft::raft ${CUVS_CTK_MATH_DEPENDENCIES}
-    PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> cuvs-cagra-search
+  target_compile_options(cuvs_static PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>")
+
+  target_include_directories(
+    cuvs_objs
+    PUBLIC "$<BUILD_INTERFACE:${DLPACK_INCLUDE_DIR}>"
+           "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
+    INTERFACE "$<INSTALL_INTERFACE:include>"
   )
-endif()
 
-if(BUILD_CAGRA_HNSWLIB)
-  target_link_libraries(cuvs PRIVATE hnswlib::hnswlib)
-  target_compile_definitions(cuvs PUBLIC CUVS_BUILD_CAGRA_HNSWLIB)
-endif()
+  target_include_directories(
+    cuvs_static
+    PUBLIC "$<BUILD_INTERFACE:${DLPACK_INCLUDE_DIR}>"
+    INTERFACE "$<INSTALL_INTERFACE:include>"
+  )
 
-# Endian detection
-include(TestBigEndian)
-test_big_endian(BIG_ENDIAN)
-if(BIG_ENDIAN)
-  target_compile_definitions(cuvs PRIVATE CUVS_SYSTEM_LITTLE_ENDIAN=0)
-else()
-  target_compile_definitions(cuvs PRIVATE CUVS_SYSTEM_LITTLE_ENDIAN=1)
-endif()
+  # ensure CUDA symbols aren't relocated to the middle of the debug build binaries
+  target_link_options(cuvs_static PRIVATE $<HOST_LINK:${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld>)
+
+  target_include_directories(
+    cuvs_static PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
+                       "$<INSTALL_INTERFACE:include>"
+  )
+
+  target_include_directories(
+    cuvs PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
+                "$<INSTALL_INTERFACE:include>"
+  )
 
-file(
-  WRITE "${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld"
-  [=[
+  rapids_find_package(
+    OpenMP REQUIRED
+    BUILD_EXPORT_SET cuvs-exports
+    INSTALL_EXPORT_SET cuvs-exports
+  )
+
+  if(NOT BUILD_CPU_ONLY)
+
+    set(CUVS_CUSOLVER_DEPENDENCY CUDA::cusolver${_ctk_static_suffix})
+    set(CUVS_CUBLAS_DEPENDENCY CUDA::cublas${_ctk_static_suffix})
+    set(CUVS_CURAND_DEPENDENCY CUDA::curand${_ctk_static_suffix})
+    set(CUVS_CUSPARSE_DEPENDENCY CUDA::cusparse${_ctk_static_suffix})
+
+    set(CUVS_CTK_MATH_DEPENDENCIES ${CUVS_CUBLAS_DEPENDENCY} ${CUVS_CUSOLVER_DEPENDENCY}
+                                   ${CUVS_CUSPARSE_DEPENDENCY} ${CUVS_CURAND_DEPENDENCY}
+    )
+
+    if(BUILD_MG_ALGOS)
+      set(CUVS_COMMS_DEPENDENCY nccl)
+    endif()
+
+    # Keep cuVS as lightweight as possible. Only CUDA libs and rmm should be used in global target.
+    target_link_libraries(
+      cuvs
+      PUBLIC rmm::rmm raft::raft ${CUVS_CTK_MATH_DEPENDENCIES}
+      PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
+              cuvs-cagra-search ${CUVS_COMMS_DEPENDENCY}
+    )
+
+    target_link_libraries(
+      cuvs_static
+      PUBLIC rmm::rmm raft::raft ${CUVS_CTK_MATH_DEPENDENCIES}
+      PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
+    )
+  endif()
+
+  if(BUILD_MG_ALGOS)
+    target_compile_definitions(cuvs PUBLIC CUVS_BUILD_MG_ALGOS)
+    target_compile_definitions(cuvs_objs PUBLIC CUVS_BUILD_MG_ALGOS)
+  endif()
+
+  if(BUILD_CAGRA_HNSWLIB)
+    target_link_libraries(cuvs_objs PRIVATE hnswlib::hnswlib)
+    target_compile_definitions(cuvs_objs PUBLIC CUVS_BUILD_CAGRA_HNSWLIB)
+  endif()
+
+  # Endian detection
+  include(TestBigEndian)
+  test_big_endian(BIG_ENDIAN)
+  if(BIG_ENDIAN)
+    target_compile_definitions(cuvs PRIVATE CUVS_SYSTEM_LITTLE_ENDIAN=0)
+  else()
+    target_compile_definitions(cuvs PRIVATE CUVS_SYSTEM_LITTLE_ENDIAN=1)
+  endif()
+
+  file(
+    WRITE "${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld"
+    [=[
 SECTIONS
 {
 .nvFatBinSegment : { *(.nvFatBinSegment) }
 .nv_fatbin : { *(.nv_fatbin) }
 }
 ]=]
-)
-
-# ##################################################################################################
-# * NVTX support in cuvs -----------------------------------------------------
-
-if(CUVS_NVTX)
-  # This enables NVTX within the project with no option to disable it downstream.
-  target_link_libraries(cuvs PUBLIC CUDA::nvtx3)
-  target_compile_definitions(cuvs PUBLIC NVTX_ENABLED)
-else()
-  # Allow enable NVTX downstream if not set here. This creates a new option at build/install time,
-  # which is set by default to OFF, but can be enabled in the dependent project.
-  get_property(
-    nvtx_option_help_string
-    CACHE CUVS_NVTX
-    PROPERTY HELPSTRING
   )
-  string(
-    CONCAT
-      nvtx_export_string
-      "option(CUVS_NVTX \""
-      ${nvtx_option_help_string}
-      "\" OFF)"
-      [=[
+
+  # ################################################################################################
+  # * NVTX support in cuvs -----------------------------------------------------
+
+  if(CUVS_NVTX)
+    # This enables NVTX within the project with no option to disable it downstream.
+    target_link_libraries(cuvs PUBLIC CUDA::nvtx3)
+    target_compile_definitions(cuvs PUBLIC NVTX_ENABLED)
+  else()
+    # Allow enable NVTX downstream if not set here. This creates a new option at build/install time,
+    # which is set by default to OFF, but can be enabled in the dependent project.
+    get_property(
+      nvtx_option_help_string
+      CACHE CUVS_NVTX
+      PROPERTY HELPSTRING
+    )
+    string(
+      CONCAT
+        nvtx_export_string
+        "option(CUVS_NVTX \""
+        ${nvtx_option_help_string}
+        "\" OFF)"
+        [=[
 
 target_link_libraries(cuvs::cuvs INTERFACE $<$<BOOL:${CUVS_NVTX}>:CUDA::nvtx3>)
 target_compile_definitions(cuvs::cuvs INTERFACE $<$<BOOL:${CUVS_NVTX}>:NVTX_ENABLED>)
 
   ]=]
-  )
-endif()
-
-set_target_properties(
-  cuvs
-  PROPERTIES BUILD_RPATH "\$ORIGIN"
-             INSTALL_RPATH "\$ORIGIN"
-             CXX_STANDARD 17
-             CXX_STANDARD_REQUIRED ON
-             CUDA_STANDARD 17
-             CUDA_STANDARD_REQUIRED ON
-             INTERFACE_POSITION_INDEPENDENT_CODE ON
-             POSITION_INDEPENDENT_CODE ON
-)
-
-target_compile_options(
-  cuvs PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>"
-               "$<$<COMPILE_LANGUAGE:CUDA>:${CUVS_CUDA_FLAGS}>"
-)
-# ensure CUDA symbols aren't relocated to the middle of the debug build binaries
-target_link_options(cuvs PRIVATE $<HOST_LINK:${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld>)
-
-# ##################################################################################################
-# * cuvs_c -------------------------------------------------------------------------------
-if(BUILD_C_LIBRARY)
-  add_library(
-    cuvs_c SHARED
-    src/core/c_api.cpp
-    src/neighbors/brute_force_c.cpp
-    src/neighbors/ivf_flat_c.cpp
-    src/neighbors/ivf_pq_c.cpp
-    src/neighbors/cagra_c.cpp
-    src/neighbors/hnsw_c.cpp
-    src/neighbors/refine/refine_c.cpp
-    src/distance/pairwise_distance_c.cpp
-  )
-
-  add_library(cuvs::c_api ALIAS cuvs_c)
+    )
+  endif()
 
   set_target_properties(
-    cuvs_c
+    cuvs
     PROPERTIES BUILD_RPATH "\$ORIGIN"
                INSTALL_RPATH "\$ORIGIN"
                CXX_STANDARD 17
                CXX_STANDARD_REQUIRED ON
-               POSITION_INDEPENDENT_CODE ON
+               CUDA_STANDARD 17
+               CUDA_STANDARD_REQUIRED ON
                INTERFACE_POSITION_INDEPENDENT_CODE ON
-               EXPORT_NAME c_api
+               POSITION_INDEPENDENT_CODE ON
   )
 
-  target_compile_options(cuvs_c PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>")
-
-  target_include_directories(
-    cuvs_c
-    PUBLIC "$<BUILD_INTERFACE:${DLPACK_INCLUDE_DIR}>"
-    INTERFACE "$<INSTALL_INTERFACE:include>"
+  target_compile_options(
+    cuvs PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>"
+                 "$<$<COMPILE_LANGUAGE:CUDA>:${CUVS_CUDA_FLAGS}>"
   )
+  # ensure CUDA symbols aren't relocated to the middle of the debug build binaries
+  target_link_options(cuvs PRIVATE $<HOST_LINK:${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld>)
+
+  # ################################################################################################
+  # * cuvs_c -------------------------------------------------------------------------------
+  if(BUILD_C_LIBRARY)
+    add_library(
+      cuvs_c SHARED
+      src/core/c_api.cpp
+      src/neighbors/brute_force_c.cpp
+      src/neighbors/ivf_flat_c.cpp
+      src/neighbors/ivf_pq_c.cpp
+      src/neighbors/cagra_c.cpp
+      $<$<BOOL:${BUILD_CAGRA_HNSWLIB}>:src/neighbors/hnsw_c.cpp>
+      src/neighbors/refine/refine_c.cpp
+      src/distance/pairwise_distance_c.cpp
+    )
 
-  target_link_libraries(
-    cuvs_c
-    PUBLIC cuvs::cuvs ${CUVS_CTK_MATH_DEPENDENCIES}
-    PRIVATE raft::raft
-  )
+    if(BUILD_CAGRA_HNSWLIB)
+      target_link_libraries(cuvs_c PRIVATE hnswlib::hnswlib)
+      target_compile_definitions(cuvs_c PUBLIC CUVS_BUILD_CAGRA_HNSWLIB)
+    endif()
+
+    add_library(cuvs::c_api ALIAS cuvs_c)
+
+    set_target_properties(
+      cuvs_c
+      PROPERTIES BUILD_RPATH "\$ORIGIN"
+                 INSTALL_RPATH "\$ORIGIN"
+                 CXX_STANDARD 17
+                 CXX_STANDARD_REQUIRED ON
+                 POSITION_INDEPENDENT_CODE ON
+                 INTERFACE_POSITION_INDEPENDENT_CODE ON
+                 EXPORT_NAME c_api
+    )
 
-  # ensure CUDA symbols aren't relocated to the middle of the debug build binaries
-  target_link_options(cuvs_c PRIVATE "${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld")
-endif()
+    target_compile_options(cuvs_c PRIVATE "$<$<COMPILE_LANGUAGE:CXX>:${CUVS_CXX_FLAGS}>")
 
-# ##################################################################################################
-# * install targets-----------------------------------------------------------
-rapids_cmake_install_lib_dir(lib_dir)
-include(GNUInstallDirs)
-include(CPack)
-
-install(
-  TARGETS cuvs
-  DESTINATION ${lib_dir}
-  COMPONENT cuvs
-  EXPORT cuvs-exports
-)
+    target_include_directories(
+      cuvs_c
+      PUBLIC "$<BUILD_INTERFACE:${DLPACK_INCLUDE_DIR}>"
+      INTERFACE "$<INSTALL_INTERFACE:include>"
+    )
 
-install(
-  DIRECTORY include/cuvs
-  COMPONENT cuvs
-  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
-)
+    target_link_libraries(
+      cuvs_c
+      PUBLIC cuvs::cuvs ${CUVS_CTK_MATH_DEPENDENCIES}
+      PRIVATE raft::raft
+    )
+
+    # ensure CUDA symbols aren't relocated to the middle of the debug build binaries
+    target_link_options(cuvs_c PRIVATE "${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld")
+  endif()
+
+  # ################################################################################################
+  # * install targets-----------------------------------------------------------
+  rapids_cmake_install_lib_dir(lib_dir)
+  include(GNUInstallDirs)
+  include(CPack)
 
-if(BUILD_C_LIBRARY)
   install(
-    TARGETS cuvs_c
+    TARGETS cuvs cuvs_static
     DESTINATION ${lib_dir}
-    COMPONENT c_api
-    EXPORT cuvs-c-exports
+    COMPONENT cuvs
+    EXPORT cuvs-exports
   )
-endif()
 
-install(
-  FILES ${CMAKE_CURRENT_BINARY_DIR}/include/cuvs/version_config.hpp
-  COMPONENT cuvs
-  DESTINATION include/cuvs
-)
+  install(
+    DIRECTORY include/cuvs
+    COMPONENT cuvs
+    DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+  )
 
-if(TARGET cuvs_c)
-  list(APPEND cuvs_components c_api)
-  list(APPEND cuvs_export_sets cuvs-c-exports)
-  set(CUVS_C_TARGET cuvs_c)
-endif()
+  if(BUILD_C_LIBRARY)
+    install(
+      TARGETS cuvs_c
+      DESTINATION ${lib_dir}
+      COMPONENT c_api
+      EXPORT cuvs-c-exports
+    )
+  endif()
 
-# Use `rapids_export` for 22.04 as it will have COMPONENT support
-rapids_export(
-  INSTALL cuvs
-  EXPORT_SET cuvs-exports
-  COMPONENTS ${cuvs_components}
-  COMPONENTS_EXPORT_SET ${cuvs_export_sets}
-  GLOBAL_TARGETS cuvs ${CUVS_C_TARGET}
-  NAMESPACE cuvs::
-)
+  install(
+    FILES ${CMAKE_CURRENT_BINARY_DIR}/include/cuvs/version_config.hpp
+    COMPONENT cuvs
+    DESTINATION include/cuvs
+  )
 
-# ##################################################################################################
-# * build export -------------------------------------------------------------
-rapids_export(
-  BUILD cuvs
-  EXPORT_SET cuvs-exports
-  COMPONENTS ${cuvs_components}
-  COMPONENTS_EXPORT_SET ${cuvs_export_sets}
-  GLOBAL_TARGETS cuvs ${CUVS_C_TARGET}
-  NAMESPACE cuvs::
-)
+  if(TARGET cuvs_c)
+    list(APPEND cuvs_components c_api)
+    list(APPEND cuvs_export_sets cuvs-c-exports)
+    set(CUVS_C_TARGET cuvs_c)
+  endif()
+
+  # Use `rapids_export` for 22.04 as it will have COMPONENT support
+  rapids_export(
+    INSTALL cuvs
+    EXPORT_SET cuvs-exports
+    COMPONENTS ${cuvs_components}
+    COMPONENTS_EXPORT_SET ${cuvs_export_sets}
+    GLOBAL_TARGETS cuvs ${CUVS_C_TARGET}
+    NAMESPACE cuvs::
+  )
+
+  # ################################################################################################
+  # * build export -------------------------------------------------------------
+  rapids_export(
+    BUILD cuvs
+    EXPORT_SET cuvs-exports
+    COMPONENTS ${cuvs_components}
+    COMPONENTS_EXPORT_SET ${cuvs_export_sets}
+    GLOBAL_TARGETS cuvs ${CUVS_C_TARGET}
+    NAMESPACE cuvs::
+  )
+endif()
 
 # ##################################################################################################
 # * build test executable ----------------------------------------------------
diff --git a/cpp/bench/ann/CMakeLists.txt b/cpp/bench/ann/CMakeLists.txt
index ac1301221..0f6b42ae9 100644
--- a/cpp/bench/ann/CMakeLists.txt
+++ b/cpp/bench/ann/CMakeLists.txt
@@ -32,6 +32,7 @@ option(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE "Include cuVS brute force knn in benc
 option(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB "Include cuVS CAGRA with HNSW search in benchmark" ON)
 option(CUVS_ANN_BENCH_USE_HNSWLIB "Include hnsw algorithm in benchmark" ON)
 option(CUVS_ANN_BENCH_USE_GGNN "Include ggnn algorithm in benchmark" OFF)
+option(CUVS_ANN_BENCH_USE_CUVS_MG "Include cuVS ann mg algorithm in benchmark" ${BUILD_MG_ALGOS})
 option(CUVS_ANN_BENCH_SINGLE_EXE
        "Make a single executable with benchmark as shared library modules" OFF
 )
@@ -55,6 +56,7 @@ if(BUILD_CPU_ONLY)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB OFF)
   set(CUVS_ANN_BENCH_USE_GGNN OFF)
   set(CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE OFF)
+  set(CUVS_ANN_BENCH_USE_CUVS_MG OFF)
 else()
   set(CUVS_FAISS_ENABLE_GPU ON)
 endif()
@@ -66,6 +68,7 @@ if(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ
    OR CUVS_ANN_BENCH_USE_CUVS_CAGRA
    OR CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB
    OR CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE
+   OR CUVS_ANN_BENCH_USE_CUVS_MG
 )
   set(CUVS_ANN_BENCH_USE_CUVS ON)
 endif()
@@ -227,6 +230,21 @@ if(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB)
   )
 endif()
 
+if(CUVS_ANN_BENCH_USE_CUVS_MG)
+  ConfigureAnnBench(
+    NAME
+    CUVS_MG
+    PATH
+    src/cuvs/cuvs_benchmark.cu
+    $<$<BOOL:${CUVS_ANN_BENCH_USE_CUVS_MG}>:src/cuvs/cuvs_mg_ivf_flat.cu>
+    $<$<BOOL:${CUVS_ANN_BENCH_USE_CUVS_MG}>:src/cuvs/cuvs_mg_ivf_pq.cu>
+    $<$<BOOL:${CUVS_ANN_BENCH_USE_CUVS_MG}>:src/cuvs/cuvs_mg_cagra.cu>
+    LINKS
+    cuvs
+    nccl
+  )
+endif()
+
 message("CUVS_FAISS_TARGETS: ${CUVS_FAISS_TARGETS}")
 message("CUDAToolkit_LIBRARY_DIR: ${CUDAToolkit_LIBRARY_DIR}")
 if(CUVS_ANN_BENCH_USE_FAISS_CPU_FLAT)
diff --git a/cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h b/cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
index 22f0cab6f..57d5b1910 100644
--- a/cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
+++ b/cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
@@ -45,7 +45,18 @@ extern template class cuvs::bench::cuvs_cagra<uint8_t, uint32_t>;
 extern template class cuvs::bench::cuvs_cagra<int8_t, uint32_t>;
 #endif
 
-#ifdef CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT
+#ifdef CUVS_ANN_BENCH_USE_CUVS_MG
+#include "cuvs_ivf_flat_wrapper.h"
+#include "cuvs_mg_ivf_flat_wrapper.h"
+
+#include "cuvs_ivf_pq_wrapper.h"
+#include "cuvs_mg_ivf_pq_wrapper.h"
+
+#include "cuvs_cagra_wrapper.h"
+#include "cuvs_mg_cagra_wrapper.h"
+#endif
+
+#if defined(CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT) || defined(CUVS_ANN_BENCH_USE_CUVS_MG)
 template <typename T, typename IdxT>
 void parse_build_param(const nlohmann::json& conf,
                        typename cuvs::bench::cuvs_ivf_flat<T, IdxT>::build_param& param)
@@ -64,7 +75,7 @@ void parse_search_param(const nlohmann::json& conf,
 #endif
 
 #if defined(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ) || defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA) || \
-  defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB)
+  defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB) || defined(CUVS_ANN_BENCH_USE_CUVS_MG)
 template <typename T, typename IdxT>
 void parse_build_param(const nlohmann::json& conf,
                        typename cuvs::bench::cuvs_ivf_pq<T, IdxT>::build_param& param)
@@ -130,7 +141,8 @@ void parse_search_param(const nlohmann::json& conf,
 }
 #endif
 
-#if defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA) || defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB)
+#if defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA) || defined(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB) || \
+  defined(CUVS_ANN_BENCH_USE_CUVS_MG)
 template <typename T, typename IdxT>
 void parse_build_param(const nlohmann::json& conf, cuvs::neighbors::nn_descent::index_params& param)
 {
diff --git a/cpp/bench/ann/src/cuvs/cuvs_benchmark.cu b/cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
index a956ab139..893097236 100644
--- a/cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
+++ b/cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
@@ -29,6 +29,43 @@
 
 namespace cuvs::bench {
 
+#ifdef CUVS_ANN_BENCH_USE_CUVS_MG
+void add_distribution_mode(cuvs::neighbors::mg::distribution_mode* dist_mode,
+                           const nlohmann::json& conf)
+{
+  if (conf.contains("distribution_mode")) {
+    std::string distribution_mode = conf.at("distribution_mode");
+    if (distribution_mode == "replicated") {
+      *dist_mode = cuvs::neighbors::mg::distribution_mode::REPLICATED;
+    } else if (distribution_mode == "sharded") {
+      *dist_mode = cuvs::neighbors::mg::distribution_mode::SHARDED;
+    } else {
+      throw std::runtime_error("invalid value for distribution_mode");
+    }
+  } else {
+    // default
+    *dist_mode = cuvs::neighbors::mg::distribution_mode::SHARDED;
+  }
+};
+
+void add_merge_mode(cuvs::neighbors::mg::sharded_merge_mode* merge_mode, const nlohmann::json& conf)
+{
+  if (conf.contains("merge_mode")) {
+    std::string sharded_merge_mode = conf.at("merge_mode");
+    if (sharded_merge_mode == "tree_merge") {
+      *merge_mode = cuvs::neighbors::mg::sharded_merge_mode::TREE_MERGE;
+    } else if (sharded_merge_mode == "merge_on_root_rank") {
+      *merge_mode = cuvs::neighbors::mg::sharded_merge_mode::MERGE_ON_ROOT_RANK;
+    } else {
+      throw std::runtime_error("invalid value for merge_mode");
+    }
+  } else {
+    // default
+    *merge_mode = cuvs::neighbors::mg::sharded_merge_mode::TREE_MERGE;
+  }
+};
+#endif
+
 template <typename T>
 auto create_algo(const std::string& algo_name,
                  const std::string& distance,
@@ -71,6 +108,32 @@ auto create_algo(const std::string& algo_name,
     parse_build_param<T, uint32_t>(conf, param);
     a = std::make_unique<cuvs::bench::cuvs_cagra<T, uint32_t>>(metric, dim, param);
   }
+#endif
+#ifdef CUVS_ANN_BENCH_USE_CUVS_MG
+  if constexpr (std::is_same_v<T, float> || std::is_same_v<T, uint8_t> ||
+                std::is_same_v<T, int8_t>) {
+    if (algo_name == "raft_mg_ivf_flat" || algo_name == "cuvs_mg_ivf_flat") {
+      typename cuvs::bench::cuvs_mg_ivf_flat<T, int64_t>::build_param param;
+      parse_build_param<T, int64_t>(conf, param);
+      add_distribution_mode(&param.mode, conf);
+      a = std::make_unique<cuvs::bench::cuvs_mg_ivf_flat<T, int64_t>>(metric, dim, param);
+    }
+  }
+
+  if (algo_name == "raft_mg_ivf_pq" || algo_name == "cuvs_mg_ivf_pq") {
+    typename cuvs::bench::cuvs_mg_ivf_pq<T, int64_t>::build_param param;
+    parse_build_param<T, int64_t>(conf, param);
+    add_distribution_mode(&param.mode, conf);
+    a = std::make_unique<cuvs::bench::cuvs_mg_ivf_pq<T, int64_t>>(metric, dim, param);
+  }
+
+  if (algo_name == "raft_mg_cagra" || algo_name == "cuvs_mg_cagra") {
+    typename cuvs::bench::cuvs_mg_cagra<T, uint32_t>::build_param param;
+    parse_build_param<T, uint32_t>(conf, param);
+    add_distribution_mode(&param.mode, conf);
+    a = std::make_unique<cuvs::bench::cuvs_mg_cagra<T, uint32_t>>(metric, dim, param);
+  }
+
 #endif
 
   if (!a) { throw std::runtime_error("invalid algo: '" + algo_name + "'"); }
@@ -113,6 +176,32 @@ auto create_search_param(const std::string& algo_name, const nlohmann::json& con
     return param;
   }
 #endif
+#ifdef CUVS_ANN_BENCH_USE_CUVS_MG
+  if constexpr (std::is_same_v<T, float> || std::is_same_v<T, uint8_t> ||
+                std::is_same_v<T, int8_t>) {
+    if (algo_name == "raft_mg_ivf_flat" || algo_name == "cuvs_mg_ivf_flat") {
+      auto param =
+        std::make_unique<typename cuvs::bench::cuvs_mg_ivf_flat<T, int64_t>::search_param>();
+      parse_search_param<T, int64_t>(conf, *param);
+      add_merge_mode(&param->merge_mode, conf);
+      return param;
+    }
+  }
+
+  if (algo_name == "raft_mg_ivf_pq" || algo_name == "cuvs_mg_ivf_pq") {
+    auto param = std::make_unique<typename cuvs::bench::cuvs_mg_ivf_pq<T, int64_t>::search_param>();
+    parse_search_param<T, int64_t>(conf, *param);
+    add_merge_mode(&param->merge_mode, conf);
+    return param;
+  }
+
+  if (algo_name == "raft_mg_cagra" || algo_name == "cuvs_mg_cagra") {
+    auto param = std::make_unique<typename cuvs::bench::cuvs_mg_cagra<T, uint32_t>::search_param>();
+    parse_search_param<T, uint32_t>(conf, *param);
+    add_merge_mode(&param->merge_mode, conf);
+    return param;
+  }
+#endif
 
   // else
   throw std::runtime_error("invalid algo: '" + algo_name + "'");
diff --git a/cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h b/cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h
index ff854f890..b2ba35eee 100644
--- a/cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h
+++ b/cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h
@@ -72,6 +72,23 @@ class cuvs_cagra : public algo<T>, public algo_gpu {
     std::optional<float> ivf_pq_refine_rate                                    = std::nullopt;
     std::optional<cuvs::neighbors::ivf_pq::index_params> ivf_pq_build_params   = std::nullopt;
     std::optional<cuvs::neighbors::ivf_pq::search_params> ivf_pq_search_params = std::nullopt;
+
+    void prepare_build_params(const raft::extent_2d<IdxT>& dataset_extents)
+    {
+      if (algo == CagraBuildAlgo::kIvfPq) {
+        auto pq_params = cuvs::neighbors::cagra::graph_build_params::ivf_pq_params(
+          dataset_extents, cagra_params.metric);
+        if (ivf_pq_build_params) { pq_params.build_params = *ivf_pq_build_params; }
+        if (ivf_pq_search_params) { pq_params.search_params = *ivf_pq_search_params; }
+        if (ivf_pq_refine_rate) { pq_params.refinement_rate = *ivf_pq_refine_rate; }
+        cagra_params.graph_build_params = pq_params;
+      } else if (algo == CagraBuildAlgo::kNnDescent) {
+        auto nn_params = cuvs::neighbors::cagra::graph_build_params::nn_descent_params(
+          cagra_params.intermediate_graph_degree);
+        if (nn_descent_params) { nn_params = *nn_descent_params; }
+        cagra_params.graph_build_params = nn_params;
+      }
+    }
   };
 
   cuvs_cagra(Metric metric, int dim, const build_param& param, int concurrent_searches = 1)
@@ -168,28 +185,9 @@ template <typename T, typename IdxT>
 void cuvs_cagra<T, IdxT>::build(const T* dataset, size_t nrow)
 {
   auto dataset_extents = raft::make_extents<IdxT>(nrow, dimension_);
+  index_params_.prepare_build_params(dataset_extents);
 
   auto& params = index_params_.cagra_params;
-
-  if (index_params_.algo == CagraBuildAlgo::kIvfPq) {
-    auto pq_params =
-      cuvs::neighbors::cagra::graph_build_params::ivf_pq_params(dataset_extents, params.metric);
-    if (index_params_.ivf_pq_build_params) {
-      pq_params.build_params = *index_params_.ivf_pq_build_params;
-    }
-    if (index_params_.ivf_pq_search_params) {
-      pq_params.search_params = *index_params_.ivf_pq_search_params;
-    }
-    if (index_params_.ivf_pq_refine_rate) {
-      pq_params.refinement_rate = *index_params_.ivf_pq_refine_rate;
-    }
-    params.graph_build_params = pq_params;
-  } else if (index_params_.algo == CagraBuildAlgo::kNnDescent) {
-    auto nn_params = cuvs::neighbors::cagra::graph_build_params::nn_descent_params(
-      params.intermediate_graph_degree);
-    if (index_params_.nn_descent_params) { nn_params = *index_params_.nn_descent_params; }
-    params.graph_build_params = nn_params;
-  }
   auto dataset_view_host =
     raft::make_mdspan<const T, IdxT, raft::row_major, true, false>(dataset, dataset_extents);
   auto dataset_view_device =
diff --git a/cpp/bench/ann/src/cuvs/cuvs_mg_cagra.cu b/cpp/bench/ann/src/cuvs/cuvs_mg_cagra.cu
new file mode 100644
index 000000000..801caa85f
--- /dev/null
+++ b/cpp/bench/ann/src/cuvs/cuvs_mg_cagra.cu
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "cuvs_mg_cagra_wrapper.h"
+
+namespace cuvs::bench {
+template class cuvs_mg_cagra<float, uint32_t>;
+template class cuvs_mg_cagra<half, uint32_t>;
+template class cuvs_mg_cagra<uint8_t, uint32_t>;
+template class cuvs_mg_cagra<int8_t, uint32_t>;
+}  // namespace cuvs::bench
diff --git a/cpp/bench/ann/src/cuvs/cuvs_mg_cagra_wrapper.h b/cpp/bench/ann/src/cuvs/cuvs_mg_cagra_wrapper.h
new file mode 100644
index 000000000..50c1ff4db
--- /dev/null
+++ b/cpp/bench/ann/src/cuvs/cuvs_mg_cagra_wrapper.h
@@ -0,0 +1,183 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include "cuvs_ann_bench_utils.h"
+#include "cuvs_cagra_wrapper.h"
+#include <cuvs/neighbors/mg.hpp>
+#include <raft/core/resource/nccl_clique.hpp>
+
+namespace cuvs::bench {
+using namespace cuvs::neighbors;
+
+enum class AllocatorType;
+enum class CagraBuildAlgo;
+
+template <typename T, typename IdxT>
+class cuvs_mg_cagra : public algo<T>, public algo_gpu {
+ public:
+  using search_param_base = typename algo<T>::search_param;
+  using algo<T>::dim_;
+
+  struct build_param : public cuvs::bench::cuvs_cagra<T, IdxT>::build_param {
+    cuvs::neighbors::mg::distribution_mode mode;
+  };
+
+  struct search_param : public cuvs::bench::cuvs_cagra<T, IdxT>::search_param {
+    cuvs::neighbors::mg::sharded_merge_mode merge_mode;
+  };
+
+  cuvs_mg_cagra(Metric metric, int dim, const build_param& param, int concurrent_searches = 1)
+    : algo<T>(metric, dim), index_params_(param)
+  {
+    index_params_.cagra_params.metric         = parse_metric_type(metric);
+    index_params_.ivf_pq_build_params->metric = parse_metric_type(metric);
+
+    // init nccl clique outside as to not affect benchmark
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle_);
+  }
+
+  void build(const T* dataset, size_t nrow) final;
+
+  void set_search_param(const search_param_base& param) override;
+
+  void set_search_dataset(const T* dataset, size_t nrow) override;
+
+  void search(const T* queries,
+              int batch_size,
+              int k,
+              algo_base::index_type* neighbors,
+              float* distances) const override;
+  void search_base(const T* queries,
+                   int batch_size,
+                   int k,
+                   algo_base::index_type* neighbors,
+                   float* distances) const;
+
+  [[nodiscard]] auto get_sync_stream() const noexcept -> cudaStream_t override
+  {
+    auto stream = raft::resource::get_cuda_stream(handle_);
+    return stream;
+  }
+
+  // to enable dataset access from GPU memory
+  [[nodiscard]] auto get_preference() const -> algo_property override
+  {
+    algo_property property;
+    property.dataset_memory_type = MemoryType::kHost;
+    property.query_memory_type   = MemoryType::kHost;
+    return property;
+  }
+  void save(const std::string& file) const override;
+  void load(const std::string&) override;
+  void save_to_hnswlib(const std::string& file) const;
+  std::unique_ptr<algo<T>> copy() override;
+
+ private:
+  raft::device_resources handle_;
+  float refine_ratio_;
+  build_param index_params_;
+  cuvs::neighbors::mg::search_params<cagra::search_params> search_params_;
+  std::shared_ptr<cuvs::neighbors::mg::index<cuvs::neighbors::cagra::index<T, IdxT>, T, IdxT>>
+    index_;
+};
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::build(const T* dataset, size_t nrow)
+{
+  auto dataset_extents = raft::make_extents<IdxT>(nrow, dim_);
+  index_params_.prepare_build_params(dataset_extents);
+  cuvs::neighbors::mg::index_params<cagra::index_params> build_params = index_params_.cagra_params;
+  build_params.mode                                                   = index_params_.mode;
+
+  auto dataset_view =
+    raft::make_host_matrix_view<const T, int64_t, raft::row_major>(dataset, nrow, dim_);
+  auto idx = cuvs::neighbors::mg::build(handle_, build_params, dataset_view);
+  index_ =
+    std::make_shared<cuvs::neighbors::mg::index<cuvs::neighbors::cagra::index<T, IdxT>, T, IdxT>>(
+      std::move(idx));
+}
+
+inline auto allocator_to_string(AllocatorType mem_type) -> std::string;
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::set_search_param(const search_param_base& param)
+{
+  auto sp = dynamic_cast<const search_param&>(param);
+  // search_params_ = static_cast<mg::search_params<cagra::search_params>>(sp.p);
+  cagra::search_params* search_params_ptr_ = static_cast<cagra::search_params*>(&search_params_);
+  *search_params_ptr_                      = sp.p;
+  search_params_.merge_mode                = sp.merge_mode;
+  refine_ratio_                            = sp.refine_ratio;
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::set_search_dataset(const T* dataset, size_t nrow)
+{
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::save(const std::string& file) const
+{
+  cuvs::neighbors::mg::serialize(handle_, *index_, file);
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::load(const std::string& file)
+{
+  index_ =
+    std::make_shared<cuvs::neighbors::mg::index<cuvs::neighbors::cagra::index<T, IdxT>, T, IdxT>>(
+      std::move(cuvs::neighbors::mg::deserialize_cagra<T, IdxT>(handle_, file)));
+}
+
+template <typename T, typename IdxT>
+std::unique_ptr<algo<T>> cuvs_mg_cagra<T, IdxT>::copy()
+{
+  return std::make_unique<cuvs_mg_cagra<T, IdxT>>(*this);  // use copy constructor
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::search_base(
+  const T* queries, int batch_size, int k, algo_base::index_type* neighbors, float* distances) const
+{
+  static_assert(std::is_integral_v<algo_base::index_type>);
+  static_assert(std::is_integral_v<IdxT>);
+
+  auto queries_view =
+    raft::make_host_matrix_view<const T, int64_t, raft::row_major>(queries, batch_size, dim_);
+  auto neighbors_view =
+    raft::make_host_matrix_view<IdxT, int64_t, raft::row_major>((IdxT*)neighbors, batch_size, k);
+  auto distances_view =
+    raft::make_host_matrix_view<float, int64_t, raft::row_major>(distances, batch_size, k);
+
+  cuvs::neighbors::mg::search(
+    handle_, *index_, search_params_, queries_view, neighbors_view, distances_view);
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_cagra<T, IdxT>::search(
+  const T* queries, int batch_size, int k, algo_base::index_type* neighbors, float* distances) const
+{
+  auto k0                       = static_cast<size_t>(refine_ratio_ * k);
+  const bool disable_refinement = k0 <= static_cast<size_t>(k);
+
+  if (disable_refinement) {
+    search_base(queries, batch_size, k, neighbors, distances);
+  } else {
+    throw std::runtime_error("refinement not supported");
+  }
+}
+}  // namespace cuvs::bench
diff --git a/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_flat.cu b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_flat.cu
new file mode 100644
index 000000000..20cdc41e3
--- /dev/null
+++ b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_flat.cu
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "cuvs_mg_ivf_flat_wrapper.h"
+
+namespace cuvs::bench {
+template class cuvs_mg_ivf_flat<float, int64_t>;
+// template class cuvs_mg_ivf_flat<half, int64_t>;
+template class cuvs_mg_ivf_flat<uint8_t, int64_t>;
+template class cuvs_mg_ivf_flat<int8_t, int64_t>;
+}  // namespace cuvs::bench
diff --git a/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_flat_wrapper.h b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_flat_wrapper.h
new file mode 100644
index 000000000..54a0d2fac
--- /dev/null
+++ b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_flat_wrapper.h
@@ -0,0 +1,140 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "cuvs_ann_bench_utils.h"
+#include "cuvs_ivf_flat_wrapper.h"
+#include <cuvs/neighbors/mg.hpp>
+#include <raft/core/resource/nccl_clique.hpp>
+
+namespace cuvs::bench {
+using namespace cuvs::neighbors;
+
+template <typename T, typename IdxT>
+class cuvs_mg_ivf_flat : public algo<T>, public algo_gpu {
+ public:
+  using search_param_base = typename algo<T>::search_param;
+  using algo<T>::dim_;
+
+  using build_param = cuvs::neighbors::mg::index_params<ivf_flat::index_params>;
+
+  struct search_param : public cuvs::bench::cuvs_ivf_flat<T, IdxT>::search_param {
+    cuvs::neighbors::mg::sharded_merge_mode merge_mode;
+  };
+
+  cuvs_mg_ivf_flat(Metric metric, int dim, const build_param& param)
+    : algo<T>(metric, dim), index_params_(param)
+  {
+    index_params_.metric = parse_metric_type(metric);
+    // init nccl clique outside as to not affect benchmark
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle_);
+  }
+
+  void build(const T* dataset, size_t nrow) final;
+  void set_search_param(const search_param_base& param) override;
+  void search(const T* queries,
+              int batch_size,
+              int k,
+              algo_base::index_type* neighbors,
+              float* distances) const override;
+
+  [[nodiscard]] auto get_preference() const -> algo_property override
+  {
+    algo_property property;
+    property.dataset_memory_type = MemoryType::kHost;
+    property.query_memory_type   = MemoryType::kHost;
+    return property;
+  }
+
+  [[nodiscard]] auto get_sync_stream() const noexcept -> cudaStream_t override
+  {
+    auto stream = raft::resource::get_cuda_stream(handle_);
+    return stream;
+  }
+
+  [[nodiscard]] auto uses_stream() const noexcept -> bool override { return false; }
+
+  void save(const std::string& file) const override;
+  void load(const std::string&) override;
+  std::unique_ptr<algo<T>> copy() override;
+
+ private:
+  raft::device_resources handle_;
+  build_param index_params_;
+  cuvs::neighbors::mg::search_params<ivf_flat::search_params> search_params_;
+  std::shared_ptr<cuvs::neighbors::mg::index<cuvs::neighbors::ivf_flat::index<T, IdxT>, T, IdxT>>
+    index_;
+};
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_flat<T, IdxT>::build(const T* dataset, size_t nrow)
+{
+  auto dataset_view =
+    raft::make_host_matrix_view<const T, int64_t, raft::row_major>(dataset, IdxT(nrow), IdxT(dim_));
+  auto idx = cuvs::neighbors::mg::build(handle_, index_params_, dataset_view);
+  index_   = std::make_shared<
+    cuvs::neighbors::mg::index<cuvs::neighbors::ivf_flat::index<T, IdxT>, T, IdxT>>(std::move(idx));
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_flat<T, IdxT>::set_search_param(const search_param_base& param)
+{
+  auto sp = dynamic_cast<const search_param&>(param);
+  // search_params_ = sp.ivf_flat_params;
+  ivf_flat::search_params* search_params_ptr_ =
+    static_cast<ivf_flat::search_params*>(&search_params_);
+  *search_params_ptr_       = sp.ivf_flat_params;
+  search_params_.merge_mode = sp.merge_mode;
+  assert(search_params_.n_probes <= index_params_.n_lists);
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_flat<T, IdxT>::save(const std::string& file) const
+{
+  cuvs::neighbors::mg::serialize(handle_, *index_, file);
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_flat<T, IdxT>::load(const std::string& file)
+{
+  index_ = std::make_shared<
+    cuvs::neighbors::mg::index<cuvs::neighbors::ivf_flat::index<T, IdxT>, T, IdxT>>(
+    std::move(cuvs::neighbors::mg::deserialize_flat<T, IdxT>(handle_, file)));
+}
+
+template <typename T, typename IdxT>
+std::unique_ptr<algo<T>> cuvs_mg_ivf_flat<T, IdxT>::copy()
+{
+  return std::make_unique<cuvs_mg_ivf_flat<T, IdxT>>(*this);  // use copy constructor
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_flat<T, IdxT>::search(
+  const T* queries, int batch_size, int k, algo_base::index_type* neighbors, float* distances) const
+{
+  auto queries_view = raft::make_host_matrix_view<const T, int64_t, raft::row_major>(
+    queries, IdxT(batch_size), IdxT(dim_));
+  auto neighbors_view = raft::make_host_matrix_view<IdxT, int64_t, raft::row_major>(
+    (IdxT*)neighbors, IdxT(batch_size), IdxT(k));
+  auto distances_view = raft::make_host_matrix_view<float, int64_t, raft::row_major>(
+    distances, IdxT(batch_size), IdxT(k));
+
+  cuvs::neighbors::mg::search(
+    handle_, *index_, search_params_, queries_view, neighbors_view, distances_view);
+}
+
+}  // namespace cuvs::bench
\ No newline at end of file
diff --git a/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_pq.cu b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_pq.cu
new file mode 100644
index 000000000..a74bab6f5
--- /dev/null
+++ b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_pq.cu
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "cuvs_mg_ivf_pq_wrapper.h"
+
+namespace cuvs::bench {
+template class cuvs_mg_ivf_pq<float, int64_t>;
+template class cuvs_mg_ivf_pq<half, int64_t>;
+template class cuvs_mg_ivf_pq<uint8_t, int64_t>;
+template class cuvs_mg_ivf_pq<int8_t, int64_t>;
+}  // namespace cuvs::bench
diff --git a/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_pq_wrapper.h b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_pq_wrapper.h
new file mode 100644
index 000000000..84aea7d4a
--- /dev/null
+++ b/cpp/bench/ann/src/cuvs/cuvs_mg_ivf_pq_wrapper.h
@@ -0,0 +1,139 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "cuvs_ann_bench_utils.h"
+#include "cuvs_ivf_pq_wrapper.h"
+#include <cuvs/neighbors/mg.hpp>
+#include <raft/core/resource/nccl_clique.hpp>
+
+namespace cuvs::bench {
+using namespace cuvs::neighbors;
+
+template <typename T, typename IdxT>
+class cuvs_mg_ivf_pq : public algo<T>, public algo_gpu {
+ public:
+  using search_param_base = typename algo<T>::search_param;
+  using algo<T>::dim_;
+
+  using build_param = cuvs::neighbors::mg::index_params<ivf_pq::index_params>;
+
+  struct search_param : public cuvs::bench::cuvs_ivf_pq<T, IdxT>::search_param {
+    cuvs::neighbors::mg::sharded_merge_mode merge_mode;
+  };
+
+  cuvs_mg_ivf_pq(Metric metric, int dim, const build_param& param)
+    : algo<T>(metric, dim), index_params_(param)
+  {
+    index_params_.metric = parse_metric_type(metric);
+    // init nccl clique outside as to not affect benchmark
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle_);
+  }
+
+  void build(const T* dataset, size_t nrow) final;
+  void set_search_param(const search_param_base& param) override;
+  void search(const T* queries,
+              int batch_size,
+              int k,
+              algo_base::index_type* neighbors,
+              float* distances) const override;
+
+  [[nodiscard]] auto get_preference() const -> algo_property override
+  {
+    algo_property property;
+    property.dataset_memory_type = MemoryType::kHost;
+    property.query_memory_type   = MemoryType::kHost;
+    return property;
+  }
+
+  [[nodiscard]] auto get_sync_stream() const noexcept -> cudaStream_t override
+  {
+    auto stream = raft::resource::get_cuda_stream(handle_);
+    return stream;
+  }
+
+  [[nodiscard]] auto uses_stream() const noexcept -> bool override { return false; }
+
+  void save(const std::string& file) const override;
+  void load(const std::string&) override;
+  std::unique_ptr<algo<T>> copy() override;
+
+ private:
+  raft::device_resources handle_;
+  build_param index_params_;
+  cuvs::neighbors::mg::search_params<ivf_pq::search_params> search_params_;
+  std::shared_ptr<cuvs::neighbors::mg::index<cuvs::neighbors::ivf_pq::index<IdxT>, T, IdxT>> index_;
+};
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_pq<T, IdxT>::build(const T* dataset, size_t nrow)
+{
+  auto dataset_view =
+    raft::make_host_matrix_view<const T, int64_t, raft::row_major>(dataset, IdxT(nrow), IdxT(dim_));
+  auto idx = cuvs::neighbors::mg::build(handle_, index_params_, dataset_view);
+  index_ =
+    std::make_shared<cuvs::neighbors::mg::index<cuvs::neighbors::ivf_pq::index<IdxT>, T, IdxT>>(
+      std::move(idx));
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_pq<T, IdxT>::set_search_param(const search_param_base& param)
+{
+  auto sp = dynamic_cast<const search_param&>(param);
+  // search_params_ = static_cast<mg::search_params<ivf_pq::search_params>>(sp.pq_param);
+  ivf_pq::search_params* search_params_ptr_ = static_cast<ivf_pq::search_params*>(&search_params_);
+  *search_params_ptr_                       = sp.pq_param;
+  search_params_.merge_mode                 = sp.merge_mode;
+  assert(search_params_.n_probes <= index_params_.n_lists);
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_pq<T, IdxT>::save(const std::string& file) const
+{
+  cuvs::neighbors::mg::serialize(handle_, *index_, file);
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_pq<T, IdxT>::load(const std::string& file)
+{
+  index_ =
+    std::make_shared<cuvs::neighbors::mg::index<cuvs::neighbors::ivf_pq::index<IdxT>, T, IdxT>>(
+      std::move(cuvs::neighbors::mg::deserialize_pq<T, IdxT>(handle_, file)));
+}
+
+template <typename T, typename IdxT>
+std::unique_ptr<algo<T>> cuvs_mg_ivf_pq<T, IdxT>::copy()
+{
+  return std::make_unique<cuvs_mg_ivf_pq<T, IdxT>>(*this);  // use copy constructor
+}
+
+template <typename T, typename IdxT>
+void cuvs_mg_ivf_pq<T, IdxT>::search(
+  const T* queries, int batch_size, int k, algo_base::index_type* neighbors, float* distances) const
+{
+  auto queries_view = raft::make_host_matrix_view<const T, int64_t, raft::row_major>(
+    queries, IdxT(batch_size), IdxT(dim_));
+  auto neighbors_view = raft::make_host_matrix_view<IdxT, int64_t, raft::row_major>(
+    (IdxT*)neighbors, IdxT(batch_size), IdxT(k));
+  auto distances_view = raft::make_host_matrix_view<float, int64_t, raft::row_major>(
+    distances, IdxT(batch_size), IdxT(k));
+
+  cuvs::neighbors::mg::search(
+    handle_, *index_, search_params_, queries_view, neighbors_view, distances_view);
+}
+
+}  // namespace cuvs::bench
\ No newline at end of file
diff --git a/cpp/cmake/thirdparty/get_cuvs.cmake b/cpp/cmake/thirdparty/get_cuvs.cmake
new file mode 100644
index 000000000..c21cccbcc
--- /dev/null
+++ b/cpp/cmake/thirdparty/get_cuvs.cmake
@@ -0,0 +1,64 @@
+# =============================================================================
+# Copyright (c) 2023-2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+# in compliance with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software distributed under the License
+# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+# or implied. See the License for the specific language governing permissions and limitations under
+# the License.
+
+# Use RAPIDS_VERSION_MAJOR_MINOR from rapids_config.cmake
+set(CUVS_VERSION "${RAPIDS_VERSION_MAJOR_MINOR}")
+set(CUVS_FORK "rapidsai")
+set(CUVS_PINNED_TAG "branch-${RAPIDS_VERSION_MAJOR_MINOR}")
+
+function(find_and_configure_cuvs)
+    set(oneValueArgs VERSION FORK PINNED_TAG ENABLE_NVTX CLONE_ON_PIN BUILD_CPU_ONLY BUILD_SHARED_LIBS)
+    cmake_parse_arguments(PKG "${options}" "${oneValueArgs}"
+            "${multiValueArgs}" ${ARGN} )
+
+    if(PKG_CLONE_ON_PIN AND NOT PKG_PINNED_TAG STREQUAL "branch-${CUVS_VERSION}")
+        message(STATUS "cuVS: pinned tag found: ${PKG_PINNED_TAG}. Cloning cuVS locally.")
+        set(CPM_DOWNLOAD_cuvs ON)
+    endif()
+
+    #-----------------------------------------------------
+    # Invoke CPM find_package()
+    #-----------------------------------------------------
+    rapids_cpm_find(cuvs ${PKG_VERSION}
+            GLOBAL_TARGETS      cuvs::cuvs
+            BUILD_EXPORT_SET    cuvs-bench-exports
+            INSTALL_EXPORT_SET  cuvs-bench-exports
+            COMPONENTS          cuvs
+            CPM_ARGS
+              GIT_REPOSITORY        https://github.com/${PKG_FORK}/cuvs.git
+              GIT_TAG               ${PKG_PINNED_TAG}
+              SOURCE_SUBDIR         cpp
+              OPTIONS
+              "BUILD_SHARED_LIBS ${PKG_BUILD_SHARED_LIBS}"
+              "BUILD_CPU_ONLY ${PKG_BUILD_CPU_ONLY}"
+              "BUILD_TESTS OFF"
+              "BUILD_CAGRA_HNSWLIB OFF"
+              "CUVS_CLONE_ON_PIN ${PKG_CLONE_ON_PIN}"
+            )
+endfunction()
+
+
+# Change pinned tag here to test a commit in CI
+# To use a different cuVS locally, set the CMake variable
+# CPM_cuvs_SOURCE=/path/to/local/cuvs
+find_and_configure_cuvs(VERSION  ${CUVS_VERSION}.00
+        FORK                     ${CUVS_FORK}
+        PINNED_TAG               ${CUVS_PINNED_TAG}
+        ENABLE_NVTX              OFF
+        # When PINNED_TAG above doesn't match the default rapids branch,
+        # force local cuvs clone in build directory
+        # even if it's already installed.
+        CLONE_ON_PIN     ${CUVS_CLONE_ON_PIN}
+        BUILD_CPU_ONLY ${BUILD_CPU_ONLY}
+        BUILD_SHARED_LIBS ${BUILD_SHARED_LIBS}
+)
diff --git a/cpp/doxygen/Doxyfile b/cpp/doxygen/Doxyfile
index 2459d521d..e28572457 100644
--- a/cpp/doxygen/Doxyfile
+++ b/cpp/doxygen/Doxyfile
@@ -2149,7 +2149,7 @@ INCLUDE_FILE_PATTERNS  =
 # recursively expanded use the := operator instead of the = operator.
 # This tag requires that the tag ENABLE_PREPROCESSING is set to YES.
 
-PREDEFINED             =
+PREDEFINED             = "CUVS_BUILD_MG_ALGOS=1"
 
 # If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this
 # tag can be used to specify a list of macro names that should be expanded. The
diff --git a/cpp/include/cuvs/neighbors/cagra.h b/cpp/include/cuvs/neighbors/cagra.h
index 241f5d8b0..14331ebbc 100644
--- a/cpp/include/cuvs/neighbors/cagra.h
+++ b/cpp/include/cuvs/neighbors/cagra.h
@@ -267,6 +267,15 @@ cuvsError_t cuvsCagraIndexCreate(cuvsCagraIndex_t* index);
  */
 cuvsError_t cuvsCagraIndexDestroy(cuvsCagraIndex_t index);
 
+/**
+ * @brief Get dimension of the CAGRA index
+ *
+ * @param[in] index CAGRA index
+ * @param[out] dim return dimension of the index
+ * @return cuvsError_t
+ */
+cuvsError_t cuvsCagraIndexGetDims(cuvsCagraIndex_t index, int* dim);
+
 /**
  * @}
  */
@@ -338,7 +347,7 @@ cuvsError_t cuvsCagraBuild(cuvsResources_t res,
  *        with the same type of `queries`, such that `index.dtype.code ==
  * queries.dl_tensor.dtype.code` Types for input are:
  *        1. `queries`:
- *`         a. kDLDataType.code == kDLFloat` and `kDLDataType.bits = 32`
+ *          a. `kDLDataType.code == kDLFloat` and `kDLDataType.bits = 32`
  *          b. `kDLDataType.code == kDLInt` and `kDLDataType.bits = 8`
  *          c. `kDLDataType.code == kDLUInt` and `kDLDataType.bits = 8`
  *        2. `neighbors`: `kDLDataType.code == kDLUInt` and `kDLDataType.bits = 32`
diff --git a/cpp/include/cuvs/neighbors/common.hpp b/cpp/include/cuvs/neighbors/common.hpp
index 73ce80b41..60b8cc122 100644
--- a/cpp/include/cuvs/neighbors/common.hpp
+++ b/cpp/include/cuvs/neighbors/common.hpp
@@ -19,7 +19,8 @@
 #include <cstdint>
 #include <cuvs/distance/distance.hpp>
 #include <raft/core/device_mdarray.hpp>
-#include <raft/core/device_mdspan.hpp>
+#include <raft/core/device_resources.hpp>
+#include <raft/core/host_mdspan.hpp>
 #include <raft/core/resource/cuda_stream.hpp>
 #include <raft/core/resources.hpp>
 #include <raft/util/cudart_utils.hpp>   // get_device_for_address
@@ -636,5 +637,56 @@ enable_if_valid_list_t<ListT> deserialize_list(const raft::resources& handle,
                                                const typename ListT::spec_type& store_spec,
                                                const typename ListT::spec_type& device_spec);
 }  // namespace ivf
+}  // namespace cuvs::neighbors
+
+namespace cuvs::neighbors {
+using namespace raft;
+
+template <typename AnnIndexType, typename T, typename IdxT>
+struct iface {
+  iface() : mutex_(std::make_shared<std::mutex>()) {}
+
+  const IdxT size() const { return index_.value().size(); }
+
+  std::optional<AnnIndexType> index_;
+  std::shared_ptr<std::mutex> mutex_;
+};
+
+template <typename AnnIndexType, typename T, typename IdxT, typename Accessor>
+void build(const raft::device_resources& handle,
+           cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+           const cuvs::neighbors::index_params* index_params,
+           raft::mdspan<const T, matrix_extent<int64_t>, row_major, Accessor> index_dataset);
+
+template <typename AnnIndexType, typename T, typename IdxT, typename Accessor1, typename Accessor2>
+void extend(
+  const raft::device_resources& handle,
+  cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+  raft::mdspan<const T, matrix_extent<int64_t>, row_major, Accessor1> new_vectors,
+  std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, Accessor2>>
+    new_indices);
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void search(const raft::device_resources& handle,
+            const cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+            const cuvs::neighbors::search_params* search_params,
+            raft::device_matrix_view<const T, int64_t, row_major> h_queries,
+            raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,
+            raft::device_matrix_view<float, int64_t, row_major> d_distances);
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void serialize(const raft::device_resources& handle,
+               const cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+               std::ostream& os);
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void deserialize(const raft::device_resources& handle,
+                 cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+                 std::istream& is);
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void deserialize(const raft::device_resources& handle,
+                 cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+                 const std::string& filename);
 
 };  // namespace cuvs::neighbors
diff --git a/cpp/include/cuvs/neighbors/hnsw.h b/cpp/include/cuvs/neighbors/hnsw.h
index 5e94de60a..0495c574a 100644
--- a/cpp/include/cuvs/neighbors/hnsw.h
+++ b/cpp/include/cuvs/neighbors/hnsw.h
@@ -105,8 +105,10 @@ cuvsError_t cuvsHnswIndexDestroy(cuvsHnswIndex_t index);
  *        with the same type of `queries`, such that `index.dtype.code ==
  *        queries.dl_tensor.dtype.code`
  *        Supported types for input are:
- *        1. `queries`: `kDLDataType.code == kDLFloat` or `kDLDataType.code == kDLInt` and
- * `kDLDataType.bits = 32`
+ *        1. `queries`:
+ *          a. `kDLDataType.code == kDLFloat` and `kDLDataType.bits = 32`
+ *          b. `kDLDataType.code == kDLInt` and `kDLDataType.bits = 8`
+ *          c. `kDLDataType.code == kDLUInt` and `kDLDataType.bits = 8`
  *        2. `neighbors`: `kDLDataType.code == kDLUInt` and `kDLDataType.bits = 64`
  *        3. `distances`: `kDLDataType.code == kDLFloat` and `kDLDataType.bits = 32`
  * NOTE: The HNSW index can only be searched by the hnswlib wrapper in cuVS,
diff --git a/cpp/include/cuvs/neighbors/hnsw.hpp b/cpp/include/cuvs/neighbors/hnsw.hpp
index 007adef0d..d5abd6d55 100644
--- a/cpp/include/cuvs/neighbors/hnsw.hpp
+++ b/cpp/include/cuvs/neighbors/hnsw.hpp
@@ -173,6 +173,8 @@ std::unique_ptr<index<int8_t>> from_cagra(
 
 /**@}*/
 
+// TODO: Filtered Search APIs: https://github.com/rapidsai/cuvs/issues/363
+
 /**
  * @defgroup hnsw_cpp_index_search Search hnswlib index
  * @{
@@ -260,7 +262,7 @@ void search(raft::resources const& res,
 void search(raft::resources const& res,
             const search_params& params,
             const index<uint8_t>& idx,
-            raft::host_matrix_view<const int, int64_t, raft::row_major> queries,
+            raft::host_matrix_view<const uint8_t, int64_t, raft::row_major> queries,
             raft::host_matrix_view<uint64_t, int64_t, raft::row_major> neighbors,
             raft::host_matrix_view<float, int64_t, raft::row_major> distances);
 
@@ -303,7 +305,7 @@ void search(raft::resources const& res,
 void search(raft::resources const& res,
             const search_params& params,
             const index<int8_t>& idx,
-            raft::host_matrix_view<const int, int64_t, raft::row_major> queries,
+            raft::host_matrix_view<const int8_t, int64_t, raft::row_major> queries,
             raft::host_matrix_view<uint64_t, int64_t, raft::row_major> neighbors,
             raft::host_matrix_view<float, int64_t, raft::row_major> distances);
 
diff --git a/cpp/include/cuvs/neighbors/ivf_flat.hpp b/cpp/include/cuvs/neighbors/ivf_flat.hpp
index 67d1b46c0..7f852d635 100644
--- a/cpp/include/cuvs/neighbors/ivf_flat.hpp
+++ b/cpp/include/cuvs/neighbors/ivf_flat.hpp
@@ -1168,7 +1168,7 @@ void extend(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_flat::search_params& params,
-            cuvs::neighbors::ivf_flat::index<float, int64_t>& index,
+            const cuvs::neighbors::ivf_flat::index<float, int64_t>& index,
             raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
@@ -1209,7 +1209,7 @@ void search(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_flat::search_params& params,
-            cuvs::neighbors::ivf_flat::index<int8_t, int64_t>& index,
+            const cuvs::neighbors::ivf_flat::index<int8_t, int64_t>& index,
             raft::device_matrix_view<const int8_t, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
@@ -1250,7 +1250,7 @@ void search(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_flat::search_params& params,
-            cuvs::neighbors::ivf_flat::index<uint8_t, int64_t>& index,
+            const cuvs::neighbors::ivf_flat::index<uint8_t, int64_t>& index,
             raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
diff --git a/cpp/include/cuvs/neighbors/ivf_pq.hpp b/cpp/include/cuvs/neighbors/ivf_pq.hpp
index 3ce5f382f..ae543c9e9 100644
--- a/cpp/include/cuvs/neighbors/ivf_pq.hpp
+++ b/cpp/include/cuvs/neighbors/ivf_pq.hpp
@@ -1221,6 +1221,75 @@ void extend(raft::resources const& handle,
             std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices,
             cuvs::neighbors::ivf_pq::index<int64_t>* idx);
 
+/**
+ * @brief Extend the index with the new data.
+ *
+ * Note, the user can set a stream pool in the input raft::resource with
+ * at least one stream to enable kernel and copy overlapping.
+ *
+ * Usage example:
+ * @code{.cpp}
+ *   using namespace cuvs::neighbors;
+ *   ivf_pq::index_params index_params;
+ *   index_params.add_data_on_build = false;      // don't populate index on build
+ *   index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
+ *   // train the index from a [N, D] dataset
+ *   auto index_empty = ivf_pq::build(handle, index_params, dataset);
+ *   // optional: create a stream pool with at least one stream to enable kernel and copy
+ *   // overlapping
+ *   raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
+ *   // fill the index with the data
+ *   std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
+ *   auto index = ivf_pq::extend(handle, new_vectors, no_op, index_empty);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] new_vectors a host matrix view to a row-major matrix [n_rows, idx.dim()]
+ * @param[in] new_indices a host vector view to a vector of indices [n_rows].
+ *    If the original index is empty (`idx.size() == 0`), you can pass `std::nullopt`
+ *    here to imply a continuous range `[0...n_rows)`.
+ * @param[inout] idx
+ */
+auto extend(raft::resources const& handle,
+            raft::host_matrix_view<const half, int64_t, raft::row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices,
+            const cuvs::neighbors::ivf_pq::index<int64_t>& idx)
+  -> cuvs::neighbors::ivf_pq::index<int64_t>;
+
+/**
+ * @brief Extend the index with the new data.
+ *
+ * Note, the user can set a stream pool in the input raft::resource with
+ * at least one stream to enable kernel and copy overlapping.
+ *
+ * Usage example:
+ * @code{.cpp}
+ *   using namespace cuvs::neighbors;
+ *   ivf_pq::index_params index_params;
+ *   index_params.add_data_on_build = false;      // don't populate index on build
+ *   index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
+ *   // train the index from a [N, D] dataset
+ *   auto index_empty = ivf_pq::build(handle, index_params, dataset);
+ *   // optional: create a stream pool with at least one stream to enable kernel and copy
+ *   // overlapping
+ *   raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
+ *   // fill the index with the data
+ *   std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
+ *   ivf_pq::extend(handle, new_vectors, no_op, &index_empty);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] new_vectors a host matrix view to a row-major matrix [n_rows, idx.dim()]
+ * @param[in] new_indices a host vector view to a vector of indices [n_rows].
+ *    If the original index is empty (`idx.size() == 0`), you can pass `std::nullopt`
+ *    here to imply a continuous range `[0...n_rows)`.
+ * @param[inout] idx
+ */
+void extend(raft::resources const& handle,
+            raft::host_matrix_view<const half, int64_t, raft::row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices,
+            cuvs::neighbors::ivf_pq::index<int64_t>* idx);
+
 /**
  * @brief Extend the index with the new data.
  *
@@ -1405,7 +1474,7 @@ void extend(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_pq::search_params& search_params,
-            cuvs::neighbors::ivf_pq::index<int64_t>& index,
+            const cuvs::neighbors::ivf_pq::index<int64_t>& index,
             raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
@@ -1450,7 +1519,7 @@ void search(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_pq::search_params& search_params,
-            cuvs::neighbors::ivf_pq::index<int64_t>& index,
+            const cuvs::neighbors::ivf_pq::index<int64_t>& index,
             raft::device_matrix_view<const half, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
@@ -1495,7 +1564,7 @@ void search(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_pq::search_params& search_params,
-            cuvs::neighbors::ivf_pq::index<int64_t>& index,
+            const cuvs::neighbors::ivf_pq::index<int64_t>& index,
             raft::device_matrix_view<const int8_t, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
@@ -1540,7 +1609,7 @@ void search(raft::resources const& handle,
  */
 void search(raft::resources const& handle,
             const cuvs::neighbors::ivf_pq::search_params& search_params,
-            cuvs::neighbors::ivf_pq::index<int64_t>& index,
+            const cuvs::neighbors::ivf_pq::index<int64_t>& index,
             raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> queries,
             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
             raft::device_matrix_view<float, int64_t, raft::row_major> distances,
diff --git a/cpp/include/cuvs/neighbors/mg.hpp b/cpp/include/cuvs/neighbors/mg.hpp
new file mode 100644
index 000000000..4657fa8fb
--- /dev/null
+++ b/cpp/include/cuvs/neighbors/mg.hpp
@@ -0,0 +1,1367 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#ifdef CUVS_BUILD_MG_ALGOS
+
+#include <atomic>
+#include <memory>
+
+#include <raft/core/device_resources.hpp>
+#include <rmm/mr/device/per_device_resource.hpp>
+
+#include <cuvs/neighbors/cagra.hpp>
+#include <cuvs/neighbors/common.hpp>
+#include <cuvs/neighbors/ivf_flat.hpp>
+#include <cuvs/neighbors/ivf_pq.hpp>
+
+#define DEFAULT_SEARCH_BATCH_SIZE 1 << 20
+
+/// \defgroup mg_cpp_index_params ANN MG index build parameters
+
+namespace cuvs::neighbors::mg {
+/** Distribution mode */
+/// \ingroup mg_cpp_index_params
+enum distribution_mode {
+  /** Index is replicated on each device, favors throughput */
+  REPLICATED,
+  /** Index is split on several devices, favors scaling */
+  SHARDED
+};
+
+/// \defgroup mg_cpp_search_params ANN MG search parameters
+
+/** Search mode when using a replicated index */
+/// \ingroup mg_cpp_search_params
+enum replicated_search_mode {
+  /** Search queries are splited to maintain equal load on GPUs */
+  LOAD_BALANCER,
+  /** Each search query is processed by a single GPU in a round-robin fashion */
+  ROUND_ROBIN
+};
+
+/** Merge mode when using a sharded index */
+/// \ingroup mg_cpp_search_params
+enum sharded_merge_mode {
+  /** Search batches are merged on the root rank */
+  MERGE_ON_ROOT_RANK,
+  /** Search batches are merged in a tree reduction fashion */
+  TREE_MERGE
+};
+
+/** Build parameters */
+/// \ingroup mg_cpp_index_params
+template <typename Upstream>
+struct index_params : public Upstream {
+  index_params() : mode(SHARDED) {}
+
+  index_params(const Upstream& sp) : Upstream(sp), mode(SHARDED) {}
+
+  /** Distribution mode */
+  cuvs::neighbors::mg::distribution_mode mode = SHARDED;
+};
+
+/** Search parameters */
+/// \ingroup mg_cpp_search_params
+template <typename Upstream>
+struct search_params : public Upstream {
+  search_params() : search_mode(LOAD_BALANCER), merge_mode(TREE_MERGE) {}
+
+  search_params(const Upstream& sp)
+    : Upstream(sp), search_mode(LOAD_BALANCER), merge_mode(TREE_MERGE)
+  {
+  }
+
+  /** Replicated search mode */
+  cuvs::neighbors::mg::replicated_search_mode search_mode = LOAD_BALANCER;
+  /** Sharded merge mode */
+  cuvs::neighbors::mg::sharded_merge_mode merge_mode = TREE_MERGE;
+};
+
+}  // namespace cuvs::neighbors::mg
+
+namespace cuvs::neighbors::mg {
+
+using namespace raft;
+
+template <typename AnnIndexType, typename T, typename IdxT>
+struct index {
+  index(distribution_mode mode, int num_ranks_);
+  index(const raft::device_resources& handle, const std::string& filename);
+
+  index(const index&)                    = delete;
+  index(index&&)                         = default;
+  auto operator=(const index&) -> index& = delete;
+  auto operator=(index&&) -> index&      = default;
+
+  distribution_mode mode_;
+  int num_ranks_;
+  std::vector<iface<AnnIndexType, T, IdxT>> ann_interfaces_;
+
+  // for load balancing mechanism
+  std::shared_ptr<std::atomic<int64_t>> round_robin_counter_;
+};
+
+/// \defgroup mg_cpp_index_build ANN MG index build
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-Flat MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_flat::index_params>& index_params,
+           raft::host_matrix_view<const float, int64_t, row_major> index_dataset)
+  -> index<ivf_flat::index<float, int64_t>, float, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-Flat MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_flat::index_params>& index_params,
+           raft::host_matrix_view<const int8_t, int64_t, row_major> index_dataset)
+  -> index<ivf_flat::index<int8_t, int64_t>, int8_t, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-Flat MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_flat::index_params>& index_params,
+           raft::host_matrix_view<const uint8_t, int64_t, row_major> index_dataset)
+  -> index<ivf_flat::index<uint8_t, int64_t>, uint8_t, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-PQ MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_pq::index_params>& index_params,
+           raft::host_matrix_view<const float, int64_t, row_major> index_dataset)
+  -> index<ivf_pq::index<int64_t>, float, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-PQ MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_pq::index_params>& index_params,
+           raft::host_matrix_view<const half, int64_t, row_major> index_dataset)
+  -> index<ivf_pq::index<int64_t>, half, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-PQ MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_pq::index_params>& index_params,
+           raft::host_matrix_view<const int8_t, int64_t, row_major> index_dataset)
+  -> index<ivf_pq::index<int64_t>, int8_t, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed IVF-PQ MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<ivf_pq::index_params>& index_params,
+           raft::host_matrix_view<const uint8_t, int64_t, row_major> index_dataset)
+  -> index<ivf_pq::index<int64_t>, uint8_t, int64_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed CAGRA MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<cagra::index_params>& index_params,
+           raft::host_matrix_view<const float, int64_t, row_major> index_dataset)
+  -> index<cagra::index<float, uint32_t>, float, uint32_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed CAGRA MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<cagra::index_params>& index_params,
+           raft::host_matrix_view<const half, int64_t, row_major> index_dataset)
+  -> index<cagra::index<half, uint32_t>, half, uint32_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed CAGRA MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<cagra::index_params>& index_params,
+           raft::host_matrix_view<const int8_t, int64_t, row_major> index_dataset)
+  -> index<cagra::index<int8_t, uint32_t>, int8_t, uint32_t>;
+
+/// \ingroup mg_cpp_index_build
+/**
+ * @brief Builds a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index_params configure the index building
+ * @param[in] index_dataset a row-major matrix on host [n_rows, dim]
+ *
+ * @return the constructed CAGRA MG index
+ */
+auto build(const raft::device_resources& handle,
+           const mg::index_params<cagra::index_params>& index_params,
+           raft::host_matrix_view<const uint8_t, int64_t, row_major> index_dataset)
+  -> index<cagra::index<uint8_t, uint32_t>, uint8_t, uint32_t>;
+
+/// \defgroup mg_cpp_index_extend ANN MG index extend
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_flat::index<float, int64_t>, float, int64_t>& index,
+            raft::host_matrix_view<const float, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_flat::index<int8_t, int64_t>, int8_t, int64_t>& index,
+            raft::host_matrix_view<const int8_t, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_flat::index<uint8_t, int64_t>, uint8_t, int64_t>& index,
+            raft::host_matrix_view<const uint8_t, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_pq::index<int64_t>, float, int64_t>& index,
+            raft::host_matrix_view<const float, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_pq::index<int64_t>, half, int64_t>& index,
+            raft::host_matrix_view<const half, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_pq::index<int64_t>, int8_t, int64_t>& index,
+            raft::host_matrix_view<const int8_t, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<ivf_pq::index<int64_t>, uint8_t, int64_t>& index,
+            raft::host_matrix_view<const uint8_t, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const int64_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<cagra::index<float, uint32_t>, float, uint32_t>& index,
+            raft::host_matrix_view<const float, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const uint32_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<cagra::index<half, uint32_t>, half, uint32_t>& index,
+            raft::host_matrix_view<const half, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const uint32_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<cagra::index<int8_t, uint32_t>, int8_t, uint32_t>& index,
+            raft::host_matrix_view<const int8_t, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const uint32_t, int64_t>> new_indices);
+
+/// \ingroup mg_cpp_index_extend
+/**
+ * @brief Extends a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::extend(handle, index, new_vectors, std::nullopt);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] new_vectors a row-major matrix on host [n_rows, dim]
+ * @param[in] new_indices optional vector on host [n_rows],
+ * `std::nullopt` means default continuous range `[0...n_rows)`
+ *
+ */
+void extend(const raft::device_resources& handle,
+            index<cagra::index<uint8_t, uint32_t>, uint8_t, uint32_t>& index,
+            raft::host_matrix_view<const uint8_t, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const uint32_t, int64_t>> new_indices);
+
+/// \defgroup mg_cpp_index_search ANN MG index search
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_flat::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_flat::index<float, int64_t>, float, int64_t>& index,
+            const mg::search_params<ivf_flat::search_params>& search_params,
+            raft::host_matrix_view<const float, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_flat::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_flat::index<int8_t, int64_t>, int8_t, int64_t>& index,
+            const mg::search_params<ivf_flat::search_params>& search_params,
+            raft::host_matrix_view<const int8_t, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_flat::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_flat::index<uint8_t, int64_t>, uint8_t, int64_t>& index,
+            const mg::search_params<ivf_flat::search_params>& search_params,
+            raft::host_matrix_view<const uint8_t, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_pq::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_pq::index<int64_t>, float, int64_t>& index,
+            const mg::search_params<ivf_pq::search_params>& search_params,
+            raft::host_matrix_view<const float, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_pq::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_pq::index<int64_t>, half, int64_t>& index,
+            const mg::search_params<ivf_pq::search_params>& search_params,
+            raft::host_matrix_view<const half, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_pq::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_pq::index<int64_t>, int8_t, int64_t>& index,
+            const mg::search_params<ivf_pq::search_params>& search_params,
+            raft::host_matrix_view<const int8_t, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<ivf_pq::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<ivf_pq::index<int64_t>, uint8_t, int64_t>& index,
+            const mg::search_params<ivf_pq::search_params>& search_params,
+            raft::host_matrix_view<const uint8_t, int64_t, row_major> queries,
+            raft::host_matrix_view<int64_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<cagra::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<cagra::index<float, uint32_t>, float, uint32_t>& index,
+            const mg::search_params<cagra::search_params>& search_params,
+            raft::host_matrix_view<const float, int64_t, row_major> queries,
+            raft::host_matrix_view<uint32_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<cagra::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<cagra::index<half, uint32_t>, half, uint32_t>& index,
+            const mg::search_params<cagra::search_params>& search_params,
+            raft::host_matrix_view<const half, int64_t, row_major> queries,
+            raft::host_matrix_view<uint32_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<cagra::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<cagra::index<int8_t, uint32_t>, int8_t, uint32_t>& index,
+            const mg::search_params<cagra::search_params>& search_params,
+            raft::host_matrix_view<const int8_t, int64_t, row_major> queries,
+            raft::host_matrix_view<uint32_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \ingroup mg_cpp_index_search
+/**
+ * @brief Searches a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * cuvs::neighbors::mg::search_params<cagra::search_params> search_params;
+ * cuvs::neighbors::mg::search(handle, index, search_params, queries, neighbors,
+ * distances);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] search_params configure the index search
+ * @param[in] queries a row-major matrix on host [n_rows, dim]
+ * @param[out] neighbors a row-major matrix on host [n_rows, n_neighbors]
+ * @param[out] distances a row-major matrix on host [n_rows, n_neighbors]
+ * @param[in] n_rows_per_batch (optional) search batch size
+ *
+ */
+void search(const raft::device_resources& handle,
+            const index<cagra::index<uint8_t, uint32_t>, uint8_t, uint32_t>& index,
+            const mg::search_params<cagra::search_params>& search_params,
+            raft::host_matrix_view<const uint8_t, int64_t, row_major> queries,
+            raft::host_matrix_view<uint32_t, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch = DEFAULT_SEARCH_BATCH_SIZE);
+
+/// \defgroup mg_cpp_serialize ANN MG index serialization
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_flat::index<float, int64_t>, float, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_flat::index<int8_t, int64_t>, int8_t, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_flat::index<uint8_t, int64_t>, uint8_t, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_pq::index<int64_t>, float, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_pq::index<int64_t>, half, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_pq::index<int64_t>, int8_t, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<ivf_pq::index<int64_t>, uint8_t, int64_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<cagra::index<float, uint32_t>, float, uint32_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<cagra::index<half, uint32_t>, half, uint32_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<cagra::index<int8_t, uint32_t>, int8_t, uint32_t>& index,
+               const std::string& filename);
+
+/// \ingroup mg_cpp_serialize
+/**
+ * @brief Serializes a multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] index the pre-built index
+ * @param[in] filename path to the file to be serialized
+ *
+ */
+void serialize(const raft::device_resources& handle,
+               const index<cagra::index<uint8_t, uint32_t>, uint8_t, uint32_t>& index,
+               const std::string& filename);
+
+/// \defgroup mg_cpp_deserialize ANN MG index deserialization
+
+/// \ingroup mg_cpp_deserialize
+/**
+ * @brief Deserializes an IVF-Flat multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_flat::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * auto new_index = cuvs::neighbors::mg::deserialize_flat<float, int64_t>(handle, filename);
+ *
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] filename path to the file to be deserialized
+ *
+ */
+template <typename T, typename IdxT>
+auto deserialize_flat(const raft::device_resources& handle, const std::string& filename)
+  -> index<ivf_flat::index<T, IdxT>, T, IdxT>;
+
+/// \ingroup mg_cpp_deserialize
+/**
+ * @brief Deserializes an IVF-PQ multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<ivf_pq::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * auto new_index = cuvs::neighbors::mg::deserialize_pq<float, int64_t>(handle, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] filename path to the file to be deserialized
+ *
+ */
+template <typename T, typename IdxT>
+auto deserialize_pq(const raft::device_resources& handle, const std::string& filename)
+  -> index<ivf_pq::index<IdxT>, T, IdxT>;
+
+/// \ingroup mg_cpp_deserialize
+/**
+ * @brief Deserializes a CAGRA multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::mg::index_params<cagra::index_params> index_params;
+ * auto index = cuvs::neighbors::mg::build(handle, index_params, index_dataset);
+ * const std::string filename = "mg_index.cuvs";
+ * cuvs::neighbors::mg::serialize(handle, index, filename);
+ * auto new_index = cuvs::neighbors::mg::deserialize_cagra<float, uint32_t>(handle, filename);
+ *
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] filename path to the file to be deserialized
+ *
+ */
+template <typename T, typename IdxT>
+auto deserialize_cagra(const raft::device_resources& handle, const std::string& filename)
+  -> index<cagra::index<T, IdxT>, T, IdxT>;
+
+/// \defgroup mg_cpp_distribute ANN MG local index distribution
+
+/// \ingroup mg_cpp_distribute
+/**
+ * @brief Replicates a locally built and serialized IVF-Flat index to all GPUs to form a distributed
+ * multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::ivf_flat::index_params index_params;
+ * auto index = cuvs::neighbors::ivf_flat::build(handle, index_params, index_dataset);
+ * const std::string filename = "local_index.cuvs";
+ * cuvs::neighbors::ivf_flat::serialize(handle, filename, index);
+ * auto new_index = cuvs::neighbors::mg::distribute_flat<float, int64_t>(handle, filename);
+ *
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] filename path to the file to be deserialized : a local index
+ *
+ */
+template <typename T, typename IdxT>
+auto distribute_flat(const raft::device_resources& handle, const std::string& filename)
+  -> index<ivf_flat::index<T, IdxT>, T, IdxT>;
+
+/// \ingroup mg_cpp_distribute
+/**
+ * @brief Replicates a locally built and serialized IVF-PQ index to all GPUs to form a distributed
+ * multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::ivf_pq::index_params index_params;
+ * auto index = cuvs::neighbors::ivf_pq::build(handle, index_params, index_dataset);
+ * const std::string filename = "local_index.cuvs";
+ * cuvs::neighbors::ivf_pq::serialize(handle, filename, index);
+ * auto new_index = cuvs::neighbors::mg::distribute_pq<float, int64_t>(handle, filename);
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] filename path to the file to be deserialized : a local index
+ *
+ */
+template <typename T, typename IdxT>
+auto distribute_pq(const raft::device_resources& handle, const std::string& filename)
+  -> index<ivf_pq::index<IdxT>, T, IdxT>;
+
+/// \ingroup mg_cpp_distribute
+/**
+ * @brief Replicates a locally built and serialized CAGRA index to all GPUs to form a distributed
+ * multi-GPU index
+ *
+ * Usage example:
+ * @code{.cpp}
+ * raft::handle_t handle;
+ * cuvs::neighbors::cagra::index_params index_params;
+ * auto index = cuvs::neighbors::cagra::build(handle, index_params, index_dataset);
+ * const std::string filename = "local_index.cuvs";
+ * cuvs::neighbors::cagra::serialize(handle, filename, index);
+ * auto new_index = cuvs::neighbors::mg::distribute_cagra<float, uint32_t>(handle, filename);
+ *
+ * @endcode
+ *
+ * @param[in] handle
+ * @param[in] filename path to the file to be deserialized : a local index
+ *
+ */
+template <typename T, typename IdxT>
+auto distribute_cagra(const raft::device_resources& handle, const std::string& filename)
+  -> index<cagra::index<T, IdxT>, T, IdxT>;
+
+}  // namespace cuvs::neighbors::mg
+
+#else
+
+static_assert(false,
+              "FORBIDEN_MG_ALGORITHM_IMPORT\n\n"
+              "Please recompile the cuVS library with MG algorithms BUILD_MG_ALGOS=ON.\n");
+
+#endif
diff --git a/cpp/src/neighbors/cagra_c.cpp b/cpp/src/neighbors/cagra_c.cpp
index 164448f2c..6985ff094 100644
--- a/cpp/src/neighbors/cagra_c.cpp
+++ b/cpp/src/neighbors/cagra_c.cpp
@@ -176,6 +176,14 @@ extern "C" cuvsError_t cuvsCagraIndexDestroy(cuvsCagraIndex_t index_c_ptr)
   });
 }
 
+extern "C" cuvsError_t cuvsCagraIndexGetDims(cuvsCagraIndex_t index, int* dim)
+{
+  return cuvs::core::translate_exceptions([=] {
+    auto index_ptr = reinterpret_cast<cuvs::neighbors::cagra::index<float, uint32_t>*>(index->addr);
+    *dim           = index_ptr->dim();
+  });
+}
+
 extern "C" cuvsError_t cuvsCagraBuild(cuvsResources_t res,
                                       cuvsCagraIndexParams_t params,
                                       DLManagedTensor* dataset_tensor,
diff --git a/cpp/src/neighbors/detail/cagra/add_nodes.cuh b/cpp/src/neighbors/detail/cagra/add_nodes.cuh
index 9694a3e7a..b03b8214b 100644
--- a/cpp/src/neighbors/detail/cagra/add_nodes.cuh
+++ b/cpp/src/neighbors/detail/cagra/add_nodes.cuh
@@ -29,9 +29,10 @@
 
 #include <cstdint>
 
+namespace cuvs::neighbors::cagra {
+
 static const std::string RAFT_NAME = "raft";
 
-namespace cuvs::neighbors::cagra {
 template <class T, class IdxT, class Accessor>
 void add_node_core(
   raft::resources const& handle,
diff --git a/cpp/src/neighbors/detail/cagra/cagra_serialize.cuh b/cpp/src/neighbors/detail/cagra/cagra_serialize.cuh
index 4c3fe5e81..0f6cf852f 100644
--- a/cpp/src/neighbors/detail/cagra/cagra_serialize.cuh
+++ b/cpp/src/neighbors/detail/cagra/cagra_serialize.cuh
@@ -32,9 +32,10 @@
 #include <fstream>
 #include <type_traits>
 
-static const std::string RAFT_NAME = "raft";
 namespace cuvs::neighbors::cagra::detail {
 
+static const std::string RAFT_NAME = "raft";
+
 constexpr int serialization_version = 4;
 
 /**
@@ -119,9 +120,9 @@ void serialize_to_hnswlib(raft::resources const& res,
   os.write(reinterpret_cast<char*>(&curr_element_count), sizeof(std::size_t));
   // Example:M: 16, dim = 128, data_t = float, index_t = uint32_t, list_size_type = uint32_t,
   // labeltype: size_t size_data_per_element_ = M * 2 * sizeof(index_t) + sizeof(list_size_type) +
-  // dim * 4 + sizeof(labeltype)
-  auto size_data_per_element =
-    static_cast<std::size_t>(index_.graph_degree() * sizeof(IdxT) + 4 + index_.dim() * 4 + 8);
+  // dim * sizeof(T) + sizeof(labeltype)
+  auto size_data_per_element = static_cast<std::size_t>(index_.graph_degree() * sizeof(IdxT) + 4 +
+                                                        index_.dim() * sizeof(T) + 8);
   os.write(reinterpret_cast<char*>(&size_data_per_element), sizeof(std::size_t));
   // label_offset
   std::size_t label_offset = size_data_per_element - 8;
@@ -184,18 +185,9 @@ void serialize_to_hnswlib(raft::resources const& res,
     }
 
     auto data_row = host_dataset.data_handle() + (index_.dim() * i);
-    if constexpr (std::is_same_v<T, float>) {
-      for (std::size_t j = 0; j < index_.dim(); ++j) {
-        auto data_elem = static_cast<float>(host_dataset(i, j));
-        os.write(reinterpret_cast<char*>(&data_elem), sizeof(float));
-      }
-    } else if constexpr (std::is_same_v<T, std::int8_t> or std::is_same_v<T, std::uint8_t>) {
-      for (std::size_t j = 0; j < index_.dim(); ++j) {
-        auto data_elem = static_cast<int>(host_dataset(i, j));
-        os.write(reinterpret_cast<char*>(&data_elem), sizeof(int));
-      }
-    } else {
-      RAFT_FAIL("Unsupported dataset type while saving CAGRA dataset to HNSWlib format");
+    for (std::size_t j = 0; j < index_.dim(); ++j) {
+      auto data_elem = static_cast<T>(host_dataset(i, j));
+      os.write(reinterpret_cast<char*>(&data_elem), sizeof(T));
     }
 
     os.write(reinterpret_cast<char*>(&i), sizeof(std::size_t));
diff --git a/cpp/src/neighbors/detail/cagra/graph_core.cuh b/cpp/src/neighbors/detail/cagra/graph_core.cuh
index 9edbbf5c1..43bf1ba2b 100644
--- a/cpp/src/neighbors/detail/cagra/graph_core.cuh
+++ b/cpp/src/neighbors/detail/cagra/graph_core.cuh
@@ -475,12 +475,12 @@ void sort_knn_graph(
 {
   RAFT_EXPECTS(dataset.extent(0) == knn_graph.extent(0),
                "dataset size is expected to have the same number of graph index size");
-  const uint32_t dataset_size = dataset.extent(0);
-  const uint32_t dataset_dim  = dataset.extent(1);
+  const uint64_t dataset_size = dataset.extent(0);
+  const uint64_t dataset_dim  = dataset.extent(1);
   const DataT* dataset_ptr    = dataset.data_handle();
 
   const IdxT graph_size             = dataset_size;
-  const uint32_t input_graph_degree = knn_graph.extent(1);
+  const uint64_t input_graph_degree = knn_graph.extent(1);
   IdxT* const input_graph_ptr       = knn_graph.data_handle();
 
   auto large_tmp_mr = raft::resource::get_large_workspace_resource(res);
@@ -528,7 +528,7 @@ void sort_knn_graph(
     kernel_sort                        = kern_sort<DataT, IdxT, numElementsPerThread>;
   } else {
     RAFT_FAIL(
-      "The degree of input knn graph is too large (%u). "
+      "The degree of input knn graph is too large (%lu). "
       "It must be equal to or smaller than %d.",
       input_graph_degree,
       1024);
diff --git a/cpp/src/neighbors/detail/hnsw.hpp b/cpp/src/neighbors/detail/hnsw.hpp
index 0d1ae4ec9..ce1e03264 100644
--- a/cpp/src/neighbors/detail/hnsw.hpp
+++ b/cpp/src/neighbors/detail/hnsw.hpp
@@ -110,9 +110,9 @@ std::unique_ptr<index<T>> from_cagra(raft::resources const& res,
   return std::unique_ptr<index<T>>(hnsw_index);
 }
 
-template <typename QueriesT>
-void get_search_knn_results(hnswlib::HierarchicalNSW<QueriesT> const* idx,
-                            const QueriesT* query,
+template <typename T>
+void get_search_knn_results(hnswlib::HierarchicalNSW<typename hnsw_dist_t<T>::type> const* idx,
+                            const T* query,
                             int k,
                             uint64_t* indices,
                             float* distances)
@@ -127,11 +127,11 @@ void get_search_knn_results(hnswlib::HierarchicalNSW<QueriesT> const* idx,
   }
 }
 
-template <typename T, typename QueriesT>
+template <typename T>
 void search(raft::resources const& res,
             const search_params& params,
             const index<T>& idx,
-            raft::host_matrix_view<const QueriesT, int64_t, raft::row_major> queries,
+            raft::host_matrix_view<const T, int64_t, raft::row_major> queries,
             raft::host_matrix_view<uint64_t, int64_t, raft::row_major> neighbors,
             raft::host_matrix_view<float, int64_t, raft::row_major> distances)
 {
@@ -146,7 +146,8 @@ void search(raft::resources const& res,
 
   idx.set_ef(params.ef);
   auto const* hnswlib_index =
-    reinterpret_cast<hnswlib::HierarchicalNSW<QueriesT> const*>(idx.get_index());
+    reinterpret_cast<hnswlib::HierarchicalNSW<typename hnsw_dist_t<T>::type> const*>(
+      idx.get_index());
 
   // when num_threads == 0, automatically maximize parallelism
   if (params.num_threads) {
diff --git a/cpp/src/neighbors/hnsw.cpp b/cpp/src/neighbors/hnsw.cpp
index 36cbb16c9..e6f3fbcc7 100644
--- a/cpp/src/neighbors/hnsw.cpp
+++ b/cpp/src/neighbors/hnsw.cpp
@@ -34,20 +34,20 @@ CUVS_INST_HNSW_FROM_CAGRA(int8_t);
 
 #undef CUVS_INST_HNSW_FROM_CAGRA
 
-#define CUVS_INST_HNSW_SEARCH(T, QueriesT)                                              \
-  void search(raft::resources const& res,                                               \
-              const search_params& params,                                              \
-              const index<T>& idx,                                                      \
-              raft::host_matrix_view<const QueriesT, int64_t, raft::row_major> queries, \
-              raft::host_matrix_view<uint64_t, int64_t, raft::row_major> neighbors,     \
-              raft::host_matrix_view<float, int64_t, raft::row_major> distances)        \
-  {                                                                                     \
-    detail::search<T, QueriesT>(res, params, idx, queries, neighbors, distances);       \
+#define CUVS_INST_HNSW_SEARCH(T)                                                    \
+  void search(raft::resources const& res,                                           \
+              const search_params& params,                                          \
+              const index<T>& idx,                                                  \
+              raft::host_matrix_view<const T, int64_t, raft::row_major> queries,    \
+              raft::host_matrix_view<uint64_t, int64_t, raft::row_major> neighbors, \
+              raft::host_matrix_view<float, int64_t, raft::row_major> distances)    \
+  {                                                                                 \
+    detail::search<T>(res, params, idx, queries, neighbors, distances);             \
   }
 
-CUVS_INST_HNSW_SEARCH(float, float);
-CUVS_INST_HNSW_SEARCH(uint8_t, int);
-CUVS_INST_HNSW_SEARCH(int8_t, int);
+CUVS_INST_HNSW_SEARCH(float);
+CUVS_INST_HNSW_SEARCH(uint8_t);
+CUVS_INST_HNSW_SEARCH(int8_t);
 
 #undef CUVS_INST_HNSW_SEARCH
 
diff --git a/cpp/src/neighbors/hnsw_c.cpp b/cpp/src/neighbors/hnsw_c.cpp
index ab5268a6d..a19875641 100644
--- a/cpp/src/neighbors/hnsw_c.cpp
+++ b/cpp/src/neighbors/hnsw_c.cpp
@@ -31,7 +31,7 @@
 #include <cuvs/neighbors/hnsw.hpp>
 
 namespace {
-template <typename T, typename QueriesT>
+template <typename T>
 void _search(cuvsResources_t res,
              cuvsHnswSearchParams params,
              cuvsHnswIndex index,
@@ -46,7 +46,7 @@ void _search(cuvsResources_t res,
   search_params.ef          = params.ef;
   search_params.num_threads = params.numThreads;
 
-  using queries_mdspan_type   = raft::host_matrix_view<QueriesT const, int64_t, raft::row_major>;
+  using queries_mdspan_type   = raft::host_matrix_view<T const, int64_t, raft::row_major>;
   using neighbors_mdspan_type = raft::host_matrix_view<uint64_t, int64_t, raft::row_major>;
   using distances_mdspan_type = raft::host_matrix_view<float, int64_t, raft::row_major>;
   auto queries_mds            = cuvs::core::from_dlpack<queries_mdspan_type>(queries_tensor);
@@ -127,16 +127,13 @@ extern "C" cuvsError_t cuvsHnswSearch(cuvsResources_t res,
 
     auto index = *index_c_ptr;
     RAFT_EXPECTS(queries.dtype.code == index.dtype.code, "type mismatch between index and queries");
-    RAFT_EXPECTS(queries.dtype.bits == 32, "number of bits in queries dtype should be 32");
 
     if (index.dtype.code == kDLFloat) {
-      _search<float, float>(
-        res, *params, index, queries_tensor, neighbors_tensor, distances_tensor);
+      _search<float>(res, *params, index, queries_tensor, neighbors_tensor, distances_tensor);
     } else if (index.dtype.code == kDLUInt) {
-      _search<uint8_t, int>(
-        res, *params, index, queries_tensor, neighbors_tensor, distances_tensor);
+      _search<uint8_t>(res, *params, index, queries_tensor, neighbors_tensor, distances_tensor);
     } else if (index.dtype.code == kDLInt) {
-      _search<int8_t, int>(res, *params, index, queries_tensor, neighbors_tensor, distances_tensor);
+      _search<int8_t>(res, *params, index, queries_tensor, neighbors_tensor, distances_tensor);
     } else {
       RAFT_FAIL("Unsupported index dtype: %d and bits: %d", queries.dtype.code, queries.dtype.bits);
     }
@@ -152,13 +149,10 @@ extern "C" cuvsError_t cuvsHnswDeserialize(cuvsResources_t res,
   return cuvs::core::translate_exceptions([=] {
     if (index->dtype.code == kDLFloat && index->dtype.bits == 32) {
       index->addr = reinterpret_cast<uintptr_t>(_deserialize<float>(res, filename, dim, metric));
-      index->dtype.code = kDLFloat;
     } else if (index->dtype.code == kDLUInt && index->dtype.bits == 8) {
       index->addr = reinterpret_cast<uintptr_t>(_deserialize<uint8_t>(res, filename, dim, metric));
-      index->dtype.code = kDLInt;
     } else if (index->dtype.code == kDLInt && index->dtype.bits == 8) {
       index->addr = reinterpret_cast<uintptr_t>(_deserialize<int8_t>(res, filename, dim, metric));
-      index->dtype.code = kDLUInt;
     } else {
       RAFT_FAIL("Unsupported dtype in file %s", filename);
     }
diff --git a/cpp/src/neighbors/iface/generate_iface.py b/cpp/src/neighbors/iface/generate_iface.py
new file mode 100644
index 000000000..794219bbf
--- /dev/null
+++ b/cpp/src/neighbors/iface/generate_iface.py
@@ -0,0 +1,273 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+header = """/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+"""
+
+include_macro = """
+#include "iface.hpp"
+"""
+
+namespace_macro = """
+namespace cuvs::neighbors {
+"""
+
+footer = """
+}  // namespace cuvs::neighbors
+"""
+
+flat_macro = """
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                                                              \\
+  using T_ha = raft::host_device_accessor<std::experimental::default_accessor<const T>, raft::memory_type::device>;             \\
+  using T_da= raft::host_device_accessor<std::experimental::default_accessor<const T>, raft::memory_type::host>;                \\
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>, raft::memory_type::device>;       \\
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>, raft::memory_type::host>;         \\
+                                                                                                                                \\
+  template void build(const raft::device_resources& handle,                                                                     \\
+             cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                              \\
+             const cuvs::neighbors::index_params* index_params,                                                                 \\
+             raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                                     \\
+                                                                                                                                \\
+  template void build(const raft::device_resources& handle,                                                                     \\
+             cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                              \\
+             const cuvs::neighbors::index_params* index_params,                                                                 \\
+             raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                                     \\
+                                                                                                                                \\
+  template void extend(const raft::device_resources& handle,                                                                    \\
+              cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                             \\
+              raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                                       \\
+              std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> new_indices);       \\
+                                                                                                                                \\
+  template void extend(const raft::device_resources& handle,                                                                    \\
+              cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                             \\
+              raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                                       \\
+              std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> new_indices);       \\
+                                                                                                                                \\
+  template void search(const raft::device_resources& handle,                                                                    \\
+              const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                       \\
+              const cuvs::neighbors::search_params* search_params,                                                              \\
+              raft::device_matrix_view<const T, int64_t, row_major> queries,                                                    \\
+              raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,                                                     \\
+              raft::device_matrix_view<float, int64_t, row_major> distances);                                                   \\
+                                                                                                                                \\
+  template void search(const raft::device_resources& handle,                                                                    \\
+              const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                       \\
+              const cuvs::neighbors::search_params* search_params,                                                              \\
+              raft::host_matrix_view<const T, int64_t, row_major> h_queries,                                                    \\
+              raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,                                                   \\
+              raft::device_matrix_view<float, int64_t, row_major> d_distances);                                                 \\
+                                                                                                                                \\
+  template void serialize(const raft::device_resources& handle,                                                                 \\
+                 const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                    \\
+                 std::ostream& os);                                                                                             \\
+                                                                                                                                \\
+  template void deserialize(const raft::device_resources& handle,                                                               \\
+                   cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                        \\
+                   std::istream& is);                                                                                           \\
+                                                                                                                                \\
+  template void deserialize(const raft::device_resources& handle,                                                               \\
+                   cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                                        \\
+                   const std::string& filename);
+"""
+
+pq_macro = """
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                                                \\
+  using T_ha = raft::host_device_accessor<std::experimental::default_accessor<const T>, raft::memory_type::device>;             \\
+  using T_da= raft::host_device_accessor<std::experimental::default_accessor<const T>, raft::memory_type::host>;                \\
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>, raft::memory_type::device>;       \\
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>, raft::memory_type::host>;         \\
+                                                                                                                                \\
+  template void build(const raft::device_resources& handle,                                                                     \\
+             cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                                   \\
+             const cuvs::neighbors::index_params* index_params,                                                                 \\
+             raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                                     \\
+                                                                                                                                \\
+  template void build(const raft::device_resources& handle,                                                                     \\
+             cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                                   \\
+             const cuvs::neighbors::index_params* index_params,                                                                 \\
+             raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                                     \\
+                                                                                                                                \\
+  template void extend(const raft::device_resources& handle,                                                                    \\
+              cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                                  \\
+              raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                                       \\
+              std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> new_indices);       \\
+                                                                                                                                \\
+  template void extend(const raft::device_resources& handle,                                                                    \\
+              cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                                  \\
+              raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                                       \\
+              std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> new_indices);       \\
+                                                                                                                                \\
+  template void search(const raft::device_resources& handle,                                                                    \\
+              const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                            \\
+              const cuvs::neighbors::search_params* search_params,                                                              \\
+              raft::device_matrix_view<const T, int64_t, row_major> queries,                                                    \\
+              raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,                                                     \\
+              raft::device_matrix_view<float, int64_t, row_major> distances);                                                   \\
+                                                                                                                                \\
+  template void search(const raft::device_resources& handle,                                                                    \\
+              const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                            \\
+              const cuvs::neighbors::search_params* search_params,                                                              \\
+              raft::host_matrix_view<const T, int64_t, row_major> h_queries,                                                    \\
+              raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,                                                   \\
+              raft::device_matrix_view<float, int64_t, row_major> d_distances);                                                 \\
+                                                                                                                                \\
+  template void serialize(const raft::device_resources& handle,                                                                 \\
+                 const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                         \\
+                 std::ostream& os);                                                                                             \\
+                                                                                                                                \\
+  template void deserialize(const raft::device_resources& handle,                                                               \\
+                   cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                             \\
+                   std::istream& is);                                                                                           \\
+                                                                                                                                \\
+  template void deserialize(const raft::device_resources& handle,                                                               \\
+                   cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                                             \\
+                   const std::string& filename);
+"""
+
+cagra_macro = """
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                                                             \\
+  using T_ha = raft::host_device_accessor<std::experimental::default_accessor<const T>, raft::memory_type::device>;             \\
+  using T_da= raft::host_device_accessor<std::experimental::default_accessor<const T>, raft::memory_type::host>;                \\
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>, raft::memory_type::device>;       \\
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>, raft::memory_type::host>;         \\
+                                                                                                                                \\
+  template void build(const raft::device_resources& handle,                                                                     \\
+             cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                                 \\
+             const cuvs::neighbors::index_params* index_params,                                                                 \\
+             raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                                     \\
+                                                                                                                                \\
+  template void build(const raft::device_resources& handle,                                                                     \\
+             cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                                 \\
+             const cuvs::neighbors::index_params* index_params,                                                                 \\
+             raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                                     \\
+                                                                                                                                \\
+  template void extend(const raft::device_resources& handle,                                                                    \\
+              cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                                \\
+              raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                                       \\
+              std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> new_indices);       \\
+                                                                                                                                \\
+  template void extend(const raft::device_resources& handle,                                                                    \\
+              cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                                \\
+              raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                                       \\
+              std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> new_indices);       \\
+                                                                                                                                \\
+  template void search(const raft::device_resources& handle,                                                                    \\
+              const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                          \\
+              const cuvs::neighbors::search_params* search_params,                                                              \\
+              raft::device_matrix_view<const T, int64_t, row_major> queries,                                                    \\
+              raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,                                                     \\
+              raft::device_matrix_view<float, int64_t, row_major> distances);                                                   \\
+                                                                                                                                \\
+    template void search(const raft::device_resources& handle,                                                                  \\
+              const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                          \\
+              const cuvs::neighbors::search_params* search_params,                                                              \\
+              raft::host_matrix_view<const T, int64_t, row_major> h_queries,                                                    \\
+              raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,                                                   \\
+              raft::device_matrix_view<float, int64_t, row_major> d_distances);                                                 \\
+                                                                                                                                \\
+  template void serialize(const raft::device_resources& handle,                                                                 \\
+                 const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                       \\
+                 std::ostream& os);                                                                                             \\
+                                                                                                                                \\
+  template void deserialize(const raft::device_resources& handle,                                                               \\
+                   cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                           \\
+                   std::istream& is);                                                                                           \\
+                                                                                                                                \\
+  template void deserialize(const raft::device_resources& handle,                                                               \\
+                   cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                                           \\
+                   const std::string& filename);
+"""
+
+flat_macros = dict (
+    flat = dict(
+        include=include_macro,
+        definition=flat_macro,
+        name="CUVS_INST_MG_FLAT",
+    )
+)
+
+pq_macros = dict (
+    pq = dict(
+        include=include_macro,
+        definition=pq_macro,
+        name="CUVS_INST_MG_PQ",
+    )
+)
+
+cagra_macros = dict (
+    cagra = dict(
+        include=include_macro,
+        definition=cagra_macro,
+        name="CUVS_INST_MG_CAGRA",
+    )
+)
+
+flat_types = dict(
+    float_int64_t=("float", "int64_t"),
+    int8_t_int64_t=("int8_t", "int64_t"),
+    uint8_t_int64_t=("uint8_t", "int64_t"),
+)
+
+pq_types = dict(
+    float_int64_t=("float", "int64_t"),
+    half_int64_t=("half", "int64_t"),
+    int8_t_int64_t=("int8_t", "int64_t"),
+    uint8_t_int64_t=("uint8_t", "int64_t"),
+)
+
+cagra_types = dict(
+    float_uint32_t=("float", "uint32_t"),
+    half_uint32_t=("half", "uint32_t"),
+    int8_t_uint32_t=("int8_t", "uint32_t"),
+    uint8_t_uint32_t=("uint8_t", "uint32_t"),
+)
+
+for macros, types in [(flat_macros, flat_types), (pq_macros, pq_types), (cagra_macros, cagra_types)]:
+  for type_path, (T, IdxT) in types.items():
+      for macro_path, macro in macros.items():
+          path = f"iface_{macro_path}_{type_path}.cu"
+          with open(path, "w") as f:
+              f.write(header)
+              f.write(macro['include'])
+              f.write(namespace_macro)
+              f.write(macro["definition"])
+              f.write(f"{macro['name']}({T}, {IdxT});\n\n")
+              f.write(f"#undef {macro['name']}\n")
+              f.write(footer)
+
+          print(f"src/neighbors/iface/{path}")
diff --git a/cpp/src/neighbors/iface/iface.hpp b/cpp/src/neighbors/iface/iface.hpp
new file mode 100644
index 000000000..a329db429
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface.hpp
@@ -0,0 +1,198 @@
+#include <mutex>
+
+#include <cuvs/neighbors/cagra.hpp>
+#include <cuvs/neighbors/common.hpp>
+#include <cuvs/neighbors/ivf_flat.hpp>
+#include <cuvs/neighbors/ivf_pq.hpp>
+#include <raft/core/device_resources.hpp>
+
+namespace cuvs::neighbors {
+
+using namespace raft;
+
+template <typename AnnIndexType, typename T, typename IdxT, typename Accessor>
+void build(const raft::device_resources& handle,
+           cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+           const cuvs::neighbors::index_params* index_params,
+           raft::mdspan<const T, matrix_extent<int64_t>, row_major, Accessor> index_dataset)
+{
+  interface.mutex_->lock();
+
+  if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+    auto idx = cuvs::neighbors::ivf_flat::build(
+      handle, *static_cast<const ivf_flat::index_params*>(index_params), index_dataset);
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+    auto idx = cuvs::neighbors::ivf_pq::build(
+      handle, *static_cast<const ivf_pq::index_params*>(index_params), index_dataset);
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+    auto idx = cuvs::neighbors::cagra::build(
+      handle, *static_cast<const cagra::index_params*>(index_params), index_dataset);
+    interface.index_.emplace(std::move(idx));
+  }
+  resource::sync_stream(handle);
+
+  interface.mutex_->unlock();
+}
+
+template <typename AnnIndexType, typename T, typename IdxT, typename Accessor1, typename Accessor2>
+void extend(
+  const raft::device_resources& handle,
+  cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+  raft::mdspan<const T, matrix_extent<int64_t>, row_major, Accessor1> new_vectors,
+  std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, Accessor2>>
+    new_indices)
+{
+  interface.mutex_->lock();
+
+  if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+    auto idx =
+      cuvs::neighbors::ivf_flat::extend(handle, new_vectors, new_indices, interface.index_.value());
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+    auto idx =
+      cuvs::neighbors::ivf_pq::extend(handle, new_vectors, new_indices, interface.index_.value());
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+    RAFT_FAIL("CAGRA does not implement the extend method");
+  }
+  resource::sync_stream(handle);
+
+  interface.mutex_->unlock();
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void search(const raft::device_resources& handle,
+            const cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+            const cuvs::neighbors::search_params* search_params,
+            raft::device_matrix_view<const T, int64_t, row_major> queries,
+            raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,
+            raft::device_matrix_view<float, int64_t, row_major> distances)
+{
+  // interface.mutex_->lock();
+  if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, int64_t>>::value) {
+    cuvs::neighbors::ivf_flat::search(
+      handle,
+      *reinterpret_cast<const ivf_flat::search_params*>(search_params),
+      interface.index_.value(),
+      queries,
+      neighbors,
+      distances);
+  } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<int64_t>>::value) {
+    cuvs::neighbors::ivf_pq::search(handle,
+                                    *reinterpret_cast<const ivf_pq::search_params*>(search_params),
+                                    interface.index_.value(),
+                                    queries,
+                                    neighbors,
+                                    distances);
+  } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, uint32_t>>::value) {
+    cuvs::neighbors::cagra::search(handle,
+                                   *reinterpret_cast<const cagra::search_params*>(search_params),
+                                   interface.index_.value(),
+                                   queries,
+                                   neighbors,
+                                   distances);
+  }
+  resource::sync_stream(handle);
+
+  // interface.mutex_->unlock();
+}
+
+// for MG ANN only
+template <typename AnnIndexType, typename T, typename IdxT>
+void search(const raft::device_resources& handle,
+            const cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+            const cuvs::neighbors::search_params* search_params,
+            raft::host_matrix_view<const T, int64_t, row_major> h_queries,
+            raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,
+            raft::device_matrix_view<float, int64_t, row_major> d_distances)
+{
+  // interface.mutex_->lock();
+
+  int64_t n_rows = h_queries.extent(0);
+  int64_t n_dims = h_queries.extent(1);
+  auto d_queries = raft::make_device_matrix<T, int64_t, row_major>(handle, n_rows, n_dims);
+  raft::copy(d_queries.data_handle(),
+             h_queries.data_handle(),
+             n_rows * n_dims,
+             resource::get_cuda_stream(handle));
+  auto d_query_view = raft::make_const_mdspan(d_queries.view());
+
+  search(handle, interface, search_params, d_query_view, d_neighbors, d_distances);
+
+  // interface.mutex_->unlock();
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void serialize(const raft::device_resources& handle,
+               const cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+               std::ostream& os)
+{
+  interface.mutex_->lock();
+
+  if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+    ivf_flat::serialize(handle, os, interface.index_.value());
+  } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+    ivf_pq::serialize(handle, os, interface.index_.value());
+  } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+    cagra::serialize(handle, os, interface.index_.value(), true);
+  }
+
+  interface.mutex_->unlock();
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void deserialize(const raft::device_resources& handle,
+                 cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+                 std::istream& is)
+{
+  interface.mutex_->lock();
+
+  if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+    ivf_flat::index<T, IdxT> idx(handle);
+    ivf_flat::deserialize(handle, is, &idx);
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+    ivf_pq::index<IdxT> idx(handle);
+    ivf_pq::deserialize(handle, is, &idx);
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+    cagra::index<T, IdxT> idx(handle);
+    cagra::deserialize(handle, is, &idx);
+    interface.index_.emplace(std::move(idx));
+  }
+
+  interface.mutex_->unlock();
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void deserialize(const raft::device_resources& handle,
+                 cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+                 const std::string& filename)
+{
+  interface.mutex_->lock();
+
+  std::ifstream is(filename, std::ios::in | std::ios::binary);
+  if (!is) { RAFT_FAIL("Cannot open file %s", filename.c_str()); }
+
+  if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+    ivf_flat::index<T, IdxT> idx(handle);
+    ivf_flat::deserialize(handle, is, &idx);
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+    ivf_pq::index<IdxT> idx(handle);
+    ivf_pq::deserialize(handle, is, &idx);
+    interface.index_.emplace(std::move(idx));
+  } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+    cagra::index<T, IdxT> idx(handle);
+    cagra::deserialize(handle, is, &idx);
+    interface.index_.emplace(std::move(idx));
+  }
+
+  is.close();
+
+  interface.mutex_->unlock();
+}
+
+};  // namespace cuvs::neighbors
\ No newline at end of file
diff --git a/cpp/src/neighbors/iface/iface_cagra_float_uint32_t.cu b/cpp/src/neighbors/iface/iface_cagra_float_uint32_t.cu
new file mode 100644
index 000000000..b5e329dd8
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_cagra_float_uint32_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                                \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(const raft::device_resources& handle,                                    \
+                          const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface, \
+                          std::ostream& os);                                                       \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            const std::string& filename);
+CUVS_INST_MG_CAGRA(float, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_cagra_half_uint32_t.cu b/cpp/src/neighbors/iface/iface_cagra_half_uint32_t.cu
new file mode 100644
index 000000000..23fcffc59
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_cagra_half_uint32_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                                \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(const raft::device_resources& handle,                                    \
+                          const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface, \
+                          std::ostream& os);                                                       \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            const std::string& filename);
+CUVS_INST_MG_CAGRA(half, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_cagra_int8_t_uint32_t.cu b/cpp/src/neighbors/iface/iface_cagra_int8_t_uint32_t.cu
new file mode 100644
index 000000000..30377ab66
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_cagra_int8_t_uint32_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                                \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(const raft::device_resources& handle,                                    \
+                          const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface, \
+                          std::ostream& os);                                                       \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            const std::string& filename);
+CUVS_INST_MG_CAGRA(int8_t, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_cagra_uint8_t_uint32_t.cu b/cpp/src/neighbors/iface/iface_cagra_uint8_t_uint32_t.cu
new file mode 100644
index 000000000..59a1640e8
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_cagra_uint8_t_uint32_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                                \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,    \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(const raft::device_resources& handle,                                    \
+                          const cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface, \
+                          std::ostream& os);                                                       \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<cagra::index<T, IdxT>, T, IdxT>& interface,     \
+                            const std::string& filename);
+CUVS_INST_MG_CAGRA(uint8_t, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_flat_float_int64_t.cu b/cpp/src/neighbors/iface/iface_flat_float_int64_t.cu
new file mode 100644
index 000000000..a0a455375
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_flat_float_int64_t.cu
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                                 \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface, \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface, \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(                                                                         \
+    const raft::device_resources& handle,                                                          \
+    const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                    \
+    std::ostream& os);                                                                             \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,  \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,  \
+                            const std::string& filename);
+CUVS_INST_MG_FLAT(float, int64_t);
+
+#undef CUVS_INST_MG_FLAT
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_flat_int8_t_int64_t.cu b/cpp/src/neighbors/iface/iface_flat_int8_t_int64_t.cu
new file mode 100644
index 000000000..9fdd6464f
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_flat_int8_t_int64_t.cu
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                                 \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface, \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface, \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(                                                                         \
+    const raft::device_resources& handle,                                                          \
+    const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                    \
+    std::ostream& os);                                                                             \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,  \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,  \
+                            const std::string& filename);
+CUVS_INST_MG_FLAT(int8_t, int64_t);
+
+#undef CUVS_INST_MG_FLAT
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_flat_uint8_t_int64_t.cu b/cpp/src/neighbors/iface/iface_flat_uint8_t_int64_t.cu
new file mode 100644
index 000000000..daee59c4a
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_flat_uint8_t_int64_t.cu
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                                 \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::device>;                           \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,         \
+                                          raft::memory_type::host>;                             \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::device>;                           \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,      \
+                                             raft::memory_type::host>;                             \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                 \
+                                                                                                   \
+  template void build(                                                                             \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    const cuvs::neighbors::index_params* index_params,                                             \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                 \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void extend(                                                                            \
+    const raft::device_resources& handle,                                                          \
+    cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                          \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                    \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>>  \
+      new_indices);                                                                                \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface, \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,               \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);             \
+                                                                                                   \
+  template void search(const raft::device_resources& handle,                                       \
+                       const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface, \
+                       const cuvs::neighbors::search_params* search_params,                        \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,              \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,             \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);           \
+                                                                                                   \
+  template void serialize(                                                                         \
+    const raft::device_resources& handle,                                                          \
+    const cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,                    \
+    std::ostream& os);                                                                             \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,  \
+                            std::istream& is);                                                     \
+                                                                                                   \
+  template void deserialize(const raft::device_resources& handle,                                  \
+                            cuvs::neighbors::iface<ivf_flat::index<T, IdxT>, T, IdxT>& interface,  \
+                            const std::string& filename);
+CUVS_INST_MG_FLAT(uint8_t, int64_t);
+
+#undef CUVS_INST_MG_FLAT
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_pq_float_int64_t.cu b/cpp/src/neighbors/iface/iface_pq_float_int64_t.cu
new file mode 100644
index 000000000..7282d6bd0
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_pq_float_int64_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                  \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::device>;                          \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::host>;                            \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::device>;                          \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::host>;                            \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);            \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,            \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);          \
+                                                                                                  \
+  template void serialize(const raft::device_resources& handle,                                   \
+                          const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,  \
+                          std::ostream& os);                                                      \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            std::istream& is);                                                    \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            const std::string& filename);
+CUVS_INST_MG_PQ(float, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_pq_half_int64_t.cu b/cpp/src/neighbors/iface/iface_pq_half_int64_t.cu
new file mode 100644
index 000000000..4d67f9aed
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_pq_half_int64_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                  \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::device>;                          \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::host>;                            \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::device>;                          \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::host>;                            \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);            \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,            \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);          \
+                                                                                                  \
+  template void serialize(const raft::device_resources& handle,                                   \
+                          const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,  \
+                          std::ostream& os);                                                      \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            std::istream& is);                                                    \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            const std::string& filename);
+CUVS_INST_MG_PQ(half, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_pq_int8_t_int64_t.cu b/cpp/src/neighbors/iface/iface_pq_int8_t_int64_t.cu
new file mode 100644
index 000000000..46537b3f9
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_pq_int8_t_int64_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                  \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::device>;                          \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::host>;                            \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::device>;                          \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::host>;                            \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);            \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,            \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);          \
+                                                                                                  \
+  template void serialize(const raft::device_resources& handle,                                   \
+                          const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,  \
+                          std::ostream& os);                                                      \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            std::istream& is);                                                    \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            const std::string& filename);
+CUVS_INST_MG_PQ(int8_t, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/iface/iface_pq_uint8_t_int64_t.cu b/cpp/src/neighbors/iface/iface_pq_uint8_t_int64_t.cu
new file mode 100644
index 000000000..591ac881a
--- /dev/null
+++ b/cpp/src/neighbors/iface/iface_pq_uint8_t_int64_t.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_iface.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_iface.py
+ *
+ */
+
+#include "iface.hpp"
+
+namespace cuvs::neighbors {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                  \
+  using T_ha    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::device>;                          \
+  using T_da    = raft::host_device_accessor<std::experimental::default_accessor<const T>,        \
+                                          raft::memory_type::host>;                            \
+  using IdxT_ha = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::device>;                          \
+  using IdxT_da = raft::host_device_accessor<std::experimental::default_accessor<const IdxT>,     \
+                                             raft::memory_type::host>;                            \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> index_dataset);                \
+                                                                                                  \
+  template void build(                                                                            \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    const cuvs::neighbors::index_params* index_params,                                            \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> index_dataset);                \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_ha> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_ha>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void extend(                                                                           \
+    const raft::device_resources& handle,                                                         \
+    cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,                              \
+    raft::mdspan<const T, matrix_extent<int64_t>, row_major, T_da> new_vectors,                   \
+    std::optional<raft::mdspan<const IdxT, vector_extent<int64_t>, layout_c_contiguous, IdxT_da>> \
+      new_indices);                                                                               \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::device_matrix_view<const T, int64_t, row_major> queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+                       raft::device_matrix_view<float, int64_t, row_major> distances);            \
+                                                                                                  \
+  template void search(const raft::device_resources& handle,                                      \
+                       const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,     \
+                       const cuvs::neighbors::search_params* search_params,                       \
+                       raft::host_matrix_view<const T, int64_t, row_major> h_queries,             \
+                       raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,            \
+                       raft::device_matrix_view<float, int64_t, row_major> d_distances);          \
+                                                                                                  \
+  template void serialize(const raft::device_resources& handle,                                   \
+                          const cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,  \
+                          std::ostream& os);                                                      \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            std::istream& is);                                                    \
+                                                                                                  \
+  template void deserialize(const raft::device_resources& handle,                                 \
+                            cuvs::neighbors::iface<ivf_pq::index<IdxT>, T, IdxT>& interface,      \
+                            const std::string& filename);
+CUVS_INST_MG_PQ(uint8_t, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors
diff --git a/cpp/src/neighbors/ivf_flat/generate_ivf_flat.py b/cpp/src/neighbors/ivf_flat/generate_ivf_flat.py
index 1fabcca8c..c435cc6d9 100644
--- a/cpp/src/neighbors/ivf_flat/generate_ivf_flat.py
+++ b/cpp/src/neighbors/ivf_flat/generate_ivf_flat.py
@@ -144,7 +144,7 @@
   void search(                                                              \\
     raft::resources const& handle,                                          \\
     const cuvs::neighbors::ivf_flat::search_params& params,                 \\
-    cuvs::neighbors::ivf_flat::index<T, IdxT>& index,                       \\
+    const cuvs::neighbors::ivf_flat::index<T, IdxT>& index,                 \\
     raft::device_matrix_view<const T, IdxT, raft::row_major> queries,       \\
     raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,        \\
     raft::device_matrix_view<float, IdxT, raft::row_major> distances,       \\
diff --git a/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh b/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh
index 9626b2ce5..f5a4267cd 100644
--- a/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh
+++ b/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh
@@ -1206,8 +1206,8 @@ void launch_with_fixed_consts(cuvs::distance::DistanceType metric, Args&&... arg
                            inner_prod_dist<Veclen, T, AccT>>(
         {},
         raft::compose_op(raft::add_const_op<float>{1.0f}, raft::mul_const_op<float>{-1.0f}),
-        std::forward<Args>(args)...);
-    // NB: update the description of `knn::ivf_flat::build` when adding here a new metric.
+        std::forward<Args>(args)...);  // NB: update the description of `knn::ivf_flat::build` when
+                                       // adding here a new metric.
     default: RAFT_FAIL("The chosen distance metric is not supported (%d)", int(metric));
   }
 }
diff --git a/cpp/src/neighbors/ivf_flat/ivf_flat_search_float_int64_t.cu b/cpp/src/neighbors/ivf_flat/ivf_flat_search_float_int64_t.cu
index 3f262d612..87abc0bc0 100644
--- a/cpp/src/neighbors/ivf_flat/ivf_flat_search_float_int64_t.cu
+++ b/cpp/src/neighbors/ivf_flat/ivf_flat_search_float_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_flat {
 #define CUVS_INST_IVF_FLAT_SEARCH(T, IdxT)                                      \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_flat::search_params& params,           \
-              cuvs::neighbors::ivf_flat::index<T, IdxT>& index,                 \
+              const cuvs::neighbors::ivf_flat::index<T, IdxT>& index,           \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_flat/ivf_flat_search_int8_t_int64_t.cu b/cpp/src/neighbors/ivf_flat/ivf_flat_search_int8_t_int64_t.cu
index 4357afb0a..c1e92ae5b 100644
--- a/cpp/src/neighbors/ivf_flat/ivf_flat_search_int8_t_int64_t.cu
+++ b/cpp/src/neighbors/ivf_flat/ivf_flat_search_int8_t_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_flat {
 #define CUVS_INST_IVF_FLAT_SEARCH(T, IdxT)                                      \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_flat::search_params& params,           \
-              cuvs::neighbors::ivf_flat::index<T, IdxT>& index,                 \
+              const cuvs::neighbors::ivf_flat::index<T, IdxT>& index,           \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_flat/ivf_flat_search_uint8_t_int64_t.cu b/cpp/src/neighbors/ivf_flat/ivf_flat_search_uint8_t_int64_t.cu
index 8265a3e17..4ff8ed770 100644
--- a/cpp/src/neighbors/ivf_flat/ivf_flat_search_uint8_t_int64_t.cu
+++ b/cpp/src/neighbors/ivf_flat/ivf_flat_search_uint8_t_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_flat {
 #define CUVS_INST_IVF_FLAT_SEARCH(T, IdxT)                                      \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_flat::search_params& params,           \
-              cuvs::neighbors::ivf_flat::index<T, IdxT>& index,                 \
+              const cuvs::neighbors::ivf_flat::index<T, IdxT>& index,           \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_pq/detail/generate_ivf_pq.py b/cpp/src/neighbors/ivf_pq/detail/generate_ivf_pq.py
index a5a829967..a2ac048ff 100644
--- a/cpp/src/neighbors/ivf_pq/detail/generate_ivf_pq.py
+++ b/cpp/src/neighbors/ivf_pq/detail/generate_ivf_pq.py
@@ -68,7 +68,7 @@
 #define CUVS_INST_IVF_PQ_SEARCH(T, IdxT)                                        \\
   void search(raft::resources const& handle,                                    \\
               const cuvs::neighbors::ivf_pq::search_params& params,             \\
-              cuvs::neighbors::ivf_pq::index<IdxT>& index,                      \\
+              const cuvs::neighbors::ivf_pq::index<IdxT>& index,                \\
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \\
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \\
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \\
diff --git a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_float_int64_t.cu b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_float_int64_t.cu
index 07ee110bc..44e9777ba 100644
--- a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_float_int64_t.cu
+++ b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_float_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_pq {
 #define CUVS_INST_IVF_PQ_SEARCH(T, IdxT)                                        \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_pq::search_params& params,             \
-              cuvs::neighbors::ivf_pq::index<IdxT>& index,                      \
+              const cuvs::neighbors::ivf_pq::index<IdxT>& index,                \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_half_int64_t.cu b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_half_int64_t.cu
index cf387cb67..d7446e846 100644
--- a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_half_int64_t.cu
+++ b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_half_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_pq {
 #define CUVS_INST_IVF_PQ_SEARCH(T, IdxT)                                        \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_pq::search_params& params,             \
-              cuvs::neighbors::ivf_pq::index<IdxT>& index,                      \
+              const cuvs::neighbors::ivf_pq::index<IdxT>& index,                \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_int8_t_int64_t.cu b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_int8_t_int64_t.cu
index 5ec9093df..c1ffede97 100644
--- a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_int8_t_int64_t.cu
+++ b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_int8_t_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_pq {
 #define CUVS_INST_IVF_PQ_SEARCH(T, IdxT)                                        \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_pq::search_params& params,             \
-              cuvs::neighbors::ivf_pq::index<IdxT>& index,                      \
+              const cuvs::neighbors::ivf_pq::index<IdxT>& index,                \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_uint8_t_int64_t.cu b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_uint8_t_int64_t.cu
index d2e2f3b00..08e4f0536 100644
--- a/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_uint8_t_int64_t.cu
+++ b/cpp/src/neighbors/ivf_pq/detail/ivf_pq_search_uint8_t_int64_t.cu
@@ -32,7 +32,7 @@ namespace cuvs::neighbors::ivf_pq {
 #define CUVS_INST_IVF_PQ_SEARCH(T, IdxT)                                        \
   void search(raft::resources const& handle,                                    \
               const cuvs::neighbors::ivf_pq::search_params& params,             \
-              cuvs::neighbors::ivf_pq::index<IdxT>& index,                      \
+              const cuvs::neighbors::ivf_pq::index<IdxT>& index,                \
               raft::device_matrix_view<const T, IdxT, raft::row_major> queries, \
               raft::device_matrix_view<IdxT, IdxT, raft::row_major> neighbors,  \
               raft::device_matrix_view<float, IdxT, raft::row_major> distances, \
diff --git a/cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh b/cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh
index f0f464950..1d4acea1e 100644
--- a/cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh
+++ b/cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh
@@ -31,6 +31,7 @@
 
 #include <raft/core/device_mdarray.hpp>
 #include <raft/core/logger-ext.hpp>
+#include <raft/core/mdspan.hpp>
 #include <raft/core/operators.hpp>
 #include <raft/core/resource/cuda_stream.hpp>
 #include <raft/core/resource/cuda_stream_pool.hpp>
@@ -41,6 +42,8 @@
 #include <raft/linalg/gemm.cuh>
 #include <raft/linalg/map.cuh>
 #include <raft/linalg/norm.cuh>
+#include <raft/linalg/norm_types.hpp>
+#include <raft/linalg/normalize.cuh>
 #include <raft/linalg/unary_op.cuh>
 #include <raft/matrix/gather.cuh>
 #include <raft/matrix/linewise_op.cuh>
@@ -1466,6 +1469,13 @@ void extend(raft::resources const& handle,
                   std::is_same_v<T, int8_t>,
                 "Unsupported data type");
 
+  if (index->metric() == distance::DistanceType::CosineExpanded) {
+    if constexpr (std::is_same_v<T, uint8_t> || std::is_same_v<T, int8_t>)
+      RAFT_FAIL(
+        "CosineExpanded distance metric is currently not supported for uint8_t and int8_t data "
+        "type");
+  }
+
   rmm::device_async_resource_ref device_memory = raft::resource::get_workspace_resource(handle);
   rmm::device_async_resource_ref large_memory =
     raft::resource::get_large_workspace_resource(handle);
@@ -1632,6 +1642,14 @@ void extend(raft::resources const& handle,
   vec_batches.prefetch_next_batch();
   for (const auto& vec_batch : vec_batches) {
     const auto& idx_batch = *idx_batches++;
+    if (index->metric() == CosineExpanded) {
+      auto vec_batch_view = raft::make_device_matrix_view<T, internal_extents_t>(
+        const_cast<T*>(vec_batch.data()), vec_batch.size(), index->dim());
+      raft::linalg::row_normalize(handle,
+                                  raft::make_const_mdspan(vec_batch_view),
+                                  vec_batch_view,
+                                  raft::linalg::NormType::L2Norm);
+    }
     process_and_fill_codes(handle,
                            *index,
                            vec_batch.data(),
@@ -1683,6 +1701,13 @@ auto build(raft::resources const& handle,
             << (int)params.pq_dim << std::endl;
   RAFT_EXPECTS(n_rows > 0 && dim > 0, "empty dataset");
   RAFT_EXPECTS(n_rows >= params.n_lists, "number of rows can't be less than n_lists");
+  if (params.metric == distance::DistanceType::CosineExpanded) {
+    // TODO: support int8_t and uint8_t types (https://github.com/rapidsai/cuvs/issues/389)
+    if constexpr (std::is_same_v<T, uint8_t> || std::is_same_v<T, int8_t>)
+      RAFT_FAIL(
+        "CosineExpanded distance metric is currently not supported for uint8_t and int8_t data "
+        "type");
+  }
 
   auto stream = raft::resource::get_cuda_stream(handle);
 
@@ -1761,6 +1786,11 @@ auto build(raft::resources const& handle,
     cuvs::cluster::kmeans::balanced_params kmeans_params;
     kmeans_params.n_iters = params.kmeans_n_iters;
     kmeans_params.metric  = static_cast<cuvs::distance::DistanceType>((int)index.metric());
+
+    if (index.metric() == distance::DistanceType::CosineExpanded) {
+      raft::linalg::row_normalize(
+        handle, trainset_const_view, trainset.view(), raft::linalg::NormType::L2Norm);
+    }
     cuvs::cluster::kmeans_balanced::fit(
       handle, kmeans_params, trainset_const_view, centers_view, utils::mapping<float>{});
 
@@ -1768,6 +1798,10 @@ auto build(raft::resources const& handle,
     rmm::device_uvector<uint32_t> labels(n_rows_train, stream, big_memory_resource);
     auto centers_const_view = raft::make_device_matrix_view<const float, internal_extents_t>(
       cluster_centers, index.n_lists(), index.dim());
+    if (index.metric() == distance::DistanceType::CosineExpanded) {
+      raft::linalg::row_normalize(
+        handle, centers_const_view, centers_view, raft::linalg::NormType::L2Norm);
+    }
     auto labels_view =
       raft::make_device_vector_view<uint32_t, internal_extents_t>(labels.data(), n_rows_train);
     cuvs::cluster::kmeans_balanced::predict(handle,
diff --git a/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh b/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh
index 8404ca1f9..fbbdd06c2 100644
--- a/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh
+++ b/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh
@@ -369,6 +369,7 @@ RAFT_KERNEL compute_similarity_kernel(uint32_t dim,
             reinterpret_cast<float*>(lut_end)[i] = query[i] - cluster_center[i];
           }
         } break;
+        case distance::DistanceType::CosineExpanded:
         case distance::DistanceType::InnerProduct: {
           float2 pvals;
           for (uint32_t i = threadIdx.x; i < dim; i += blockDim.x) {
@@ -408,6 +409,7 @@ RAFT_KERNEL compute_similarity_kernel(uint32_t dim,
               diff -= pq_c;
               score += diff * diff;
             } break;
+            case distance::DistanceType::CosineExpanded:
             case distance::DistanceType::InnerProduct: {
               // NB: we negate the scores as we hardcoded select-topk to always compute the minimum
               float q;
@@ -485,6 +487,7 @@ RAFT_KERNEL compute_similarity_kernel(uint32_t dim,
           reinterpret_cast<const vec_t::io_t*>(pq_thread_data),
           lut_scores,
           early_stop_limit);
+        if (metric == distance::DistanceType::CosineExpanded) { score = OutT(1) + score; }
       }
       if constexpr (kManageLocalTopK) {
         block_topk.add(score, sample_offset + i);
diff --git a/cpp/src/neighbors/ivf_pq/ivf_pq_search.cuh b/cpp/src/neighbors/ivf_pq/ivf_pq_search.cuh
index e185f18dc..db8f9fbd3 100644
--- a/cpp/src/neighbors/ivf_pq/ivf_pq_search.cuh
+++ b/cpp/src/neighbors/ivf_pq/ivf_pq_search.cuh
@@ -37,6 +37,9 @@
 #include <raft/core/resources.hpp>
 #include <raft/linalg/gemm.cuh>
 #include <raft/linalg/map.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/norm_types.hpp>
+#include <raft/linalg/normalize.cuh>
 #include <raft/linalg/unary_op.cuh>
 #include <raft/matrix/detail/select_warpsort.cuh>
 #include <raft/util/cache.hpp>
@@ -104,12 +107,21 @@ void select_clusters(raft::resources const& handle,
 
       This is a negative inner-product distance. We minimize it to find the similar clusters.
 
+      NB: qc_distances is NOT used further in ivfpq_search.
+
+    Cosine distance:
+      `qc_distances[i, j] = - (queries[i], cluster_centers[j])`
+
+      This is a negative inner-product distance. The queries and cluster centers are row normalized.
+      We minimize it to find the similar clusters.
+
       NB: qc_distances is NOT used further in ivfpq_search.
  */
   float norm_factor;
   switch (metric) {
     case cuvs::distance::DistanceType::L2SqrtExpanded:
     case cuvs::distance::DistanceType::L2Expanded: norm_factor = 1.0 / -2.0; break;
+    case cuvs::distance::DistanceType::CosineExpanded:
     case cuvs::distance::DistanceType::InnerProduct: norm_factor = 0.0; break;
     default: RAFT_FAIL("Unsupported distance type %d.", int(metric));
   }
@@ -133,6 +145,7 @@ void select_clusters(raft::resources const& handle,
       gemm_k = dim + 1;
       RAFT_EXPECTS(gemm_k <= dim_ext, "unexpected gemm_k or dim_ext");
     } break;
+    case cuvs::distance::DistanceType::CosineExpanded:
     case cuvs::distance::DistanceType::InnerProduct: {
       alpha = -1.0;
       beta  = 0.0;
@@ -363,8 +376,9 @@ void ivfpq_search_worker(raft::resources const& handle,
       // stores basediff (query[i] - center[i])
       precomp_data_count = index.rot_dim();
     } break;
+    case distance::DistanceType::CosineExpanded:
     case distance::DistanceType::InnerProduct: {
-      // stores two components (query[i] * center[i], query[i] * center[i])
+      // stores two components (query[i], query[i] * center[i])
       precomp_data_count = index.rot_dim() * 2;
     } break;
     default: {
@@ -457,8 +471,14 @@ void ivfpq_search_worker(raft::resources const& handle,
     num_samples_vector);
 
   // Postprocessing
-  ivf::detail::postprocess_distances(
-    distances, topk_dists.data(), index.metric(), n_queries, topK, scaling_factor, true, stream);
+  ivf::detail::postprocess_distances(distances,
+                                     topk_dists.data(),
+                                     index.metric(),
+                                     n_queries,
+                                     topK,
+                                     scaling_factor,
+                                     index.metric() != distance::DistanceType::CosineExpanded,
+                                     stream);
   ivf::detail::postprocess_neighbors(neighbors,
                                      neighbors_uint32,
                                      index.inds_ptrs().data_handle(),
@@ -508,6 +528,7 @@ struct ivfpq_search {
   {
     bool signed_metric = false;
     switch (metric) {
+      case cuvs::distance::DistanceType::CosineExpanded: signed_metric = true; break;
       case cuvs::distance::DistanceType::InnerProduct: signed_metric = true; break;
       default: break;
     }
@@ -606,6 +627,12 @@ inline void search(raft::resources const& handle,
   static_assert(std::is_same_v<T, float> || std::is_same_v<T, half> || std::is_same_v<T, uint8_t> ||
                   std::is_same_v<T, int8_t>,
                 "Unsupported element type.");
+  if (index.metric() == distance::DistanceType::CosineExpanded) {
+    if constexpr (std::is_same_v<T, uint8_t> || std::is_same_v<T, int8_t>)
+      RAFT_FAIL(
+        "CosineExpanded distance metric is currently not supported for uint8_t and int8_t data "
+        "type");
+  }
   raft::common::nvtx::range<cuvs::common::nvtx::domain::cuvs> fun_scope(
     "ivf_pq::search(n_queries = %u, n_probes = %u, k = %u, dim = %zu)",
     n_queries,
@@ -698,7 +725,14 @@ inline void search(raft::resources const& handle,
                        rot_queries.data(),
                        index.rot_dim(),
                        stream);
-
+    if (index.metric() == distance::DistanceType::CosineExpanded) {
+      auto rot_queries_view = raft::make_device_matrix_view<float, uint32_t>(
+        rot_queries.data(), max_queries, index.rot_dim());
+      raft::linalg::row_normalize(handle,
+                                  raft::make_const_mdspan(rot_queries_view),
+                                  rot_queries_view,
+                                  raft::linalg::NormType::L2Norm);
+    }
     for (uint32_t offset_b = 0; offset_b < queries_batch; offset_b += max_batch_size) {
       uint32_t batch_size = min(max_batch_size, queries_batch - offset_b);
       /* The distance calculation is done in the rotated/transformed space;
diff --git a/cpp/src/neighbors/mg/generate_mg.py b/cpp/src/neighbors/mg/generate_mg.py
new file mode 100644
index 000000000..af5e60545
--- /dev/null
+++ b/cpp/src/neighbors/mg/generate_mg.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+header = """/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+"""
+
+include_macro = """
+#include "mg.cuh"
+"""
+
+namespace_macro = """
+namespace cuvs::neighbors::mg {
+"""
+
+footer = """
+}  // namespace cuvs::neighbors::mg
+"""
+
+flat_macro = """
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                                                          \\
+  index<ivf_flat::index<T, IdxT>, T, IdxT> build(const raft::device_resources& handle,                                      \\
+                                                 const mg::index_params<ivf_flat::index_params>& index_params,              \\
+                                                 raft::host_matrix_view<const T, int64_t, row_major> index_dataset)         \\
+  {                                                                                                                         \\
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);                                       \\
+    index<ivf_flat::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);                                   \\
+    cuvs::neighbors::mg::detail::build(handle, index,                                                                       \\
+                                       static_cast<const cuvs::neighbors::index_params*>(&index_params),                    \\
+                                       index_dataset);                                                                      \\
+    return index;                                                                                                           \\
+  }                                                                                                                         \\
+                                                                                                                            \\
+  void extend(const raft::device_resources& handle,                                                                         \\
+              index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                                                              \\
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,                                              \\
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)                                       \\
+  {                                                                                                                         \\
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);                                           \\
+  }                                                                                                                         \\
+                                                                                                                            \\
+  void search(const raft::device_resources& handle,                                                                         \\
+              const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                                                        \\
+              const mg::search_params<ivf_flat::search_params>& search_params,                                              \\
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                                                  \\
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                                                   \\
+              raft::host_matrix_view<float, int64_t, row_major> distances,                                                  \\
+              int64_t n_rows_per_batch)                                                                                     \\
+  {                                                                                                                         \\
+    cuvs::neighbors::mg::detail::search(handle, index,                                                                      \\
+                                        static_cast<const cuvs::neighbors::search_params*>(&search_params),                 \\
+                                        queries, neighbors, distances, n_rows_per_batch);                                   \\
+  }                                                                                                                         \\
+                                                                                                                            \\
+  void serialize(const raft::device_resources& handle,                                                                      \\
+                 const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                                                     \\
+                 const std::string& filename)                                                                               \\
+  {                                                                                                                         \\
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                                                        \\
+  }                                                                                                                         \\
+                                                                                                                            \\
+  template<>                                                                                                                \\
+  index<ivf_flat::index<T, IdxT>, T, IdxT> deserialize_flat<T, IdxT>(const raft::device_resources& handle,                  \\
+                                                                     const std::string& filename)                           \\
+  {                                                                                                                         \\
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(handle, filename);                                                  \\
+    return idx;                                                                                                             \\
+  }                                                                                                                         \\
+                                                                                                                            \\
+  template<>                                                                                                                \\
+  index<ivf_flat::index<T, IdxT>, T, IdxT> distribute_flat<T, IdxT>(const raft::device_resources& handle,                   \\
+                                                                    const std::string& filename)                            \\
+  {                                                                                                                         \\
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);                                       \\
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                                     \\
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                                         \\
+    return idx;                                                                                                             \\
+  }
+"""
+
+pq_macro = """
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                                          \\
+  index<ivf_pq::index<IdxT>, T, IdxT> build(const raft::device_resources& handle,                                         \\
+                                            const mg::index_params<ivf_pq::index_params>& index_params,                   \\
+                                            raft::host_matrix_view<const T, int64_t, row_major> index_dataset)            \\
+  {                                                                                                                       \\
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);                                     \\
+    index<ivf_pq::index<IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);                                      \\
+    cuvs::neighbors::mg::detail::build(handle, index,                                                                     \\
+                                       static_cast<const cuvs::neighbors::index_params*>(&index_params),                  \\
+                                       index_dataset);                                                                    \\
+    return index;                                                                                                         \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  void extend(const raft::device_resources& handle,                                                                       \\
+              index<ivf_pq::index<IdxT>, T, IdxT>& index,                                                                 \\
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,                                            \\
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)                                     \\
+  {                                                                                                                       \\
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);                                         \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  void search(const raft::device_resources& handle,                                                                       \\
+              const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                                           \\
+              const mg::search_params<ivf_pq::search_params>& search_params,                                              \\
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                                                \\
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                                                 \\
+              raft::host_matrix_view<float, int64_t, row_major> distances,                                                \\
+              int64_t n_rows_per_batch)                                                                                   \\
+  {                                                                                                                       \\
+    cuvs::neighbors::mg::detail::search(handle, index,                                                                    \\
+                                        static_cast<const cuvs::neighbors::search_params*>(&search_params),               \\
+                                        queries, neighbors, distances, n_rows_per_batch);                                 \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  void serialize(const raft::device_resources& handle,                                                                    \\
+                 const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                                        \\
+                 const std::string& filename)                                                                             \\
+  {                                                                                                                       \\
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                                                      \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  template<>                                                                                                              \\
+  index<ivf_pq::index<IdxT>, T, IdxT> deserialize_pq<T, IdxT>(const raft::device_resources& handle,                       \\
+                                                              const std::string& filename)                                \\
+  {                                                                                                                       \\
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(handle, filename);                                                     \\
+    return idx;                                                                                                           \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  template<>                                                                                                              \\
+  index<ivf_pq::index<IdxT>, T, IdxT> distribute_pq<T, IdxT>(const raft::device_resources& handle,                        \\
+                                                             const std::string& filename)                                 \\
+  {                                                                                                                       \\
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);                                     \\
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                                        \\
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                                       \\
+    return idx;                                                                                                           \\
+  }
+"""
+
+cagra_macro = """
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                                                       \\
+  index<cagra::index<T, IdxT>, T, IdxT> build(const raft::device_resources& handle,                                       \\
+                                              const mg::index_params<cagra::index_params>& index_params,                  \\
+                                              raft::host_matrix_view<const T, int64_t, row_major> index_dataset)          \\
+  {                                                                                                                       \\
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);                                     \\
+    index<cagra::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);                                    \\
+    cuvs::neighbors::mg::detail::build(handle, index,                                                                     \\
+                                       static_cast<const cuvs::neighbors::index_params*>(&index_params),                  \\
+                                       index_dataset);                                                                    \\
+    return index;                                                                                                         \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  void search(const raft::device_resources& handle,                                                                       \\
+              const index<cagra::index<T, IdxT>, T, IdxT>& index,                                                         \\
+              const mg::search_params<cagra::search_params>& search_params,                                               \\
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                                                \\
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                                                 \\
+              raft::host_matrix_view<float, int64_t, row_major> distances,                                                \\
+              int64_t n_rows_per_batch)                                                                                   \\
+  {                                                                                                                       \\
+    cuvs::neighbors::mg::detail::search(handle, index,                                                                    \\
+                                        static_cast<const cuvs::neighbors::search_params*>(&search_params),               \\
+                                        queries, neighbors, distances, n_rows_per_batch);                                 \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  void serialize(const raft::device_resources& handle,                                                                    \\
+                 const index<cagra::index<T, IdxT>, T, IdxT>& index,                                                      \\
+                 const std::string& filename)                                                                             \\
+  {                                                                                                                       \\
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                                                      \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  template<>                                                                                                              \\
+  index<cagra::index<T, IdxT>, T, IdxT> deserialize_cagra<T, IdxT>(const raft::device_resources& handle,                  \\
+                                                                   const std::string& filename)                           \\
+  {                                                                                                                       \\
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(handle, filename);                                                   \\
+    return idx;                                                                                                           \\
+  }                                                                                                                       \\
+                                                                                                                          \\
+  template<>                                                                                                              \\
+  index<cagra::index<T, IdxT>, T, IdxT> distribute_cagra<T, IdxT>(const raft::device_resources& handle,                   \\
+                                                                  const std::string& filename)                            \\
+  {                                                                                                                       \\
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);                                     \\
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                                      \\
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                                       \\
+    return idx;                                                                                                           \\
+  }
+"""
+
+flat_macros = dict (
+    flat = dict(
+        include=include_macro,
+        definition=flat_macro,
+        name="CUVS_INST_MG_FLAT",
+    )
+)
+
+pq_macros = dict (
+    pq = dict(
+        include=include_macro,
+        definition=pq_macro,
+        name="CUVS_INST_MG_PQ",
+    )
+)
+
+cagra_macros = dict (
+    cagra = dict(
+        include=include_macro,
+        definition=cagra_macro,
+        name="CUVS_INST_MG_CAGRA",
+    )
+)
+
+flat_types = dict(
+    float_int64_t=("float", "int64_t"),
+    int8_t_int64_t=("int8_t", "int64_t"),
+    uint8_t_int64_t=("uint8_t", "int64_t"),
+)
+
+pq_types = dict(
+    float_int64_t=("float", "int64_t"),
+    half_int64_t=("half", "int64_t"),
+    int8_t_int64_t=("int8_t", "int64_t"),
+    uint8_t_int64_t=("uint8_t", "int64_t"),
+)
+
+cagra_types = dict(
+    float_uint32_t=("float", "uint32_t"),
+    half_uint32_t=("half", "uint32_t"),
+    int8_t_uint32_t=("int8_t", "uint32_t"),
+    uint8_t_uint32_t=("uint8_t", "uint32_t"),
+)
+
+for macros, types in [(flat_macros, flat_types), (pq_macros, pq_types), (cagra_macros, cagra_types)]:
+  for type_path, (T, IdxT) in types.items():
+      for macro_path, macro in macros.items():
+          path = f"mg_{macro_path}_{type_path}.cu"
+          with open(path, "w") as f:
+              f.write(header)
+              f.write(macro['include'])
+              f.write(namespace_macro)
+              f.write(macro["definition"])
+              f.write(f"{macro['name']}({T}, {IdxT});\n\n")
+              f.write(f"#undef {macro['name']}\n")
+              f.write(footer)
+
+          print(f"src/neighbors/mg/{path}")
diff --git a/cpp/src/neighbors/mg/mg.cuh b/cpp/src/neighbors/mg/mg.cuh
new file mode 100644
index 000000000..d3f635bc4
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg.cuh
@@ -0,0 +1,690 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "../detail/knn_merge_parts.cuh"
+#include <raft/core/resource/nccl_clique.hpp>
+#include <raft/core/serialize.hpp>
+#include <raft/linalg/add.cuh>
+#include <raft/util/cuda_dev_essentials.cuh>
+
+#include <cuvs/neighbors/common.hpp>
+#include <cuvs/neighbors/mg.hpp>
+
+namespace cuvs::neighbors {
+using namespace raft;
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void search(const raft::device_resources& handle,
+            const cuvs::neighbors::iface<AnnIndexType, T, IdxT>& interface,
+            const cuvs::neighbors::search_params* search_params,
+            raft::host_matrix_view<const T, int64_t, row_major> h_queries,
+            raft::device_matrix_view<IdxT, int64_t, row_major> d_neighbors,
+            raft::device_matrix_view<float, int64_t, row_major> d_distances);
+}  // namespace cuvs::neighbors
+
+namespace cuvs::neighbors::mg {
+void check_omp_threads(const int requirements);
+}  // namespace cuvs::neighbors::mg
+
+namespace cuvs::neighbors::mg::detail {
+using namespace cuvs::neighbors;
+using namespace raft;
+
+// local index deserialization and distribution
+template <typename AnnIndexType, typename T, typename IdxT>
+void deserialize_and_distribute(const raft::device_resources& handle,
+                                index<AnnIndexType, T, IdxT>& index,
+                                const std::string& filename)
+{
+  const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);
+  for (int rank = 0; rank < index.num_ranks_; rank++) {
+    int dev_id                            = clique.device_ids_[rank];
+    const raft::device_resources& dev_res = clique.device_resources_[rank];
+    RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+    auto& ann_if = index.ann_interfaces_.emplace_back();
+    cuvs::neighbors::deserialize(dev_res, ann_if, filename);
+  }
+}
+
+// MG index deserialization
+template <typename AnnIndexType, typename T, typename IdxT>
+void deserialize(const raft::device_resources& handle,
+                 index<AnnIndexType, T, IdxT>& index,
+                 const std::string& filename)
+{
+  std::ifstream is(filename, std::ios::in | std::ios::binary);
+  if (!is) { RAFT_FAIL("Cannot open file %s", filename.c_str()); }
+
+  const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);
+
+  index.mode_      = (cuvs::neighbors::mg::distribution_mode)deserialize_scalar<int>(handle, is);
+  index.num_ranks_ = deserialize_scalar<int>(handle, is);
+
+  if (index.num_ranks_ != clique.num_ranks_) {
+    RAFT_FAIL("Serialized index has %d ranks whereas NCCL clique has %d ranks",
+              index.num_ranks_,
+              clique.num_ranks_);
+  }
+
+  for (int rank = 0; rank < index.num_ranks_; rank++) {
+    int dev_id                            = clique.device_ids_[rank];
+    const raft::device_resources& dev_res = clique.device_resources_[rank];
+    RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+    auto& ann_if = index.ann_interfaces_.emplace_back();
+    cuvs::neighbors::deserialize(dev_res, ann_if, is);
+  }
+
+  is.close();
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void build(const raft::device_resources& handle,
+           index<AnnIndexType, T, IdxT>& index,
+           const cuvs::neighbors::index_params* index_params,
+           raft::host_matrix_view<const T, int64_t, row_major> index_dataset)
+{
+  const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);
+
+  if (index.mode_ == REPLICATED) {
+    int64_t n_rows = index_dataset.extent(0);
+    RAFT_LOG_INFO("REPLICATED BUILD: %d*%drows", index.num_ranks_, n_rows);
+
+    index.ann_interfaces_.resize(index.num_ranks_);
+#pragma omp parallel for
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      int dev_id                            = clique.device_ids_[rank];
+      const raft::device_resources& dev_res = clique.device_resources_[rank];
+      RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+      auto& ann_if = index.ann_interfaces_[rank];
+      cuvs::neighbors::build(dev_res, ann_if, index_params, index_dataset);
+      resource::sync_stream(dev_res);
+    }
+  } else if (index.mode_ == SHARDED) {
+    int64_t n_rows           = index_dataset.extent(0);
+    int64_t n_cols           = index_dataset.extent(1);
+    int64_t n_rows_per_shard = raft::ceildiv(n_rows, (int64_t)index.num_ranks_);
+
+    RAFT_LOG_INFO("SHARDED BUILD: %d*%drows", index.num_ranks_, n_rows_per_shard);
+
+    index.ann_interfaces_.resize(index.num_ranks_);
+#pragma omp parallel for
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      int dev_id                            = clique.device_ids_[rank];
+      const raft::device_resources& dev_res = clique.device_resources_[rank];
+      RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+      int64_t offset                  = rank * n_rows_per_shard;
+      int64_t n_rows_of_current_shard = std::min(n_rows_per_shard, n_rows - offset);
+      const T* partition_ptr          = index_dataset.data_handle() + (offset * n_cols);
+      auto partition                  = raft::make_host_matrix_view<const T, int64_t, row_major>(
+        partition_ptr, n_rows_of_current_shard, n_cols);
+      auto& ann_if = index.ann_interfaces_[rank];
+      cuvs::neighbors::build(dev_res, ann_if, index_params, partition);
+      resource::sync_stream(dev_res);
+    }
+  }
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void extend(const raft::device_resources& handle,
+            index<AnnIndexType, T, IdxT>& index,
+            raft::host_matrix_view<const T, int64_t, row_major> new_vectors,
+            std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)
+{
+  const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);
+
+  int64_t n_rows = new_vectors.extent(0);
+  if (index.mode_ == REPLICATED) {
+    RAFT_LOG_INFO("REPLICATED EXTEND: %d*%drows", index.num_ranks_, n_rows);
+
+#pragma omp parallel for
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      int dev_id                            = clique.device_ids_[rank];
+      const raft::device_resources& dev_res = clique.device_resources_[rank];
+      RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+      auto& ann_if = index.ann_interfaces_[rank];
+      cuvs::neighbors::extend(dev_res, ann_if, new_vectors, new_indices);
+      resource::sync_stream(dev_res);
+    }
+  } else if (index.mode_ == SHARDED) {
+    int64_t n_cols           = new_vectors.extent(1);
+    int64_t n_rows_per_shard = raft::ceildiv(n_rows, (int64_t)index.num_ranks_);
+
+    RAFT_LOG_INFO("SHARDED EXTEND: %d*%drows", index.num_ranks_, n_rows_per_shard);
+
+#pragma omp parallel for
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      int dev_id                            = clique.device_ids_[rank];
+      const raft::device_resources& dev_res = clique.device_resources_[rank];
+      RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+      int64_t offset                  = rank * n_rows_per_shard;
+      int64_t n_rows_of_current_shard = std::min(n_rows_per_shard, n_rows - offset);
+      const T* new_vectors_ptr        = new_vectors.data_handle() + (offset * n_cols);
+      auto new_vectors_part           = raft::make_host_matrix_view<const T, int64_t, row_major>(
+        new_vectors_ptr, n_rows_of_current_shard, n_cols);
+
+      std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices_part = std::nullopt;
+      if (new_indices.has_value()) {
+        const IdxT* new_indices_ptr = new_indices.value().data_handle() + offset;
+        new_indices_part            = raft::make_host_vector_view<const IdxT, int64_t>(
+          new_indices_ptr, n_rows_of_current_shard);
+      }
+      auto& ann_if = index.ann_interfaces_[rank];
+      cuvs::neighbors::extend(dev_res, ann_if, new_vectors_part, new_indices_part);
+      resource::sync_stream(dev_res);
+    }
+  }
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void sharded_search_with_direct_merge(const raft::comms::nccl_clique& clique,
+                                      const index<AnnIndexType, T, IdxT>& index,
+                                      const cuvs::neighbors::search_params* search_params,
+                                      raft::host_matrix_view<const T, int64_t, row_major> queries,
+                                      raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,
+                                      raft::host_matrix_view<float, int64_t, row_major> distances,
+                                      int64_t n_rows_per_batch,
+                                      int64_t n_rows,
+                                      int64_t n_cols,
+                                      int64_t n_neighbors,
+                                      int64_t n_batches)
+{
+  const auto& root_handle = clique.set_current_device_to_root_rank();
+  auto in_neighbors       = raft::make_device_matrix<IdxT, int64_t, row_major>(
+    root_handle, index.num_ranks_ * n_rows_per_batch, n_neighbors);
+  auto in_distances = raft::make_device_matrix<float, int64_t, row_major>(
+    root_handle, index.num_ranks_ * n_rows_per_batch, n_neighbors);
+  auto out_neighbors =
+    raft::make_device_matrix<IdxT, int64_t, row_major>(root_handle, n_rows_per_batch, n_neighbors);
+  auto out_distances =
+    raft::make_device_matrix<float, int64_t, row_major>(root_handle, n_rows_per_batch, n_neighbors);
+
+  for (int64_t batch_idx = 0; batch_idx < n_batches; batch_idx++) {
+    int64_t offset                  = batch_idx * n_rows_per_batch;
+    int64_t query_offset            = offset * n_cols;
+    int64_t output_offset           = offset * n_neighbors;
+    int64_t n_rows_of_current_batch = std::min((int64_t)n_rows_per_batch, n_rows - offset);
+    int64_t part_size               = n_rows_of_current_batch * n_neighbors;
+    auto query_partition            = raft::make_host_matrix_view<const T, int64_t, row_major>(
+      queries.data_handle() + query_offset, n_rows_of_current_batch, n_cols);
+
+    const int& requirements = index.num_ranks_;
+    check_omp_threads(requirements);  // should use at least num_ranks_ threads to avoid NCCL hang
+#pragma omp parallel for num_threads(index.num_ranks_)
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      int dev_id                            = clique.device_ids_[rank];
+      const raft::device_resources& dev_res = clique.device_resources_[rank];
+      auto& ann_if                          = index.ann_interfaces_[rank];
+      RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+
+      if (rank == clique.root_rank_) {  // root rank
+        uint64_t batch_offset = clique.root_rank_ * part_size;
+        auto d_neighbors      = raft::make_device_matrix_view<IdxT, int64_t, row_major>(
+          in_neighbors.data_handle() + batch_offset, n_rows_of_current_batch, n_neighbors);
+        auto d_distances = raft::make_device_matrix_view<float, int64_t, row_major>(
+          in_distances.data_handle() + batch_offset, n_rows_of_current_batch, n_neighbors);
+        cuvs::neighbors::search(
+          dev_res, ann_if, search_params, query_partition, d_neighbors, d_distances);
+
+        // wait for other ranks
+        ncclGroupStart();
+        for (int from_rank = 0; from_rank < index.num_ranks_; from_rank++) {
+          if (from_rank == clique.root_rank_) continue;
+
+          batch_offset = from_rank * part_size;
+          ncclRecv(in_neighbors.data_handle() + batch_offset,
+                   part_size * sizeof(IdxT),
+                   ncclUint8,
+                   from_rank,
+                   clique.nccl_comms_[rank],
+                   resource::get_cuda_stream(dev_res));
+          ncclRecv(in_distances.data_handle() + batch_offset,
+                   part_size * sizeof(float),
+                   ncclUint8,
+                   from_rank,
+                   clique.nccl_comms_[rank],
+                   resource::get_cuda_stream(dev_res));
+        }
+        ncclGroupEnd();
+        resource::sync_stream(dev_res);
+      } else {  // non-root ranks
+        auto d_neighbors = raft::make_device_matrix<IdxT, int64_t, row_major>(
+          dev_res, n_rows_of_current_batch, n_neighbors);
+        auto d_distances = raft::make_device_matrix<float, int64_t, row_major>(
+          dev_res, n_rows_of_current_batch, n_neighbors);
+        cuvs::neighbors::search(
+          dev_res, ann_if, search_params, query_partition, d_neighbors.view(), d_distances.view());
+
+        // send results to root rank
+        ncclGroupStart();
+        ncclSend(d_neighbors.data_handle(),
+                 part_size * sizeof(IdxT),
+                 ncclUint8,
+                 clique.root_rank_,
+                 clique.nccl_comms_[rank],
+                 resource::get_cuda_stream(dev_res));
+        ncclSend(d_distances.data_handle(),
+                 part_size * sizeof(float),
+                 ncclUint8,
+                 clique.root_rank_,
+                 clique.nccl_comms_[rank],
+                 resource::get_cuda_stream(dev_res));
+        ncclGroupEnd();
+        resource::sync_stream(dev_res);
+      }
+    }
+
+    const auto& root_handle_   = clique.set_current_device_to_root_rank();
+    auto h_trans               = std::vector<IdxT>(index.num_ranks_);
+    int64_t translation_offset = 0;
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      h_trans[rank] = translation_offset;
+      translation_offset += index.ann_interfaces_[rank].size();
+    }
+    auto d_trans = raft::make_device_vector<IdxT, IdxT>(root_handle_, index.num_ranks_);
+    raft::copy(d_trans.data_handle(),
+               h_trans.data(),
+               index.num_ranks_,
+               resource::get_cuda_stream(root_handle_));
+
+    cuvs::neighbors::detail::knn_merge_parts(in_distances.data_handle(),
+                                             in_neighbors.data_handle(),
+                                             out_distances.data_handle(),
+                                             out_neighbors.data_handle(),
+                                             n_rows_of_current_batch,
+                                             index.num_ranks_,
+                                             n_neighbors,
+                                             resource::get_cuda_stream(root_handle_),
+                                             d_trans.data_handle());
+
+    raft::copy(neighbors.data_handle() + output_offset,
+               out_neighbors.data_handle(),
+               part_size,
+               resource::get_cuda_stream(root_handle_));
+    raft::copy(distances.data_handle() + output_offset,
+               out_distances.data_handle(),
+               part_size,
+               resource::get_cuda_stream(root_handle_));
+
+    resource::sync_stream(root_handle_);
+  }
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void sharded_search_with_tree_merge(const raft::comms::nccl_clique& clique,
+                                    const index<AnnIndexType, T, IdxT>& index,
+                                    const cuvs::neighbors::search_params* search_params,
+                                    raft::host_matrix_view<const T, int64_t, row_major> queries,
+                                    raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,
+                                    raft::host_matrix_view<float, int64_t, row_major> distances,
+                                    int64_t n_rows_per_batch,
+                                    int64_t n_rows,
+                                    int64_t n_cols,
+                                    int64_t n_neighbors,
+                                    int64_t n_batches)
+{
+  for (int64_t batch_idx = 0; batch_idx < n_batches; batch_idx++) {
+    int64_t offset                  = batch_idx * n_rows_per_batch;
+    int64_t query_offset            = offset * n_cols;
+    int64_t output_offset           = offset * n_neighbors;
+    int64_t n_rows_of_current_batch = std::min((int64_t)n_rows_per_batch, n_rows - offset);
+    auto query_partition            = raft::make_host_matrix_view<const T, int64_t, row_major>(
+      queries.data_handle() + query_offset, n_rows_of_current_batch, n_cols);
+
+    const int& requirements = index.num_ranks_;
+    check_omp_threads(requirements);  // should use at least num_ranks_ threads to avoid NCCL hang
+#pragma omp parallel for num_threads(index.num_ranks_)
+    for (int rank = 0; rank < index.num_ranks_; rank++) {
+      int dev_id                            = clique.device_ids_[rank];
+      const raft::device_resources& dev_res = clique.device_resources_[rank];
+      auto& ann_if                          = index.ann_interfaces_[rank];
+      RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+
+      int64_t part_size = n_rows_of_current_batch * n_neighbors;
+
+      auto tmp_neighbors = raft::make_device_matrix<IdxT, int64_t, row_major>(
+        dev_res, 2 * n_rows_of_current_batch, n_neighbors);
+      auto tmp_distances = raft::make_device_matrix<float, int64_t, row_major>(
+        dev_res, 2 * n_rows_of_current_batch, n_neighbors);
+      auto neighbors_view = raft::make_device_matrix_view<IdxT, int64_t, row_major>(
+        tmp_neighbors.data_handle(), n_rows_of_current_batch, n_neighbors);
+      auto distances_view = raft::make_device_matrix_view<float, int64_t, row_major>(
+        tmp_distances.data_handle(), n_rows_of_current_batch, n_neighbors);
+      cuvs::neighbors::search(
+        dev_res, ann_if, search_params, query_partition, neighbors_view, distances_view);
+
+      int64_t translation_offset = 0;
+      for (int r = 0; r < rank; r++) {
+        translation_offset += index.ann_interfaces_[r].size();
+      }
+      raft::linalg::addScalar(neighbors_view.data_handle(),
+                              neighbors_view.data_handle(),
+                              (IdxT)translation_offset,
+                              part_size,
+                              resource::get_cuda_stream(dev_res));
+
+      auto d_trans = raft::make_device_vector<IdxT, IdxT>(dev_res, 2);
+      cudaMemsetAsync(
+        d_trans.data_handle(), 0, 2 * sizeof(IdxT), resource::get_cuda_stream(dev_res));
+
+      int64_t remaining = index.num_ranks_;
+      int64_t radix     = 2;
+
+      while (remaining > 1) {
+        bool received_something = false;
+        int64_t offset          = radix / 2;
+        ncclGroupStart();
+        if (rank % radix == 0)  // This is one of the receivers
+        {
+          int other_id = rank + offset;
+          if (other_id < index.num_ranks_)  // Make sure someone's sending anything
+          {
+            ncclRecv(tmp_neighbors.data_handle() + part_size,
+                     part_size * sizeof(IdxT),
+                     ncclUint8,
+                     other_id,
+                     clique.nccl_comms_[rank],
+                     resource::get_cuda_stream(dev_res));
+            ncclRecv(tmp_distances.data_handle() + part_size,
+                     part_size * sizeof(float),
+                     ncclUint8,
+                     other_id,
+                     clique.nccl_comms_[rank],
+                     resource::get_cuda_stream(dev_res));
+            received_something = true;
+          }
+        } else if (rank % radix == offset)  // This is one of the senders
+        {
+          int other_id = rank - offset;
+          ncclSend(tmp_neighbors.data_handle(),
+                   part_size * sizeof(IdxT),
+                   ncclUint8,
+                   other_id,
+                   clique.nccl_comms_[rank],
+                   resource::get_cuda_stream(dev_res));
+          ncclSend(tmp_distances.data_handle(),
+                   part_size * sizeof(float),
+                   ncclUint8,
+                   other_id,
+                   clique.nccl_comms_[rank],
+                   resource::get_cuda_stream(dev_res));
+        }
+        ncclGroupEnd();
+
+        remaining = (remaining + 1) / 2;
+        radix *= 2;
+
+        if (received_something) {
+          // merge inplace
+          cuvs::neighbors::detail::knn_merge_parts(tmp_distances.data_handle(),
+                                                   tmp_neighbors.data_handle(),
+                                                   tmp_distances.data_handle(),
+                                                   tmp_neighbors.data_handle(),
+                                                   n_rows_of_current_batch,
+                                                   2,
+                                                   n_neighbors,
+                                                   resource::get_cuda_stream(dev_res),
+                                                   d_trans.data_handle());
+
+          // If done, copy the final result
+          if (remaining <= 1) {
+            raft::copy(neighbors.data_handle() + output_offset,
+                       tmp_neighbors.data_handle(),
+                       part_size,
+                       resource::get_cuda_stream(dev_res));
+            raft::copy(distances.data_handle() + output_offset,
+                       tmp_distances.data_handle(),
+                       part_size,
+                       resource::get_cuda_stream(dev_res));
+
+            resource::sync_stream(dev_res);
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void run_search_batch(const raft::comms::nccl_clique& clique,
+                      const index<AnnIndexType, T, IdxT>& index,
+                      int rank,
+                      const cuvs::neighbors::search_params* search_params,
+                      raft::host_matrix_view<const T, int64_t, row_major>& queries,
+                      raft::host_matrix_view<IdxT, int64_t, row_major>& neighbors,
+                      raft::host_matrix_view<float, int64_t, row_major>& distances,
+                      int64_t query_offset,
+                      int64_t output_offset,
+                      int64_t n_rows_of_current_batch,
+                      int64_t n_cols,
+                      int64_t n_neighbors)
+{
+  int dev_id = clique.device_ids_[rank];
+  RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+  const raft::device_resources& dev_res = clique.device_resources_[rank];
+  auto& ann_if                          = index.ann_interfaces_[rank];
+
+  auto query_partition = raft::make_host_matrix_view<const T, int64_t, row_major>(
+    queries.data_handle() + query_offset, n_rows_of_current_batch, n_cols);
+  auto d_neighbors = raft::make_device_matrix<IdxT, int64_t, row_major>(
+    dev_res, n_rows_of_current_batch, n_neighbors);
+  auto d_distances = raft::make_device_matrix<float, int64_t, row_major>(
+    dev_res, n_rows_of_current_batch, n_neighbors);
+
+  cuvs::neighbors::search(
+    dev_res, ann_if, search_params, query_partition, d_neighbors.view(), d_distances.view());
+
+  raft::copy(neighbors.data_handle() + output_offset,
+             d_neighbors.data_handle(),
+             n_rows_of_current_batch * n_neighbors,
+             resource::get_cuda_stream(dev_res));
+  raft::copy(distances.data_handle() + output_offset,
+             d_distances.data_handle(),
+             n_rows_of_current_batch * n_neighbors,
+             resource::get_cuda_stream(dev_res));
+
+  resource::sync_stream(dev_res);
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void search(const raft::device_resources& handle,
+            const index<AnnIndexType, T, IdxT>& index,
+            const cuvs::neighbors::search_params* search_params,
+            raft::host_matrix_view<const T, int64_t, row_major> queries,
+            raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,
+            raft::host_matrix_view<float, int64_t, row_major> distances,
+            int64_t n_rows_per_batch)
+{
+  const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);
+
+  int64_t n_rows      = queries.extent(0);
+  int64_t n_cols      = queries.extent(1);
+  int64_t n_neighbors = neighbors.extent(1);
+
+  if (index.mode_ == REPLICATED) {
+    cuvs::neighbors::mg::replicated_search_mode search_mode;
+    if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+      const cuvs::neighbors::mg::search_params<ivf_flat::search_params>* mg_search_params =
+        static_cast<const cuvs::neighbors::mg::search_params<ivf_flat::search_params>*>(
+          search_params);
+      search_mode = mg_search_params->search_mode;
+    } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+      const cuvs::neighbors::mg::search_params<ivf_pq::search_params>* mg_search_params =
+        static_cast<const cuvs::neighbors::mg::search_params<ivf_pq::search_params>*>(
+          search_params);
+      search_mode = mg_search_params->search_mode;
+    } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+      const cuvs::neighbors::mg::search_params<cagra::search_params>* mg_search_params =
+        static_cast<const cuvs::neighbors::mg::search_params<cagra::search_params>*>(search_params);
+      search_mode = mg_search_params->search_mode;
+    }
+
+    if (search_mode == LOAD_BALANCER) {
+      int64_t n_rows_per_rank = raft::ceildiv(n_rows, (int64_t)index.num_ranks_);
+      n_rows_per_batch =
+        std::min(n_rows_per_batch, n_rows_per_rank);  // get at least num_ranks_ batches
+      int64_t n_batches = raft::ceildiv(n_rows, (int64_t)n_rows_per_batch);
+      if (n_batches <= 1) n_rows_per_batch = n_rows;
+
+      RAFT_LOG_INFO(
+        "REPLICATED SEARCH IN LOAD BALANCER MODE: %d*%drows", n_batches, n_rows_per_batch);
+
+#pragma omp parallel for
+      for (int64_t batch_idx = 0; batch_idx < n_batches; batch_idx++) {
+        int rank                        = batch_idx % index.num_ranks_;  // alternate GPUs
+        int64_t offset                  = batch_idx * n_rows_per_batch;
+        int64_t query_offset            = offset * n_cols;
+        int64_t output_offset           = offset * n_neighbors;
+        int64_t n_rows_of_current_batch = std::min(n_rows_per_batch, n_rows - offset);
+
+        run_search_batch(clique,
+                         index,
+                         rank,
+                         search_params,
+                         queries,
+                         neighbors,
+                         distances,
+                         query_offset,
+                         output_offset,
+                         n_rows_of_current_batch,
+                         n_cols,
+                         n_neighbors);
+      }
+    } else if (search_mode == ROUND_ROBIN) {
+      RAFT_LOG_INFO("REPLICATED SEARCH IN ROUND ROBIN MODE: %d*%drows", 1, n_rows);
+
+      ASSERT(n_rows <= n_rows_per_batch,
+             "In round-robin mode, n_rows must lower or equal to n_rows_per_batch");
+
+      auto& rrc    = *index.round_robin_counter_;
+      int64_t rank = rrc++;
+      rank %= index.num_ranks_;
+
+      run_search_batch(clique,
+                       index,
+                       rank,
+                       search_params,
+                       queries,
+                       neighbors,
+                       distances,
+                       0,
+                       0,
+                       n_rows,
+                       n_cols,
+                       n_neighbors);
+    }
+  } else if (index.mode_ == SHARDED) {
+    cuvs::neighbors::mg::sharded_merge_mode merge_mode;
+    if constexpr (std::is_same<AnnIndexType, ivf_flat::index<T, IdxT>>::value) {
+      const cuvs::neighbors::mg::search_params<ivf_flat::search_params>* mg_search_params =
+        static_cast<const cuvs::neighbors::mg::search_params<ivf_flat::search_params>*>(
+          search_params);
+      merge_mode = mg_search_params->merge_mode;
+    } else if constexpr (std::is_same<AnnIndexType, ivf_pq::index<IdxT>>::value) {
+      const cuvs::neighbors::mg::search_params<ivf_pq::search_params>* mg_search_params =
+        static_cast<const cuvs::neighbors::mg::search_params<ivf_pq::search_params>*>(
+          search_params);
+      merge_mode = mg_search_params->merge_mode;
+    } else if constexpr (std::is_same<AnnIndexType, cagra::index<T, IdxT>>::value) {
+      const cuvs::neighbors::mg::search_params<cagra::search_params>* mg_search_params =
+        static_cast<const cuvs::neighbors::mg::search_params<cagra::search_params>*>(search_params);
+      merge_mode = mg_search_params->merge_mode;
+    }
+
+    int64_t n_batches = raft::ceildiv(n_rows, (int64_t)n_rows_per_batch);
+    if (n_batches <= 1) n_rows_per_batch = n_rows;
+
+    if (merge_mode == MERGE_ON_ROOT_RANK) {
+      RAFT_LOG_INFO("SHARDED SEARCH WITH MERGE_ON_ROOT_RANK MERGE MODE: %d*%drows",
+                    n_batches,
+                    n_rows_per_batch);
+      sharded_search_with_direct_merge(clique,
+                                       index,
+                                       search_params,
+                                       queries,
+                                       neighbors,
+                                       distances,
+                                       n_rows_per_batch,
+                                       n_rows,
+                                       n_cols,
+                                       n_neighbors,
+                                       n_batches);
+    } else if (merge_mode == TREE_MERGE) {
+      RAFT_LOG_INFO(
+        "SHARDED SEARCH WITH TREE_MERGE MERGE MODE %d*%drows", n_batches, n_rows_per_batch);
+      sharded_search_with_tree_merge(clique,
+                                     index,
+                                     search_params,
+                                     queries,
+                                     neighbors,
+                                     distances,
+                                     n_rows_per_batch,
+                                     n_rows,
+                                     n_cols,
+                                     n_neighbors,
+                                     n_batches);
+    }
+  }
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+void serialize(const raft::device_resources& handle,
+               const index<AnnIndexType, T, IdxT>& index,
+               const std::string& filename)
+{
+  std::ofstream of(filename, std::ios::out | std::ios::binary);
+  if (!of) { RAFT_FAIL("Cannot open file %s", filename.c_str()); }
+
+  const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);
+
+  serialize_scalar(handle, of, (int)index.mode_);
+  serialize_scalar(handle, of, index.num_ranks_);
+
+  for (int rank = 0; rank < index.num_ranks_; rank++) {
+    int dev_id                            = clique.device_ids_[rank];
+    const raft::device_resources& dev_res = clique.device_resources_[rank];
+    RAFT_CUDA_TRY(cudaSetDevice(dev_id));
+    auto& ann_if = index.ann_interfaces_[rank];
+    cuvs::neighbors::serialize(dev_res, ann_if, of);
+  }
+
+  of.close();
+  if (!of) { RAFT_FAIL("Error writing output %s", filename.c_str()); }
+}
+
+}  // namespace cuvs::neighbors::mg::detail
+
+namespace cuvs::neighbors::mg {
+using namespace cuvs::neighbors;
+using namespace raft;
+
+template <typename AnnIndexType, typename T, typename IdxT>
+index<AnnIndexType, T, IdxT>::index(distribution_mode mode, int num_ranks_)
+  : mode_(mode),
+    num_ranks_(num_ranks_),
+    round_robin_counter_(std::make_shared<std::atomic<int64_t>>(0))
+{
+}
+
+template <typename AnnIndexType, typename T, typename IdxT>
+index<AnnIndexType, T, IdxT>::index(const raft::device_resources& handle,
+                                    const std::string& filename)
+  : round_robin_counter_(std::make_shared<std::atomic<int64_t>>(0))
+{
+  cuvs::neighbors::mg::detail::deserialize(handle, *this, filename);
+}
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_cagra_float_uint32_t.cu b/cpp/src/neighbors/mg/mg_cagra_float_uint32_t.cu
new file mode 100644
index 000000000..b11610fb4
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_cagra_float_uint32_t.cu
@@ -0,0 +1,92 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                    \
+  index<cagra::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                              \
+    const mg::index_params<cagra::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    index<cagra::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                \
+      index_dataset);                                                                  \
+    return index;                                                                      \
+  }                                                                                    \
+                                                                                       \
+  void search(const raft::device_resources& handle,                                    \
+              const index<cagra::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<cagra::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,             \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+              raft::host_matrix_view<float, int64_t, row_major> distances,             \
+              int64_t n_rows_per_batch)                                                \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::search(                                               \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),              \
+      queries,                                                                         \
+      neighbors,                                                                       \
+      distances,                                                                       \
+      n_rows_per_batch);                                                               \
+  }                                                                                    \
+                                                                                       \
+  void serialize(const raft::device_resources& handle,                                 \
+                 const index<cagra::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                          \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                   \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> deserialize_cagra<T, IdxT>(                    \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                        \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> distribute_cagra<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);    \
+    return idx;                                                                        \
+  }
+CUVS_INST_MG_CAGRA(float, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_cagra_half_uint32_t.cu b/cpp/src/neighbors/mg/mg_cagra_half_uint32_t.cu
new file mode 100644
index 000000000..8f76c69a3
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_cagra_half_uint32_t.cu
@@ -0,0 +1,92 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                    \
+  index<cagra::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                              \
+    const mg::index_params<cagra::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    index<cagra::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                \
+      index_dataset);                                                                  \
+    return index;                                                                      \
+  }                                                                                    \
+                                                                                       \
+  void search(const raft::device_resources& handle,                                    \
+              const index<cagra::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<cagra::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,             \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+              raft::host_matrix_view<float, int64_t, row_major> distances,             \
+              int64_t n_rows_per_batch)                                                \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::search(                                               \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),              \
+      queries,                                                                         \
+      neighbors,                                                                       \
+      distances,                                                                       \
+      n_rows_per_batch);                                                               \
+  }                                                                                    \
+                                                                                       \
+  void serialize(const raft::device_resources& handle,                                 \
+                 const index<cagra::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                          \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                   \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> deserialize_cagra<T, IdxT>(                    \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                        \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> distribute_cagra<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);    \
+    return idx;                                                                        \
+  }
+CUVS_INST_MG_CAGRA(half, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_cagra_int8_t_uint32_t.cu b/cpp/src/neighbors/mg/mg_cagra_int8_t_uint32_t.cu
new file mode 100644
index 000000000..67b88d742
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_cagra_int8_t_uint32_t.cu
@@ -0,0 +1,92 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                    \
+  index<cagra::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                              \
+    const mg::index_params<cagra::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    index<cagra::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                \
+      index_dataset);                                                                  \
+    return index;                                                                      \
+  }                                                                                    \
+                                                                                       \
+  void search(const raft::device_resources& handle,                                    \
+              const index<cagra::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<cagra::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,             \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+              raft::host_matrix_view<float, int64_t, row_major> distances,             \
+              int64_t n_rows_per_batch)                                                \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::search(                                               \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),              \
+      queries,                                                                         \
+      neighbors,                                                                       \
+      distances,                                                                       \
+      n_rows_per_batch);                                                               \
+  }                                                                                    \
+                                                                                       \
+  void serialize(const raft::device_resources& handle,                                 \
+                 const index<cagra::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                          \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                   \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> deserialize_cagra<T, IdxT>(                    \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                        \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> distribute_cagra<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);    \
+    return idx;                                                                        \
+  }
+CUVS_INST_MG_CAGRA(int8_t, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_cagra_uint8_t_uint32_t.cu b/cpp/src/neighbors/mg/mg_cagra_uint8_t_uint32_t.cu
new file mode 100644
index 000000000..f72174923
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_cagra_uint8_t_uint32_t.cu
@@ -0,0 +1,92 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_CAGRA(T, IdxT)                                                    \
+  index<cagra::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                              \
+    const mg::index_params<cagra::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    index<cagra::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                \
+      index_dataset);                                                                  \
+    return index;                                                                      \
+  }                                                                                    \
+                                                                                       \
+  void search(const raft::device_resources& handle,                                    \
+              const index<cagra::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<cagra::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,             \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,              \
+              raft::host_matrix_view<float, int64_t, row_major> distances,             \
+              int64_t n_rows_per_batch)                                                \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::search(                                               \
+      handle,                                                                          \
+      index,                                                                           \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),              \
+      queries,                                                                         \
+      neighbors,                                                                       \
+      distances,                                                                       \
+      n_rows_per_batch);                                                               \
+  }                                                                                    \
+                                                                                       \
+  void serialize(const raft::device_resources& handle,                                 \
+                 const index<cagra::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                          \
+  {                                                                                    \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                   \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> deserialize_cagra<T, IdxT>(                    \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                        \
+  }                                                                                    \
+                                                                                       \
+  template <>                                                                          \
+  index<cagra::index<T, IdxT>, T, IdxT> distribute_cagra<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                 \
+  {                                                                                    \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);  \
+    auto idx = index<cagra::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);    \
+    return idx;                                                                        \
+  }
+CUVS_INST_MG_CAGRA(uint8_t, uint32_t);
+
+#undef CUVS_INST_MG_CAGRA
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_flat_float_int64_t.cu b/cpp/src/neighbors/mg/mg_flat_float_int64_t.cu
new file mode 100644
index 000000000..4495e2527
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_flat_float_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                        \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                                 \
+    const mg::index_params<ivf_flat::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                    \
+  {                                                                                       \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);     \
+    index<ivf_flat::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                   \
+      handle,                                                                             \
+      index,                                                                              \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                   \
+      index_dataset);                                                                     \
+    return index;                                                                         \
+  }                                                                                       \
+                                                                                          \
+  void extend(const raft::device_resources& handle,                                       \
+              index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                            \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,            \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)     \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);         \
+  }                                                                                       \
+                                                                                          \
+  void search(const raft::device_resources& handle,                                       \
+              const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<ivf_flat::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                 \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                \
+              int64_t n_rows_per_batch)                                                   \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::search(                                                  \
+      handle,                                                                             \
+      index,                                                                              \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                 \
+      queries,                                                                            \
+      neighbors,                                                                          \
+      distances,                                                                          \
+      n_rows_per_batch);                                                                  \
+  }                                                                                       \
+                                                                                          \
+  void serialize(const raft::device_resources& handle,                                    \
+                 const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                             \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                      \
+  }                                                                                       \
+                                                                                          \
+  template <>                                                                             \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> deserialize_flat<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                    \
+  {                                                                                       \
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                           \
+  }                                                                                       \
+                                                                                          \
+  template <>                                                                             \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> distribute_flat<T, IdxT>(                      \
+    const raft::device_resources& handle, const std::string& filename)                    \
+  {                                                                                       \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);     \
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);       \
+    return idx;                                                                           \
+  }
+CUVS_INST_MG_FLAT(float, int64_t);
+
+#undef CUVS_INST_MG_FLAT
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_flat_int8_t_int64_t.cu b/cpp/src/neighbors/mg/mg_flat_int8_t_int64_t.cu
new file mode 100644
index 000000000..5494414a6
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_flat_int8_t_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                        \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                                 \
+    const mg::index_params<ivf_flat::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                    \
+  {                                                                                       \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);     \
+    index<ivf_flat::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                   \
+      handle,                                                                             \
+      index,                                                                              \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                   \
+      index_dataset);                                                                     \
+    return index;                                                                         \
+  }                                                                                       \
+                                                                                          \
+  void extend(const raft::device_resources& handle,                                       \
+              index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                            \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,            \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)     \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);         \
+  }                                                                                       \
+                                                                                          \
+  void search(const raft::device_resources& handle,                                       \
+              const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<ivf_flat::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                 \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                \
+              int64_t n_rows_per_batch)                                                   \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::search(                                                  \
+      handle,                                                                             \
+      index,                                                                              \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                 \
+      queries,                                                                            \
+      neighbors,                                                                          \
+      distances,                                                                          \
+      n_rows_per_batch);                                                                  \
+  }                                                                                       \
+                                                                                          \
+  void serialize(const raft::device_resources& handle,                                    \
+                 const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                             \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                      \
+  }                                                                                       \
+                                                                                          \
+  template <>                                                                             \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> deserialize_flat<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                    \
+  {                                                                                       \
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                           \
+  }                                                                                       \
+                                                                                          \
+  template <>                                                                             \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> distribute_flat<T, IdxT>(                      \
+    const raft::device_resources& handle, const std::string& filename)                    \
+  {                                                                                       \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);     \
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);       \
+    return idx;                                                                           \
+  }
+CUVS_INST_MG_FLAT(int8_t, int64_t);
+
+#undef CUVS_INST_MG_FLAT
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_flat_uint8_t_int64_t.cu b/cpp/src/neighbors/mg/mg_flat_uint8_t_int64_t.cu
new file mode 100644
index 000000000..35df2146b
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_flat_uint8_t_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_FLAT(T, IdxT)                                                        \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> build(                                         \
+    const raft::device_resources& handle,                                                 \
+    const mg::index_params<ivf_flat::index_params>& index_params,                         \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                    \
+  {                                                                                       \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);     \
+    index<ivf_flat::index<T, IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_); \
+    cuvs::neighbors::mg::detail::build(                                                   \
+      handle,                                                                             \
+      index,                                                                              \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                   \
+      index_dataset);                                                                     \
+    return index;                                                                         \
+  }                                                                                       \
+                                                                                          \
+  void extend(const raft::device_resources& handle,                                       \
+              index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                            \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,            \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)     \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);         \
+  }                                                                                       \
+                                                                                          \
+  void search(const raft::device_resources& handle,                                       \
+              const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                      \
+              const mg::search_params<ivf_flat::search_params>& search_params,            \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                 \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                \
+              int64_t n_rows_per_batch)                                                   \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::search(                                                  \
+      handle,                                                                             \
+      index,                                                                              \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                 \
+      queries,                                                                            \
+      neighbors,                                                                          \
+      distances,                                                                          \
+      n_rows_per_batch);                                                                  \
+  }                                                                                       \
+                                                                                          \
+  void serialize(const raft::device_resources& handle,                                    \
+                 const index<ivf_flat::index<T, IdxT>, T, IdxT>& index,                   \
+                 const std::string& filename)                                             \
+  {                                                                                       \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                      \
+  }                                                                                       \
+                                                                                          \
+  template <>                                                                             \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> deserialize_flat<T, IdxT>(                     \
+    const raft::device_resources& handle, const std::string& filename)                    \
+  {                                                                                       \
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(handle, filename);                \
+    return idx;                                                                           \
+  }                                                                                       \
+                                                                                          \
+  template <>                                                                             \
+  index<ivf_flat::index<T, IdxT>, T, IdxT> distribute_flat<T, IdxT>(                      \
+    const raft::device_resources& handle, const std::string& filename)                    \
+  {                                                                                       \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);     \
+    auto idx = index<ivf_flat::index<T, IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);   \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);       \
+    return idx;                                                                           \
+  }
+CUVS_INST_MG_FLAT(uint8_t, int64_t);
+
+#undef CUVS_INST_MG_FLAT
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_pq_float_int64_t.cu b/cpp/src/neighbors/mg/mg_pq_float_int64_t.cu
new file mode 100644
index 000000000..c671740e6
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_pq_float_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                   \
+  index<ivf_pq::index<IdxT>, T, IdxT> build(                                                       \
+    const raft::device_resources& handle,                                                          \
+    const mg::index_params<ivf_pq::index_params>& index_params,                                    \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                             \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    index<ivf_pq::index<IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);               \
+    cuvs::neighbors::mg::detail::build(                                                            \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                            \
+      index_dataset);                                                                              \
+    return index;                                                                                  \
+  }                                                                                                \
+                                                                                                   \
+  void extend(const raft::device_resources& handle,                                                \
+              index<ivf_pq::index<IdxT>, T, IdxT>& index,                                          \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,                     \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)              \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);                  \
+  }                                                                                                \
+                                                                                                   \
+  void search(const raft::device_resources& handle,                                                \
+              const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                    \
+              const mg::search_params<ivf_pq::search_params>& search_params,                       \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                         \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                          \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                         \
+              int64_t n_rows_per_batch)                                                            \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::search(                                                           \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                          \
+      queries,                                                                                     \
+      neighbors,                                                                                   \
+      distances,                                                                                   \
+      n_rows_per_batch);                                                                           \
+  }                                                                                                \
+                                                                                                   \
+  void serialize(const raft::device_resources& handle,                                             \
+                 const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                 \
+                 const std::string& filename)                                                      \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                               \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> deserialize_pq<T, IdxT>(                                     \
+    const raft::device_resources& handle, const std::string& filename)                             \
+  {                                                                                                \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(handle, filename);                              \
+    return idx;                                                                                    \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> distribute_pq<T, IdxT>(const raft::device_resources& handle, \
+                                                             const std::string& filename)          \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                 \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                \
+    return idx;                                                                                    \
+  }
+CUVS_INST_MG_PQ(float, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_pq_half_int64_t.cu b/cpp/src/neighbors/mg/mg_pq_half_int64_t.cu
new file mode 100644
index 000000000..b167239c6
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_pq_half_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                   \
+  index<ivf_pq::index<IdxT>, T, IdxT> build(                                                       \
+    const raft::device_resources& handle,                                                          \
+    const mg::index_params<ivf_pq::index_params>& index_params,                                    \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                             \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    index<ivf_pq::index<IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);               \
+    cuvs::neighbors::mg::detail::build(                                                            \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                            \
+      index_dataset);                                                                              \
+    return index;                                                                                  \
+  }                                                                                                \
+                                                                                                   \
+  void extend(const raft::device_resources& handle,                                                \
+              index<ivf_pq::index<IdxT>, T, IdxT>& index,                                          \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,                     \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)              \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);                  \
+  }                                                                                                \
+                                                                                                   \
+  void search(const raft::device_resources& handle,                                                \
+              const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                    \
+              const mg::search_params<ivf_pq::search_params>& search_params,                       \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                         \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                          \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                         \
+              int64_t n_rows_per_batch)                                                            \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::search(                                                           \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                          \
+      queries,                                                                                     \
+      neighbors,                                                                                   \
+      distances,                                                                                   \
+      n_rows_per_batch);                                                                           \
+  }                                                                                                \
+                                                                                                   \
+  void serialize(const raft::device_resources& handle,                                             \
+                 const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                 \
+                 const std::string& filename)                                                      \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                               \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> deserialize_pq<T, IdxT>(                                     \
+    const raft::device_resources& handle, const std::string& filename)                             \
+  {                                                                                                \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(handle, filename);                              \
+    return idx;                                                                                    \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> distribute_pq<T, IdxT>(const raft::device_resources& handle, \
+                                                             const std::string& filename)          \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                 \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                \
+    return idx;                                                                                    \
+  }
+CUVS_INST_MG_PQ(half, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_pq_int8_t_int64_t.cu b/cpp/src/neighbors/mg/mg_pq_int8_t_int64_t.cu
new file mode 100644
index 000000000..127baf8fd
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_pq_int8_t_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                   \
+  index<ivf_pq::index<IdxT>, T, IdxT> build(                                                       \
+    const raft::device_resources& handle,                                                          \
+    const mg::index_params<ivf_pq::index_params>& index_params,                                    \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                             \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    index<ivf_pq::index<IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);               \
+    cuvs::neighbors::mg::detail::build(                                                            \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                            \
+      index_dataset);                                                                              \
+    return index;                                                                                  \
+  }                                                                                                \
+                                                                                                   \
+  void extend(const raft::device_resources& handle,                                                \
+              index<ivf_pq::index<IdxT>, T, IdxT>& index,                                          \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,                     \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)              \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);                  \
+  }                                                                                                \
+                                                                                                   \
+  void search(const raft::device_resources& handle,                                                \
+              const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                    \
+              const mg::search_params<ivf_pq::search_params>& search_params,                       \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                         \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                          \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                         \
+              int64_t n_rows_per_batch)                                                            \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::search(                                                           \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                          \
+      queries,                                                                                     \
+      neighbors,                                                                                   \
+      distances,                                                                                   \
+      n_rows_per_batch);                                                                           \
+  }                                                                                                \
+                                                                                                   \
+  void serialize(const raft::device_resources& handle,                                             \
+                 const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                 \
+                 const std::string& filename)                                                      \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                               \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> deserialize_pq<T, IdxT>(                                     \
+    const raft::device_resources& handle, const std::string& filename)                             \
+  {                                                                                                \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(handle, filename);                              \
+    return idx;                                                                                    \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> distribute_pq<T, IdxT>(const raft::device_resources& handle, \
+                                                             const std::string& filename)          \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                 \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                \
+    return idx;                                                                                    \
+  }
+CUVS_INST_MG_PQ(int8_t, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/mg_pq_uint8_t_int64_t.cu b/cpp/src/neighbors/mg/mg_pq_uint8_t_int64_t.cu
new file mode 100644
index 000000000..869e009a5
--- /dev/null
+++ b/cpp/src/neighbors/mg/mg_pq_uint8_t_int64_t.cu
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * NOTE: this file is generated by generate_mg.py
+ *
+ * Make changes there and run in this directory:
+ *
+ * > python generate_mg.py
+ *
+ */
+
+#include "mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+#define CUVS_INST_MG_PQ(T, IdxT)                                                                   \
+  index<ivf_pq::index<IdxT>, T, IdxT> build(                                                       \
+    const raft::device_resources& handle,                                                          \
+    const mg::index_params<ivf_pq::index_params>& index_params,                                    \
+    raft::host_matrix_view<const T, int64_t, row_major> index_dataset)                             \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    index<ivf_pq::index<IdxT>, T, IdxT> index(index_params.mode, clique.num_ranks_);               \
+    cuvs::neighbors::mg::detail::build(                                                            \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::index_params*>(&index_params),                            \
+      index_dataset);                                                                              \
+    return index;                                                                                  \
+  }                                                                                                \
+                                                                                                   \
+  void extend(const raft::device_resources& handle,                                                \
+              index<ivf_pq::index<IdxT>, T, IdxT>& index,                                          \
+              raft::host_matrix_view<const T, int64_t, row_major> new_vectors,                     \
+              std::optional<raft::host_vector_view<const IdxT, int64_t>> new_indices)              \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::extend(handle, index, new_vectors, new_indices);                  \
+  }                                                                                                \
+                                                                                                   \
+  void search(const raft::device_resources& handle,                                                \
+              const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                    \
+              const mg::search_params<ivf_pq::search_params>& search_params,                       \
+              raft::host_matrix_view<const T, int64_t, row_major> queries,                         \
+              raft::host_matrix_view<IdxT, int64_t, row_major> neighbors,                          \
+              raft::host_matrix_view<float, int64_t, row_major> distances,                         \
+              int64_t n_rows_per_batch)                                                            \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::search(                                                           \
+      handle,                                                                                      \
+      index,                                                                                       \
+      static_cast<const cuvs::neighbors::search_params*>(&search_params),                          \
+      queries,                                                                                     \
+      neighbors,                                                                                   \
+      distances,                                                                                   \
+      n_rows_per_batch);                                                                           \
+  }                                                                                                \
+                                                                                                   \
+  void serialize(const raft::device_resources& handle,                                             \
+                 const index<ivf_pq::index<IdxT>, T, IdxT>& index,                                 \
+                 const std::string& filename)                                                      \
+  {                                                                                                \
+    cuvs::neighbors::mg::detail::serialize(handle, index, filename);                               \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> deserialize_pq<T, IdxT>(                                     \
+    const raft::device_resources& handle, const std::string& filename)                             \
+  {                                                                                                \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(handle, filename);                              \
+    return idx;                                                                                    \
+  }                                                                                                \
+                                                                                                   \
+  template <>                                                                                      \
+  index<ivf_pq::index<IdxT>, T, IdxT> distribute_pq<T, IdxT>(const raft::device_resources& handle, \
+                                                             const std::string& filename)          \
+  {                                                                                                \
+    const raft::comms::nccl_clique& clique = raft::resource::get_nccl_clique(handle);              \
+    auto idx = index<ivf_pq::index<IdxT>, T, IdxT>(REPLICATED, clique.num_ranks_);                 \
+    cuvs::neighbors::mg::detail::deserialize_and_distribute(handle, idx, filename);                \
+    return idx;                                                                                    \
+  }
+CUVS_INST_MG_PQ(uint8_t, int64_t);
+
+#undef CUVS_INST_MG_PQ
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/src/neighbors/mg/nccl_comm.cpp b/cpp/src/neighbors/mg/nccl_comm.cpp
new file mode 100644
index 000000000..c4556957a
--- /dev/null
+++ b/cpp/src/neighbors/mg/nccl_comm.cpp
@@ -0,0 +1,8 @@
+#include <nccl.h>
+#include <raft/core/resources.hpp>
+
+namespace raft::comms {
+void build_comms_nccl_only(raft::resources* handle, ncclComm_t nccl_comm, int num_ranks, int rank)
+{
+}
+}  // namespace raft::comms
diff --git a/cpp/src/neighbors/mg/omp_checks.cpp b/cpp/src/neighbors/mg/omp_checks.cpp
new file mode 100644
index 000000000..e09182dfe
--- /dev/null
+++ b/cpp/src/neighbors/mg/omp_checks.cpp
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <omp.h>
+#include <raft/core/logger.hpp>
+
+namespace cuvs::neighbors::mg {
+using raft::RAFT_NAME;
+
+void check_omp_threads(const int requirements)
+{
+  const int max_threads = omp_get_max_threads();
+  if (max_threads < requirements)
+    RAFT_LOG_WARN(
+      "OpenMP is only allowed %d threads to run %d GPUs. Please increase the number of OpenMP "
+      "threads to avoid NCCL hangs by modifying the environment variable OMP_NUM_THREADS.",
+      max_threads,
+      requirements);
+}
+
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/test/CMakeLists.txt b/cpp/test/CMakeLists.txt
index 58cfc3862..f4d35e438 100644
--- a/cpp/test/CMakeLists.txt
+++ b/cpp/test/CMakeLists.txt
@@ -22,7 +22,7 @@ rapids_test_init()
 function(ConfigureTest)
 
   set(options OPTIONAL NOCUDA C_LIB)
-  set(oneValueArgs NAME GPUS PERCENT)
+  set(oneValueArgs NAME GPUS PERCENT ADDITIONAL_DEP)
   set(multiValueArgs PATH TARGETS CONFIGURATIONS)
 
   cmake_parse_arguments(_CUVS_TEST "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
@@ -56,6 +56,7 @@ function(ConfigureTest)
             $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
             $<TARGET_NAME_IF_EXISTS:conda_env>
             $<$<BOOL:${_CUVS_TEST_C_LIB}>:cuvs::c_api>
+            ${_CUVS_TEST_ADDITIONAL_DEP}
   )
   set_target_properties(
     ${TEST_NAME}
@@ -159,7 +160,7 @@ if(BUILD_TESTS)
     100
   )
 
- ConfigureTest(
+  ConfigureTest(
     NAME
     NEIGHBORS_ANN_VAMANA_TEST
     PATH
@@ -174,6 +175,14 @@ if(BUILD_TESTS)
 
   if(BUILD_CAGRA_HNSWLIB)
     ConfigureTest(NAME NEIGHBORS_HNSW_TEST PATH neighbors/hnsw.cu GPUS 1 PERCENT 100)
+    target_link_libraries(NEIGHBORS_HNSW_TEST PRIVATE hnswlib::hnswlib)
+    target_compile_definitions(NEIGHBORS_HNSW_TEST PUBLIC CUVS_BUILD_CAGRA_HNSWLIB)
+  endif()
+
+  if(BUILD_MG_ALGOS)
+    ConfigureTest(
+      NAME NEIGHBORS_MG_TEST PATH neighbors/mg/test_float.cu GPUS 1 PERCENT 100 ADDITIONAL_DEP nccl
+    )
   endif()
 
   ConfigureTest(
@@ -227,6 +236,8 @@ if(BUILD_C_TESTS)
 
   if(BUILD_CAGRA_HNSWLIB)
     ConfigureTest(NAME HNSW_C_TEST PATH neighbors/ann_hnsw_c.cu C_LIB)
+    target_link_libraries(NEIGHBORS_HNSW_TEST PRIVATE hnswlib::hnswlib)
+    target_compile_definitions(NEIGHBORS_HNSW_TEST PUBLIC CUVS_BUILD_CAGRA_HNSWLIB)
   endif()
 endif()
 
diff --git a/cpp/test/neighbors/ann_ivf_pq.cuh b/cpp/test/neighbors/ann_ivf_pq.cuh
index f02568b74..fd4e330db 100644
--- a/cpp/test/neighbors/ann_ivf_pq.cuh
+++ b/cpp/test/neighbors/ann_ivf_pq.cuh
@@ -282,6 +282,8 @@ class ivf_pq_test : public ::testing::TestWithParam<ivf_pq_inputs> {
                             uint32_t n_take,
                             uint32_t n_skip)
   {
+    // the original data cannot be reconstructed since the dataset was normalized
+    if (index.metric() == cuvs::distance::DistanceType::CosineExpanded) { return; }
     auto& rec_list = index.lists()[label];
     auto dim       = index.dim();
     n_take         = std::min<uint32_t>(n_take, rec_list->size.load());
@@ -313,6 +315,7 @@ class ivf_pq_test : public ::testing::TestWithParam<ivf_pq_inputs> {
     auto old_list = index->lists()[label];
     auto n_rows   = old_list->size.load();
     if (n_rows == 0) { return; }
+    if (index->metric() == cuvs::distance::DistanceType::CosineExpanded) { return; }
 
     auto vectors_1 = raft::make_device_matrix<EvalT>(handle_, n_rows, index->dim());
     auto indices   = raft::make_device_vector<IdxT>(handle_, n_rows);
@@ -374,7 +377,7 @@ class ivf_pq_test : public ::testing::TestWithParam<ivf_pq_inputs> {
                                   cuvs::Compare<uint8_t>{}));
 
     // Pack a few vectors back to the list.
-    int row_offset = 9;
+    int row_offset = 5;
     int n_vec      = 3;
     ASSERT_TRUE(row_offset + n_vec < n_rows);
     size_t offset      = row_offset * index->pq_dim();
@@ -884,6 +887,25 @@ inline auto enum_variety_l2sqrt() -> test_cases_t
   });
 }
 
+inline auto enum_variety_cosine() -> test_cases_t
+{
+  return map<ivf_pq_inputs>(enum_variety(), [](const ivf_pq_inputs& x) {
+    ivf_pq_inputs y(x);
+    if (y.min_recall.has_value()) {
+      if (y.search_params.lut_dtype == CUDA_R_8U) {
+        // TODO: Increase this recall threshold for 8 bit lut
+        // (https://github.com/rapidsai/cuvs/issues/390)
+        y.min_recall = y.min_recall.value() * 0.70;
+      } else {
+        // In other cases it seems to perform a little bit better, still worse than L2
+        y.min_recall = y.min_recall.value() * 0.94;
+      }
+    }
+    y.index_params.metric = distance::DistanceType::CosineExpanded;
+    return y;
+  });
+}
+
 /**
  * Try different number of n_probes, some of which may trigger the non-fused version of the search
  * kernel.
diff --git a/cpp/test/neighbors/ann_ivf_pq/test_float_int64_t.cu b/cpp/test/neighbors/ann_ivf_pq/test_float_int64_t.cu
index cdc6c1b7e..834fdb3d0 100644
--- a/cpp/test/neighbors/ann_ivf_pq/test_float_int64_t.cu
+++ b/cpp/test/neighbors/ann_ivf_pq/test_float_int64_t.cu
@@ -25,9 +25,13 @@ TEST_BUILD_HOST_INPUT_SEARCH(f32_f32_i64)
 TEST_BUILD_HOST_INPUT_OVERLAP_SEARCH(f32_f32_i64)
 TEST_BUILD_EXTEND_SEARCH(f32_f32_i64)
 TEST_BUILD_SERIALIZE_SEARCH(f32_f32_i64)
-INSTANTIATE(f32_f32_i64, defaults() + small_dims() + big_dims_moderate_lut());
+INSTANTIATE(f32_f32_i64,
+            defaults() + small_dims() + big_dims_moderate_lut() + enum_variety_l2() +
+              enum_variety_l2sqrt() + enum_variety_ip() + enum_variety_cosine());
 
 TEST_BUILD_SEARCH(f32_f32_i64_filter)
-INSTANTIATE(f32_f32_i64_filter, defaults() + small_dims() + big_dims_moderate_lut());
+INSTANTIATE(f32_f32_i64_filter,
+            defaults() + small_dims() + big_dims_moderate_lut() + enum_variety_l2() +
+              enum_variety_l2sqrt() + enum_variety_ip() + enum_variety_cosine());
 
 }  // namespace cuvs::neighbors::ivf_pq
diff --git a/cpp/test/neighbors/ann_ivf_pq/test_int8_t_int64_t.cu b/cpp/test/neighbors/ann_ivf_pq/test_int8_t_int64_t.cu
index 80b0e2ccb..c9e5d4f01 100644
--- a/cpp/test/neighbors/ann_ivf_pq/test_int8_t_int64_t.cu
+++ b/cpp/test/neighbors/ann_ivf_pq/test_int8_t_int64_t.cu
@@ -25,8 +25,9 @@ TEST_BUILD_SEARCH(f32_i08_i64)
 TEST_BUILD_HOST_INPUT_SEARCH(f32_i08_i64)
 TEST_BUILD_HOST_INPUT_OVERLAP_SEARCH(f32_i08_i64)
 TEST_BUILD_SERIALIZE_SEARCH(f32_i08_i64)
-INSTANTIATE(f32_i08_i64, defaults() + big_dims() + var_k());
+INSTANTIATE(f32_i08_i64, defaults() + big_dims() + var_k() + enum_variety_l2() + enum_variety_ip());
 
 TEST_BUILD_SEARCH(f32_i08_i64_filter)
-INSTANTIATE(f32_i08_i64_filter, defaults() + big_dims() + var_k());
+INSTANTIATE(f32_i08_i64_filter,
+            defaults() + big_dims() + var_k() + enum_variety_l2() + enum_variety_ip());
 }  // namespace cuvs::neighbors::ivf_pq
diff --git a/cpp/test/neighbors/ann_ivf_pq/test_uint8_t_int64_t.cu b/cpp/test/neighbors/ann_ivf_pq/test_uint8_t_int64_t.cu
index 0216a1e80..6e0732227 100644
--- a/cpp/test/neighbors/ann_ivf_pq/test_uint8_t_int64_t.cu
+++ b/cpp/test/neighbors/ann_ivf_pq/test_uint8_t_int64_t.cu
@@ -25,8 +25,12 @@ TEST_BUILD_SEARCH(f32_u08_i64)
 TEST_BUILD_HOST_INPUT_SEARCH(f32_u08_i64)
 TEST_BUILD_HOST_INPUT_OVERLAP_SEARCH(f32_u08_i64)
 TEST_BUILD_EXTEND_SEARCH(f32_u08_i64)
-INSTANTIATE(f32_u08_i64, small_dims_per_cluster() + enum_variety());
+INSTANTIATE(f32_u08_i64,
+            small_dims_per_cluster() + enum_variety() + enum_variety_l2() + enum_variety_l2sqrt() +
+              enum_variety_ip());
 
 TEST_BUILD_SEARCH(f32_u08_i64_filter)
-INSTANTIATE(f32_u08_i64_filter, small_dims_per_cluster() + enum_variety());
+INSTANTIATE(f32_u08_i64_filter,
+            small_dims_per_cluster() + enum_variety() + enum_variety_l2() + enum_variety_l2sqrt() +
+              enum_variety_ip());
 }  // namespace cuvs::neighbors::ivf_pq
diff --git a/cpp/test/neighbors/mg.cuh b/cpp/test/neighbors/mg.cuh
new file mode 100644
index 000000000..be30ca615
--- /dev/null
+++ b/cpp/test/neighbors/mg.cuh
@@ -0,0 +1,825 @@
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include "../test_utils.cuh"
+#include "ann_utils.cuh"
+#include "naive_knn.cuh"
+
+#include <cuvs/neighbors/mg.hpp>
+#include <raft/core/resource/nccl_clique.hpp>
+
+namespace cuvs::neighbors::mg {
+
+enum class algo_t { IVF_FLAT, IVF_PQ, CAGRA };
+enum class d_mode_t { REPLICATED, SHARDED, LOCAL_THEN_DISTRIBUTED, ROUND_ROBIN };
+enum class m_mode_t { MERGE_ON_ROOT_RANK, TREE_MERGE, UNDEFINED };
+
+struct AnnMGInputs {
+  int64_t num_queries;
+  int64_t num_db_vecs;
+  int64_t dim;
+  int64_t k;
+  d_mode_t d_mode;
+  m_mode_t m_mode;
+  algo_t algo;
+  int64_t nprobe;
+  int64_t nlist;
+  cuvs::distance::DistanceType metric;
+  bool adaptive_centers;
+};
+
+template <typename T, typename DataT>
+class AnnMGTest : public ::testing::TestWithParam<AnnMGInputs> {
+ public:
+  AnnMGTest()
+    : stream_(resource::get_cuda_stream(handle_)),
+      clique_(raft::resource::get_nccl_clique(handle_)),
+      ps(::testing::TestWithParam<AnnMGInputs>::GetParam()),
+      d_index_dataset(0, stream_),
+      d_queries(0, stream_),
+      h_index_dataset(0),
+      h_queries(0)
+  {
+  }
+
+  void testAnnMG()
+  {
+    size_t queries_size = ps.num_queries * ps.k;
+    std::vector<int64_t> neighbors_ref(queries_size);
+    std::vector<T> distances_ref(queries_size);
+    std::vector<int64_t> neighbors_snmg_ann(queries_size);
+    std::vector<T> distances_snmg_ann(queries_size);
+    std::vector<uint32_t> neighbors_ref_32bits(queries_size);
+    std::vector<uint32_t> neighbors_snmg_ann_32bits(queries_size);
+
+    {
+      rmm::device_uvector<T> distances_ref_dev(queries_size, stream_);
+      rmm::device_uvector<int64_t> neighbors_ref_dev(queries_size, stream_);
+      cuvs::neighbors::naive_knn<T, DataT, int64_t>(handle_,
+                                                    distances_ref_dev.data(),
+                                                    neighbors_ref_dev.data(),
+                                                    d_queries.data(),
+                                                    d_index_dataset.data(),
+                                                    ps.num_queries,
+                                                    ps.num_db_vecs,
+                                                    ps.dim,
+                                                    ps.k,
+                                                    ps.metric);
+      update_host(distances_ref.data(), distances_ref_dev.data(), queries_size, stream_);
+      update_host(neighbors_ref.data(), neighbors_ref_dev.data(), queries_size, stream_);
+      resource::sync_stream(handle_);
+    }
+
+    int64_t n_rows_per_search_batch = 3000;  // [3000, 3000, 1000] == 7000 rows
+
+    // IVF-Flat
+    if (ps.algo == algo_t::IVF_FLAT &&
+        (ps.d_mode == d_mode_t::REPLICATED || ps.d_mode == d_mode_t::SHARDED)) {
+      distribution_mode d_mode;
+      if (ps.d_mode == d_mode_t::REPLICATED)
+        d_mode = distribution_mode::REPLICATED;
+      else
+        d_mode = distribution_mode::SHARDED;
+
+      mg::index_params<ivf_flat::index_params> index_params;
+      index_params.n_lists                  = ps.nlist;
+      index_params.metric                   = ps.metric;
+      index_params.adaptive_centers         = ps.adaptive_centers;
+      index_params.add_data_on_build        = false;
+      index_params.kmeans_trainset_fraction = 1.0;
+      index_params.metric_arg               = 0;
+      index_params.mode                     = d_mode;
+
+      mg::search_params<ivf_flat::search_params> search_params;
+      search_params.n_probes    = ps.nprobe;
+      search_params.search_mode = LOAD_BALANCER;
+
+      auto index_dataset = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_index_dataset.data(), ps.num_db_vecs, ps.dim);
+      auto queries = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+      auto neighbors = raft::make_host_matrix_view<int64_t, int64_t, row_major>(
+        neighbors_snmg_ann.data(), ps.num_queries, ps.k);
+      auto distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+        distances_snmg_ann.data(), ps.num_queries, ps.k);
+
+      {
+        auto index = cuvs::neighbors::mg::build(handle_, index_params, index_dataset);
+        cuvs::neighbors::mg::extend(handle_, index, index_dataset, std::nullopt);
+        cuvs::neighbors::mg::serialize(handle_, index, "mg_ivf_flat_index");
+      }
+      auto new_index =
+        cuvs::neighbors::mg::deserialize_flat<DataT, int64_t>(handle_, "mg_ivf_flat_index");
+
+      if (ps.m_mode == m_mode_t::MERGE_ON_ROOT_RANK)
+        search_params.merge_mode = MERGE_ON_ROOT_RANK;
+      else
+        search_params.merge_mode = TREE_MERGE;
+      cuvs::neighbors::mg::search(
+        handle_, new_index, search_params, queries, neighbors, distances, n_rows_per_search_batch);
+      resource::sync_stream(handle_);
+
+      double min_recall = static_cast<double>(ps.nprobe) / static_cast<double>(ps.nlist);
+      ASSERT_TRUE(eval_neighbours(neighbors_ref,
+                                  neighbors_snmg_ann,
+                                  distances_ref,
+                                  distances_snmg_ann,
+                                  ps.num_queries,
+                                  ps.k,
+                                  0.001,
+                                  min_recall));
+      std::fill(neighbors_snmg_ann.begin(), neighbors_snmg_ann.end(), 0);
+      std::fill(distances_snmg_ann.begin(), distances_snmg_ann.end(), 0);
+    }
+
+    // IVF-PQ
+    if (ps.algo == algo_t::IVF_PQ &&
+        (ps.d_mode == d_mode_t::REPLICATED || ps.d_mode == d_mode_t::SHARDED)) {
+      distribution_mode d_mode;
+      if (ps.d_mode == d_mode_t::REPLICATED)
+        d_mode = distribution_mode::REPLICATED;
+      else
+        d_mode = distribution_mode::SHARDED;
+
+      mg::index_params<ivf_pq::index_params> index_params;
+      index_params.n_lists                  = ps.nlist;
+      index_params.metric                   = ps.metric;
+      index_params.add_data_on_build        = false;
+      index_params.kmeans_trainset_fraction = 1.0;
+      index_params.metric_arg               = 0;
+      index_params.mode                     = d_mode;
+
+      mg::search_params<ivf_pq::search_params> search_params;
+      search_params.n_probes    = ps.nprobe;
+      search_params.search_mode = LOAD_BALANCER;
+
+      auto index_dataset = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_index_dataset.data(), ps.num_db_vecs, ps.dim);
+      auto queries = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+      auto neighbors = raft::make_host_matrix_view<int64_t, int64_t, row_major>(
+        neighbors_snmg_ann.data(), ps.num_queries, ps.k);
+      auto distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+        distances_snmg_ann.data(), ps.num_queries, ps.k);
+
+      {
+        auto index = cuvs::neighbors::mg::build(handle_, index_params, index_dataset);
+        cuvs::neighbors::mg::extend(handle_, index, index_dataset, std::nullopt);
+        cuvs::neighbors::mg::serialize(handle_, index, "mg_ivf_pq_index");
+      }
+      auto new_index =
+        cuvs::neighbors::mg::deserialize_pq<DataT, int64_t>(handle_, "mg_ivf_pq_index");
+
+      if (ps.m_mode == m_mode_t::MERGE_ON_ROOT_RANK)
+        search_params.merge_mode = MERGE_ON_ROOT_RANK;
+      else
+        search_params.merge_mode = TREE_MERGE;
+      cuvs::neighbors::mg::search(
+        handle_, new_index, search_params, queries, neighbors, distances, n_rows_per_search_batch);
+      resource::sync_stream(handle_);
+
+      double min_recall = static_cast<double>(ps.nprobe) / static_cast<double>(ps.nlist);
+      ASSERT_TRUE(eval_neighbours(neighbors_ref,
+                                  neighbors_snmg_ann,
+                                  distances_ref,
+                                  distances_snmg_ann,
+                                  ps.num_queries,
+                                  ps.k,
+                                  0.001,
+                                  min_recall));
+      std::fill(neighbors_snmg_ann.begin(), neighbors_snmg_ann.end(), 0);
+      std::fill(distances_snmg_ann.begin(), distances_snmg_ann.end(), 0);
+    }
+
+    // CAGRA
+    if (ps.algo == algo_t::CAGRA &&
+        (ps.d_mode == d_mode_t::REPLICATED || ps.d_mode == d_mode_t::SHARDED)) {
+      distribution_mode d_mode;
+      if (ps.d_mode == d_mode_t::REPLICATED)
+        d_mode = distribution_mode::REPLICATED;
+      else
+        d_mode = distribution_mode::SHARDED;
+
+      mg::index_params<cagra::index_params> index_params;
+      index_params.graph_build_params = cagra::graph_build_params::ivf_pq_params(
+        raft::matrix_extent<int64_t>(ps.num_db_vecs, ps.dim));
+      index_params.mode = d_mode;
+
+      mg::search_params<cagra::search_params> search_params;
+
+      auto index_dataset = raft::make_host_matrix_view<const DataT, uint32_t, row_major>(
+        h_index_dataset.data(), ps.num_db_vecs, ps.dim);
+      auto queries = raft::make_host_matrix_view<const DataT, uint32_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+      auto neighbors = raft::make_host_matrix_view<uint32_t, uint32_t, row_major>(
+        neighbors_snmg_ann_32bits.data(), ps.num_queries, ps.k);
+      auto distances = raft::make_host_matrix_view<float, uint32_t, row_major>(
+        distances_snmg_ann.data(), ps.num_queries, ps.k);
+
+      {
+        auto index = cuvs::neighbors::mg::build(handle_, index_params, index_dataset);
+        cuvs::neighbors::mg::serialize(handle_, index, "mg_cagra_index");
+      }
+      auto new_index =
+        cuvs::neighbors::mg::deserialize_cagra<DataT, uint32_t>(handle_, "mg_cagra_index");
+
+      if (ps.m_mode == m_mode_t::MERGE_ON_ROOT_RANK)
+        search_params.merge_mode = MERGE_ON_ROOT_RANK;
+      else
+        search_params.merge_mode = TREE_MERGE;
+      cuvs::neighbors::mg::search(
+        handle_, new_index, search_params, queries, neighbors, distances, n_rows_per_search_batch);
+      resource::sync_stream(handle_);
+
+      double min_recall = static_cast<double>(ps.nprobe) / static_cast<double>(ps.nlist);
+      ASSERT_TRUE(eval_neighbours(neighbors_ref_32bits,
+                                  neighbors_snmg_ann_32bits,
+                                  distances_ref,
+                                  distances_snmg_ann,
+                                  ps.num_queries,
+                                  ps.k,
+                                  0.001,
+                                  min_recall));
+      std::fill(neighbors_snmg_ann_32bits.begin(), neighbors_snmg_ann_32bits.end(), 0);
+      std::fill(distances_snmg_ann.begin(), distances_snmg_ann.end(), 0);
+    }
+
+    if (ps.algo == algo_t::IVF_FLAT && ps.d_mode == d_mode_t::LOCAL_THEN_DISTRIBUTED) {
+      ivf_flat::index_params index_params;
+      index_params.n_lists                  = ps.nlist;
+      index_params.metric                   = ps.metric;
+      index_params.adaptive_centers         = ps.adaptive_centers;
+      index_params.add_data_on_build        = true;
+      index_params.kmeans_trainset_fraction = 1.0;
+      index_params.metric_arg               = 0;
+
+      mg::search_params<ivf_flat::search_params> search_params;
+      search_params.n_probes    = ps.nprobe;
+      search_params.search_mode = LOAD_BALANCER;
+
+      {
+        auto index_dataset = raft::make_device_matrix_view<const DataT, int64_t>(
+          d_index_dataset.data(), ps.num_db_vecs, ps.dim);
+        auto index = cuvs::neighbors::ivf_flat::build(handle_, index_params, index_dataset);
+        ivf_flat::serialize(handle_, "local_ivf_flat_index", index);
+      }
+
+      auto queries = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+      auto neighbors = raft::make_host_matrix_view<int64_t, int64_t, row_major>(
+        neighbors_snmg_ann.data(), ps.num_queries, ps.k);
+      auto distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+        distances_snmg_ann.data(), ps.num_queries, ps.k);
+
+      auto distributed_index =
+        cuvs::neighbors::mg::distribute_flat<DataT, int64_t>(handle_, "local_ivf_flat_index");
+      search_params.merge_mode = TREE_MERGE;
+      cuvs::neighbors::mg::search(handle_,
+                                  distributed_index,
+                                  search_params,
+                                  queries,
+                                  neighbors,
+                                  distances,
+                                  n_rows_per_search_batch);
+
+      resource::sync_stream(handle_);
+
+      double min_recall = static_cast<double>(ps.nprobe) / static_cast<double>(ps.nlist);
+      ASSERT_TRUE(eval_neighbours(neighbors_ref,
+                                  neighbors_snmg_ann,
+                                  distances_ref,
+                                  distances_snmg_ann,
+                                  ps.num_queries,
+                                  ps.k,
+                                  0.001,
+                                  min_recall));
+      std::fill(neighbors_snmg_ann.begin(), neighbors_snmg_ann.end(), 0);
+      std::fill(distances_snmg_ann.begin(), distances_snmg_ann.end(), 0);
+    }
+
+    if (ps.algo == algo_t::IVF_PQ && ps.d_mode == d_mode_t::LOCAL_THEN_DISTRIBUTED) {
+      ivf_pq::index_params index_params;
+      index_params.n_lists                  = ps.nlist;
+      index_params.metric                   = ps.metric;
+      index_params.add_data_on_build        = true;
+      index_params.kmeans_trainset_fraction = 1.0;
+      index_params.metric_arg               = 0;
+
+      mg::search_params<ivf_pq::search_params> search_params;
+      search_params.n_probes    = ps.nprobe;
+      search_params.search_mode = LOAD_BALANCER;
+
+      {
+        auto index_dataset = raft::make_device_matrix_view<const DataT, int64_t>(
+          d_index_dataset.data(), ps.num_db_vecs, ps.dim);
+        auto index = cuvs::neighbors::ivf_pq::build(handle_, index_params, index_dataset);
+        ivf_pq::serialize(handle_, "local_ivf_pq_index", index);
+      }
+
+      auto queries = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+      auto neighbors = raft::make_host_matrix_view<int64_t, int64_t, row_major>(
+        neighbors_snmg_ann.data(), ps.num_queries, ps.k);
+      auto distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+        distances_snmg_ann.data(), ps.num_queries, ps.k);
+
+      auto distributed_index =
+        cuvs::neighbors::mg::distribute_pq<DataT, int64_t>(handle_, "local_ivf_pq_index");
+      search_params.merge_mode = TREE_MERGE;
+      cuvs::neighbors::mg::search(handle_,
+                                  distributed_index,
+                                  search_params,
+                                  queries,
+                                  neighbors,
+                                  distances,
+                                  n_rows_per_search_batch);
+
+      resource::sync_stream(handle_);
+
+      double min_recall = static_cast<double>(ps.nprobe) / static_cast<double>(ps.nlist);
+      ASSERT_TRUE(eval_neighbours(neighbors_ref,
+                                  neighbors_snmg_ann,
+                                  distances_ref,
+                                  distances_snmg_ann,
+                                  ps.num_queries,
+                                  ps.k,
+                                  0.001,
+                                  min_recall));
+      std::fill(neighbors_snmg_ann.begin(), neighbors_snmg_ann.end(), 0);
+      std::fill(distances_snmg_ann.begin(), distances_snmg_ann.end(), 0);
+    }
+
+    if (ps.algo == algo_t::CAGRA && ps.d_mode == d_mode_t::LOCAL_THEN_DISTRIBUTED) {
+      cagra::index_params index_params;
+      index_params.graph_build_params = cagra::graph_build_params::ivf_pq_params(
+        raft::matrix_extent<int64_t>(ps.num_db_vecs, ps.dim));
+
+      mg::search_params<cagra::search_params> search_params;
+
+      {
+        auto index_dataset = raft::make_device_matrix_view<const DataT, int64_t>(
+          d_index_dataset.data(), ps.num_db_vecs, ps.dim);
+        auto index = cuvs::neighbors::cagra::build(handle_, index_params, index_dataset);
+        cuvs::neighbors::cagra::serialize(handle_, "local_cagra_index", index);
+      }
+
+      auto queries = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+      auto neighbors = raft::make_host_matrix_view<uint32_t, int64_t, row_major>(
+        neighbors_snmg_ann_32bits.data(), ps.num_queries, ps.k);
+      auto distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+        distances_snmg_ann.data(), ps.num_queries, ps.k);
+
+      auto distributed_index =
+        cuvs::neighbors::mg::distribute_cagra<DataT, uint32_t>(handle_, "local_cagra_index");
+
+      search_params.merge_mode = TREE_MERGE;
+      cuvs::neighbors::mg::search(handle_,
+                                  distributed_index,
+                                  search_params,
+                                  queries,
+                                  neighbors,
+                                  distances,
+                                  n_rows_per_search_batch);
+
+      resource::sync_stream(handle_);
+
+      double min_recall = static_cast<double>(ps.nprobe) / static_cast<double>(ps.nlist);
+      ASSERT_TRUE(eval_neighbours(neighbors_ref_32bits,
+                                  neighbors_snmg_ann_32bits,
+                                  distances_ref,
+                                  distances_snmg_ann,
+                                  ps.num_queries,
+                                  ps.k,
+                                  0.001,
+                                  min_recall));
+      std::fill(neighbors_snmg_ann_32bits.begin(), neighbors_snmg_ann_32bits.end(), 0);
+      std::fill(distances_snmg_ann.begin(), distances_snmg_ann.end(), 0);
+    }
+
+    if (ps.algo == algo_t::IVF_FLAT && ps.d_mode == d_mode_t::ROUND_ROBIN) {
+      ASSERT_TRUE(ps.num_queries <= 4);
+
+      mg::index_params<ivf_flat::index_params> index_params;
+      index_params.n_lists                  = ps.nlist;
+      index_params.metric                   = ps.metric;
+      index_params.adaptive_centers         = ps.adaptive_centers;
+      index_params.add_data_on_build        = false;
+      index_params.kmeans_trainset_fraction = 1.0;
+      index_params.metric_arg               = 0;
+      index_params.mode                     = REPLICATED;
+
+      mg::search_params<ivf_flat::search_params> search_params;
+      search_params.n_probes    = ps.nprobe;
+      search_params.search_mode = ROUND_ROBIN;
+
+      auto index_dataset = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_index_dataset.data(), ps.num_db_vecs, ps.dim);
+      auto small_batch_query = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+
+      auto index = cuvs::neighbors::mg::build(handle_, index_params, index_dataset);
+      cuvs::neighbors::mg::extend(handle_, index, index_dataset, std::nullopt);
+
+      int n_parallel_searches = 16;
+      std::vector<char> searches_correctness(n_parallel_searches);
+      std::vector<int64_t> load_balancer_neighbors_snmg_ann(n_parallel_searches * ps.num_queries *
+                                                            ps.k);
+      std::vector<float> load_balancer_distances_snmg_ann(n_parallel_searches * ps.num_queries *
+                                                          ps.k);
+#pragma omp parallel for
+      for (uint64_t search_idx = 0; search_idx < searches_correctness.size(); search_idx++) {
+        uint64_t offset            = search_idx * ps.num_queries * ps.k;
+        auto small_batch_neighbors = raft::make_host_matrix_view<int64_t, int64_t, row_major>(
+          load_balancer_neighbors_snmg_ann.data() + offset, ps.num_queries, ps.k);
+        auto small_batch_distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+          load_balancer_distances_snmg_ann.data() + offset, ps.num_queries, ps.k);
+        cuvs::neighbors::mg::search(handle_,
+                                    index,
+                                    search_params,
+                                    small_batch_query,
+                                    small_batch_neighbors,
+                                    small_batch_distances,
+                                    n_rows_per_search_batch);
+
+        std::vector<int64_t> small_batch_neighbors_vec(
+          small_batch_neighbors.data_handle(),
+          small_batch_neighbors.data_handle() + small_batch_neighbors.size());
+        std::vector<float> small_batch_distances_vec(
+          small_batch_distances.data_handle(),
+          small_batch_distances.data_handle() + small_batch_distances.size());
+        searches_correctness[search_idx] = eval_neighbours(neighbors_ref,
+                                                           small_batch_neighbors_vec,
+                                                           distances_ref,
+                                                           small_batch_distances_vec,
+                                                           ps.num_queries,
+                                                           ps.k,
+                                                           0.001,
+                                                           0.9);
+      }
+      ASSERT_TRUE(std::all_of(searches_correctness.begin(),
+                              searches_correctness.end(),
+                              [](char val) { return val != 0; }));
+    }
+
+    if (ps.algo == algo_t::IVF_PQ && ps.d_mode == d_mode_t::ROUND_ROBIN) {
+      ASSERT_TRUE(ps.num_queries <= 4);
+
+      mg::index_params<ivf_pq::index_params> index_params;
+      index_params.n_lists                  = ps.nlist;
+      index_params.metric                   = ps.metric;
+      index_params.add_data_on_build        = false;
+      index_params.kmeans_trainset_fraction = 1.0;
+      index_params.metric_arg               = 0;
+      index_params.mode                     = REPLICATED;
+
+      mg::search_params<ivf_pq::search_params> search_params;
+      search_params.n_probes    = ps.nprobe;
+      search_params.search_mode = ROUND_ROBIN;
+
+      auto index_dataset = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_index_dataset.data(), ps.num_db_vecs, ps.dim);
+      auto small_batch_query = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+
+      auto index = cuvs::neighbors::mg::build(handle_, index_params, index_dataset);
+      cuvs::neighbors::mg::extend(handle_, index, index_dataset, std::nullopt);
+
+      int n_parallel_searches = 16;
+      std::vector<char> searches_correctness(n_parallel_searches);
+      std::vector<int64_t> load_balancer_neighbors_snmg_ann(n_parallel_searches * ps.num_queries *
+                                                            ps.k);
+      std::vector<float> load_balancer_distances_snmg_ann(n_parallel_searches * ps.num_queries *
+                                                          ps.k);
+#pragma omp parallel for
+      for (uint64_t search_idx = 0; search_idx < searches_correctness.size(); search_idx++) {
+        uint64_t offset            = search_idx * ps.num_queries * ps.k;
+        auto small_batch_neighbors = raft::make_host_matrix_view<int64_t, int64_t, row_major>(
+          load_balancer_neighbors_snmg_ann.data() + offset, ps.num_queries, ps.k);
+        auto small_batch_distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+          load_balancer_distances_snmg_ann.data() + offset, ps.num_queries, ps.k);
+        cuvs::neighbors::mg::search(handle_,
+                                    index,
+                                    search_params,
+                                    small_batch_query,
+                                    small_batch_neighbors,
+                                    small_batch_distances,
+                                    n_rows_per_search_batch);
+
+        std::vector<int64_t> small_batch_neighbors_vec(
+          small_batch_neighbors.data_handle(),
+          small_batch_neighbors.data_handle() + small_batch_neighbors.size());
+        std::vector<float> small_batch_distances_vec(
+          small_batch_distances.data_handle(),
+          small_batch_distances.data_handle() + small_batch_distances.size());
+        searches_correctness[search_idx] = eval_neighbours(neighbors_ref,
+                                                           small_batch_neighbors_vec,
+                                                           distances_ref,
+                                                           small_batch_distances_vec,
+                                                           ps.num_queries,
+                                                           ps.k,
+                                                           0.001,
+                                                           0.9);
+      }
+      ASSERT_TRUE(std::all_of(searches_correctness.begin(),
+                              searches_correctness.end(),
+                              [](char val) { return val != 0; }));
+    }
+
+    if (ps.algo == algo_t::CAGRA && ps.d_mode == d_mode_t::ROUND_ROBIN) {
+      ASSERT_TRUE(ps.num_queries <= 4);
+
+      mg::index_params<cagra::index_params> index_params;
+      index_params.graph_build_params = cagra::graph_build_params::ivf_pq_params(
+        raft::matrix_extent<int64_t>(ps.num_db_vecs, ps.dim));
+      index_params.mode = REPLICATED;
+
+      mg::search_params<cagra::search_params> search_params;
+      search_params.search_mode = ROUND_ROBIN;
+
+      auto index_dataset = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_index_dataset.data(), ps.num_db_vecs, ps.dim);
+      auto small_batch_query = raft::make_host_matrix_view<const DataT, int64_t, row_major>(
+        h_queries.data(), ps.num_queries, ps.dim);
+
+      auto index = cuvs::neighbors::mg::build(handle_, index_params, index_dataset);
+
+      int n_parallel_searches = 16;
+      std::vector<char> searches_correctness(n_parallel_searches);
+      std::vector<uint32_t> load_balancer_neighbors_snmg_ann(n_parallel_searches * ps.num_queries *
+                                                             ps.k);
+      std::vector<float> load_balancer_distances_snmg_ann(n_parallel_searches * ps.num_queries *
+                                                          ps.k);
+#pragma omp parallel for
+      for (uint64_t search_idx = 0; search_idx < searches_correctness.size(); search_idx++) {
+        uint64_t offset            = search_idx * ps.num_queries * ps.k;
+        auto small_batch_neighbors = raft::make_host_matrix_view<uint32_t, int64_t, row_major>(
+          load_balancer_neighbors_snmg_ann.data() + offset, ps.num_queries, ps.k);
+        auto small_batch_distances = raft::make_host_matrix_view<float, int64_t, row_major>(
+          load_balancer_distances_snmg_ann.data() + offset, ps.num_queries, ps.k);
+        cuvs::neighbors::mg::search(handle_,
+                                    index,
+                                    search_params,
+                                    small_batch_query,
+                                    small_batch_neighbors,
+                                    small_batch_distances,
+                                    n_rows_per_search_batch);
+
+        std::vector<uint32_t> small_batch_neighbors_vec(
+          small_batch_neighbors.data_handle(),
+          small_batch_neighbors.data_handle() + small_batch_neighbors.size());
+        std::vector<float> small_batch_distances_vec(
+          small_batch_distances.data_handle(),
+          small_batch_distances.data_handle() + small_batch_distances.size());
+        searches_correctness[search_idx] = eval_neighbours(neighbors_ref_32bits,
+                                                           small_batch_neighbors_vec,
+                                                           distances_ref,
+                                                           small_batch_distances_vec,
+                                                           ps.num_queries,
+                                                           ps.k,
+                                                           0.001,
+                                                           0.9);
+      }
+      ASSERT_TRUE(std::all_of(searches_correctness.begin(),
+                              searches_correctness.end(),
+                              [](char val) { return val != 0; }));
+    }
+  }
+
+  void SetUp() override
+  {
+    d_index_dataset.resize(ps.num_db_vecs * ps.dim, stream_);
+    d_queries.resize(ps.num_queries * ps.dim, stream_);
+    h_index_dataset.resize(ps.num_db_vecs * ps.dim);
+    h_queries.resize(ps.num_queries * ps.dim);
+
+    raft::random::RngState r(1234ULL);
+    if constexpr (std::is_same<DataT, float>{}) {
+      raft::random::uniform(
+        handle_, r, d_index_dataset.data(), d_index_dataset.size(), DataT(0.1), DataT(2.0));
+      raft::random::uniform(handle_, r, d_queries.data(), d_queries.size(), DataT(0.1), DataT(2.0));
+    } else {
+      raft::random::uniformInt(
+        handle_, r, d_index_dataset.data(), d_index_dataset.size(), DataT(1), DataT(20));
+      raft::random::uniformInt(handle_, r, d_queries.data(), d_queries.size(), DataT(1), DataT(20));
+    }
+
+    raft::copy(h_index_dataset.data(),
+               d_index_dataset.data(),
+               d_index_dataset.size(),
+               resource::get_cuda_stream(handle_));
+    raft::copy(
+      h_queries.data(), d_queries.data(), d_queries.size(), resource::get_cuda_stream(handle_));
+    resource::sync_stream(handle_);
+  }
+
+  void TearDown() override {}
+
+ private:
+  raft::device_resources handle_;
+  rmm::cuda_stream_view stream_;
+  raft::comms::nccl_clique clique_;
+  AnnMGInputs ps;
+  std::vector<DataT> h_index_dataset;
+  std::vector<DataT> h_queries;
+  rmm::device_uvector<DataT> d_index_dataset;
+  rmm::device_uvector<DataT> d_queries;
+};
+
+const std::vector<AnnMGInputs> inputs = {
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::REPLICATED,
+   m_mode_t::UNDEFINED,
+   algo_t::IVF_FLAT,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::REPLICATED,
+   m_mode_t::UNDEFINED,
+   algo_t::IVF_PQ,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+
+  /*
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::REPLICATED,
+   m_mode_t::UNDEFINED,
+   algo_t::CAGRA,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  */
+
+  /*
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::SHARDED,
+   m_mode_t::MERGE_ON_ROOT_RANK,
+   algo_t::IVF_FLAT,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::SHARDED,
+   m_mode_t::MERGE_ON_ROOT_RANK,
+   algo_t::IVF_PQ,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::SHARDED,
+   m_mode_t::MERGE_ON_ROOT_RANK,
+   algo_t::CAGRA,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::SHARDED,
+   m_mode_t::TREE_MERGE,
+   algo_t::IVF_FLAT,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::SHARDED,
+   m_mode_t::TREE_MERGE,
+   algo_t::IVF_PQ,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::SHARDED,
+   m_mode_t::TREE_MERGE,
+   algo_t::CAGRA,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  */
+
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::LOCAL_THEN_DISTRIBUTED,
+   m_mode_t::UNDEFINED,
+   algo_t::IVF_FLAT,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::LOCAL_THEN_DISTRIBUTED,
+   m_mode_t::UNDEFINED,
+   algo_t::IVF_PQ,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+
+  /*
+  {7000,
+   10000,
+   8,
+   16,
+   d_mode_t::LOCAL_THEN_DISTRIBUTED,
+   m_mode_t::UNDEFINED,
+   algo_t::CAGRA,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  */
+
+  {3,
+   10000,
+   8,
+   16,
+   d_mode_t::ROUND_ROBIN,
+   m_mode_t::UNDEFINED,
+   algo_t::IVF_FLAT,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  {3,
+   10000,
+   8,
+   16,
+   d_mode_t::ROUND_ROBIN,
+   m_mode_t::UNDEFINED,
+   algo_t::IVF_PQ,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+
+  /*
+  {3,
+   10000,
+   8,
+   16,
+   d_mode_t::ROUND_ROBIN,
+   m_mode_t::UNDEFINED,
+   algo_t::CAGRA,
+   40,
+   1024,
+   cuvs::distance::DistanceType::L2Expanded,
+   true},
+  */
+};
+}  // namespace cuvs::neighbors::mg
diff --git a/cpp/test/neighbors/mg/test_float.cu b/cpp/test/neighbors/mg/test_float.cu
new file mode 100644
index 000000000..ef9c9a043
--- /dev/null
+++ b/cpp/test/neighbors/mg/test_float.cu
@@ -0,0 +1,28 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <gtest/gtest.h>
+
+#include "../mg.cuh"
+
+namespace cuvs::neighbors::mg {
+
+typedef AnnMGTest<float, float> AnnMGTestF_float;
+TEST_P(AnnMGTestF_float, AnnMG) { this->testAnnMG(); }
+
+INSTANTIATE_TEST_CASE_P(AnnMGTest, AnnMGTestF_float, ::testing::ValuesIn(inputs));
+
+}  // namespace cuvs::neighbors::mg
diff --git a/dependencies.yaml b/dependencies.yaml
index cc8f5985c..a68a550bb 100644
--- a/dependencies.yaml
+++ b/dependencies.yaml
@@ -159,6 +159,7 @@ dependencies:
         packages:
           - c-compiler
           - cxx-compiler
+          - nccl>=2.19
     specific:
       - output_types: conda
         matrices:
diff --git a/docs/source/basics.rst b/docs/source/api_basics.rst
similarity index 100%
rename from docs/source/basics.rst
rename to docs/source/api_basics.rst
diff --git a/docs/source/api_docs.rst b/docs/source/api_docs.rst
index fe71ff313..f4deef506 100644
--- a/docs/source/api_docs.rst
+++ b/docs/source/api_docs.rst
@@ -2,10 +2,12 @@ API Reference
 =============
 
 .. toctree::
-   :maxdepth: 1
-   :caption: Contents:
+   :maxdepth: 3
 
    c_api.rst
    cpp_api.rst
    python_api.rst
    rust_api/index.rst
+
+* :ref:`genindex`
+* :ref:`search`
\ No newline at end of file
diff --git a/docs/source/interoperability.rst b/docs/source/api_interoperability.rst
similarity index 100%
rename from docs/source/interoperability.rst
rename to docs/source/api_interoperability.rst
diff --git a/docs/source/build.rst b/docs/source/build.rst
index 9c7c98989..a6b8ccd13 100644
--- a/docs/source/build.rst
+++ b/docs/source/build.rst
@@ -9,8 +9,6 @@ The cuVS software development kit provides APIs for C, C++, Python, and Rust lan
 
   * `Python through Pip`_
 
-  * `Rust through crates.io`_
-
 - `Build from source`_
 
   * `Prerequisites`_
@@ -38,21 +36,21 @@ C, C++, and Python through Conda
 
 The easiest way to install the pre-compiled C, C++, and Python packages is through conda. You can get a minimal conda installation with `miniforge <https://github.com/conda-forge/miniforge>`__.
 
-Use the following commands, depending on your CUDA version, to install cuVS packages (replace `rapidsai` with `rapidsai-nightly` to install more up-to-date but less stable nightly packages). `mamba` is preferred over the `conda` command.
+Use the following commands, depending on your CUDA version, to install cuVS packages (replace `rapidsai` with `rapidsai-nightly` to install more up-to-date but less stable nightly packages). `mamba` is preferred over the `conda` command and can be enabled using `this guide <https://conda.github.io/conda-libmamba-solver/user-guide/>`_.
 
 C/C++ Package
 ~~~~~~~~~~~~~
 
 .. code-block:: bash
 
-    mamba install -c rapidsai -c conda-forge -c nvidia libcuvs cuda-version=12.5
+   conda install -c rapidsai -c conda-forge -c nvidia libcuvs cuda-version=12.5
 
 Python Package
 ~~~~~~~~~~~~~~
 
 .. code-block:: bash
 
-    mamba install -c rapidsai -c conda-forge -c nvidia cuvs cuda-version=12.5
+   conda install -c rapidsai -c conda-forge -c nvidia cuvs cuda-version=12.5
 
 Python through Pip
 ^^^^^^^^^^^^^^^^^^
@@ -73,9 +71,6 @@ And CUDA 12 packages:
 
 Note: these packages statically link the C and C++ libraries so the `libcuvs` and `libcuvs_c` shared libraries won't be readily available to use in your code.
 
-Rust through crates.io
-^^^^^^^^^^^^^^^^^^^^^^
-
 Build from source
 -----------------
 
@@ -97,15 +92,15 @@ Conda environment scripts are provided for installing the necessary dependencies
 
 .. code-block:: bash
 
-    mamba env create --name cuvs -f conda/environments/all_cuda-125_arch-x86_64.yaml
-    mamba activate cuvs
+    conda env create --name cuvs -f conda/environments/all_cuda-125_arch-x86_64.yaml
+    conda activate cuvs
 
 The process for building from source with CUDA 11 differs slightly in that your host system will also need to have CUDA toolkit installed which is greater than, or equal to, the version you install into you conda environment. Installing CUDA toolkit into your host system is necessary because `nvcc` is not provided with Conda's cudatoolkit dependencies for CUDA 11. The following example will install create and install dependencies for a CUDA 11.8 conda environment
 
 .. code-block:: bash
 
-    mamba env create --name cuvs -f conda/environments/all_cuda-118_arch-x86_64.yaml
-    mamba activate cuvs
+    conda env create --name cuvs -f conda/environments/all_cuda-118_arch-x86_64.yaml
+    conda activate cuvs
 
 The recommended way to build and install cuVS from source is to use the `build.sh` script in the root of the repository. This script can build both the C++ and Python artifacts and provides CMake options for building and installing the headers, tests, benchmarks, and the pre-compiled shared library.
 
@@ -128,6 +123,16 @@ Once installed, the shared libraries, headers (and any dependencies downloaded a
     ./build.sh libcuvs --uninstall
 
 
+Multi-GPU features
+^^^^^^^^^^^^^^^^^^
+
+To disable the multi-gpu features run :
+
+.. code-block:: bash
+
+    ./build.sh libcuvs --no-mg
+
+
 Building the Googletests
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/docs/source/c_api/neighbors.rst b/docs/source/c_api/neighbors.rst
index dc55a74dc..9c3fce672 100644
--- a/docs/source/c_api/neighbors.rst
+++ b/docs/source/c_api/neighbors.rst
@@ -13,3 +13,4 @@ Nearest Neighbors
    neighbors_ivf_flat_c.rst
    neighbors_ivf_pq_c.rst
    neighbors_cagra_c.rst
+   neighbors_hnsw_c.rst
diff --git a/docs/source/choosing_and_configuring_indexes.rst b/docs/source/choosing_and_configuring_indexes.rst
new file mode 100644
index 000000000..b4c140f29
--- /dev/null
+++ b/docs/source/choosing_and_configuring_indexes.rst
@@ -0,0 +1,98 @@
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Primer on vector search indexes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Vector search indexes often use approximations to trade-off accuracy of the results for speed, either through lowering latency (end-to-end single query speed) or by increasing throughput (the number of query vectors that can be satisfied in a short period of time). Vector search indexes, especially ones that use approximations, are very closely related to machine learning models but they are optimized for fast search and accuracy of results.
+
+When the number of vectors is very small, such as less than 100 thousand vectors, it could be fast enough to use a brute-force (also known as a flat index), which returns exact results but at the expense of exhaustively searching all possible neighbors
+
+Objectives
+==========
+
+This primer addresses the challenge of configuring vector search indexes, but its primary goal is to get a user up and running quickly with acceptable enough results for a good choice of index type and a small and manageable tuning knob, rather than providing a comprehensive guide to tuning each and every hyper-parameter.
+
+For this reason, we focus on 4 primary data sizes:
+
+#. Tiny datasets where GPU is likely not needed (< 100 thousand vectors)
+#. Small datasets where GPU might not be needed (< 1 million vectors)
+#. Large datasets (> 1 million vectors), goal is fast index creation at the expense of search quality
+#. Large datasets where high quality is preferred at the expense of fast index creation
+
+Like other machine learning algorithms, vector search indexes generally have a training step – which means building the index – and an inference – or search step. The hyper-parameters also tend to be broken down into build and search parameters.
+
+While not always the case, a general trend is often observed where the search speed decreases as the quality increases. This also tends to be the case with the index build performance, though different algorithms have different relationships between build time, quality, and search time. It’s important to understand that there’s no free lunch so there will always be trade-offs for each index type.
+
+Definition of quality
+=====================
+
+What do we mean when we say quality of an index? In machine learning terminology, we measure this using recall, which is sometimes used interchangeably to mean accuracy, even though the two are slightly different measures. Recall, when used in vector search, essentially means “out of all of my results, which results would have been included in the exact results?” In vector search, the objective is to find some number of vectors that are closest to a given query vector so recall tends to be more relaxed than accuracy, discriminating only on set inclusion, rather than on exact ordered list matching, which would be closer to an accuracy measure.
+
+Choosing vector search indexes
+==============================
+
+Many vector search algorithms improve scalability while reducing the number of distances by partitioning the vector space into smaller pieces, often through the use of clustering, hashing, trees, and other techniques. Another popular technique is to reduce the width or dimensionality of the space in order to decrease the cost of computing each distance.
+
+Tiny datasets (< 100 thousand vectors)
+--------------------------------------
+
+These datasets are very small and it’s questionable whether or not the GPU would provide any value at all. If the dimensionality is also relatively small (< 1024), you could just use brute-force or HNSW on the CPU and get great performance. If the dimensionality is relatively large (1536, 2048, 4096), you should consider using HNSW. If build time performance is critical, you should consider using CAGRA to build the graph and convert it to an HNSW graph for search (this capability exists today in the standalone cuVS/RAFT libraries and will soon be added to Milvus). An IVF flat index can also be a great candidate here, as it can improve the search performance over brute-force by partitioning the vector space and thus reducing the search space.
+
+Small datasets where GPU might not be needed (< 1 million vectors)
+------------------------------------------------------------------
+
+For smaller dimensionality, such as 1024 or below, you could consider using a brute-force (aka flat) index on GPU and get very good search performance with exact results. You could also use a graph-based index like HNSW on the CPU or CAGRA on the GPU. If build time is critical, you could even build a CAGRA graph on the GPU and convert it to HNSW graph on the CPU.
+
+For larger dimensionality (1536, 2048, 4096), you will start to see lower build-time performance with HNSW for higher quality search settings, and so it becomes more clear that building a CAGRA graph can be useful instead.
+
+Large datasets (> 1 million vectors), goal is fast index creation at the expense of search quality
+--------------------------------------------------------------------------------------------------
+
+For fast ingest where slightly lower search quality is acceptable (85% recall and above), the IVF (inverted file index) methods can be very useful, as they can be very fast to build and still have acceptable search performance. IVF-flat index will partition the vectors into some number of clusters (specified by the user as n_lists) and at search time, some number of closest clusters (defined by n_probes) will be searched with brute-force for each query vector.
+
+IVF-PQ is similar to IVF-flat with the major difference that the vectors are compressed using a lossy product quantized compression so the index can have a much smaller footprint on the GPU. In general, it’s advised to set n_lists = sqrt(n_vectors) and set n_probes to some percentage of n_lists (e.g. 1%, 2%, 4%, 8%, 16%). Because IVF-PQ is a lossy compression, a refinement step can be performed by initially increasing the number of neighbors (by some multiple factor) and using the raw vectors to compute the exact distances, ultimately reducing the neighborhoods down to size k. Even a refinement of 2x (which would query initially for k*2) can be quite effective in making up for recall lost by the PQ compression, but it does come at the expense of having to keep the raw vectors around (keeping in mind many databases store the raw vectors anyways).
+
+Large datasets (> 1 million vectors), goal is high quality search at the expense of fast index creation
+-------------------------------------------------------------------------------------------------------
+
+By trading off index creation performance, an extremely high quality search model can be built. Generally, all of the vector search index types have hyperparameters that have a direct correlation with the search accuracy and so they can be cranked up to yield better recall. Unfortunately, this can also significantly increase the index build time and reduce the search throughput. The trick here is to find the fastest build time that can achieve the best recall with the lowest latency or highest throughput possible.
+
+As for suggested index types, graph-based algorithms like HNSW and CAGRA tend to scale very well to larger datasets while having superior search performance with respect to quality. The challenge is that graph-based indexes require learning a graph and so, as the subtitle of this section suggests, have a tendency to be slower to build than other options. Using the CAGRA algorithm on the GPU can reduce the build time significantly over HNSW, while also having a superior throughput (and lower latency) than searching on the CPU. Currently, the downside to using CAGRA on the GPU is that it requires both the graph and the raw vectors to fit into GPU memory. A middle-ground can be reached by building a CAGRA graph on the GPU and converting it to an HNSW for high quality (and moderately fast) search on the CPU.
+
+
+Tuning and hyperparameter optimization
+======================================
+
+Unfortunately, for large datasets, doing a hyper-parameter optimization on the whole dataset is not always feasible. It is possible, however, to perform a hyper-parameter optimization on the smaller subsets and find reasonably acceptable parameters that should generalize fairly well to the entire dataset. Generally this hyper-parameter optimization will require computing a ground truth on the subset with an exact method like brute-force and then using it to evaluate several searches on randomly sampled vectors.
+
+Full hyper-parameter optimization may also not always be necessary- for example, once you have built a ground truth dataset on a subset, many times you can start by building an index with the default build parameters and then playing around with different search parameters until you get the desired quality and search performance.  For massive indexes that might be multiple terabytes, you could also take this subsampling of, say, 10M vectors, train an index and then tune the search parameters from there. While there might be a small margin of error, the chosen build/search parameters should generalize fairly well for the databases that build locally partitioned indexes.
+
+
+Summary of vector search index types
+====================================
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Trade-offs
+     - Best to use with...
+   * - Brute-force (aka flat)
+     - Exact search but requires exhaustive distance computations
+     - Tiny datasets (< 100k vectors)
+   * - IVF-Flat
+     - Partitions the vector space to reduce distance computations for brute-force search at the expense of recall
+     - Small datasets (<1M vectors) or larger datasets (>1M vectors) where fast index build time is prioritized over quality.
+   * - IVF-PQ
+     - Adds product quantization to IVF-Flat to achieve scale at the expense of recall
+     - Large datasets (>>1M vectors) where fast index build is prioritized over quality
+   * - HNSW
+     - Significantly reduces distance computations at the expense of longer build times
+     - Small datasets (<1M vectors) or large datasets (>1M vectors) where quality and speed of search are prioritized over index build times
+   * - CAGRA
+     - Significantly reduces distance computations at the expense of longer build times (though build times improve over HNSW)
+     - Large datasets (>>1M vectors) where quality and speed of search are prioritized over index build times but index build times are still important.
+   * - CAGRA build +HNSW search
+     - (coming soon to Milvus)
+     - Significantly reduces distance computations and improves build times at the expense of higher search latency / lower throughput.
+       Large datasets (>>1M vectors) where index build times and quality of search is important but GPU resources are limited and latency of search is not.
diff --git a/docs/source/comparing_indexes.rst b/docs/source/comparing_indexes.rst
new file mode 100644
index 000000000..221aab6d7
--- /dev/null
+++ b/docs/source/comparing_indexes.rst
@@ -0,0 +1,60 @@
+.. _comparing_indexes:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Comparing performance of vector indexes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This document provides a brief overview methodology for comparing vector search indexes and models. For guidance on how to choose and configure an index type, please refer to :doc:`this <vector_databases_vs_vector_search>` guide.
+
+Unlike traditional database indexes, which will generally return correct results even without performance tuning, vector search indexes are more closely related to ML models and they can return absolutely garbage results if they have not been tuned.
+
+For this reason, it’s important to consider the parameters that an index is built upon, both for its potential quality and throughput/latency, when comparing two trained indexes. While easier to build an index on its default parameters than having to tune them, a well tuned index can have a significantly better search quality AND perform within search perf constraints like maximal throughput and minimal latency.
+
+
+What is recall?
+===============
+
+Recall is a measure of model quality. Imagine for a particular vector, we know the exact nearest neighbors because we computed them already. The recall for a query result can be computed by taking the set intersection between the exact nearest neighbors and the actual nearest neighbors. The number of neighbors in that intersection list gets divided by k, the number of neighbors being requested. To really give a fair estimate of the recall of a model, we use several query vectors, all with ground truth computed, and we take the total neighbors across all intersected neighbor lists and divide by n_queries * k.
+
+Parameter settings dictate the quality of an index. The graph below shows eight indexes from the same data but with different tuning parameters. Generally speaking, the indexes with higher average recall took longer to build. Which index is fair to report?
+
+.. image:: images/index_recalls.png
+
+
+How do I compare models or indexing algorithms?
+===============================================
+
+In order to fairly compare the performance (e.g. latency and throughput) of an indexing algorithm or model against another, we always need to do so with respect to its potential recall. This is important and draws from the ML roots of vector search, but is often confusing to newcomers who might be more familiar with the database world.
+
+Best practice: Latency and throughput can only be compared at similar levels of recall. If you measure the performance of two indexes at different levels of recall, you are making an unfair comparison.
+
+Because recall levels can vary quite a bit across parameter settings, we tend to compare recall within a small set of potential buckets, so that parameter settings that perform within each bucket can be fairly compared.
+
+We suggest averaging performance within a range of recall. For general guidance, we tend to use the following buckets:
+
+#. 85% - 89%
+#. 90% - 94%
+#. 95% - 99%
+#. >99%
+
+.. image:: images/recall_buckets.png
+
+
+This allows us to make observations such as “at 95% recall level, model A can be built 3x faster than model B, but model B has 2x lower latency than model A”
+
+.. image:: images/build_benchmarks.png
+
+
+Another important detail is that we compare these models against their best-case search performance within each recall window. This means that we aim to find models that not only have great recall quality but also have either the highest throughput or lowest latency within the window of interest. These best-cases are most often computed by doing a parameter sweep in a grid search (or other types of search optimizers) and looking at the best cases for each level of recall.
+
+The resulting data points will construct a curve known as a Pareto optimum. Please note that this process is specifically for showing best-case across recall and throughput/latency, but when we care about finding the parameters that yield the best recall and search performance, we are essentially performing a  hyperparameter optimization, which is common in machine learning.
+
+
+How do I do this on large vector databases?
+===========================================
+
+It turns out that most vector databases, like Milvus for example, make many smaller vector search indexing models for a single “index”, and the distribution of the vectors across the smaller index models are assumed to be completely uniform. This means we can use subsampling to our benefit, and tune on smaller sub-samples of the overall dataset.
+
+Please note, however, that there are often caps on the size of each of these smaller indexes, and that needs to be taken into consideration when choosing the size of the sub sample to tune.
+
+Please see :doc:`this guide <tuning_guide>` for more information on the steps one would take to do this subsampling and tuning process.
\ No newline at end of file
diff --git a/docs/source/cpp_api/neighbors.rst b/docs/source/cpp_api/neighbors.rst
index e5a9fc977..d55d58eb0 100644
--- a/docs/source/cpp_api/neighbors.rst
+++ b/docs/source/cpp_api/neighbors.rst
@@ -11,7 +11,9 @@ Nearest Neighbors
 
    neighbors_bruteforce.rst
    neighbors_cagra.rst
+   neighbors_hnsw.rst
    neighbors_ivf_flat.rst
    neighbors_ivf_pq.rst
    neighbors_nn_descent.rst
    neighbors_refine.rst
+   neighbors_mg.rst
diff --git a/docs/source/cpp_api/neighbors_mg.rst b/docs/source/cpp_api/neighbors_mg.rst
new file mode 100644
index 000000000..b68defec9
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_mg.rst
@@ -0,0 +1,76 @@
+Distributed ANN
+===============
+
+The SNMG (single-node multi-GPUs) ANN API provides a set of functions to deploy ANN indexes on multiple GPUs.
+
+.. role:: py(code)
+   :language: c++
+   :class: highlight
+
+``#include <cuvs/neighbors/mg.hpp>``
+
+namespace *cuvs::neighbors::mg*
+
+Index build parameters
+----------------------
+
+.. doxygengroup:: mg_cpp_index_params
+    :project: cuvs
+    :members:
+    :content-only:
+
+Search parameters
+----------------------
+
+.. doxygengroup:: mg_cpp_search_params
+    :project: cuvs
+    :members:
+    :content-only:
+
+Index build
+-----------
+
+.. doxygengroup:: mg_cpp_index_build
+    :project: cuvs
+    :members:
+    :content-only:
+
+Index extend
+------------
+
+.. doxygengroup:: mg_cpp_index_extend
+    :project: cuvs
+    :members:
+    :content-only:
+
+Index search
+------------
+
+.. doxygengroup:: mg_cpp_index_search
+    :project: cuvs
+    :members:
+    :content-only:
+
+Index serialize
+---------------
+
+.. doxygengroup:: mg_cpp_serialize
+    :project: cuvs
+    :members:
+    :content-only:
+
+Index deserialize
+-----------------
+
+.. doxygengroup:: mg_cpp_deserialize
+    :project: cuvs
+    :members:
+    :content-only:
+
+Distribute pre-built local index
+--------------------------------
+
+.. doxygengroup:: mg_cpp_distribute
+    :project: cuvs
+    :members:
+    :content-only:
diff --git a/docs/source/cuvs_bench/build.rst b/docs/source/cuvs_bench/build.rst
new file mode 100644
index 000000000..de8432691
--- /dev/null
+++ b/docs/source/cuvs_bench/build.rst
@@ -0,0 +1,59 @@
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Build cuVS Bench From Source
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Dependencies
+============
+
+CUDA 11 and a GPU with Volta architecture or later are required to run the benchmarks.
+
+Please refer to the  :doc:`installation docs <../build>` for the base requirements to build cuVS.
+
+In addition to the base requirements for building cuVS, additional dependencies needed to build the ANN benchmarks include:
+1. FAISS GPU >= 1.7.1
+2. Google Logging (GLog)
+3. H5Py
+4. HNSWLib
+5. nlohmann_json
+6. GGNN
+
+`rapids-cmake <https://github.com/rapidsai/rapids-cmake>`_ is used to build the ANN benchmarks so the code for dependencies not already supplied in the CUDA toolkit will be downloaded and built automatically.
+
+The easiest (and most reproducible) way to install the dependencies needed to build the ANN benchmarks is to use the conda environment file located in the `conda/environments` directory of the cuVS repository. The following command will use `mamba` (which is preferred over `conda`) to build and activate a new environment for compiling the benchmarks:
+
+.. code-block:: bash
+
+    conda env create --name cuvs_benchmarks -f conda/environments/cuvs_bench_cuda-118_arch-x86_64.yaml
+    conda activate cuvs_benchmarks
+
+The above conda environment will also reduce the compile times as dependencies like FAISS will already be installed and not need to be compiled with `rapids-cmake`.
+
+Compiling the Benchmarks
+========================
+
+After the needed dependencies are satisfied, the easiest way to compile ANN benchmarks is through the `build.sh` script in the root of the RAFT source code repository. The following will build the executables for all the support algorithms:
+
+.. code-block:: bash
+
+    ./build.sh bench-ann
+
+You can limit the algorithms that are built by providing a semicolon-delimited list of executable names (each algorithm is suffixed with `_ANN_BENCH`):
+
+.. code-block:: bash
+
+    ./build.sh bench-ann -n --limit-bench-ann=HNSWLIB_ANN_BENCH;CUVS_IVF_PQ_ANN_BENCH
+
+Available targets to use with `--limit-bench-ann` are:
+- FAISS_GPU_IVF_FLAT_ANN_BENCH
+- FAISS_GPU_IVF_PQ_ANN_BENCH
+- FAISS_CPU_IVF_FLAT_ANN_BENCH
+- FAISS_CPU_IVF_PQ_ANN_BENCH
+- FAISS_GPU_FLAT_ANN_BENCH
+- FAISS_CPU_FLAT_ANN_BENCH
+- GGNN_ANN_BENCH
+- HNSWLIB_ANN_BENCH
+- CUVS_CAGRA_ANN_BENCH
+- CUVS_IVF_PQ_ANN_BENCH
+- CUVS_IVF_FLAT_ANN_BENCH
+
+By default, the `*_ANN_BENCH` executables program infer the dataset's datatype from the filename's extension. For example, an extension of `fbin` uses a `float` datatype, `f16bin` uses a `float16` datatype, extension of `i8bin` uses `int8_t` datatype, and `u8bin` uses `uint8_t` type. Currently, only `float`, `float16`, int8_t`, and `unit8_t` are supported.
\ No newline at end of file
diff --git a/docs/source/cuvs_bench/datasets.rst b/docs/source/cuvs_bench/datasets.rst
new file mode 100644
index 000000000..0f6539e56
--- /dev/null
+++ b/docs/source/cuvs_bench/datasets.rst
@@ -0,0 +1,62 @@
+~~~~~~~~~~~~~~~~~~~
+cuVS Bench Datasets
+~~~~~~~~~~~~~~~~~~~
+
+A dataset usually has 4 binary files containing database vectors, query vectors, ground truth neighbors and their corresponding distances. For example, Glove-100 dataset has files `base.fbin` (database vectors), `query.fbin` (query vectors), `groundtruth.neighbors.ibin` (ground truth neighbors), and `groundtruth.distances.fbin` (ground truth distances). The first two files are for index building and searching, while the other two are associated with a particular distance and are used for evaluation.
+
+The file suffixes `.fbin`, `.f16bin`, `.ibin`, `.u8bin`, and `.i8bin` denote that the data type of vectors stored in the file are `float32`, `float16`(a.k.a `half`), `int`, `uint8`, and `int8`, respectively.
+These binary files are little-endian and the format is: the first 8 bytes are `num_vectors` (`uint32_t`) and `num_dimensions` (`uint32_t`), and the following `num_vectors * num_dimensions * sizeof(type)` bytes are vectors stored in row-major order.
+
+Some implementation can take `float16` database and query vectors as inputs and will have better performance. Use `python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py` to transform dataset from `float32` to `float16` type.
+
+Commonly used datasets can be downloaded from two websites:
+#. Million-scale datasets can be found at the `Data sets <https://github.com/erikbern/ann-benchmarks#data-sets>`_ section of `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_.
+
+    However, these datasets are in HDF5 format. Use `python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py` to transform the format. The usage of this script is:
+
+    .. code-block:: bash
+
+        $ python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
+        usage: hdf5_to_fbin.py [-n] <input>.hdf5
+           -n: normalize base/query set
+         outputs: <input>.base.fbin
+                  <input>.query.fbin
+                  <input>.groundtruth.neighbors.ibin
+                  <input>.groundtruth.distances.fbin
+
+    So for an input `.hdf5` file, four output binary files will be produced. See previous section for an example of prepossessing GloVe dataset.
+
+    Most datasets provided by `ann-benchmarks` use `Angular` or `Euclidean` distance. `Angular` denotes cosine distance. However, computing cosine distance reduces to computing inner product by normalizing vectors beforehand. In practice, we can always do the normalization to decrease computation cost, so it's better to measure the performance of inner product rather than cosine distance. The `-n` option of `hdf5_to_fbin.py` can be used to normalize the dataset.
+
+#. Billion-scale datasets can be found at `big-ann-benchmarks <http://big-ann-benchmarks.com>`_. The ground truth file contains both neighbors and distances, thus should be split. A script is provided for this:
+
+    Take Deep-1B dataset as an example:
+
+    .. code-block:: bash
+
+        mkdir -p data/deep-1B && cd data/deep-1B
+
+        # download manually "Ground Truth" file of "Yandex DEEP"
+        # suppose the file name is deep_new_groundtruth.public.10K.bin
+        python -m cuvs_bench.split_groundtruth deep_new_groundtruth.public.10K.bin groundtruth
+
+        # two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
+
+    Besides ground truth files for the whole billion-scale datasets, this site also provides ground truth files for the first 10M or 100M vectors of the base sets. This mean we can use these billion-scale datasets as million-scale datasets. To facilitate this, an optional parameter `subset_size` for dataset can be used. See the next step for further explanation.
+
+Generate ground truth
+=====================
+
+If you have a dataset, but no corresponding ground truth file, then you can generate ground trunth using the `generate_groundtruth` utility. Example usage:
+
+.. code-block:: bash
+
+    # With existing query file
+    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=/dataset/query.public.10K.fbin
+
+    # With randomly generated queries
+    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=random --n_queries=10000
+
+    # Using only a subset of the dataset. Define queries by randomly
+    # selecting vectors from the (subset of the) dataset.
+    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --nrows=2000000 --output=groundtruth_dir --queries=random-choice --n_queries=10000
\ No newline at end of file
diff --git a/docs/source/cuvs_bench/index.rst b/docs/source/cuvs_bench/index.rst
new file mode 100644
index 000000000..61ac622d2
--- /dev/null
+++ b/docs/source/cuvs_bench/index.rst
@@ -0,0 +1,770 @@
+~~~~~~~~~~
+cuVS Bench
+~~~~~~~~~~
+
+cuVS bench provides a reproducible benchmarking tool for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU. One of the primary goals of cuVS is to capture ideal index configurations for a variety of important usage patterns so the results can be reproduced easily on different hardware environments, such as on-prem and cloud.
+
+This tool offers several benefits, including
+
+#. Making fair comparisons of index build times
+
+#. Making fair comparisons of index search throughput and/or latency
+
+#. Finding the optimal parameter settings for a range of recall buckets
+
+#. Easily generating consistently styled plots for index build and search
+
+#. Profiling blind spots and potential for algorithm optimization
+
+#. Investigating the relationship between different parameter settings, index build times, and search performance.
+
+- `Installing the benchmarks`_
+
+  * `Conda`_
+
+  * `Docker`_
+
+- `How to run the benchmarks`_
+
+  * `Step 1: Prepare the dataset`_
+
+  * `Step 2: Build and search index`_
+
+  * `Step 3: Data export`_
+
+  * `Step 4: Plot the results`_
+
+- `Running the benchmarks`_
+
+  * `End-to-end: smaller-scale benchmarks (<1M to 10M)`_
+
+  * `End-to-end: large-scale benchmarks (>10M vectors)`_
+
+  * `Running with Docker containers`_
+
+    * `End-to-end run on GPU`_
+
+    * `Manually run the scripts inside the container`_
+
+  * `Evaluating the results`_
+
+- `Creating and customizing dataset configurations`_
+
+  * `Multi-GPU benchmarks`_
+
+- `Adding a new index algorithm`_
+
+  * `Implementation and configuration`_
+
+  * `Adding a Cmake target`_
+
+Installing the benchmarks
+=========================
+
+There are two main ways pre-compiled benchmarks are distributed:
+
+- `Conda`_ For users not using containers but want an easy to install and use Python package. Pip wheels are planned to be added as an alternative for users that cannot use conda and prefer to not use containers.
+- `Docker`_ Only needs docker and [NVIDIA docker](https://github.com/NVIDIA/nvidia-docker) to use. Provides a single docker run command for basic dataset benchmarking, as well as all the functionality of the conda solution inside the containers.
+
+Conda
+-----
+
+.. code-block:: bash
+
+   conda create --name cuvs_benchmarks
+   conda activate cuvs_benchmarks
+
+   # to install GPU package:
+   conda install -c rapidsai -c conda-forge -c nvidia cuvs-ann-bench=<rapids_version> cuda-version=11.8*
+
+   # to install CPU package for usage in CPU-only systems:
+   conda install -c rapidsai -c conda-forge  cuvs-bench-cpu
+
+The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
+
+Please see the :doc:`build instructions <build>` to build the benchmarks from source.
+
+Docker
+------
+
+We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available:
+
+- `cuvs-bench`: Contains GPU and CPU benchmarks, can run all algorithms supported. Will download million-scale datasets as required. Best suited for users that prefer a smaller container size for GPU based systems. Requires the NVIDIA Container Toolkit to run GPU algorithms, can run CPU algorithms without it.
+- `cuvs-bench-datasets`: Contains the GPU and CPU benchmarks with million-scale datasets already included in the container. Best suited for users that want to run multiple million scale datasets already included in the image.
+- `cuvs-bench-cpu`: Contains only CPU benchmarks with minimal size. Best suited for users that want the smallest containers to reproduce benchmarks on systems without a GPU.
+
+Nightly images are located in `dockerhub <https://hub.docker.com/r/rapidsai/cuvs-ann-bench/tags>`_, meanwhile release (stable) versions are located in `NGC <https://hub.docker.com/r/rapidsai/cuvs_bench>`_, starting with release 24.10.
+
+The following command pulls the nightly container for python version 10, cuda version 12, and CUVS version 23.10:
+
+.. code-block:: bash
+
+   docker pull rapidsai/cuvs_bench:24.10a-cuda12.0-py3.10 #substitute cuvs_bench for the exact desired container.
+
+The CUDA and python versions can be changed for the supported values:
+- Supported CUDA versions: 11.4 and 12.x
+- Supported Python versions: 3.9 and 3.10.
+
+You can see the exact versions as well in the dockerhub site:
+- `cuVS bench images <https://hub.docker.com/r/rapidsai/cuvs_bench/tags>`_
+- `cuVS bench with datasets preloaded images <https://hub.docker.com/r/rapidsai/cuvs-bench-cpu/tags>`_
+- `cuVS bench CPU only images <https://hub.docker.com/r/rapidsai/cuvs-bench-datasets/tags>`_
+
+**Note:** GPU containers use the CUDA toolkit from inside the container, the only requirement is a driver installed on the host machine that supports that version. So, for example, CUDA 11.8 containers can run in systems with a CUDA 12.x capable driver. Please also note that the Nvidia-Docker runtime from the `Nvidia Container Toolkit <https://github.com/NVIDIA/nvidia-docker>`_ is required to use GPUs inside docker containers.
+
+How to run the benchmarks
+=========================
+
+We provide a collection of lightweight Python scripts to run the benchmarks. There are 4 general steps to running the benchmarks and visualizing the results.
+#. Prepare Dataset
+#. Build Index and Search Index
+#. Data Export
+#. Plot Results
+
+Step 1: Prepare the dataset
+---------------------------
+
+The script `cuvs_bench.get_dataset` will download and unpack the dataset in directory that the user provides. As of now, only million-scale datasets are supported by this script. For more information on :doc:`datasets and formats <datasets>`.
+
+The usage of this script is:
+
+.. code-block:: bash
+
+    usage: get_dataset.py [-h] [--name NAME] [--dataset-path DATASET_PATH] [--normalize]
+
+    options:
+      -h, --help            show this help message and exit
+      --dataset DATASET     dataset to download (default: glove-100-angular)
+      --dataset-path DATASET_PATH
+                            path to download dataset (default: ${RAPIDS_DATASET_ROOT_DIR})
+      --normalize           normalize cosine distance to inner product (default: False)
+
+When option `normalize` is provided to the script, any dataset that has cosine distances
+will be normalized to inner product. So, for example, the dataset `glove-100-angular`
+will be written at location `datasets/glove-100-inner/`.
+
+Step 2: Build and search index
+------------------------------
+
+The script `cuvs_bench.run` will build and search indices for a given dataset and its
+specified configuration.
+
+The usage of the script `cuvs_bench.run` is:
+
+.. code-block:: bash
+
+    usage: __main__.py [-h] [--subset-size SUBSET_SIZE] [-k COUNT] [-bs BATCH_SIZE] [--dataset-configuration DATASET_CONFIGURATION] [--configuration CONFIGURATION] [--dataset DATASET]
+                       [--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS] [-f] [-m SEARCH_MODE]
+
+    options:
+      -h, --help            show this help message and exit
+      --subset-size SUBSET_SIZE
+                            the number of subset rows of the dataset to build the index (default: None)
+      -k COUNT, --count COUNT
+                            the number of nearest neighbors to search for (default: 10)
+      -bs BATCH_SIZE, --batch-size BATCH_SIZE
+                            number of query vectors to use in each query trial (default: 10000)
+      --dataset-configuration DATASET_CONFIGURATION
+                            path to YAML configuration file for datasets (default: None)
+      --configuration CONFIGURATION
+                            path to YAML configuration file or directory for algorithms Any run groups found in the specified file/directory will automatically override groups of the same name
+                            present in the default configurations, including `base` (default: None)
+      --dataset DATASET     name of dataset (default: glove-100-inner)
+      --dataset-path DATASET_PATH
+                            path to dataset folder, by default will look in RAPIDS_DATASET_ROOT_DIR if defined, otherwise a datasets subdirectory from the calling directory (default:
+                            os.getcwd()/datasets/)
+      --build
+      --search
+      --algorithms ALGORITHMS
+                            run only comma separated list of named algorithms. If parameters `groups` and `algo-groups` are both undefined, then group `base` is run by default (default: None)
+      --groups GROUPS       run only comma separated groups of parameters (default: base)
+      --algo-groups ALGO_GROUPS
+                            add comma separated <algorithm>.<group> to run. Example usage: "--algo-groups=cuvs_cagra.large,hnswlib.large" (default: None)
+      -f, --force           re-run algorithms even if their results already exist (default: False)
+      -m SEARCH_MODE, --search-mode SEARCH_MODE
+                            run search in 'latency' (measure individual batches) or 'throughput' (pipeline batches and measure end-to-end) mode (default: throughput)
+      -t SEARCH_THREADS, --search-threads SEARCH_THREADS
+                            specify the number threads to use for throughput benchmark. Single value or a pair of min and max separated by ':'. Example --search-threads=1:4. Power of 2 values between 'min' and 'max' will be used. If only 'min' is
+                            specified, then a single test is run with 'min' threads. By default min=1, max=<num hyper threads>. (default: None)
+      -r, --dry-run         dry-run mode will convert the yaml config for the specified algorithms and datasets to the json format that's consumed by the lower-level c++ binaries and then print the command to run execute the benchmarks but
+                            will not actually execute the command. (default: False)
+
+`dataset`: name of the dataset to be searched in `datasets.yaml`_
+
+`dataset-configuration`: optional filepath to custom dataset YAML config which has an entry for arg `dataset`
+
+`configuration`: optional filepath to YAML configuration for an algorithm or to directory that contains YAML configurations for several algorithms. Refer to `Dataset.yaml config`_ for more info.
+
+`algorithms`: runs all algorithms that it can find in YAML configs found by `configuration`. By default, only `base` group will be run.
+
+`groups`: run only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group
+
+`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to run the benchmark for in addition to all the arguments from `algorithms` and `groups`. It is of the format `<algorithm>.<group>`, or for example, `cuvs_cagra.large`
+
+For every algorithm run by this script, it outputs an index build statistics JSON file in `<dataset-path/<dataset>/result/build/<{algo},{group}.json>`
+and an index search statistics JSON file in `<dataset-path/<dataset>/result/search/<{algo},{group},k{k},bs{batch_size}.json>`. NOTE: The filenames will not have ",{group}" if `group = "base"`.
+
+For every algorithm run by this script, it outputs an index build statistics JSON file in `<dataset-path/<dataset>/result/build/<{algo},{group}.json>`
+and an index search statistics JSON file in `<dataset-path/<dataset>/result/search/<{algo},{group},k{k},bs{batch_size}.json>`. NOTE: The filenames will not have ",{group}" if `group = "base"`.
+
+`dataset-path` :
+#. data is read from `<dataset-path>/<dataset>`
+#. indices are built in `<dataset-path>/<dataset>/index`
+#. build/search results are stored in `<dataset-path>/<dataset>/result`
+
+`build` and `search` : if both parameters are not supplied to the script then it is assumed both are `True`.
+
+`indices` and `algorithms` : these parameters ensure that the algorithm specified for an index is available in `algos.yaml` and not disabled, as well as having an associated executable.
+
+Step 3: Data export
+-------------------
+
+The script `cuvs_bench.data_export` will convert the intermediate JSON outputs produced by `cuvs_bench.run` to more easily readable CSV files, which are needed to build charts made by `cuvs_bench.plot`.
+
+.. code-block:: bash
+
+    usage: data_export.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH]
+
+    options:
+      -h, --help            show this help message and exit
+      --dataset DATASET     dataset to download (default: glove-100-inner)
+      --dataset-path DATASET_PATH
+                            path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
+
+Build statistics CSV file is stored in `<dataset-path/<dataset>/result/build/<{algo},{group}.csv>`
+and index search statistics CSV file in `<dataset-path/<dataset>/result/search/<{algo},{group},k{k},bs{batch_size},{suffix}.csv>`, where suffix has three values:
+#. `raw`: All search results are exported
+#. `throughput`: Pareto frontier of throughput results is exported
+#. `latency`: Pareto frontier of latency results is exported
+
+Step 4: Plot the results
+------------------------
+
+The script `cuvs_bench.plot` will plot results for all algorithms found in index search statistics CSV files `<dataset-path/<dataset>/result/search/*.csv`.
+
+The usage of this script is:
+
+.. code-block:: bash
+
+    usage:  [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS]
+            [-k COUNT] [-bs BATCH_SIZE] [--build] [--search] [--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--x-start X_START] [--mode {throughput,latency}]
+            [--time-unit {s,ms,us}] [--raw]
+
+    options:
+      -h, --help            show this help message and exit
+      --dataset DATASET     dataset to plot (default: glove-100-inner)
+      --dataset-path DATASET_PATH
+                            path to dataset folder (default: /home/coder/cuvs/datasets/)
+      --output-filepath OUTPUT_FILEPATH
+                            directory for PNG to be saved (default: /home/coder/cuvs)
+      --algorithms ALGORITHMS
+                            plot only comma separated list of named algorithms. If parameters `groups` and `algo-groups are both undefined, then group `base` is plot by default
+                            (default: None)
+      --groups GROUPS       plot only comma separated groups of parameters (default: base)
+      --algo-groups ALGO_GROUPS, --algo-groups ALGO_GROUPS
+                            add comma separated <algorithm>.<group> to plot. Example usage: "--algo-groups=cuvs_cagra.large,hnswlib.large" (default: None)
+      -k COUNT, --count COUNT
+                            the number of nearest neighbors to search for (default: 10)
+      -bs BATCH_SIZE, --batch-size BATCH_SIZE
+                            number of query vectors to use in each query trial (default: 10000)
+      --build
+      --search
+      --x-scale X_SCALE     Scale to use when drawing the X-axis. Typically linear, logit or a2 (default: linear)
+      --y-scale {linear,log,symlog,logit}
+                            Scale to use when drawing the Y-axis (default: linear)
+      --x-start X_START     Recall values to start the x-axis from (default: 0.8)
+      --mode {throughput,latency}
+                            search mode whose Pareto frontier is used on the y-axis (default: throughput)
+      --time-unit {s,ms,us}
+                            time unit to plot when mode is latency (default: ms)
+      --raw                 Show raw results (not just Pareto frontier) of mode arg (default: False)
+
+`mode`: plots pareto frontier of `throughput` or `latency` results exported in the previous step
+
+`algorithms`: plots all algorithms that it can find results for the specified `dataset`. By default, only `base` group will be plotted.
+
+`groups`: plot only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group
+
+`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to plot results for in addition to all the arguments from `algorithms` and `groups`. It is of the format `<algorithm>.<group>`, or for example, `cuvs_cagra.large`
+
+Running the benchmarks
+======================
+
+End-to-end: smaller-scale benchmarks (<1M to 10M)
+-------------------------------------------------
+
+The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called:
+
+.. code-block:: bash
+
+
+    # (1) prepare dataset.
+    python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize
+
+    # (2) build and search index
+    python -m cuvs_bench.run --dataset deep-image-96-inner --algorithms cuvs_cagra --batch-size 10 -k 10
+
+    # (3) export data
+    python -m cuvs_bench.data_export --dataset deep-image-96-inner
+
+    # (4) plot results
+    python -m cuvs_bench.plot --dataset deep-image-96-inner
+
+
+.. list-table::
+
+ * - Dataset name
+   - Train rows
+   - Columns
+   - Test rows
+   - Distance
+
+ * - `deep-image-96-angular`
+   - 10M
+   - 96
+   - 10K
+   - Angular
+
+ * - `fashion-mnist-784-euclidean`
+   - 60K
+   - 784
+   - 10K
+   - Euclidean
+
+ * - `glove-50-angular`
+   - 1.1M
+   - 50
+   - 10K
+   - Angular
+
+ * - `glove-100-angular`
+   - 1.1M
+   - 100
+   - 10K
+   - Angular
+
+ * - `mnist-784-euclidean`
+   - 60K
+   - 784
+   - 10K
+   - Euclidean
+
+ * - `nytimes-256-angular`
+   - 290K
+   - 256
+   - 10K
+   - Angular
+
+ * - `sift-128-euclidean`
+   - 1M
+   - 128
+   - 10K
+   - Euclidean
+
+All of the datasets above contain ground test datasets with 100 neighbors. Thus `k` for these datasets must be  less than or equal to 100.
+
+End-to-end: large-scale benchmarks (>10M vectors)
+-------------------------------------------------
+
+`cuvs_bench.get_dataset` cannot be used to download the `billion-scale datasets`_ due to their size. You should instead use our billion-scale datasets guide to download and prepare them.
+All other python commands mentioned below work as intended once the billion-scale dataset has been downloaded.
+
+To download billion-scale datasets, visit `big-ann-benchmarks <http://big-ann-benchmarks.com/neurips21.html>`_
+
+We also provide a new dataset called `wiki-all` containing 88 million 768-dimensional vectors. This dataset is meant for benchmarking a realistic retrieval-augmented generation (RAG)/LLM embedding size at scale. It also contains 1M and 10M vector subsets for smaller-scale experiments. See our :doc:`Wiki-all Dataset Guide <wiki_all_dataset>` for more information and to download the dataset.
+
+
+The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100.
+
+.. code-block:: bash
+
+    mkdir -p datasets/deep-1B
+    # (1) prepare dataset
+    # download manually "Ground Truth" file of "Yandex DEEP"
+    # suppose the file name is deep_new_groundtruth.public.10K.bin
+    python -m cuvs_bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin
+    # two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
+
+    # (2) build and search index
+    python -m cuvs_bench.run --dataset deep-1B --algorithms cuvs_cagra --batch-size 10 -k 10
+
+    # (3) export data
+    python -m cuvs_bench.data_export --dataset deep-1B
+
+    # (4) plot results
+    python -m cuvs_bench.plot --dataset deep-1B
+
+The usage of `python -m cuvs_bench.split_groundtruth` is:
+
+.. code-block:: bash
+    usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH
+
+    options:
+      -h, --help            show this help message and exit
+      --groundtruth GROUNDTRUTH
+                            Path to billion-scale dataset groundtruth file (default: None)
+
+Running with Docker containers
+------------------------------
+
+Two methods are provided for running the benchmarks with the Docker containers.
+
+End-to-end run on GPU
+~~~~~~~~~~~~~~~~~~~~~
+
+When no other entrypoint is provided, an end-to-end script will run through all the steps in `Running the benchmarks`_ above.
+
+For GPU-enabled systems, the `DATA_FOLDER` variable should be a local folder where you want datasets stored in `$DATA_FOLDER/datasets` and results in `$DATA_FOLDER/result` (we highly recommend `$DATA_FOLDER` to be a dedicated folder for the datasets and results of the containers):
+
+.. code-block:: bash
+
+    export DATA_FOLDER=path/to/store/datasets/and/results
+    docker run --gpus all --rm -it -u $(id -u)                      \
+        -v $DATA_FOLDER:/data/benchmarks                            \
+        rapidsai/cuvs-bench:24.10a-cuda11.8-py3.10              \
+        "--dataset deep-image-96-angular"                           \
+        "--normalize"                                               \
+        "--algorithms cuvs_cagra,cuvs_ivf_pq --batch-size 10 -k 10" \
+        ""
+
+Usage of the above command is as follows:
+
+.. list-table::
+
+ * - Argument
+   - Description
+
+ * - `rapidsai/cuvs-bench:24.10a-cuda11.8-py3.10`
+   - Image to use. Can be either `cuvs-bench` or `cuvs-bench-datasets`
+
+ * - `"--dataset deep-image-96-angular"`
+   - Dataset name
+
+ * - `"--normalize"`
+   - Whether to normalize the dataset
+
+ * - `"--algorithms cuvs_cagra,hnswlib --batch-size 10 -k 10"`
+   - Arguments passed to the `run` script, such as the algorithms to benchmark, the batch size, and `k`
+
+ * - `""`
+   - Additional (optional) arguments that will be passed to the `plot` script.
+
+***Note about user and file permissions:*** The flag `-u $(id -u)` allows the user inside the container to match the `uid` of the user outside the container, allowing the container to read and write to the mounted volume indicated by the `$DATA_FOLDER` variable.
+
+End-to-end run on CPU
+~~~~~~~~~~~~~~~~~~~~~
+
+The container arguments in the above section also be used for the CPU-only container, which can be used on systems that don't have a GPU installed.
+
+***Note:*** the image changes to `cuvs-bench-cpu` container and the `--gpus all` argument is no longer used:
+
+.. code-block:: bash
+
+    export DATA_FOLDER=path/to/store/datasets/and/results
+    docker run  --rm -it -u $(id -u)                  \
+        -v $DATA_FOLDER:/data/benchmarks              \
+        rapidsai/cuvs-bench-cpu:24.10a-py3.10     \
+         "--dataset deep-image-96-angular"            \
+         "--normalize"                                \
+         "--algorithms hnswlib --batch-size 10 -k 10" \
+         ""
+
+Manually run the scripts inside the container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All of the `cuvs-bench` images contain the Conda packages, so they can be used directly by logging directly into the container itself:
+
+.. code-block:: bash
+
+    export DATA_FOLDER=path/to/store/datasets/and/results
+    docker run --gpus all --rm -it -u $(id -u)          \
+        --entrypoint /bin/bash                          \
+        --workdir /data/benchmarks                      \
+        -v $DATA_FOLDER:/data/benchmarks                \
+        rapidsai/cuvs-bench:24.10a-cuda11.8-py3.10
+
+This will drop you into a command line in the container, with the `cuvs-bench` python package ready to use, as described in the [Running the benchmarks](#running-the-benchmarks) section above:
+
+.. code-block:: bash
+
+    (base) root@00b068fbb862:/data/benchmarks# python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize
+
+Additionally, the containers can be run in detached mode without any issue.
+
+Evaluating the results
+----------------------
+
+The benchmarks capture several different measurements. The table below describes each of the measurements for index build benchmarks:
+
+.. list-table::
+
+ * - Name
+   - Description
+
+ * - Benchmark
+   - A name that uniquely identifies the benchmark instance
+
+ * - Time
+   - Wall-time spent training the index
+
+ * - CPU
+   - CPU time spent training the index
+
+ * - Iterations
+   - Number of iterations (this is usually 1)
+
+ * - GPU
+   - GU time spent building
+
+ * - index_size
+   - Number of vectors used to train index
+
+The table below describes each of the measurements for the index search benchmarks. The most important measurements `Latency`, `items_per_second`, `end_to_end`.
+
+.. list-table::
+
+ * - Name
+   - Description
+
+ * - Benchmark
+   - A name that uniquely identifies the benchmark instance
+
+ * - Time
+   - The wall-clock time of a single iteration (batch) divided by the number of threads.
+
+ * - CPU
+   - The average CPU time (user + sys time). This does not include idle time (which can also happen while waiting for GPU sync).
+
+ * - Iterations
+   - Total number of batches. This is going to be `total_queries` / `n_queries`.
+
+ * - GPU
+   - GPU latency of a single batch (seconds). In throughput mode this is averaged over multiple threads.
+
+ * - Latency
+   - Latency of a single batch (seconds), calculated from wall-clock time. In throughput mode this is averaged over multiple threads.
+
+ * - Recall
+   - Proportion of correct neighbors to ground truth neighbors. Note this column is only present if groundtruth file is specified in dataset configuration.
+
+ * - items_per_second
+   - Total throughput, a.k.a Queries per second (QPS). This is approximately `total_queries` / `end_to_end`.
+
+ * - k
+   - Number of neighbors being queried in each iteration
+
+ * - end_to_end
+   - Total time taken to run all batches for all iterations
+
+ * - n_queries
+   - Total number of query vectors in each batch
+
+ * - total_queries
+   - Total number of vectors queries across all iterations ( = `iterations` * `n_queries`)
+
+Note the following:
+- A slightly different method is used to measure `Time` and `end_to_end`. That is why `end_to_end` = `Time` * `Iterations` holds only approximately.
+- The actual table displayed on the screen may differ slightly as the hyper-parameters will also be displayed for each different combination being benchmarked.
+- Recall calculation: the number of queries processed per test depends on the number of iterations. Because of this, recall can show slight fluctuations if less neighbors are processed then it is available for the benchmark.
+
+Creating and customizing dataset configurations
+===============================================
+
+A single configuration will often define a set of algorithms, with associated index and search parameters, that can be generalize across datasets. We use YAML to define dataset specific and algorithm specific configurations.
+
+A default `datasets.yaml` is provided by CUVS in `${CUVS_HOME}/python/cuvs-ann-bench/src/cuvs_bench/run/conf` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset:
+
+.. code-block:: yaml
+
+    - name: sift-128-euclidean
+      base_file: sift-128-euclidean/base.fbin
+      query_file: sift-128-euclidean/query.fbin
+      groundtruth_neighbors_file: sift-128-euclidean/groundtruth.neighbors.ibin
+      dims: 128
+      distance: euclidean
+
+Configuration files for ANN algorithms supported by `cuvs-bench` are provided in `${CUVS_HOME}/python/cuvs-bench/src/cuvs_bench/run/conf`. `cuvs_cagra` algorithm configuration looks like:
+
+.. code-block:: yaml
+
+    name: cuvs_cagra
+    groups:
+      base:
+        build:
+          graph_degree: [32, 64]
+          intermediate_graph_degree: [64, 96]
+          graph_build_algo: ["NN_DESCENT"]
+        search:
+          itopk: [32, 64, 128]
+
+      large:
+        build:
+          graph_degree: [32, 64]
+        search:
+          itopk: [32, 64, 128]
+
+The default parameters for which the benchmarks are run can be overridden by creating a custom YAML file for algorithms with a `base` group.
+
+There config above has 2 fields:
+1. `name` - define the name of the algorithm for which the parameters are being specified.
+2. `groups` - define a run group which has a particular set of parameters. Each group helps create a cross-product of all hyper-parameter fields for `build` and `search`.
+
+The table below contains all algorithms supported by cuVS. Each unique algorithm will have its own set of `build` and `search` settings. The :doc:`ANN Algorithm Parameter Tuning Guide <param_tuning>` contains detailed instructions on choosing build and search parameters for each supported algorithm.
+
+.. list-table::
+
+ * - Library
+   - Algorithms
+
+ * - FAISS_GPU
+   - `faiss_gpu_flat`, `faiss_gpu_ivf_flat`, `faiss_gpu_ivf_pq`
+
+ * - FAISS_CPU
+   - `faiss_cpu_flat`, `faiss_cpu_ivf_flat`, `faiss_cpu_ivf_pq`
+
+ * - GGNN
+   - `ggnn`
+
+ * - HNSWLIB
+   - `hnswlib`
+
+ * - cuVS
+   - `cuvs_brute_force`, `cuvs_cagra`, `cuvs_ivf_flat`, `cuvs_ivf_pq`, `cuvs_cagra_hnswlib`
+
+
+Multi-GPU benchmarks
+--------------------
+
+cuVS implements single node multi-GPU versions of IVF-Flat, IVF-PQ and CAGRA indexes.
+
+.. list-table::
+
+ * - Index type
+   - Multi-GPU algo name
+
+ * - IVF-Flat
+   - `cuvs_mg_ivf_flat`
+
+ * - IVF-PQ
+   - `cuvs_mg_ivf_pq`
+
+ * - CAGRA
+   - `cuvs_mg_cagra`
+
+
+Adding a new index algorithm
+============================
+
+Implementation and configuration
+--------------------------------
+
+Implementation of a new algorithm should be a C++ class that inherits `class ANN` (defined in `cpp/bench/ann/src/ann.h`) and implements all the pure virtual functions.
+
+In addition, it should define two `struct`s for building and searching parameters. The searching parameter class should inherit `struct ANN<T>::AnnSearchParam`. Take `class HnswLib` as an example, its definition is:
+
+.. code-block:: c++
+    template<typename T>
+    class HnswLib : public ANN<T> {
+    public:
+      struct BuildParam {
+        int M;
+        int ef_construction;
+        int num_threads;
+      };
+
+      using typename ANN<T>::AnnSearchParam;
+      struct SearchParam : public AnnSearchParam {
+        int ef;
+        int num_threads;
+      };
+
+      // ...
+    };
+
+
+The benchmark program uses JSON format natively in a configuration file to specify indexes to build, along with the build and search parameters. However the JSON config files are overly verbose and are not meant to be used directly. Instead, the Python scripts parse YAML and create these json files automatically. It's important to realize that these json objects align with the yaml objects for `build_param`, whose value is a JSON object, and `search_param`, whose value is an array of JSON objects. Take the json configuration for `HnswLib` as an example of the json after it's been parsed from yaml:
+
+.. code-block:: json
+    {
+      "name" : "hnswlib.M12.ef500.th32",
+      "algo" : "hnswlib",
+      "build_param": {"M":12, "efConstruction":500, "numThreads":32},
+      "file" : "/path/to/file",
+      "search_params" : [
+        {"ef":10, "numThreads":1},
+        {"ef":20, "numThreads":1},
+        {"ef":40, "numThreads":1},
+      ],
+      "search_result_file" : "/path/to/file"
+    },
+
+The build and search params are ultimately passed to the C++ layer as json objects for each param configuration to benchmark. The code below shows how to parse these params for `Hnswlib`:
+
+1. First, add two functions for parsing JSON object to `struct BuildParam` and `struct SearchParam`, respectively:
+
+.. code-block:: c++
+
+    template<typename T>
+    void parse_build_param(const nlohmann::json& conf,
+                           typename cuann::HnswLib<T>::BuildParam& param) {
+      param.ef_construction = conf.at("efConstruction");
+      param.M = conf.at("M");
+      if (conf.contains("numThreads")) {
+        param.num_threads = conf.at("numThreads");
+      }
+    }
+
+    template<typename T>
+    void parse_search_param(const nlohmann::json& conf,
+                            typename cuann::HnswLib<T>::SearchParam& param) {
+      param.ef = conf.at("ef");
+      if (conf.contains("numThreads")) {
+        param.num_threads = conf.at("numThreads");
+      }
+    }
+
+
+
+2. Next, add corresponding `if` case to functions `create_algo()` (in `cpp/bench/ann/) and `create_search_param()` by calling parsing functions. The string literal in `if` condition statement must be the same as the value of `algo` in configuration file. For example,
+
+.. code-block:: c++
+      // JSON configuration file contains a line like:  "algo" : "hnswlib"
+      if (algo == "hnswlib") {
+         // ...
+      }
+
+Adding a Cmake target
+---------------------
+
+In `cuvs/cpp/bench/ann/CMakeLists.txt`, we provide a `CMake` function to configure a new Benchmark target with the following signature:
+
+
+.. code-block:: cmake
+    ConfigureAnnBench(
+      NAME <algo_name>
+      PATH </path/to/algo/benchmark/source/file>
+      INCLUDES <additional_include_directories>
+      CXXFLAGS <additional_cxx_flags>
+      LINKS <additional_link_library_targets>
+    )
+
+To add a target for `HNSWLIB`, we would call the function as:
+
+.. code-block:: cmake
+
+    ConfigureAnnBench(
+      NAME HNSWLIB PATH bench/ann/src/hnswlib/hnswlib_benchmark.cpp INCLUDES
+      ${CMAKE_CURRENT_BINARY_DIR}/_deps/hnswlib-src/hnswlib CXXFLAGS "${HNSW_CXX_FLAGS}"
+    )
+
+This will create an executable called `HNSWLIB_ANN_BENCH`, which can then be used to run `HNSWLIB` benchmarks.
+
+Add a new entry to `algos.yaml` to map the name of the algorithm to its binary executable and specify whether the algorithm requires GPU support.
+
+.. code-block:: yaml
+    cuvs_ivf_pq:
+      executable: CUVS_IVF_PQ_ANN_BENCH
+      requires_gpu: true
+
+`executable` : specifies the name of the binary that will build/search the index. It is assumed to be available in `cuvs/cpp/build/`.
+`requires_gpu` : denotes whether an algorithm requires GPU to run.
\ No newline at end of file
diff --git a/docs/source/cuvs_bench/param_tuning.rst b/docs/source/cuvs_bench/param_tuning.rst
new file mode 100644
index 000000000..faffa9daf
--- /dev/null
+++ b/docs/source/cuvs_bench/param_tuning.rst
@@ -0,0 +1,674 @@
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+cuVS Bench Parameter Tuning Guide
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This guide outlines the various parameter settings that can be specified in :doc:`cuVS Benchmarks <index>` yaml configuration files and explains the impact they have on corresponding algorithms to help inform their settings for benchmarking across desired levels of recall.
+
+cuVS Indexes
+============
+
+cuvs_brute_force
+----------------
+
+Use cuVS brute-force index for exact search. Brute-force has no further build or search parameters.
+
+cuvs_ivf_flat
+-------------
+
+IVF-flat uses an inverted-file index, which partitions the vectors into a series of clusters, or lists, storing them in an interleaved format which is optimized for fast distance computation. The searching of an IVF-flat index reduces the total vectors in the index to those within some user-specified nearest clusters called probes.
+
+IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall.
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nlist`
+   - `build`
+   - Y
+   - Positive integer >0
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+ * - `niter`
+   - `build`
+   - N
+   - Positive integer >0
+   - 20
+   - Number of kmeans iterations to use when training the ivf clusters
+
+ * - `ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 2
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+  * - `dataset_memory_type`
+    - `build`
+    - N
+    - [`device`, `host`, `mmap`]
+    - `mmap`
+    - Where should the dataset reside?
+
+  * - `query_memory_type`
+    - `search`
+    - [`device`, `host`, `mmap`]
+    - `device`
+    - Where should the queries reside?
+
+  * - `nprobe`
+    - `search`
+    - Y
+    - Positive integer >0
+    -
+    - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+cuvs_ivf_pq
+-----------
+
+IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available.
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nlist`
+   - `build`
+   - Y
+   - Positive integer >0
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+ * - `niter`
+   - `build`
+   - N
+   - Positive integer >0
+   - 20
+   - Number of kmeans iterations to use when training the ivf clusters
+
+ * - `ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 2
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+ * - `pq_dim`
+   - `build`
+   - N
+   - Positive integer. Multiple of 8.
+   - 0
+   - Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value.
+
+ * - `pq_bits`
+   - `build`
+   - N
+   - Positive integer [4-8]
+   - 8
+   - Bit length of the vector element after quantization.
+
+ * - `codebook_kind`
+   - `build`
+   - N
+   - [`cluster`, `subspace`]
+   - `subspace`
+   - Type of codebook. See :doc:`IVF-PQ index overview <../indexes/ivfpq>` for more detail
+
+ * - `dataset_memory_type`
+   - `build`
+   - N
+   - [`device`, `host`, `mmap`]
+   - `mmap`
+   - Where should the dataset reside?
+
+ * - `query_memory_type`
+   - `search`
+   - [`device`, `host`, `mmap`]
+   - `device`
+   - Where should the queries reside?
+
+ * - `nprobe`
+   - `search`
+   - Y
+   - Positive integer >0
+   -
+   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+ * - `internalDistanceDtype`
+   - `search`
+   - N
+   - [`float`, `half`]
+   - `half`
+   - The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy.
+
+ * - `smemLutDtype`
+   - `search`
+   - N
+   - [`float`, `half`, `fp8`]
+   - `half`
+   - The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy.
+
+ * - `refine_ratio`
+   - `search`
+   - N
+   - Positive integer >0
+   - 1
+   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+
+
+cuvs_cagra
+----------
+
+CAGRA uses a graph-based index, which creates an intermediate, approximate kNN graph using IVF-PQ and then further refining and optimizing to create a final kNN graph. This kNN graph is used by CAGRA as an index for search.
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `graph_degree`
+   - `build`
+   - N
+   - Positive integer >0
+   - 64
+   - Degree of the final kNN graph index.
+
+ * - `intermediate_graph_degree`
+   - `build`
+   - N
+   - Positive integer >0
+   - 128
+   - Degree of the intermediate kNN graph before the CAGRA graph is optimized
+
+ * - `graph_build_algo`
+   - `build`
+   - `N
+   - [`IVF_PQ`, NN_DESCENT`]
+   - `IVF_PQ`
+   - Algorithm to use for building the initial kNN graph, from which CAGRA will optimize into the navigable CAGRA graph
+
+ * - `dataset_memory_type`
+   - `build`
+   - N
+   - [`device`, `host`, `mmap`]
+   - `mmap`
+   - Where should the dataset reside?
+
+ * - `query_memory_type`
+   - `search`
+   - [`device`, `host`, `mmap`]
+   - `device`
+   - Where should the queries reside?
+
+ * - `itopk`
+   - `search`
+   - N
+   - Positive integer >0
+   - 64
+   - Number of intermediate search results retained during the search. Higher values improve search accuracy at the cost of speed
+
+ * - `search_width`
+   - `search`
+   - N
+   - Positive integer >0
+   - 1
+   - Number of graph nodes to select as the starting point for the search in each iteration.
+
+ * - `max_iterations`
+   - `search`
+   - N
+   - Positive integer >=0
+   - 0
+   - Upper limit of search iterations. Auto select when 0
+
+ * - `algo`
+   - `search`
+   - N
+   - [`auto`, `single_cta`, `multi_cta`, `multi_kernel`]
+   - `auto`
+   - Algorithm to use for search. It's usually best to leave this to `auto`.
+
+ * - `graph_memory_type`
+   - `search`
+   - N
+   - [`device`, `host_pinned`, `host_huge_page`]
+   - `device`
+   - Memory type to store graph
+
+ * - `internal_dataset_memory_type`
+   - `search`
+   - N
+   - [`device`, `host_pinned`, `host_huge_page`]
+   - `device`
+   - Memory type to store dataset
+
+The `graph_memory_type` or `internal_dataset_memory_type` options can be useful for large datasets that do not fit the device memory. Setting `internal_dataset_memory_type` other than `device` has negative impact on search speed. Using `host_huge_page` option is only supported on systems with Heterogeneous Memory Management or on platforms that natively support GPU access to system allocated memory, for example Grace Hopper.
+
+To fine tune CAGRA index building we can customize IVF-PQ index builder options using the following settings. These take effect only if `graph_build_algo == "IVF_PQ"`. It is recommended to experiment using a separate IVF-PQ index to find the config that gives the largest QPS for large batch. Recall does not need to be very high, since CAGRA further optimizes the kNN neighbor graph. Some of the default values are derived from the dataset size which is assumed to be [n_vecs, dim].
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `ivf_pq_build_nlist`
+   - `build`
+   - N
+   - Positive integer >0
+   - sqrt(n_vecs)
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+ * - `ivf_pq_build_niter`
+   - `build`
+   - N
+   - Positive integer >0
+   - 25
+   - Number of k-means iterations to use when training the clusters.
+
+ * - `ivf_pq_build_ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 10
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+ * - `ivf_pq_pq_dim`
+   - `build`
+   - N
+   - Positive integer. Multiple of 8
+   - dim/2 rounded up to 8
+   - Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value. `pq_dim` * `pq_bits` must be a multiple of 8.
+
+ * - `ivf_pq_build_pq_bits`
+   - `build`
+   - N
+   - Positive integer [4-8]
+   - 8
+   - Bit length of the vector element after quantization.
+
+ * - `ivf_pq_build_codebook_kind`
+   - `build`
+   - N
+   - [`cluster`, `subspace`]
+   - `subspace`
+   - Type of codebook. See :doc:`IVF-PQ index overview <../indexes/ivfpq>` for more detail
+
+ * - `ivf_pq_build_nprobe`
+   - `search`
+   - N
+   - Positive integer >0
+   - min(2*dim, nlist)
+   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+ * - `ivf_pq_build_internalDistanceDtype`
+   - `search`
+   - N
+   - [`float`, `half`]
+   - `half`
+   - The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy.
+
+ * - `ivf_pq_build_smemLutDtype`
+   - `search`
+   - N
+   - [`float`, `half`, `fp8`]
+   - `fp8`
+   - The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy.
+
+ * - `ivf_pq_build_refine_ratio`
+   - `search`
+   - N
+   - Positive integer >0
+   - 2
+   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+
+Alternatively, if `graph_build_algo == "NN_DESCENT"`, then we can customize the following parameters
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nn_descent_niter`
+   - `build`
+   - N
+   - Positive integer >0
+   - 20
+   - Number of nn-descent iterations
+
+ * - `nn_descent_intermediate_graph_degree
+   - `build`
+   - N
+   - Positive integer >0
+   - `cagra.intermediate_graph_degree` * 1.5
+   - Intermadiate graph degree during nn-descent iterations
+
+ * - nn_descent_termination_threshold
+   - `build`
+   - N
+   - Positive float >0
+   - 1e-4
+   - Early stopping threshold for nn-descent convergence
+
+cuvs_cagra_hnswlib
+------------------
+
+This is a benchmark that enables interoperability between `CAGRA` built `HNSW` search. It uses the `CAGRA` built graph as the base layer of an `hnswlib` index to search queries only within the base layer (this is enabled with a simple patch to `hnswlib`).
+
+`build` : Same as `build` of CAGRA
+
+`search` : Same as `search` of Hnswlib
+
+FAISS Indexes
+=============
+
+faiss_gpu_flat
+--------------
+
+Use FAISS flat index on the GPU, which performs an exact search using brute-force and doesn't have any further build or search parameters.
+
+faiss_gpu_ivf_flat
+------------------
+
+IVF-flat uses an inverted-file index, which partitions the vectors into a series of clusters, or lists, storing them in an interleaved format which is optimized for fast distance computation. The searching of an IVF-flat index reduces the total vectors in the index to those within some user-specified nearest clusters called probes.
+
+IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall.
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nlists`
+   - `build`
+   - Y
+   - Positive integer >0
+   -
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained
+
+ * - `ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 2
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+ * - `nprobe`
+   - `search`
+   - Y
+   - Positive integer >0
+   -
+   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+faiss_gpu_ivf_pq
+----------------
+
+IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available.
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nlist`
+   - `build`
+   - Y
+   - Positive integer >0
+   -
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+ * - `ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 2
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+ * - `M_ratio`
+   - `build`
+   - Y
+   - Positive integer. Power of 2 [8-64]
+   -
+   - Ratio of numbeer of chunks or subquantizers for each vector. Computed by `dims` / `M_ratio`
+
+ * - `usePrecomputed`
+   - `build`
+   - N
+   - Boolean
+   - `false`
+   - Use pre-computed lookup tables to speed up search at the cost of increased memory usage.
+
+ * - `useFloat16`
+   - `build`
+   - N
+   - Boolean
+   - `false`
+   - Use half-precision floats for clustering step.
+
+ * - `nprobe`
+   - `search`
+   - Y
+   - Positive integer >0
+   -
+   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+ * - `refine_ratio`
+   - `search`
+   - N
+   - Positive number >=1
+   - 1
+   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+
+
+faiss_cpu_flat
+--------------
+
+Use FAISS flat index on the CPU, which performs an exact search using brute-force and doesn't have any further build or search parameters.
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `numThreads`
+   - `search`
+   - N
+   - Positive integer >0
+   - 1
+   - Number of threads to use for queries.
+
+faiss_cpu_ivf_flat
+------------------
+
+Use FAISS IVF-Flat index on CPU
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nlists`
+   - `build`
+   - Y
+   - Positive integer >0
+   -
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained
+
+ * - `ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 2
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+ * - `nprobe`
+   - `search`
+   - Y
+   - Positive integer >0
+   -
+   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+ * - `numThreads`
+   - `search`
+   - N
+   - Positive integer >0
+   - 1
+   - Number of threads to use for queries.
+
+faiss_cpu_ivf_pq
+----------------
+
+Use FAISS IVF-PQ index on CPU
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `nlist`
+   - `build`
+   - Y
+   - Positive integer >0
+   -
+   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+ * - `ratio`
+   - `build`
+   - N
+   - Positive integer >0
+   - 2
+   - `1/ratio` is the number of training points which should be used to train the clusters.
+
+ * - `M`
+   - `build`
+   - Y
+   - Positive integer. Power of 2 [8-64]
+   -
+   - Ratio of number of chunks or subquantizers for each vector. Computed by `dims` / `M_ratio`
+
+ * - `usePrecomputed`
+   - `build`
+   - N
+   - Boolean
+   - `false`
+   - Use pre-computed lookup tables to speed up search at the cost of increased memory usage.
+
+ * - `bitsPerCode`
+   - `build`
+   - N
+   - Positive integer [4-8]
+   - 8
+   - Number of bits for representing each quantized code.
+
+ * - `nprobe`
+   - `search`
+   - Y
+   - Positive integer >0
+   -
+   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+ * - `refine_ratio`
+   - `search`
+   - N
+   - Positive number >=1
+   - 1
+   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+
+ * - `numThreads`
+   - `search`
+   - N
+   - Positive integer >0
+   - 1
+   - Number of threads to use for queries.
+
+HNSW
+====
+
+hnswlib
+-------
+
+.. list-table::
+
+ * - Parameter
+   - Type
+   - Required
+   - Data Type
+   - Default
+   - Description
+
+ * - `efConstruction`
+   - `build`
+   - Y
+   - Positive integer >0
+   -
+   - Controls index time and accuracy. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
+
+ * - `M`
+   - `build`
+   - Y
+   - Positive integer. Often between 2-100
+   -
+   - umber of bi-directional links create for every new element during construction. Higher values work for higher intrinsic dimensionality and/or high recall, low values can work for datasets with low intrinsic dimensionality and/or low recalls. Also affects the algorithm's memory consumption.
+
+ * - `numThreads`
+   - `build`
+   - N
+   - Positive integer >0
+   - 1
+   - Number of threads to use to build the index.
+
+ * - `ef`
+   - `search`
+   - Y
+   - Positive integer >0
+   -
+   - Size of the dynamic list for the nearest neighbors used for search. Higher value leads to more accurate but slower search. Cannot be lower than `k`.
+
+ * - `numThreads`
+   - `search`
+   - N
+   - Positive integer >0
+   - 1
+   - Number of threads to use for queries.
+
+Please refer to `HNSW algorithm parameters guide <https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md>`_ from `hnswlib` to learn more about these arguments.
\ No newline at end of file
diff --git a/docs/source/cuvs_bench/wiki_all_dataset.rst b/docs/source/cuvs_bench/wiki_all_dataset.rst
new file mode 100644
index 000000000..04ac7d9a4
--- /dev/null
+++ b/docs/source/cuvs_bench/wiki_all_dataset.rst
@@ -0,0 +1,55 @@
+~~~~~~~~~~~~~~~~
+Wiki-all Dataset
+~~~~~~~~~~~~~~~~
+
+
+The `wiki-all` dataset was created to stress vector search algorithms at scale with both a large number of vectors and dimensions. The entire dataset contains 88M vectors with 768 dimensions and is meant for testing the types of vectors one would typically encounter in retrieval augmented generation (RAG) workloads. The full dataset is ~251GB in size, which is intentionally larger than the typical memory of GPUs. The massive scale is intended to promote the use of compression and efficient out-of-core methods for both indexing and search.
+
+The dataset is composed of English wiki texts from `Kaggle <https://www.kaggle.com/datasets/jjinho/wikipedia-20230701>`_ and multi-lingual wiki texts from `Cohere Wikipedia <https://huggingface.co/datasets/Cohere/wikipedia-22-12>`_.
+
+Cohere's English Texts are older (2022) and smaller than the Kaggle English Wiki texts (2023) so the English texts have been removed from Cohere completely. The final Wiki texts include English Wiki from Kaggle and the other languages from Cohere. The English texts constitute 50% of the total text size.
+
+To form the final dataset, the Wiki texts were chunked into 85 million 128-token pieces. For reference, Cohere chunks Wiki texts into 104-token pieces. Finally, the embeddings of each chunk were computed using the `paraphrase-multilingual-mpnet-base-v2 <https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2>`_ embedding model. The resulting dataset is an embedding matrix of size 88 million by 768. Also included with the dataset is a query file containing 10k query vectors and a groundtruth file to evaluate nearest neighbors algorithms.
+
+Getting the dataset
+===================
+
+Full dataset
+------------
+
+A version of the dataset is made available in the binary format that can be used directly by the :doc:`cuvs-bench <index>` tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts.
+
+The following will download all 10 the parts and untar them to a `wiki_all_88M` directory:
+
+.. code-block:: bash
+    curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.{00..9} | tar -xf - -C wiki_all_88M/
+
+The above has the unfortunate drawback that if the command should fail for any reason, all the parts need to be re-downloaded. The files can also be downloaded individually and then untarred to the directory. Each file is ~27GB and there are 10 of them.
+
+.. code-block:: bash
+
+    curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.00
+    ...
+    curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.09
+
+    cat wiki_all.tar.* | tar -xf - -C wiki_all_88M/
+
+1M and 10M subsets
+------------------
+
+Also available are 1M and 10M subsets of the full dataset which are 2.9GB and 29GB, respectively. These subsets also include query sets of 10k vectors and corresponding groundtruth files.
+
+.. code-block:: bash
+
+    curl -s https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
+    curl -s https://data.rapids.ai/raft/datasets/wiki_all_10M/wiki_all_10M.tar
+
+Using the dataset
+=================
+
+After the dataset is downloaded and extracted to the `wiki_all_88M` directory (or `wiki_all_1M`/`wiki_all_10M` depending on whether the subsets are used), the files can be used in the benchmarking tool. The dataset name is `wiki_all` (or `wiki_all_1M`/`wiki_all_10M`), and the benchmarking tool can be used by specifying the appropriate name `--dataset wiki_all_88M` in the scripts.
+
+License info
+============
+
+The English wiki texts available on Kaggle come with the `CC BY-NCSA 4.0 <https://creativecommons.org/licenses/by-nc-sa/4.0/>`_ license and the Cohere wikipedia data set comes with the `Apache 2.0 <https://choosealicense.com/licenses/apache-2.0/>`_ license.
\ No newline at end of file
diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst
index 79b35c2d5..fa38c45ee 100644
--- a/docs/source/getting_started.rst
+++ b/docs/source/getting_started.rst
@@ -1,12 +1,120 @@
+~~~~~~~~~~~~~~~
 Getting Started
+~~~~~~~~~~~~~~~
+
+- `New to vector search?`_
+
+  * :doc:`Primer on vector search <choosing_and_configuring_indexes>`
+
+  * :doc:`Vector search indexes vs vector databases <vector_databases_vs_vector_search>`
+
+  * :doc:`Index tuning guide <tuning_guide>`
+
+  * :doc:`Comparing vector search index performance <comparing_indexes>`
+
+- `Supported indexes`_
+
+  * :doc:`Vector search index guide <indexes/indexes>`
+
+- `Using cuVS APIs`_
+
+  * :doc:`C API Docs <c_api>`
+
+  * :doc:`C++ API Docs <cpp_api>`
+
+  * :doc:`Python API Docs <python_api>`
+
+  * :doc:`Rust API Docs <rust_api/index>`
+
+  * :doc:`API basics <api_basics>`
+
+  * :doc:`API interoperability <api_interoperability>`
+
+- `Where to next?`_
+
+  * `Social media`_
+
+  * `Blogs`_
+
+  * `Research`_
+
+  * `Get involved`_
+
+New to vector search?
+=====================
+
+If you are unfamiliar with the basics of vector search or how vector search differs from vector databases, then :doc:`this primer on vector search guide <choosing_and_configuring_index>` should provide some good insight. Another good resource for the uninitiated is our :doc:`vector databases vs vector search <vector_databases_vs_vector_search>` guide. As outlined in the primer, vector search as used in vector databases is often closer to machine learning than to traditional databases. This means that while traditional databases can often be slow without any performance tuning, they will usually still yield the correct results. Unfortunately, vector search indexes, like other machine learning models, can yield garbage results of not tuned correctly.
+
+Fortunately, this opens up the whole world of hyperparamer optimization to improve vector search performance and quality. Please see our :doc:`index tuning guide <tuning_guide>` for more information.
+
+When comparing the performance of vector search indexes, it is important that considerations are made with respect to three main dimensions:
+
+#. Build time
+#. Search quality
+#. Search performance
+
+Please see the :doc:`primer on comparing vector search index performance <comparing_indexes>` for more information on methodologies and how to make a fair apples-to-apples comparison during your evaluations.
+
+Supported indexes
+=================
+
+cuVS supports many of the standard index types with the list continuing to grow and stay current with the state-of-the-art. Please refer to our :doc:`vector search index guide <indexes/indexes>` for to learn more about each individual index type, when they can be useful on the GPU, the tuning knobs they offer to trade off performance and quality.
+
+The primary goal of cuVS is to enable speed, scale, and flexibility (in that order)- and one of the important value propositions is to enhance existing software deployments with extensible GPU capabilities to improve pain points while not interrupting parts of the system that work well today with CPU.
+
+
+Using cuVS APIs
 ===============
 
-This guide provides an initial starting point of the basic concepts and using the various APIs in the cuVS software development kit.
+cuVS is a C++ library at its core, which is wrapped with a C library and exposed further through various different languages. cuVS currently provides APIs and documentation for :doc:`C <c_api>`, :doc:`C++ <cpp_api>`, :doc:`Python <python_api>`, and :doc:`Rust <rust_api/index>` with more languages in the works. our :doc:`API basics <api_basics>` provides some background and context about the important paradigms and vocabulary types you'll encounter when working with cuVS types.
+
+Please refer to the :doc:`guide on API interoperability <api_interoperability>` for more information on how cuVS can work seamlessly with other libraries like numpy, cupy, tensorflow, and pytorch, even without having to copy device memory.
+
+
+Where to next?
+==============
+
+cuVS is free and open source software, licesed under Apache 2.0 Once you are familiar with and/or have used cuVS, you can access the developer community most easily through `Github <https://github.com/rapidsai/cuvs>`_. Please open Github issues for any bugs, questions or feature requests.
+
+Social media
+------------
+
+You can access the RAPIDS community through `Slack <https://rapids.ai/slack-invite>`_ , `Stack Overflow <https://stackoverflow.com/tags/rapids>`_ and `X <https://twitter.com/rapidsai>`_
+
+Blogs
+-----
+
+We frequently publish blogs on GPU-enabled vector search, which can provide great deep dives into various important topics and breakthroughs:
+
+#. `Accelerated Vector Search: Approximating with cuVS IVF-Flat <https://developer.nvidia.com/blog/accelerated-vector-search-approximating-with-rapids-raft-ivf-flat/>`_
+#. `Accelerating Vector Search with cuVS IVF-PQ <https://developer.nvidia.com/blog/accelerating-vector-search-rapids-cuvs-ivf-pq-deep-dive-part-1/>`_
+
+Research
+--------
+
+For the interested reader, many of the accelerated implementations in cuVS are also based on research papers which can provide a lot more background. We also ask you to please cite the corresponding algorithms by referencing them in your own research.
+
+#. `CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search <https://arxiv.org/abs/2308.15136>`_
+#. `Top-K Algorithms on GPU: A Comprehensive Study and New Methods <https://dl.acm.org/doi/10.1145/3581784.3607062>`_
+#. `Fast K-NN Graph Construction by GPU Based NN-Descent <https://dl.acm.org/doi/abs/10.1145/3459637.3482344?casa_token=O_nan1B1F5cAAAAA:QHWDEhh0wmd6UUTLY9_Gv6c3XI-5DXM9mXVaUXOYeStlpxTPmV3nKvABRfoivZAaQ3n8FWyrkWw>`_
+#. `cuSLINK: Single-linkage Agglomerative Clustering on the GPU <https://arxiv.org/abs/2306.16354>`_
+#. `GPU Semiring Primitives for Sparse Neighborhood Methods <https://arxiv.org/abs/2104.06357>`_
+
+
+Get involved
+------------
+
+We always welcome patches for new features and bug fixes. Please read our `contributing guide <contributing.md>`_ for more information on contributing patches to cuVS.
+
+
 
 .. toctree::
-   :maxdepth: 1
-   :caption: Contents:
-    
-   basics.rst
-   interoperability.rst
-   working_with_ann_indexes.rst
\ No newline at end of file
+   :hidden:
+
+   choosing_and_configuring_indexes.rst
+   vector_databases_vs_vector_search.rst
+   tuning_guide.rst
+   comparing_indexes.rst
+   indexes/indexes.rst
+   api_basics.rst
+   api_interoperability.rst
\ No newline at end of file
diff --git a/docs/source/images/build_benchmarks.png b/docs/source/images/build_benchmarks.png
new file mode 100644
index 000000000..e9596b489
Binary files /dev/null and b/docs/source/images/build_benchmarks.png differ
diff --git a/docs/source/images/index_recalls.png b/docs/source/images/index_recalls.png
new file mode 100644
index 000000000..5e1bbda55
Binary files /dev/null and b/docs/source/images/index_recalls.png differ
diff --git a/docs/source/images/recall_buckets.png b/docs/source/images/recall_buckets.png
new file mode 100644
index 000000000..3589e4e6e
Binary files /dev/null and b/docs/source/images/recall_buckets.png differ
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 88f361243..647061ae5 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,36 +1,41 @@
 cuVS: Vector Search and Clustering on the GPU
 =============================================
 
+
+Welcome to cuVS, the premier library for GPU-accelerated vector search and clustering! cuVS provides several core building blocks for constructing new algorithms, as well as end-to-end vector search and clustering algorithms for use either standalone or through a growing list of :doc:`integrations <integrations>`.
+
+There are several benefits to using cuVS and GPUs for vector search, including
+
+#. Fast index build
+#. Latency critical and high throughput search
+#. Parameter tuning
+#. Cost savings
+#. Interoperability (build on GPU, deploy on CPU)
+#. Multiple language support
+#. Building blocks for composing new or accelerating existing algorithms
+
 Useful Resources
 ################
 
 .. _cuvs_reference: https://docs.rapids.ai/api/cuvs/stable/
 
-- `Example Notebooks <https://github.com/rapidsai/cuvs/tree/HEAD/notebooks>`_: Self-contained Code examples
+- `Example Notebooks <https://github.com/rapidsai/cuvs/tree/HEAD/notebooks>`_: Example notebooks
+- `Code Examples <https://github.com/rapidsai/cuvs/tree/HEAD/examples>`_: Self-contained code examples
 - `RAPIDS Community <https://rapids.ai/community.html>`_: Get help, contribute, and collaborate.
 - `GitHub repository <https://github.com/rapidsai/cuvs>`_: Download the cuVS source code.
 - `Issue tracker <https://github.com/rapidsai/cuvs/issues>`_: Report issues or request features.
 
 
-What is cuVS?
-#############
-
-cuVS is a library for vector search and clustering on the GPU.
+Contents
+########
 
 .. toctree::
-   :maxdepth: 1
-   :caption: Contents:
+   :maxdepth: 4
 
+   build.rst
    getting_started.rst
-   build.md
    integrations.rst
+   cuvs_bench/index.rst
    api_docs.rst
    contributing.md
 
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
diff --git a/docs/source/indexes/bruteforce.rst b/docs/source/indexes/bruteforce.rst
new file mode 100644
index 000000000..0bd17dbf1
--- /dev/null
+++ b/docs/source/indexes/bruteforce.rst
@@ -0,0 +1,62 @@
+Brute-force
+===========
+
+Brute-force, or flat index, is the most simple index type, as it ultimately boils down to an exhaustive matrix multiplication.
+
+While it scales with :math:`O(N^2*D)`, brute-force can be a great choice when
+
+1. exact nearest neighbors are required, and
+2. when the number of vectors is relatively small (a few thousand to a few million)
+
+Brute-force can also be a good choice for heavily filtered queries where other algorithms might struggle returning the expected results. For example,
+when filtering out 90%-95% of the vectors from a search, the IVF methods could struggle to return anything at all with smaller number of probes and
+graph-based algorithms with limited hash table memory could end up skipping over important unfiltered entries.
+
+[ :doc:`C API <../c_api/neighbors_bruteforce_c>` | :doc:`C++ API <../cpp_api/neighbors_bruteforce>` | :doc:`Python API <../python_api/neighbors_bruteforce>` | :doc:`Rust API <../rust_api/index>` ]
+
+Filtering considerations
+------------------------
+
+Because it is exhaustive, brute-force can quickly become the slowest, albeit most accurate form of search. However, even
+when the number of vectors in an index are very large, brute-force can still be used to search vectors efficiently with a filter.
+
+This is especially true for cases where the filter is excluding 90%-99% of the vectors in the index where the partitioning
+inherent in other approximate algorithms would simply not include expected vectors in the results. In the case of pre-filtered
+brute-force, the computation is inverted so distances are only computed between vectors that pass the filter, significantly reducing
+the amount of computation required.
+
+Configuration parameters
+------------------------
+
+Build parameters
+~~~~~~~~~~~~~~~~
+
+None
+
+Search Parameters
+~~~~~~~~~~~~~~~~~
+
+None
+
+
+Tuning Considerations
+---------------------
+
+Brute-force is exact but that doesn't always mean it's deterministic. For example, when there are many nearest neighbors with
+the same distances it's possible they might be ordered differently across different runs. This especially becomes apparent in
+cases where there are points with the same distance right near the cutoff of `k`, which can cause the final list of neighbors
+to differ from ground truth. This is not often a problem in practice and can usually be mitigated by increasing `k`.
+
+
+Memory footprint
+----------------
+
+:math:`precision` is the number of bytes in each element of each vector (e.g. 32-bit = 4-bytes)
+
+
+Index footprint
+~~~~~~~~~~~~~~~
+
+Raw vectors: :math:`n_vectors * n_dimensions * precision`
+
+Vector norms (for distances which require them): :math:`n_vectors * precision`
diff --git a/docs/source/indexes/cagra.rst b/docs/source/indexes/cagra.rst
new file mode 100644
index 000000000..de8821e74
--- /dev/null
+++ b/docs/source/indexes/cagra.rst
@@ -0,0 +1,149 @@
+CAGRA
+=====
+
+CAGRA, or (C)UDA (A)NN (GRA)ph-based, is a graph-based index that is based loosely on the popular navigable small-world graph (NSG) algorithm, but which has been
+built from the ground-up specifically for the GPU. CAGRA constructs a flat graph representation by first building a kNN graph
+of the training points and then removing redundant paths between neighbors.
+
+The CAGRA algorithm has two basic steps-
+* 1. Construct a kNN graph
+* 2. Prune redundant routes from the kNN graph.
+
+I-force could be used to construct the initial kNN graph. This would yield the most accurate graph but would be very slow and
+we find that in practice the kNN graph does not need to be very accurate since the pruning step helps to boost the overall recall of
+the index. cuVS provides IVF-PQ and NN-Descent strategies for building the initial kNN graph and these can be selected in index params object during index construction.
+
+[ :doc:`C API <../c_api/neighbors_cagra_c>` | :doc:`C++ API <../cpp_api/neighbors_cagra>` | :doc:`Python API <../python_api/neighbors_cagra>` | :doc:`Rust API <../rust_api/index>` ]
+
+Interoperability with HNSW
+--------------------------
+
+cuVS provides the capability to convert a CAGRA graph to an HNSW graph, which enables the GPU to be used only for building the index
+while the CPU can be leveraged for search.
+
+Filtering considerations
+------------------------
+
+CAGRA supports filtered search which can work well for moderately small filters (such as filtering out only a small percentage of the vectors in the index (e.g. <<50%).
+
+When a filter is expected to remove 80%-99% of the vectors in the index, it is preferred to use brute-force with pre-filtering instead, as that will compute only those distances
+between the vectors not being filtered out. By default, CAGRA will pass the filter to the pre-filtered brute-force when the number of vevtors being filtered out is >90% of the vectors in the index.
+
+Configuration parameters
+------------------------
+
+Build parameters
+~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Default
+     - Description
+   * - compression
+     - None
+     - For large datasets, the raw vectors can be compressed using product quantization so they can be placed on device. This comes at the cost of lowering recall, though a refinement reranking step can be used to make up the lost recall after search.
+   * - graph_build_algo
+     - 'IVF_PQ'
+     - The graph build algorithm to use for building
+   * - graph_build_params
+     - None
+     - Specify explicit build parameters for the corresponding graph build algorithms
+   * - graph_degree
+     - 32
+     - The degree of the final CAGRA graph. All vertices in the graph will have this degree. During search, a larger graph degree allows for more exploration of the search space and improves recall but at the expense of searching more vertices.
+   * - intermediate_graph_degree
+     - 64
+     - The degree of the initial knn graph before it is optimized into the final CAGRA graph. A larger value increases connectivity of the initial graph so that it performs better once pruned. Larger values come at the cost of increased device memory usage and increases the time of initial knn graph construction.
+   * - guarantee_connectivity
+     - False
+     - Uses a degree-constrained minimum spanning tree to guarantee the initial knn graph is connected. This can improve recall on some datasets.
+   * - attach_data_on_build
+     - True
+     - Should the dataset be attached to the index after the index is built? Setting this to `False` can improve memory usage and performance, for example if the graph is being serialized to disk or converted to HNSW right after building it.
+
+Search parameters
+~~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Default
+     - Description
+   * - itopk_size
+     - 64
+     - Number of intermediate search results retained during search. This value needs to be >=k. This is the main knob to tweak search performance.
+   * - max_iterations
+     - 0
+     - The maximum number of iterations during search. Default is to auto-select.
+   * - max_queries
+     - 0
+     - Max number of search queries to perform concurrently (batch size). Default is to auto-select.
+   * - team_size
+     - 0
+     - Number of CUDA threads for calculating each distance. Can be 4, 8, 16, or 32. Default is to auto-select.
+   * - search_width
+     - 1
+     - Number of vertices to select as the starting point for the search in each iteration.
+   * - min_iterations
+     - 0
+     - Minimum number of search iterations to perform
+
+Tuning Considerations
+---------------------
+
+The 3 hyper-parameters that are most often tuned are `graph_degree`, `intermediate_graph_degree`, and `itopk_size`.
+
+Memory footprint
+----------------
+
+CAGRA builds a graph that ultimately ends up on the host while it needs to keep the original dataset around (can be on host or device).
+
+IVFPQ or NN-DESCENT can be used to build the graph (additions to the peak memory usage calculated as in the respective build algo above).
+
+Dataset on device (graph on host):
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Index memory footprint (device): :math:`n_index_vectors * n_dims * sizeof(T)`
+
+Index memory footprint (host): :math:`graph_degree * n_index_vectors * sizeof(T)``
+
+Dataset on host (graph on host):
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Index memory footprint (host): :math:`n_index_vectors * n_dims * sizeof(T) + graph_degree * n_index_vectors * sizeof(T)`
+
+Build peak memory usage:
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+When built using NN-descent / IVF-PQ, the build process consists of two phases: (1) building an initial/(intermediate) graph and then (2) optimizing the graph. Key input parameters are n_vectors, intermediate_graph_degree, graph_degree.
+The memory usage in the first phase (building) depends on the chosen method. The biggest allocation is the graph (n_vectors*intermediate_graph_degree), but it’s stored in the host memory.
+Usually, the second phase (optimize) uses the most device memory. The peak memory usage is achieved during the pruning step (graph_core.cuh/optimize)
+Optimize: formula for peak memory usage (device): :math:`n_vectors * (4 + (sizeof(IdxT) + 1) * intermediate_degree)``
+
+Build with out-of-core IVF-PQ peak memory usage:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Out-of-core CAGA build consists of IVF-PQ build, IVF-PQ search, CAGRA optimization. Note that these steps are performed sequentially, so they are not additive.
+
+IVF-PQ Build:
+
+.. math::
+
+   n_vectors / train_set_ratio * dim * sizeof(float)   // trainset, may be in managed mem
+   + n_vectors / train_set_ratio * sizeof(uint32_t)    // labels, may be in managed mem
+   + n_clusters * n_dim * sizeof(float)                // cluster centers
+
+IVF-PQ Search (max batch size 1024 vectors on device at a time):
+
+.. math::
+
+   [n_vectors * (pq_dim * pq_bits / 8 + sizeof(int64_t)) + O(n_clusters)]
+   + [batch_size * n_dim * sizeof(float)] + [batch_size * intermediate_degree * sizeof(uint32_t)] +
+   [batch_size * intermediate_degree * sizeof(float)]
+
+
diff --git a/docs/source/indexes/indexes.rst b/docs/source/indexes/indexes.rst
new file mode 100644
index 000000000..a2fb1434a
--- /dev/null
+++ b/docs/source/indexes/indexes.rst
@@ -0,0 +1,19 @@
+Nearest Neighbor Indexes
+========================
+
+.. toctree::
+   :maxdepth: 3
+   :caption: Contents:
+
+   bruteforce.rst
+   cagra.rst
+   ivfflat.rst
+   ivfpq.rst
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
\ No newline at end of file
diff --git a/docs/source/indexes/ivfflat.rst b/docs/source/indexes/ivfflat.rst
new file mode 100644
index 000000000..14dd1798c
--- /dev/null
+++ b/docs/source/indexes/ivfflat.rst
@@ -0,0 +1,116 @@
+IVF-Flat
+========
+
+IVF-Flat is an inverted file index (IVF) algorithm, which in the context of nearest neighbors means that data points are
+partitioned into clusters. At search time, brute-force is performed only in a (user-defined) subset of the closest clusters.
+In practice, this algorithm can search the index much faster than brute-force and often still maintain an acceptable
+recall, though this comes with the drawback that the index itself copies the original training vectors into a memory layout
+that is optimized for fast memory reads and adds some additional memory storage overheads. Once the index is trained,
+this algorithm no longer requires the original raw training vectors.
+
+IVF-Flat tends to be a great choice when
+
+1. like brute-force, there is enough device memory available to fit all of the vectors
+in the index, and
+2. exact recall is not needed. as with the other index types, the tuning parameters are used to trade-off recall for search latency / throughput.
+
+[ :doc:`C API <../c_api/neighbors_ivf_flat_c>` | :doc:`C++ API <../cpp_api/neighbors_ivf_flat>` | :doc:`Python API <../python_api/neighbors_ivf_flat>` | :doc:`Rust API <../rust_api/index>` ]
+
+Filtering considerations
+------------------------
+
+IVF methods only apply filters to the lists which are probed for each query point. As a result, the results of a filtered query will likely differ signficiantly from the results of a filtering applid to an exact method like brute-force. For example. imagine you have 3 IVF lists each containing 2 vectors and you perform a query against only the closest 2 lists but you filter out all but 1 element. If that remaining element happens to be in one of the lists which was not proved, it will not be considered at all in the search results. It's important to consider this when using any of the IVF methods in your applications.
+
+
+Configuration parameters
+------------------------
+
+Build parameters
+~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Default
+     - Description
+   * - n_lists
+     - sqrt(n)
+     - Number of coarse clusters used to partition the index. A good heuristic for this value is sqrt(n_vectors_in_index)
+   * - add_data_on_build
+     - True
+     - Should the training points be added to the index after the index is built?
+   * - kmeans_train_iters
+     - 20
+     - Max number of iterations for k-means training before convergence is assumed. Note that convergence could happen before this number of iterations.
+   * - kmeans_trainset_fraction
+     - 0.5
+     - Fraction of points that should be subsampled from the original dataset to train the k-means clusters. Default is 1/2 the training dataset. This can often be reduced for very large datasets to improve both cluster quality and the build time.
+   * - adaptive_centers
+     - false
+     - Should the existing trained centroids adapt to new points that are added to the index? This provides a trade-off between improving recall at the expense of having to compute new centroids for clusters when new points are added. When points are added in large batches, the performance cost may not be noticeable.
+   * - conservative_memory_allocation
+     - false
+     - To support dynamic indexes, where points are expected to be added later, the individual IVF lists can be imtentionally overallocated up front to reduce the amount and impact of increasing list sizes, which requires allocating more memory and copying the old list to the new, larger, list.
+
+
+Search parameters
+~~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Default
+     - Description
+   * - n_probes
+     - 20
+     - Number of closest IVF lists to scan for each query point.
+
+Tuning Considerations
+---------------------
+
+Since IVF methods use clustering to establish spatial locality and partition data points into individual lists, there's an inherent
+assumption that the number of lists, and thus the max size of the data in the index is known up front. For some use-cases, this
+might not matter. For example, most vector databases build many smaller physical approximate nearest neighbors indexes, each from
+fixed-size or maximum-sized immutable segments and so the number of lists can be tuned based on the number of vectors in the indexes.
+
+Empirically, we've found :math:`\sqrt{n\_index\_vectors}` to be a good starting point for the :math:`n\_lists` hyper-parameter. Remember, having more
+lists means less points to search within each list, but it could also mean more :math:`n\_probes` are needed at search time to reach an acceptable
+recall.
+
+
+Memory footprint
+----------------
+
+Each cluster is padded to at least 32 vectors (but potentially up to 1024). Assuming uniform random distribution of vectors/list, we would have
+:math:`cluster\_overhead = (conservative\_memory\_allocation ? 16 : 512 ) * dim * sizeof_{float})`
+
+Note that each cluster is allocated as a separate allocation. If we use a `cuda_memory_resource`, that would grab memory in 1 MiB chunks, so on average we might have 0.5 MiB overhead per cluster. If we us 10s of thousands of clusters, it becomes essential to use pool allocator to avoid this overhead.
+
+:math:`cluster\_overhead =  0.5 MiB` // if we do not use pool allocator
+
+
+Index (device memory):
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. math::
+
+   n\_vectors * n\_dimensions * sizeof(T) +
+
+   n\_vectors  * sizeof(int_type) +
+
+   n\_clusters * n\_dimensions * sizeof(T) +
+
+   n\_clusters * cluster_overhead`
+
+
+Peak device memory usage for index build:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+:math:`workspace = min(1GB, n\_queries * [(n\_lists + 1 + n\_probes * (k + 1)) * sizeof_{float}) + n\_probes * k * sizeof_{idx}])`
+
+:math:`index\_size + workspace`
+
diff --git a/docs/source/indexes/ivfpq.rst b/docs/source/indexes/ivfpq.rst
new file mode 100644
index 000000000..e4bd81395
--- /dev/null
+++ b/docs/source/indexes/ivfpq.rst
@@ -0,0 +1,140 @@
+IVF-PQ
+======
+
+IVF-PQ is an inverted file index (IVF) algorithm, which is an extension to the IVF-Flat algorithm (e.g. data points are first
+partitioned into clusters) where product quantization is performed within each cluster in order to shrink the memory footprint
+of the index. Product quantization is a lossy compression method and it is capable of storing larger number of vectors
+on the GPU by offloading the original vectors to main memory, however higher compression levels often lead to reduced recall.
+Often a strategy called refinement reranking is employed to make up for the lost recall by querying the IVF-PQ index for a larger
+`k` than desired and performing a reordering and reduction to `k` based on the distances from the unquantized vectors. Unfortunately,
+this does mean that the unquantized raw vectors need to be available and often this can be done efficiently using multiple CPU threads.
+
+[ :doc:`C API <../c_api/neighbors_ivf_pq_c>` | :doc:`C++ API <../cpp_api/neighbors_ivf_pq>` | :doc:`Python API <../python_api/neighbors_ivf_pq>` | :doc:`Rust API <../rust_api/index>` ]
+
+
+Configuration parameters
+------------------------
+
+Build parameters
+~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Default
+     - Description
+   * - n_lists
+     - sqrt(n)
+     - Number of coarse clusters used to partition the index. A good heuristic for this value is sqrt(n_vectors_in_index)
+   * - kmeans_n_iters
+     - 20
+     - The number of iterations when searching for k-means centers
+   * - kmeans_trainset_fraction
+     - 0.5
+     - The fraction of training data to use for iterative k-means building
+   * - pq_bits
+     - 8
+     - The bit length of each vector element after compressing with PQ. Possible values are any integer between 4 and 8.
+   * - pq_dim
+     - 0
+     - The dimensionality of each vector after compressing with PQ. When 0, the dim is set heuristically.
+   * - codebook_kind
+     - per_subspace
+     - How codebooks are created. `per_subspace` trains kmeans on some number of sub-dimensions while `per_cluster`
+   * - force_random_rotation
+     - false
+     - Apply a random rotation matrix on the input data and queries even if `dim % pq_dim == 0`
+   * - conservative_memory_allocation
+     - false
+     - To support dynamic indexes, where points are expected to be added later, the individual IVF lists can be imtentionally overallocated up front to reduce the amount and impact of increasing list sizes, which requires allocating more memory and copying the old list to the new, larger, list.
+   * - add_data_on_build
+     - True
+     - Should the training points be added to the index after the index is built?
+   * - max_train_points_per_pq_code
+     - 256
+     - The max number of data points to use per PQ code during PQ codebook training.
+
+
+Search parameters
+~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Name
+     - Default
+     - Description
+   * - n_probes
+     - 20
+     - Number of closest IVF lists to scan for each query point.
+   * - lut_dtype
+     - cuda_r_32f
+     - Datatype to store the pq lookup tables. Can also use cuda_r_16f for half-precision and cuda_r_8u for 8-bit precision. Smaller lookup tables can fit into shared memory and significantly improve search times.
+   * - internal_distance_dtype
+     - cuda_r_32f
+     - Storage data type for distance/similarity computed at search time. Can also use cuda_r_16f for half-precision.
+   * - preferred_smem_carveout
+     - 1.0
+     - Preferred fraction of SM's unified memory / L1 cache to be used as shared memory. Default is 100%
+
+Tuning Considerations
+---------------------
+
+IVF-PQ has similar tuning considerations to IVF-flat, though the PQ compression ratio adds an additional variable to trade-off index size for search quality.
+
+It's important to note that IVF-PQ becomes very lossy very quickly, and so refinement reranking is often needed to get a reasonable recall. This step usually consists of searching initially for more k-neighbors than needed and then reducing the resulting neighborhoods down to k by computing exact distances. This step can be performed efficiently on CPU or GPU and generally has only a marginal impact on search latency.
+
+Memory footprint
+----------------
+
+Index (device memory):
+~~~~~~~~~~~~~~~~~~~~~~
+
+Simple approximate formula: :math:`n\_vectors * (pq\_dim * \frac{pq\_bits}{8} + sizeof_{idx}) + n\_clusters`
+
+The IVF lists end up being represented by a sparse data structure that stores the pointers to each list, an indices array that contains the indexes of each vector in each list, and an array with the encoded (and interleaved) data for each list.
+
+IVF list pointers: :math:`n\_clusters * sizeof_{uint32_t}`
+
+Indices: :math:`n\_vectors * sizeof_{idx}``
+
+Encoded data (interleaved): :math:`n\_vectors * pq\_dim * \frac{pq\_bits}{8}`
+
+Per subspace method: :math:`4 * pq\_dim * pq\_len * 2^pq\_bits`
+
+Per cluster method: :math:`4 * n\_clusters * pq\_len * 2^pq\_bits`
+
+Extras: :math:`n\_clusters * (20 + 8 * dim)`
+
+Index (host memory):
+~~~~~~~~~~~~~~~~~~~~
+
+When refinement is used with the dataset on host, the original raw vectors are needed: :math:`n\_vectors * dims * sizeof_{Tloat}`
+
+Search peak memory usage (device);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Total usage: :math:`index + queries + output\_indices + output\_distances + workspace`
+
+Workspace size is not trivial, a heuristic controls the batch size to make sure the workspace fits the `raft::resource::get_workspace_free_bytes(res)``.
+
+Build peak memory usage (device):
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. math::
+
+   \frac{n\_vectors}{trainset\_ratio * dims * sizeof_{float}}
+
+   + \frac{n\_vectors}{trainset\_ratio * sizeof_{uint32_t}}
+
+   + n\_clusters * dim * sizeof_{float}
+
+Note, if there’s not enough space left in the workspace memory resource, IVF-PQ build automatically switches to the managed memory for the training set and labels.
+
+
+
+
+
diff --git a/docs/source/integrations.rst b/docs/source/integrations.rst
index 19d72fd90..760892a98 100644
--- a/docs/source/integrations.rst
+++ b/docs/source/integrations.rst
@@ -1,50 +1,13 @@
+============
 Integrations
 ============
 
-Aside from using cuVS directly, it can be consumed through a number of sdk and vector database integrations.
-
-- `FAISS`_
-- `Milvus`_
-- `Lucene`_
-- `Kinetica`_
-
-
-FAISS
------
-
-FAISS v1.8 provides a special conda package that enables a RAFT backend for the Flat, IVF-Flat and IVF-PQ indexes on the GPU. Like the classical FAISS GPU indexes, the RAFT backend also enables interoperability between FAISS CPU indexes, allowing an index to be trained on GPU, searched on CPU, and vice versa.
-
-The RAFT backend can be enabled by building FAISS from source with the `FAISS_USE_RAFT` cmake flag enabled and setting the `use_raft` configuration option for the RAFT-enabled GPU indexes.
-
-A pre-compiled conda package can also be installed using the following command:
-
-.. code-block:: bash
-
-    conda install -c conda-forge -c pytorch -c rapidsai -c nvidia -c "nvidia/label/cuda-11.8.0" faiss-gpu-raft
-
-The next release of FAISS will feature cuVS support as we continue to migrate the vector search algorithms from RAFT to cuVS.
-
-Milvus
-------
-
-In version 2.3, Milvus released support for IVF-Flat and IVF-PQ indexes on the GPU through RAFT. Version 2.4 adds support for brute-force and the graph-based CAGRA index on the GPU. Please refer to the `Milvus documentation <https://milvus.io/docs/install_standalone-docker-compose-gpu.md>`_ to install Milvus with GPU support.
-
-The GPU indexes can be enabled by using the index types prefixed with `GPU_`, as outlined in the `Milvus index build guide <https://milvus.io/docs/build_index.md#Prepare-index-parameter>`_.
-
-Milvus will be migrating their GPU support from RAFT to cuVS as we continue to move the vector search algorithms out of RAFT and into cuVS.
-
-
-Lucene
-------
-
-An experimental Lucene connector for cuVS enables GPU-accelerated vector search indexes through Lucene. Initial benchmarks are showing that this connector can drastically improve the performance of both indexing and search in Lucene. This connector will continue to be improved over time and any interested developers are encouraged to contribute.
-
-Install and evaluate the `lucene-cuvs` connector on `Github <https://github.com/SearchScale/lucene-cuvs>`_.
-
-
-Kinetica
---------
+Aside from using cuVS standalone, it can be consumed through a number of sdk and vector database integrations.
 
-Starting with release 7.2, Kinetica supports the graph-based the CAGRA algorithm from RAFT. Kinetica will continue to improve its support over coming versions, while also migrating to cuVS as we work to move the vector search algorithms out of RAFT and into cuVS.
+.. toctree::
+   :maxdepth: 4
 
-Kinetica currently offers the ability to create a CAGRA index in a SQL `CREATE_TABLE` statement, as outlined in their `vector search indexing docs <https://docs.kinetica.com/7.2/concepts/indexes/#cagra-index>`_. Kinetica is not open source, but the RAFT indexes can be enabled in the developer edition, which can be installed `here <https://www.kinetica.com/try/#download_instructions>`_.
+   integrations/faiss.rst
+   integrations/milvus.rst
+   integrations/lucene.rst
+   integrations/kinetica.rst
diff --git a/docs/source/integrations/faiss.rst b/docs/source/integrations/faiss.rst
new file mode 100644
index 000000000..bf8be8225
--- /dev/null
+++ b/docs/source/integrations/faiss.rst
@@ -0,0 +1,14 @@
+FAISS
+-----
+
+FAISS v1.8 provides a special conda package that enables a RAFT backend for the Flat, IVF-Flat and IVF-PQ indexes on the GPU. Like the classical FAISS GPU indexes, the RAFT backend also enables interoperability between FAISS CPU indexes, allowing an index to be trained on GPU, searched on CPU, and vice versa.
+
+The RAFT backend can be enabled by building FAISS from source with the `FAISS_USE_RAFT` cmake flag enabled and setting the `use_raft` configuration option for the RAFT-enabled GPU indexes.
+
+A pre-compiled conda package can also be installed using the following command:
+
+.. code-block:: bash
+
+    conda install -c conda-forge -c pytorch -c rapidsai -c nvidia -c "nvidia/label/cuda-11.8.0" faiss-gpu-raft
+
+The next release of FAISS will feature cuVS support as we continue to migrate the vector search algorithms from RAFT to cuVS.
diff --git a/docs/source/integrations/kinetica.rst b/docs/source/integrations/kinetica.rst
new file mode 100644
index 000000000..e74cfe82f
--- /dev/null
+++ b/docs/source/integrations/kinetica.rst
@@ -0,0 +1,6 @@
+Kinetica
+--------
+
+Starting with release 7.2, Kinetica supports the graph-based the CAGRA algorithm from RAFT. Kinetica will continue to improve its support over coming versions, while also migrating to cuVS as we work to move the vector search algorithms out of RAFT and into cuVS.
+
+Kinetica currently offers the ability to create a CAGRA index in a SQL `CREATE_TABLE` statement, as outlined in their `vector search indexing docs <https://docs.kinetica.com/7.2/concepts/indexes/#cagra-index>`_. Kinetica is not open source, but the RAFT indexes can be enabled in the developer edition, which can be installed `here <https://www.kinetica.com/try/#download_instructions>`_.
diff --git a/docs/source/integrations/lucene.rst b/docs/source/integrations/lucene.rst
new file mode 100644
index 000000000..d20052545
--- /dev/null
+++ b/docs/source/integrations/lucene.rst
@@ -0,0 +1,6 @@
+Lucene
+------
+
+An experimental Lucene connector for cuVS enables GPU-accelerated vector search indexes through Lucene. Initial benchmarks are showing that this connector can drastically improve the performance of both indexing and search in Lucene. This connector will continue to be improved over time and any interested developers are encouraged to contribute.
+
+Install and evaluate the `lucene-cuvs` connector on `Github <https://github.com/SearchScale/lucene-cuvs>`_.
diff --git a/docs/source/integrations/milvus.rst b/docs/source/integrations/milvus.rst
new file mode 100644
index 000000000..4139cca52
--- /dev/null
+++ b/docs/source/integrations/milvus.rst
@@ -0,0 +1,8 @@
+Milvus
+------
+
+In version 2.3, Milvus released support for IVF-Flat and IVF-PQ indexes on the GPU through RAFT. Version 2.4 adds support for brute-force and the graph-based CAGRA index on the GPU. Please refer to the `Milvus documentation <https://milvus.io/docs/install_standalone-docker-compose-gpu.md>`_ to install Milvus with GPU support.
+
+The GPU indexes can be enabled by using the index types prefixed with `GPU_`, as outlined in the `Milvus index build guide <https://milvus.io/docs/build_index.md#Prepare-index-parameter>`_.
+
+Milvus will be migrating their GPU support from RAFT to cuVS as we continue to move the vector search algorithms out of RAFT and into cuVS.
diff --git a/docs/source/python_api/neighbors.rst b/docs/source/python_api/neighbors.rst
index 022c50de3..cd4f2609c 100644
--- a/docs/source/python_api/neighbors.rst
+++ b/docs/source/python_api/neighbors.rst
@@ -11,5 +11,6 @@ Nearest Neighbors
 
    neighbors_brute_force.rst
    neighbors_cagra.rst
+   neighbors_hnsw.rst
    neighbors_ivf_flat.rst
    neighbors_ivf_pq.rst
diff --git a/docs/source/python_api/neighbors_hnsw.rst b/docs/source/python_api/neighbors_hnsw.rst
new file mode 100644
index 000000000..9922805b3
--- /dev/null
+++ b/docs/source/python_api/neighbors_hnsw.rst
@@ -0,0 +1,30 @@
+HNSW
+====
+
+This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
+
+.. role:: py(code)
+   :language: python
+   :class: highlight
+
+Index search parameters
+#######################
+
+.. autoclass:: cuvs.neighbors.hnsw.SearchParams
+    :members:
+
+Index
+#####
+
+.. autoclass:: cuvs.neighbors.hnsw.Index
+    :members:
+
+Index Conversion
+################
+
+.. autofunction:: cuvs.neighbors.hnsw.from_cagra
+
+Index search
+############
+
+.. autofunction:: cuvs.neighbors.hnsw.search
diff --git a/docs/source/tuning_guide.rst b/docs/source/tuning_guide.rst
new file mode 100644
index 000000000..26cb2a701
--- /dev/null
+++ b/docs/source/tuning_guide.rst
@@ -0,0 +1,56 @@
+~~~~~~~~~~~~~~~~~~~~~~
+Automated tuning Guide
+~~~~~~~~~~~~~~~~~~~~~~
+
+Introduction
+============
+
+A Method for tuning and evaluating Vector Search Indexes At Scale in Locally Indexed Vector Databases. For more information on the differences between locally and globally indexed vector databases, please see :doc:`this guide <vector_databases_vs_vector_search>`. The goal of this guide is to give users a scalable and effective approach for tuning a vector search index, no matter how large.  Evaluation of a vector search index “model” that measures recall in proportion to build time so that it penalizes the recall when the build time is really high (should ultimately optimize for finding a lower build time and higher recall).
+
+For more information on the various different types of vector search indexes, please see our :doc:`guide to choosing vector search indexes <choosing_and_configuring_indexes>`
+
+Why automated tuning?
+=====================
+
+As much as 75% of users have told us they will not be able to tune a vector database beyond one or two simple knobs and we suggest that an ideal “knob” would be to balance training time and search time with search quality. The more time, the higher the quality, and the more needed to find an acceptable search performance. Even the 25% of users that want to tune are still asking for simple tools for doing so. These users also ask for some simple guidelines for setting tuning parameters, like :doc:`this guide <indexes/indexes>`.
+
+Since vector search indexes are more closely related to machine learning models than traditional databases indexes, one option for easing the parameter tuning burden is to use hyper-parameter optimization tools like `Ray Tune <https://medium.com/rapids-ai/30x-faster-hyperparameter-search-with-raytune-and-rapids-403013fbefc5>`_ and `Optuna <https://docs.rapids.ai/deployment/stable/examples/rapids-optuna-hpo/notebook/>`_. to verify this.
+
+How to tune?
+============
+
+But how would this work when we have an index that's massively large- like 1TB?
+
+One benefit to locally indexed vector databases is that they often scale by breaking the larger set of vectors down into a smaller set by uniformly random subsampling and training smaller vector search index models on the sub-samples. Most often, the same set of tuning parameters are applied to all of the smaller sub-index models, rather than trying to set them individually for each one. During search, the query vectors are often sent to all of the sub-indexes and the resulting neighbors list reduced down to `k` based on the closest distances (or similarities).
+
+Because many databases use this sub-sampling trick, it's possible to perform an automated parameter tuning on the larger index just by randomly samplnig some number of vectors from it, splitting them into disjoint train/test/eval datasets, computing ground truth with brute-force, and then performing a hyper-parameter optimization on it. This procedure can also be repeated multiple times to simulate a monte-carlo cross validation.
+
+GPUs are naturally great at performing massively parallel tasks, especially when they are largely independent tasks, such as training and evaluating models with different hyper-parameter settings in parallel. Hyper-parameter optimization also lends itself well to distributed processing, such as multi-node multi-GPU operation.
+
+Steps to achieve automated tuning
+=================================
+
+More formally, an automated parameter tuning workflow with monte-carlo cross-validaton looks likes something like this:
+
+#. Ingest a large dataset into the vector database of your choice
+
+#. Choose an index size based on number of vectors. This should usually align with the average number of vectors the database will end up putting in a single ANN sub-index model.
+
+#. Uniformly random sample the number of vectors specified above from the database for a training set. This is often accomplished by generating some number of random (unique) numbers up to the dataset size.
+
+#. Uniformly sample some number of vectors for a test set and do this again for an evaluation set. 1-10% of the vectors in the training set.
+
+#. Use the test set to compute ground truth on the vectors from prior step against all vectors in the training set.
+
+#. Start the HPO tuning process for the training set, using the test vectors for the query set. It's important to make sure your HPO is multi-objective and optimizes for: a) low build time, b) high throughput or low latency sarch (depending on needs), and c) acceptable recall.
+
+#. Use the evaluation dataset to test that the optimal hyper-parameters generalize to unseen points that were not used in the optimization process.
+
+#. Optionally, the above steps multiple times on different uniform sub-samplings. Optimal parameters can then be combined over the multiple monte-carlo optimization iterations. For example, many hyper-parameters can simply be averaged but care might need to be taken for other parameters.
+
+#. Create a new index in the database using the ideal params from above that meet the target constraints (e.g. build vs search vs quality)
+
+Conclusion
+==========
+
+By the end of this process, you should have a set of parameters that meet your target constraints while demonstrating how well the optimal hyper-parameters generalize across the dataset. The major benefit to this approach is that it breaks a potentially unbounded dataset size down into manageable chunks and accelerates tuning on those chunks. We see this process as a major value add for vector search on the GPU.
diff --git a/docs/source/vector_databases_vs_vector_search.rst b/docs/source/vector_databases_vs_vector_search.rst
new file mode 100644
index 000000000..446737c11
--- /dev/null
+++ b/docs/source/vector_databases_vs_vector_search.rst
@@ -0,0 +1,57 @@
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Vector search indexes vs vector databases
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This guide provides information on the differences between vector search indexes and fully-fledged vector databases. For more information on selecting and configuring vector search indexes, please refer to our :doc:`guide on choosing and configuring indexes <choosing_and_configuring_indexes>`
+
+One of the primary differences between vector database indexes and traditional database indexes is that vector search often uses approximations to trade-off accuracy of the results for speed. Because of this, while many mature databases offer mechanisms to tune their indexes and achieve better performance, vector database indexes can return completely garbage results if they aren’t tuned for a reasonable level of search quality in addition to performance tuning. This is because vector database indexes are more closely related to machine learning models than they are to traditional database indexes.
+
+What are the differences between vector databases and vector search indexes?
+============================================================================
+
+Vector search in and of itself refers to the objective of finding the closest vectors in an index around a given set of query vectors. At the lowest level, vector search indexes are just machine learning models, which have a build, search, and recall performance that can be traded off, depending on the algorithm and various hyper-parameters.
+
+Vector search indexes alone are considered building blocks that enable, but are not considered by themselves to be, a fully-fledged vector database. Vector databases provide more production-level features that often use vector search algorithms in concert with other popular database design techniques to add important capabilities like durability, fault tolerance, vertical scalability, partition tolerance, and horizontal scalability.
+
+In the world of vector databases, there are special purpose-built databases that focus primarily on vector search but might also provide some small capability of more general-purpose databases, like being able to perform a hybrid search across both vectors and metadata. Many general-purpose databases, both relational and nosql / document databases for example, are beginning to add first-class vector types also.
+
+So what does all this mean to you? Sometimes a simple standalone vector search index is enough. Usually they can be trained and serialized to a file for later use, and often provide a capability to filter out specific vectors during search. Sometimes they even provide a mechanism to scale up to utilize multiple GPUs, for example, but they generally stop there- and suggest either using your own distributed system (like Spark or Dask) or a fully-fledged vector database to scale out.
+
+FAISS and cuVS are examples of standalone vector search libraries, which again are more closely related to machine learning libraries than to fully-fledged databases. Milvus is an example of a special-purpose vector database and Elastic, MongoDB, and OpenSearch are examples of general-purpose databases that have added vector search capabilities.
+
+How is vector search used by vector databases?
+==============================================
+
+Within the context of vector databases, there are two primary ways in which vector search indexes are used and it’s important to understand which you are working with because it can have an effect on the behavior of the parameters with respect to the data.
+
+Many vector search algorithms improve scalability while reducing the number of distances by partitioning the vector space into smaller pieces, often through the use of clustering, hashing, trees, and other techniques. Another popular technique is to reduce the width or dimensionality of the space in order to decrease the cost of computing each distance. In contrast, databases often partition the data, but may only do so to improve things like io performance, partition tolerance, or scale, without regards to the underlying data distributions which are ultimately going to be used for vector search.
+
+This leads us to two core architectural designs that we encounter in vector databases:
+
+Locally partitioned vector search indexes
+-----------------------------------------
+
+Most databases follow this design, and vectors are often first written to a write-ahead log for durability. After some number of vectors are written, the write-ahead logs become immutable and may be merged with other write-ahead logs before eventually being converted to a new vector search index.
+
+The search is generally done over each locally partitioned index and the results combined. When setting hyperparameters, only the local vector search indexes need to be considered, though the same hyperparameters are going to be used across all of the local partitions. So, for example, if you’ve ingested 100M vectors but each partition only contains about 10M vectors, the size of the index only needs to consider its local 10M vectors. Details like number of vectors in the index are important, for example, when setting the number of clusters in an IVF-based (inverted file index) method, as I’ll cover below.
+
+
+Globally partitioned vector search indexes
+------------------------------------------
+
+Some special-purpose vector databases follow this design, such as Yahoo’s Vespa and Google’s Spanner. A global index is trained to partition the entire database’s vectors up front as soon as there are enough vectors to do so (usually these databases are at a large enough scale that a significant number of vectors are bootstrapped initially and so it avoids the cold start problem). Ingested vectors are first run through the global index (clustering, for example, but tree- and graph-based methods have also been used) to determine which partition they belong to and the vectors are then (sent to, and) written  directly to that partition. The individual partitions can contain a graph, tree, or a simple IVF list. These types of indexes have been able to scale to hundreds of billions to trillions of vectors, and since the partitions are themselves often implicitly based on neighborhoods, rather than being based on uniformly random distributed vectors like the locally partitioned architectures, the partitions can be grouped together or intentionally separated to support localized searches or load balancing, depending upon the needs of the system.
+
+The challenge when setting hyper-parameters for globally partitioned indexes is that they need to account for the entire set of vectors, and thus the hyperparameters of the global index generally account for all of the vectors in the database, rather than any local partition.
+
+Of course, the two approaches outlined above can also be used together (e.g. training a global “coarse” index and then creating localized vector search indexes within each of the global indexes) but to my knowledge, no such architecture has implemented this pattern.
+
+A challenge with GPUs in vector databases today is that the resulting vector indexes are expected to fit into the memory of available GPUs for fast search. That is to say, there doesn’t exist today an efficient mechanism for offloading or swapping GPU indexes so they can be cached from disk or host memory, for example. We are working on mechanisms to do this, and to also utilize technologies like GPUDirect Storage and GPUDirect RDMA to improve the IO performance further.
+
+Tuning and hyperparameter optimization
+======================================
+
+Unfortunately, for large datasets, doing a hyper-parameter optimization on the whole dataset is not always feasible and this is actually where the locally partitioned vector search indexes have an advantage because you can think of each smaller segment of the larger index as a uniform random sample of the total vectors in the dataset. This means that it is possible to perform a hyperparameter optimization on the smaller subsets and find reasonably acceptable parameters that should generalize fairly well to the entire dataset. Generally this hyperparameter optimization will require computing a ground truth on the subset with an exact method like brute-force and then using it to evaluate several searches on randomly sampled vectors.
+
+Full hyper-parameter optimization may also not always be necessary- for example, once you have built a ground truth dataset on a subset, many times you can start by building an index with the default build parameters and then playing around with different search parameters until you get the desired quality and search performance.  For massive indexes that might be multiple terabytes, you could also take this subsampling of, say, 10M vectors, train an index and then tune the search parameters from there. While there might be a small margin of error, the chosen build/search parameters should generalize fairly well for the databases that build locally partitioned indexes.
+
+Refer to our :doc:`tuning guide <tuning_guide>` for more information and examples on how to efficiently and automatically tune your vector search indexes based on your needs.
\ No newline at end of file
diff --git a/python/cuvs/cuvs/neighbors/CMakeLists.txt b/python/cuvs/cuvs/neighbors/CMakeLists.txt
index 21c3db5da..f68bbea53 100644
--- a/python/cuvs/cuvs/neighbors/CMakeLists.txt
+++ b/python/cuvs/cuvs/neighbors/CMakeLists.txt
@@ -14,6 +14,7 @@
 
 add_subdirectory(brute_force)
 add_subdirectory(cagra)
+add_subdirectory(hnsw)
 add_subdirectory(ivf_flat)
 add_subdirectory(ivf_pq)
 add_subdirectory(filters)
diff --git a/python/cuvs/cuvs/neighbors/cagra/cagra.pxd b/python/cuvs/cuvs/neighbors/cagra/cagra.pxd
index b23c2a4b3..bba5a91a8 100644
--- a/python/cuvs/cuvs/neighbors/cagra/cagra.pxd
+++ b/python/cuvs/cuvs/neighbors/cagra/cagra.pxd
@@ -17,6 +17,7 @@
 
 from libc.stdint cimport (
     int8_t,
+    int32_t,
     int64_t,
     uint8_t,
     uint32_t,
@@ -100,6 +101,8 @@ cdef extern from "cuvs/neighbors/cagra.h" nogil:
 
     cuvsError_t cuvsCagraIndexDestroy(cuvsCagraIndex_t index)
 
+    cuvsError_t cuvsCagraIndexGetDims(cuvsCagraIndex_t index, int32_t* dim)
+
     cuvsError_t cuvsCagraBuild(cuvsResources_t res,
                                cuvsCagraIndexParams* params,
                                DLManagedTensor* dataset,
@@ -117,6 +120,20 @@ cdef extern from "cuvs/neighbors/cagra.h" nogil:
                                    cuvsCagraIndex_t index,
                                    bool include_dataset) except +
 
+    cuvsError_t cuvsCagraSerializeToHnswlib(cuvsResources_t res,
+                                            const char * filename,
+                                            cuvsCagraIndex_t index) except +
+
     cuvsError_t cuvsCagraDeserialize(cuvsResources_t res,
                                      const char * filename,
                                      cuvsCagraIndex_t index) except +
+
+cdef class Index:
+    """
+    CAGRA index object. This object stores the trained CAGRA index state
+    which can be used to perform nearest neighbors searches.
+    """
+
+    cdef cuvsCagraIndex_t index
+    cdef bool trained
+    cdef str active_index_type
diff --git a/python/cuvs/cuvs/neighbors/cagra/cagra.pyx b/python/cuvs/cuvs/neighbors/cagra/cagra.pyx
index f940ab8bf..95209dbeb 100644
--- a/python/cuvs/cuvs/neighbors/cagra/cagra.pyx
+++ b/python/cuvs/cuvs/neighbors/cagra/cagra.pyx
@@ -36,6 +36,7 @@ from pylibraft.neighbors.common import _check_input_array
 
 from libc.stdint cimport (
     int8_t,
+    int32_t,
     int64_t,
     uint8_t,
     uint32_t,
@@ -206,16 +207,9 @@ cdef class IndexParams:
 
 
 cdef class Index:
-    """
-    CAGRA index object. This object stores the trained CAGRA index state
-    which can be used to perform nearest neighbors searches.
-    """
-
-    cdef cuvsCagraIndex_t index
-    cdef bool trained
-
     def __cinit__(self):
         self.trained = False
+        self.active_index_type = None
         check_cuvs(cuvsCagraIndexCreate(&self.index))
 
     def __dealloc__(self):
@@ -226,6 +220,12 @@ cdef class Index:
     def trained(self):
         return self.trained
 
+    @property
+    def dim(self):
+        cdef int32_t dim
+        check_cuvs(cuvsCagraIndexGetDims(self.index, &dim))
+        return dim
+
     def __repr__(self):
         # todo(dgd): update repr as we expose data through C API
         attr_str = []
@@ -299,6 +299,7 @@ def build(IndexParams index_params, dataset, resources=None):
             idx.index
         ))
         idx.trained = True
+        idx.active_index_type = dataset_ai.dtype.name
 
     return idx
 
diff --git a/python/cuvs/cuvs/neighbors/hnsw/CMakeLists.txt b/python/cuvs/cuvs/neighbors/hnsw/CMakeLists.txt
new file mode 100644
index 000000000..1f9c422ca
--- /dev/null
+++ b/python/cuvs/cuvs/neighbors/hnsw/CMakeLists.txt
@@ -0,0 +1,24 @@
+# =============================================================================
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+# in compliance with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software distributed under the License
+# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+# or implied. See the License for the specific language governing permissions and limitations under
+# the License.
+# =============================================================================
+
+# Set the list of Cython files to build
+set(cython_sources hnsw.pyx)
+set(linked_libraries cuvs::cuvs cuvs::c_api)
+
+# Build all of the Cython targets
+rapids_cython_create_modules(
+  CXX
+  SOURCE_FILES "${cython_sources}"
+  LINKED_LIBRARIES "${linked_libraries}" ASSOCIATED_TARGETS cuvs MODULE_PREFIX neighbors_hnsw_
+)
diff --git a/python/cuvs/cuvs/neighbors/hnsw/__init__.pxd b/python/cuvs/cuvs/neighbors/hnsw/__init__.pxd
new file mode 100644
index 000000000..e69de29bb
diff --git a/python/cuvs/cuvs/neighbors/hnsw/__init__.py b/python/cuvs/cuvs/neighbors/hnsw/__init__.py
new file mode 100644
index 000000000..5efcdf68b
--- /dev/null
+++ b/python/cuvs/cuvs/neighbors/hnsw/__init__.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from .hnsw import Index, SearchParams, from_cagra, load, save, search
+
+__all__ = [
+    "Index",
+    "SearchParams",
+    "load",
+    "save",
+    "search",
+    "from_cagra",
+]
diff --git a/python/cuvs/cuvs/neighbors/hnsw/hnsw.pxd b/python/cuvs/cuvs/neighbors/hnsw/hnsw.pxd
new file mode 100644
index 000000000..1cdc97406
--- /dev/null
+++ b/python/cuvs/cuvs/neighbors/hnsw/hnsw.pxd
@@ -0,0 +1,53 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# cython: language_level=3
+
+from libc.stdint cimport int32_t, uintptr_t
+
+from cuvs.common.c_api cimport cuvsError_t, cuvsResources_t
+from cuvs.common.cydlpack cimport DLDataType, DLManagedTensor
+from cuvs.distance_type cimport cuvsDistanceType
+
+
+cdef extern from "cuvs/neighbors/hnsw.h" nogil:
+    ctypedef struct cuvsHnswSearchParams:
+        int32_t ef
+        int32_t numThreads
+
+    ctypedef cuvsHnswSearchParams* cuvsHnswSearchParams_t
+
+    ctypedef struct cuvsHnswIndex:
+        uintptr_t addr
+        DLDataType dtype
+
+    ctypedef cuvsHnswIndex* cuvsHnswIndex_t
+
+    cuvsError_t cuvsHnswIndexCreate(cuvsHnswIndex_t* index)
+
+    cuvsError_t cuvsHnswIndexDestroy(cuvsHnswIndex_t index)
+
+    cuvsError_t cuvsHnswSearch(cuvsResources_t res,
+                               cuvsHnswSearchParams* params,
+                               cuvsHnswIndex_t index,
+                               DLManagedTensor* queries,
+                               DLManagedTensor* neighbors,
+                               DLManagedTensor* distances) except +
+
+    cuvsError_t cuvsHnswDeserialize(cuvsResources_t res,
+                                    const char * filename,
+                                    int32_t dim,
+                                    cuvsDistanceType metric,
+                                    cuvsHnswIndex_t index) except +
diff --git a/python/cuvs/cuvs/neighbors/hnsw/hnsw.pyx b/python/cuvs/cuvs/neighbors/hnsw/hnsw.pyx
new file mode 100644
index 000000000..018fcfef9
--- /dev/null
+++ b/python/cuvs/cuvs/neighbors/hnsw/hnsw.pyx
@@ -0,0 +1,380 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# cython: language_level=3
+
+from libc.stdint cimport uint32_t
+from libcpp cimport bool
+from libcpp.string cimport string
+
+from cuvs.common.exceptions import check_cuvs
+from cuvs.common.resources import auto_sync_resources
+
+from cuvs.common cimport cydlpack
+
+import numpy as np
+
+from cuvs.distance import DISTANCE_TYPES
+
+from cuvs.neighbors.cagra cimport cagra
+
+import os
+import uuid
+
+from pylibraft.common import auto_convert_output
+from pylibraft.common.cai_wrapper import wrap_array
+from pylibraft.common.interruptible import cuda_interruptible
+from pylibraft.neighbors.common import _check_input_array
+
+
+cdef class SearchParams:
+    """
+    HNSW search parameters
+
+    Parameters
+    ----------
+    ef: int, default = 200
+        Maximum number of candidate list size used during search.
+    num_threads: int, default = 0
+        Number of CPU threads used to increase search parallelism.
+        When set to 0, the number of threads is automatically determined
+        using OpenMP's `omp_get_max_threads()`.
+    """
+
+    cdef cuvsHnswSearchParams params
+
+    def __init__(self, *,
+                 ef=200,
+                 num_threads=0):
+        self.params.ef = ef
+        self.params.numThreads = num_threads
+
+    def __repr__(self):
+        attr_str = [attr + "=" + str(getattr(self, attr))
+                    for attr in [
+                        "ef", "num_threads"]]
+        return "SearchParams(type=HNSW, " + (", ".join(attr_str)) + ")"
+
+    @property
+    def ef(self):
+        return self.params.ef
+
+    @property
+    def num_threads(self):
+        return self.params.numThreads
+
+
+cdef class Index:
+    """
+    HNSW index object. This object stores the trained HNSW index state
+    which can be used to perform nearest neighbors searches.
+    """
+
+    cdef cuvsHnswIndex_t index
+    cdef bool trained
+
+    def __cinit__(self):
+        self.trained = False
+        check_cuvs(cuvsHnswIndexCreate(&self.index))
+
+    def __dealloc__(self):
+        if self.index is not NULL:
+            check_cuvs(cuvsHnswIndexDestroy(self.index))
+
+    @property
+    def trained(self):
+        return self.trained
+
+    def __repr__(self):
+        # todo(dgd): update repr as we expose data through C API
+        attr_str = []
+        return "Index(type=HNSW, metric=L2" + (", ".join(attr_str)) + ")"
+
+
+@auto_sync_resources
+def save(filename, cagra.Index index, resources=None):
+    """
+    Saves the CAGRA index to a file as an hnswlib index.
+    The saved index is immutable and can only be searched by the hnswlib
+    wrapper in cuVS, as the format is not compatible with the original
+    hnswlib.
+
+    Saving / loading the index is experimental. The serialization format is
+    subject to change.
+
+    Parameters
+    ----------
+    filename : string
+        Name of the file.
+    index : Index
+        Trained CAGRA index.
+    {resources_docstring}
+
+    Examples
+    --------
+    >>> import cupy as cp
+    >>> from cuvs.neighbors import cagra
+    >>> n_samples = 50000
+    >>> n_features = 50
+    >>> dataset = cp.random.random_sample((n_samples, n_features),
+    ...                                   dtype=cp.float32)
+    >>> # Build index
+    >>> index = cagra.build(cagra.IndexParams(), dataset)
+    >>> # Serialize and deserialize the cagra index built
+    >>> hnsw.save("my_index.bin", index)
+    """
+    cdef string c_filename = filename.encode('utf-8')
+    cdef cuvsResources_t res = <cuvsResources_t>resources.get_c_obj()
+    check_cuvs(cagra.cuvsCagraSerializeToHnswlib(res,
+                                                 c_filename.c_str(),
+                                                 index.index))
+
+
+@auto_sync_resources
+def load(filename, dim, dtype, metric="sqeuclidean", resources=None):
+    """
+    Loads base-layer-only hnswlib index from file, which was originally
+    saved as a built CAGRA index. The loaded index is immutable and can only
+    be searched by the hnswlib wrapper in cuVS, as the format is not
+    compatible with the original hnswlib.
+
+    Saving / loading the index is experimental. The serialization format is
+    subject to change, therefore loading an index saved with a previous
+    version of cuVS is not guaranteed to work.
+
+    Parameters
+    ----------
+    filename : string
+        Name of the file.
+    dim : int
+        Dimensions of the training dataest
+    dtype : np.dtype of the saved index
+        Valid values for dtype: [np.float32, np.byte, np.ubyte]
+    metric : string denoting the metric type, default="sqeuclidean"
+        Valid values for metric: ["sqeuclidean", "inner_product"], where
+            - sqeuclidean is the euclidean distance without the square root
+              operation, i.e.: distance(a,b) = \\sum_i (a_i - b_i)^2,
+            - inner_product distance is defined as
+              distance(a, b) = \\sum_i a_i * b_i.
+    {resources_docstring}
+
+    Returns
+    -------
+    index : HnswIndex
+
+    Examples
+    --------
+    >>> import cupy as cp
+    >>> from cuvs.neighbors import cagra
+    >>> from cuvs.neighbors import hnsw
+    >>> n_samples = 50000
+    >>> n_features = 50
+    >>> dataset = cp.random.random_sample((n_samples, n_features),
+    ...                                   dtype=cp.float32)
+    >>> # Build index
+    >>> index = cagra.build(cagra.IndexParams(), dataset)
+    >>> # Serialize the CAGRA index to hnswlib base layer only index format
+    >>> hnsw.save("my_index.bin", index)
+    >>> index = hnsw.load("my_index.bin", n_features, np.float32,
+    ...                   "sqeuclidean")
+    """
+    cdef Index idx = Index()
+    cdef cuvsResources_t res = <cuvsResources_t>resources.get_c_obj()
+    cdef string c_filename = filename.encode('utf-8')
+    cdef cydlpack.DLDataType dl_dtype
+    if dtype == np.float32:
+        dl_dtype.code = cydlpack.kDLFloat
+        dl_dtype.bits = 32
+        dl_dtype.lanes = 1
+    elif dtype == np.ubyte:
+        dl_dtype.code = cydlpack.kDLUInt
+        dl_dtype.bits = 8
+        dl_dtype.lanes = 1
+    elif dtype == np.byte:
+        dl_dtype.code = cydlpack.kDLInt
+        dl_dtype.bits = 8
+        dl_dtype.lanes = 1
+    else:
+        raise ValueError("Only float32 is supported for dtype")
+
+    idx.index.dtype = dl_dtype
+    cdef cuvsDistanceType distance_type = DISTANCE_TYPES[metric]
+
+    check_cuvs(cuvsHnswDeserialize(
+        res,
+        c_filename.c_str(),
+        dim,
+        distance_type,
+        idx.index
+    ))
+    idx.trained = True
+    return idx
+
+
+@auto_sync_resources
+def from_cagra(cagra.Index index, temporary_index_path=None, resources=None):
+    """
+    Returns an hnsw base-layer-only index from a CAGRA index.
+
+    NOTE: This method uses the filesystem to write the CAGRA index in
+          `/tmp/<random_number>.bin` or the parameter `temporary_index_path`
+          if not None before reading it as an hnsw index,
+          then deleting the temporary file. The returned index is immutable
+          and can only be searched by the hnsw wrapper in cuVS, as the
+          format is not compatible with the original hnswlib library.
+          By `base_layer_only`, we mean that the hnsw index is created
+          without the additional layers that are used for the hierarchical
+          search in hnswlib. Instead, the base layer is used for the search.
+
+    Saving / loading the index is experimental. The serialization format is
+    subject to change.
+
+    Parameters
+    ----------
+    index : Index
+        Trained CAGRA index.
+    temporary_index_path : string, default = None
+        Path to save the temporary index file. If None, the temporary file
+        will be saved in `/tmp/<random_number>.bin`.
+    {resources_docstring}
+
+    Examples
+    --------
+    >>> import cupy as cp
+    >>> from cuvs.neighbors import cagra
+    >>> from cuvs.neighbors import hnsw
+    >>> n_samples = 50000
+    >>> n_features = 50
+    >>> dataset = cp.random.random_sample((n_samples, n_features),
+    ...                                   dtype=cp.float32)
+    >>> # Build index
+    >>> index = cagra.build(cagra.IndexParams(), dataset)
+    >>> # Serialize the CAGRA index to hnswlib base layer only index format
+    >>> hnsw_index = hnsw.from_cagra(index)
+    """
+    uuid_num = uuid.uuid4()
+    filename = temporary_index_path if temporary_index_path else \
+        f"/tmp/{uuid_num}.bin"
+    save(filename, index, resources=resources)
+    hnsw_index = load(filename, index.dim, np.dtype(index.active_index_type),
+                      "sqeuclidean", resources=resources)
+    os.remove(filename)
+    return hnsw_index
+
+
+@auto_sync_resources
+@auto_convert_output
+def search(SearchParams search_params,
+           Index index,
+           queries,
+           k,
+           neighbors=None,
+           distances=None,
+           resources=None):
+    """
+    Find the k nearest neighbors for each query.
+
+    Parameters
+    ----------
+    search_params : SearchParams
+    index : Index
+        Trained CAGRA index.
+    queries : CUDA array interface compliant matrix shape (n_samples, dim)
+        Supported dtype [float, int]
+    k : int
+        The number of neighbors.
+    neighbors : Optional CUDA array interface compliant matrix shape
+                (n_queries, k), dtype uint64_t. If supplied, neighbor
+                indices will be written here in-place. (default None)
+    distances : Optional CUDA array interface compliant matrix shape
+                (n_queries, k) If supplied, the distances to the
+                neighbors will be written here in-place. (default None)
+    {resources_docstring}
+
+    Examples
+    --------
+    >>> import cupy as cp
+    >>> from cuvs.neighbors import cagra, hnsw
+    >>> n_samples = 50000
+    >>> n_features = 50
+    >>> n_queries = 1000
+    >>> dataset = cp.random.random_sample((n_samples, n_features),
+    ...                                   dtype=cp.float32)
+    >>> # Build index
+    >>> index = cagra.build(cagra.IndexParams(), dataset)
+    >>> # Search using the built index
+    >>> queries = cp.random.random_sample((n_queries, n_features),
+    ...                                   dtype=cp.float32)
+    >>> k = 10
+    >>> search_params = hnsw.SearchParams(
+    ...     ef=200,
+    ...     num_threads=0
+    ... )
+    >>> # Convert CAGRA index to HNSW
+    >>> hnsw_index = hnsw.from_cagra(index)
+    >>> # Using a pooling allocator reduces overhead of temporary array
+    >>> # creation during search. This is useful if multiple searches
+    >>> # are performed with same query size.
+    >>> distances, neighbors = hnsw.search(search_params, index, queries,
+    ...                                     k)
+    >>> neighbors = cp.asarray(neighbors)
+    >>> distances = cp.asarray(distances)
+    """
+    if not index.trained:
+        raise ValueError("Index needs to be built before calling search.")
+
+    # todo(dgd): we can make the check of dtype a parameter of wrap_array
+    # in RAFT to make this a single call
+    queries_ai = wrap_array(queries)
+    _check_input_array(queries_ai, [np.dtype('float32'),
+                                    np.dtype('uint8'),
+                                    np.dtype('int8')])
+
+    cdef uint32_t n_queries = queries_ai.shape[0]
+
+    if neighbors is None:
+        neighbors = np.empty((n_queries, k), dtype='uint64')
+
+    neighbors_ai = wrap_array(neighbors)
+    _check_input_array(neighbors_ai, [np.dtype('uint64')],
+                       exp_rows=n_queries, exp_cols=k)
+
+    if distances is None:
+        distances = np.empty((n_queries, k), dtype='float32')
+
+    distances_ai = wrap_array(distances)
+    _check_input_array(distances_ai, [np.dtype('float32')],
+                       exp_rows=n_queries, exp_cols=k)
+
+    cdef cuvsHnswSearchParams* params = &search_params.params
+    cdef cydlpack.DLManagedTensor* queries_dlpack = \
+        cydlpack.dlpack_c(queries_ai)
+    cdef cydlpack.DLManagedTensor* neighbors_dlpack = \
+        cydlpack.dlpack_c(neighbors_ai)
+    cdef cydlpack.DLManagedTensor* distances_dlpack = \
+        cydlpack.dlpack_c(distances_ai)
+    cdef cuvsResources_t res = <cuvsResources_t>resources.get_c_obj()
+
+    with cuda_interruptible():
+        check_cuvs(cuvsHnswSearch(
+            res,
+            params,
+            index.index,
+            queries_dlpack,
+            neighbors_dlpack,
+            distances_dlpack
+        ))
+
+    return (distances, neighbors)
diff --git a/python/cuvs/cuvs/test/test_hnsw.py b/python/cuvs/cuvs/test/test_hnsw.py
new file mode 100644
index 000000000..0ae97266b
--- /dev/null
+++ b/python/cuvs/cuvs/test/test_hnsw.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     h ttp://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import pytest
+from sklearn.neighbors import NearestNeighbors
+from sklearn.preprocessing import normalize
+
+from cuvs.neighbors import cagra, hnsw
+from cuvs.test.ann_utils import calc_recall, generate_data
+
+
+def run_hnsw_build_search_test(
+    n_rows=1000,
+    n_cols=10,
+    n_queries=100,
+    k=10,
+    dtype=np.float32,
+    metric="sqeuclidean",
+    build_algo="ivf_pq",
+    intermediate_graph_degree=128,
+    graph_degree=64,
+    search_params={},
+):
+    dataset = generate_data((n_rows, n_cols), dtype)
+    if metric == "inner_product":
+        dataset = normalize(dataset, norm="l2", axis=1)
+        if dtype in [np.int8, np.uint8]:
+            pytest.skip(
+                "inner_product metric is not supported for int8/uint8 data"
+            )
+        if build_algo == "nn_descent":
+            pytest.skip("inner_product metric is not supported for nn_descent")
+
+    build_params = cagra.IndexParams(
+        metric=metric,
+        intermediate_graph_degree=intermediate_graph_degree,
+        graph_degree=graph_degree,
+        build_algo=build_algo,
+    )
+
+    index = cagra.build(build_params, dataset)
+
+    assert index.trained
+
+    hnsw_index = hnsw.from_cagra(index)
+
+    queries = generate_data((n_queries, n_cols), dtype)
+
+    search_params = hnsw.SearchParams(**search_params)
+
+    out_dist, out_idx = hnsw.search(search_params, hnsw_index, queries, k)
+
+    # Calculate reference values with sklearn
+    skl_metric = {
+        "sqeuclidean": "sqeuclidean",
+        "inner_product": "cosine",
+        "euclidean": "euclidean",
+    }[metric]
+    nn_skl = NearestNeighbors(
+        n_neighbors=k, algorithm="brute", metric=skl_metric
+    )
+    nn_skl.fit(dataset)
+    skl_dist, skl_idx = nn_skl.kneighbors(queries, return_distance=True)
+
+    recall = calc_recall(out_idx, skl_idx)
+    assert recall > 0.95
+
+
+@pytest.mark.parametrize("dtype", [np.float32, np.int8, np.uint8])
+@pytest.mark.parametrize("k", [10, 20])
+@pytest.mark.parametrize("ef", [30, 40])
+@pytest.mark.parametrize("num_threads", [2, 4])
+@pytest.mark.parametrize("metric", ["sqeuclidean"])
+@pytest.mark.parametrize("build_algo", ["ivf_pq", "nn_descent"])
+def test_hnsw(dtype, k, ef, num_threads, metric, build_algo):
+    # Note that inner_product tests use normalized input which we cannot
+    # represent in int8, therefore we test only sqeuclidean metric here.
+    run_hnsw_build_search_test(
+        dtype=dtype,
+        k=k,
+        metric=metric,
+        build_algo=build_algo,
+        search_params={"ef": ef, "num_threads": num_threads},
+    )
diff --git a/python/cuvs_bench/cuvs_bench/config/algorithms.yaml b/python/cuvs_bench/cuvs_bench/config/algorithms.yaml
index dc1127fbc..357517933 100644
--- a/python/cuvs_bench/cuvs_bench/config/algorithms.yaml
+++ b/python/cuvs_bench/cuvs_bench/config/algorithms.yaml
@@ -31,6 +31,15 @@ cuvs_cagra:
 cuvs_brute_force:
   executable: CUVS_BRUTE_FORCE_ANN_BENCH
   requires_gpu: true
+cuvs_mg_ivf_flat:
+  executable: CUVS_MG_ANN_BENCH
+  requires_gpu: true
+cuvs_mg_ivf_pq:
+  executable: CUVS_MG_ANN_BENCH
+  requires_gpu: true
+cuvs_mg_cagra:
+  executable: CUVS_MG_ANN_BENCH
+  requires_gpu: true
 ggnn:
   executable: GGNN_ANN_BENCH
   requires_gpu: true
diff --git a/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_cagra.yaml b/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_cagra.yaml
new file mode 100644
index 000000000..d730e33fb
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_cagra.yaml
@@ -0,0 +1,13 @@
+name: cuvs_mg_cagra
+constraints:
+  build: cuvs_bench.config.algos.constraints.cuvs_cagra_build
+  search: cuvs_bench.config.algos.constraints.cuvs_cagra_search
+groups:
+  base:
+    build:
+      graph_degree: [32, 64, 96, 128]
+      intermediate_graph_degree: [32, 64, 96, 128]
+      graph_build_algo: ["NN_DESCENT"]
+    search:
+      itopk: [32, 64, 128, 256, 512]
+      search_width: [1, 2, 4, 8, 16, 32, 64]
diff --git a/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_ivf_flat.yaml b/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_ivf_flat.yaml
new file mode 100644
index 000000000..317e615e5
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_ivf_flat.yaml
@@ -0,0 +1,9 @@
+name: cuvs_mg_ivf_flat
+groups:
+  base:
+    build:
+      nlist: [1024, 2048, 4096, 8192, 16384, 32000, 64000]
+      ratio: [1, 2, 4]
+      niter: [20, 25]
+    search:
+      nprobe: [1, 5, 10, 50, 100, 200, 500, 1000, 2000]
diff --git a/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_ivf_pq.yaml b/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_ivf_pq.yaml
new file mode 100644
index 000000000..7213968f4
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/config/algos/cuvs_mg_ivf_pq.yaml
@@ -0,0 +1,41 @@
+name: cuvs_mg_ivf_pq
+constraints:
+  build: cuvs_bench.config.algos.constraints.cuvs_ivf_pq_build
+  search: cuvs_bench.config.algos.constraints.cuvs_ivf_pq_search
+groups:
+  base:
+    build:
+      nlist: [1024, 2048, 4096, 8192]
+      pq_dim: [64, 32, 16]
+      pq_bits: [8, 6, 5, 4]
+      ratio: [10]
+      niter: [25]
+    search:
+      nprobe: [1, 5, 10, 50, 100, 200]
+      internalDistanceDtype: ["float"]
+      smemLutDtype: ["float", "fp8", "half"]
+      refine_ratio: [1, 2, 4]
+  large:
+    build:
+      nlist: [8192, 16384, 32768, 65536]
+      pq_dim: [48, 32, 16]
+      pq_bits: [8, 6, 5, 4]
+      ratio: [4]
+      niter: [20]
+    search:
+      nprobe: [20, 30, 40, 50, 100, 200, 500, 1000]
+      internalDistanceDtype: ["float"]
+      smemLutDtype: ["float", "fp8", "half"]
+      refine_ratio: [1, 2, 4]
+  100M:
+    build:
+      nlist: [50000]
+      pq_dim: [48]
+      pq_bits: [8, 6, 5, 4]
+      ratio: [10]
+      niter: [10]
+    search:
+      nprobe: [20, 30, 40, 50, 100, 200, 500, 1000]
+      internalDistanceDtype: ["float"]
+      smemLutDtype: ["float", "fp8", "half"]
+      refine_ratio: [1]
diff --git a/python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py b/python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py
new file mode 100644
index 000000000..2b4213016
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py
@@ -0,0 +1,241 @@
+#!/usr/bin/env python
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import argparse
+import os
+import sys
+
+import cupy as cp
+import numpy as np
+import rmm
+from pylibraft.common import DeviceResources
+from rmm.allocators.cupy import rmm_cupy_allocator
+
+from cuvs.neighbors.brute_force import knn
+
+from .utils import memmap_bin_file, suffix_from_dtype, write_bin
+
+
+def generate_random_queries(n_queries, n_features, dtype=np.float32):
+    print("Generating random queries")
+    if np.issubdtype(dtype, np.integer):
+        queries = cp.random.randint(
+            0, 255, size=(n_queries, n_features), dtype=dtype
+        )
+    else:
+        queries = cp.random.uniform(size=(n_queries, n_features)).astype(dtype)
+    return queries
+
+
+def choose_random_queries(dataset, n_queries):
+    print("Choosing random vector from dataset as query vectors")
+    query_idx = np.random.choice(
+        dataset.shape[0], size=(n_queries,), replace=False
+    )
+    return dataset[query_idx, :]
+
+
+def calc_truth(dataset, queries, k, metric="sqeuclidean"):
+    handle = DeviceResources()
+    n_samples = dataset.shape[0]
+    n = 500000  # batch size for processing neighbors
+    i = 0
+    indices = None
+    distances = None
+    queries = cp.asarray(queries, dtype=cp.float32)
+
+    while i < n_samples:
+        print("Step {0}/{1}:".format(i // n, n_samples // n))
+        n_batch = n if i + n <= n_samples else n_samples - i
+
+        X = cp.asarray(dataset[i : i + n_batch, :], cp.float32)
+
+        D, Ind = knn(X, queries, k, metric=metric, handle=handle)
+        handle.sync()
+
+        D, Ind = cp.asarray(D), cp.asarray(Ind)
+        Ind += i  # shift neighbor index by offset i
+
+        if distances is None:
+            distances = D
+            indices = Ind
+        else:
+            distances = cp.concatenate([distances, D], axis=1)
+            indices = cp.concatenate([indices, Ind], axis=1)
+            idx = cp.argsort(distances, axis=1)[:, :k]
+            distances = cp.take_along_axis(distances, idx, axis=1)
+            indices = cp.take_along_axis(indices, idx, axis=1)
+
+        i += n_batch
+
+    return distances, indices
+
+
+def main():
+    pool = rmm.mr.PoolMemoryResource(
+        rmm.mr.CudaMemoryResource(), initial_pool_size=2**30
+    )
+    rmm.mr.set_current_device_resource(pool)
+    cp.cuda.set_allocator(rmm_cupy_allocator)
+
+    parser = argparse.ArgumentParser(
+        prog="generate_groundtruth",
+        description="Generate true neighbors using exact NN search. "
+        "The input and output files are in big-ann-benchmark's binary format.",
+        epilog="""Example usage
+    # With existing query file
+    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.\
+fbin --output=groundtruth_dir --queries=/dataset/query.public.10K.fbin
+
+    # With randomly generated queries
+    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.\
+fbin --output=groundtruth_dir --queries=random --n_queries=10000
+
+    # Using only a subset of the dataset. Define queries by randomly
+    # selecting vectors from the (subset of the) dataset.
+    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.\
+fbin --nrows=2000000 --cols=128 --output=groundtruth_dir \
+--queries=random-choice --n_queries=10000
+    """,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+
+    parser.add_argument("dataset", type=str, help="input dataset file name")
+    parser.add_argument(
+        "--queries",
+        type=str,
+        default="random",
+        help="Queries file name, or one of 'random-choice' or 'random' "
+        "(default). 'random-choice': select n_queries vectors from the input "
+        "dataset. 'random': generate n_queries as uniform random numbers.",
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="",
+        help="output directory name (default current dir)",
+    )
+
+    parser.add_argument(
+        "--n_queries",
+        type=int,
+        default=10000,
+        help="Number of quries to generate (if no query file is given). "
+        "Default: 10000.",
+    )
+
+    parser.add_argument(
+        "-N",
+        "--rows",
+        default=None,
+        type=int,
+        help="use only first N rows from dataset, by default the whole "
+        "dataset is used",
+    )
+    parser.add_argument(
+        "-D",
+        "--cols",
+        default=None,
+        type=int,
+        help="number of features (dataset columns). "
+        "Default: read from dataset file.",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        help="Dataset dtype. When not specified, then derived from extension."
+        " Supported types: 'float32', 'float16', 'uint8', 'int8'",
+    )
+
+    parser.add_argument(
+        "-k",
+        type=int,
+        default=100,
+        help="Number of neighbors (per query) to calculate",
+    )
+    parser.add_argument(
+        "--metric",
+        type=str,
+        default="sqeuclidean",
+        help="Metric to use while calculating distances. Valid metrics are "
+        "those that are accepted by cuvs.neighbors.brute_force.knn. Most"
+        " commonly used with cuVS are 'sqeuclidean' and 'inner_product'",
+    )
+
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    args = parser.parse_args()
+
+    if args.rows is not None:
+        print("Reading subset of the data, nrows=", args.rows)
+    else:
+        print("Reading whole dataset")
+
+    # Load input data
+    dataset = memmap_bin_file(
+        args.dataset, args.dtype, shape=(args.rows, args.cols)
+    )
+    n_features = dataset.shape[1]
+    dtype = dataset.dtype
+
+    print(
+        "Dataset size {:6.1f} GB, shape {}, dtype {}".format(
+            dataset.size * dataset.dtype.itemsize / 1e9,
+            dataset.shape,
+            np.dtype(dtype),
+        )
+    )
+
+    if len(args.output) > 0:
+        os.makedirs(args.output, exist_ok=True)
+
+    if args.queries == "random" or args.queries == "random-choice":
+        if args.n_queries is None:
+            raise RuntimeError(
+                "n_queries must be given to generate random queries"
+            )
+        if args.queries == "random":
+            queries = generate_random_queries(
+                args.n_queries, n_features, dtype
+            )
+        elif args.queries == "random-choice":
+            queries = choose_random_queries(dataset, args.n_queries)
+
+        queries_filename = os.path.join(
+            args.output, "queries" + suffix_from_dtype(dtype)
+        )
+        print("Writing queries file", queries_filename)
+        write_bin(queries_filename, queries)
+    else:
+        print("Reading queries from file", args.queries)
+        queries = memmap_bin_file(args.queries, dtype)
+
+    print("Calculating true nearest neighbors")
+    distances, indices = calc_truth(dataset, queries, args.k, args.metric)
+
+    write_bin(
+        os.path.join(args.output, "groundtruth.neighbors.ibin"),
+        indices.astype(np.uint32),
+    )
+    write_bin(
+        os.path.join(args.output, "groundtruth.distances.fbin"),
+        distances.astype(np.float32),
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py b/python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
new file mode 100644
index 000000000..a969b3d89
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
@@ -0,0 +1,101 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+
+import numpy as np
+
+
+def dtype_from_filename(filename):
+    ext = os.path.splitext(filename)[1]
+    if ext == ".fbin":
+        return np.float32
+    if ext == ".hbin":
+        return np.float16
+    elif ext == ".ibin":
+        return np.int32
+    elif ext == ".u8bin":
+        return np.ubyte
+    elif ext == ".i8bin":
+        return np.byte
+    else:
+        raise RuntimeError("Not supported file extension" + ext)
+
+
+def suffix_from_dtype(dtype):
+    if dtype == np.float32:
+        return ".fbin"
+    if dtype == np.float16:
+        return ".hbin"
+    elif dtype == np.int32:
+        return ".ibin"
+    elif dtype == np.ubyte:
+        return ".u8bin"
+    elif dtype == np.byte:
+        return ".i8bin"
+    else:
+        raise RuntimeError("Not supported dtype extension" + dtype)
+
+
+def memmap_bin_file(
+    bin_file, dtype, shape=None, mode="r", size_dtype=np.uint32
+):
+    extent_itemsize = np.dtype(size_dtype).itemsize
+    offset = int(extent_itemsize) * 2
+    if bin_file is None:
+        return None
+    if dtype is None:
+        dtype = dtype_from_filename(bin_file)
+
+    if mode[0] == "r":
+        a = np.memmap(bin_file, mode=mode, dtype=size_dtype, shape=(2,))
+        if shape is None:
+            shape = (a[0], a[1])
+        else:
+            shape = tuple(
+                [
+                    aval if sval is None else sval
+                    for aval, sval in zip(a, shape)
+                ]
+            )
+
+        return np.memmap(
+            bin_file, mode=mode, dtype=dtype, offset=offset, shape=shape
+        )
+    elif mode[0] == "w":
+        if shape is None:
+            raise ValueError("Need to specify shape to map file in write mode")
+
+        print("creating file", bin_file)
+        dirname = os.path.dirname(bin_file)
+        if len(dirname) > 0:
+            os.makedirs(dirname, exist_ok=True)
+        a = np.memmap(bin_file, mode=mode, dtype=size_dtype, shape=(2,))
+        a[0] = shape[0]
+        a[1] = shape[1]
+        a.flush()
+        del a
+        fp = np.memmap(
+            bin_file, mode="r+", dtype=dtype, offset=offset, shape=shape
+        )
+        return fp
+
+
+def write_bin(fname, data):
+    print("writing", fname, data.shape, data.dtype, "...")
+    with open(fname, "wb") as f:
+        np.asarray(data.shape, dtype=np.uint32).tofile(f)
+        data.tofile(f)
diff --git a/python/cuvs_bench/cuvs_bench/get_dataset/__main__.py b/python/cuvs_bench/cuvs_bench/get_dataset/__main__.py
new file mode 100644
index 000000000..a6b154ef2
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/get_dataset/__main__.py
@@ -0,0 +1,115 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import subprocess
+import sys
+from urllib.request import urlretrieve
+
+
+def get_dataset_path(name, ann_bench_data_path):
+    if not os.path.exists(ann_bench_data_path):
+        os.mkdir(ann_bench_data_path)
+    return os.path.join(ann_bench_data_path, f"{name}.hdf5")
+
+
+def download_dataset(url, path):
+    if not os.path.exists(path):
+        print(f"downloading {url} -> {path}...")
+        urlretrieve(url, path)
+
+
+def convert_hdf5_to_fbin(path, normalize):
+    scripts_path = os.path.dirname(os.path.realpath(__file__))
+    ann_bench_scripts_path = os.path.join(scripts_path, "hdf5_to_fbin.py")
+    print(f"calling script {ann_bench_scripts_path}")
+    if normalize and "angular" in path:
+        subprocess.run(
+            ["python", ann_bench_scripts_path, "-n", "%s" % path], check=True
+        )
+    else:
+        subprocess.run(
+            ["python", ann_bench_scripts_path, "%s" % path], check=True
+        )
+
+
+def move(name, ann_bench_data_path):
+    if "angular" in name:
+        new_name = name.replace("angular", "inner")
+    else:
+        new_name = name
+    new_path = os.path.join(ann_bench_data_path, new_name)
+    if not os.path.exists(new_path):
+        os.mkdir(new_path)
+    for bin_name in [
+        "base.fbin",
+        "query.fbin",
+        "groundtruth.neighbors.ibin",
+        "groundtruth.distances.fbin",
+    ]:
+        os.rename(
+            f"{ann_bench_data_path}/{name}.{bin_name}",
+            f"{new_path}/{bin_name}",
+        )
+
+
+def download(name, normalize, ann_bench_data_path):
+    path = get_dataset_path(name, ann_bench_data_path)
+    try:
+        url = f"http://ann-benchmarks.com/{name}.hdf5"
+        download_dataset(url, path)
+
+        convert_hdf5_to_fbin(path, normalize)
+
+        move(name, ann_bench_data_path)
+    except Exception:
+        print(f"Cannot download {url}")
+        raise
+
+
+def main():
+    call_path = os.getcwd()
+    if "RAPIDS_DATASET_ROOT_DIR" in os.environ:
+        default_dataset_path = os.getenv("RAPIDS_DATASET_ROOT_DIR")
+    else:
+        default_dataset_path = os.path.join(call_path, "datasets/")
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument(
+        "--dataset", help="dataset to download", default="glove-100-angular"
+    )
+    parser.add_argument(
+        "--dataset-path",
+        help="path to download dataset",
+        default=default_dataset_path,
+    )
+    parser.add_argument(
+        "--normalize",
+        help="normalize cosine distance to inner product",
+        action="store_true",
+    )
+
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    args = parser.parse_args()
+
+    download(args.dataset, args.normalize, args.dataset_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py b/python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py
new file mode 100644
index 000000000..1255e42dc
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py
@@ -0,0 +1,49 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from __future__ import absolute_import, division, print_function
+
+import sys
+
+import numpy as np
+
+
+def read_fbin(fname):
+    shape = np.fromfile(fname, dtype=np.uint32, count=2)
+    if float(shape[0]) * shape[1] * 4 > 2_000_000_000:
+        data = np.memmap(fname, dtype=np.float32, offset=8, mode="r").reshape(
+            shape
+        )
+    else:
+        data = np.fromfile(fname, dtype=np.float32, offset=8).reshape(shape)
+    return data
+
+
+def write_bin(fname, data):
+    with open(fname, "wb") as f:
+        np.asarray(data.shape, dtype=np.uint32).tofile(f)
+        data.tofile(f)
+
+
+if len(sys.argv) != 3:
+    print(
+        "usage: %s input.fbin output.f16bin" % (sys.argv[0]),
+        file=sys.stderr,
+    )
+    sys.exit(-1)
+
+data = read_fbin(sys.argv[1]).astype(np.float16)
+write_bin(sys.argv[2], data)
diff --git a/python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py b/python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
new file mode 100644
index 000000000..317051aa2
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
@@ -0,0 +1,90 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import sys
+
+import h5py
+import numpy as np
+
+
+def normalize(x):
+    norm = np.linalg.norm(x, axis=1)
+    return (x.T / norm).T
+
+
+def write_bin(fname, data):
+    with open(fname, "wb") as f:
+        np.asarray(data.shape, dtype=np.uint32).tofile(f)
+        data.tofile(f)
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 2 and len(sys.argv) != 3:
+        print(
+            "usage: %s [-n] <input>.hdf5\n" % (sys.argv[0]),
+            "  -n: normalize base/query set\n",
+            "outputs: <input>.base.fbin\n",
+            "         <input>.query.fbin\n",
+            "         <input>.groundtruth.neighbors.ibin\n",
+            "         <input>.groundtruth.distances.fbin",
+            file=sys.stderr,
+        )
+        sys.exit(-1)
+
+    need_normalize = False
+    if len(sys.argv) == 3:
+        assert sys.argv[1] == "-n"
+        need_normalize = True
+    fname_prefix = sys.argv[-1]
+    assert fname_prefix.endswith(".hdf5")
+    fname_prefix = fname_prefix[:-5]
+
+    hdf5 = h5py.File(sys.argv[-1], "r")
+    assert (
+        hdf5.attrs["distance"] == "angular"
+        or hdf5.attrs["distance"] == "euclidean"
+    )
+    assert hdf5["train"].dtype == np.float32
+    assert hdf5["test"].dtype == np.float32
+    assert hdf5["neighbors"].dtype == np.int32
+    assert hdf5["distances"].dtype == np.float32
+
+    base = hdf5["train"][:]
+    query = hdf5["test"][:]
+    if need_normalize:
+        base = normalize(base)
+        query = normalize(query)
+    elif hdf5.attrs["distance"] == "angular":
+        print(
+            "warning: input has angular distance, ",
+            "specify -n to normalize base/query set!\n",
+        )
+
+    output_fname = fname_prefix + ".base.fbin"
+    print("writing", output_fname, "...")
+    write_bin(output_fname, base)
+
+    output_fname = fname_prefix + ".query.fbin"
+    print("writing", output_fname, "...")
+    write_bin(output_fname, query)
+
+    output_fname = fname_prefix + ".groundtruth.neighbors.ibin"
+    print("writing", output_fname, "...")
+    write_bin(output_fname, hdf5["neighbors"][:])
+
+    output_fname = fname_prefix + ".groundtruth.distances.fbin"
+    print("writing", output_fname, "...")
+    write_bin(output_fname, hdf5["distances"][:])
diff --git a/python/cuvs_bench/cuvs_bench/run/__main__.py b/python/cuvs_bench/cuvs_bench/run/__main__.py
index bf9f8586d..58fc5291b 100644
--- a/python/cuvs_bench/cuvs_bench/run/__main__.py
+++ b/python/cuvs_bench/cuvs_bench/run/__main__.py
@@ -19,8 +19,9 @@
 from typing import Optional
 
 import click
-from data_export import convert_json_to_csv_build, convert_json_to_csv_search
-from run import run_benchmark
+
+from .data_export import convert_json_to_csv_build, convert_json_to_csv_search
+from .run import run_benchmark
 
 
 @click.command()
diff --git a/python/cuvs_bench/cuvs_bench/run/run.py b/python/cuvs_bench/cuvs_bench/run/run.py
index a65d4b5fe..0159d2c19 100644
--- a/python/cuvs_bench/cuvs_bench/run/run.py
+++ b/python/cuvs_bench/cuvs_bench/run/run.py
@@ -21,7 +21,8 @@
 from typing import Any, Dict, Optional, Tuple
 
 import yaml
-from runners import cuvs_bench_cpp
+
+from .runners import cuvs_bench_cpp
 
 
 def rmm_present() -> bool:
diff --git a/python/cuvs_bench/cuvs_bench/split_groundtruth/__main__.py b/python/cuvs_bench/cuvs_bench/split_groundtruth/__main__.py
new file mode 100644
index 000000000..7fee30e42
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/split_groundtruth/__main__.py
@@ -0,0 +1,57 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import subprocess
+import sys
+
+
+def split_groundtruth(groundtruth_filepath):
+    ann_bench_scripts_path = os.path.join(
+        os.path.dirname(os.path.realpath(__file__)), "split_groundtruth.pl"
+    )
+    pwd = os.getcwd()
+    path_to_groundtruth = os.path.normpath(groundtruth_filepath).split(os.sep)
+    if len(path_to_groundtruth) > 1:
+        os.chdir(os.path.join(*path_to_groundtruth[:-1]))
+    groundtruth_filename = path_to_groundtruth[-1]
+    subprocess.run(
+        [ann_bench_scripts_path, groundtruth_filename, "groundtruth"],
+        check=True,
+    )
+    os.chdir(pwd)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument(
+        "--groundtruth",
+        help="Path to billion-scale dataset groundtruth file",
+        required=True,
+    )
+
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    args = parser.parse_args()
+
+    split_groundtruth(args.groundtruth)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/cuvs_bench/cuvs_bench/split_groundtruth/split_groundtruth.pl b/python/cuvs_bench/cuvs_bench/split_groundtruth/split_groundtruth.pl
new file mode 100644
index 000000000..b0a59f806
--- /dev/null
+++ b/python/cuvs_bench/cuvs_bench/split_groundtruth/split_groundtruth.pl
@@ -0,0 +1,45 @@
+#!/usr/bin/perl
+
+# =============================================================================
+# Copyright (c) 2020-2023, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+# in compliance with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software distributed under the License
+# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+# or implied. See the License for the specific language governing permissions and limitations under
+# the License.
+
+use warnings;
+use strict;
+use autodie qw(open close);
+
+
+@ARGV == 2
+  or die "usage: $0 input output_prefix\n";
+
+open my $fh, '<:raw', $ARGV[0];
+
+my $raw;
+read($fh, $raw, 8);
+my ($nrows, $dim) = unpack('LL', $raw);
+
+my $expected_size = 8 + $nrows * $dim * (4 + 4);
+my $size = (stat($fh))[7];
+$size == $expected_size
+  or die("error: expected size is $expected_size, but actual size is $size\n");
+
+
+open my $fh_out1, '>:raw', "$ARGV[1].neighbors.ibin";
+open my $fh_out2, '>:raw', "$ARGV[1].distances.fbin";
+
+print {$fh_out1} $raw;
+print {$fh_out2} $raw;
+
+read($fh, $raw, $nrows * $dim * 4);
+print {$fh_out1} $raw;
+read($fh, $raw, $nrows * $dim * 4);
+print {$fh_out2} $raw;