Multi-Vector Search (MVS) is designed to manage multi-modal or multi-view data, particularly in light of the rise of multi-modal large language models. In this context, an object or query comprises multiple vectors, and similarity between objects is determined by distances across multiple vector pairs. Existing Vector Search (VS) methods, such as HNSW, are limited to handling queries or objects with a single vector. Recent approaches to MVS, which rely on VS using multiple single-space indexes, often face challenges in efficiency and accuracy, largely due to the inherent constraints of single-space indexes.
In our paper, Empowering High-Dimensional Multi-Vector Search, we propose a novel Multi-Space Graph index, termed MSG, specifically designed to address the MVS problem.
This repository provides the source code, datasets, and additional implementation details used in the experiments.
- VBase (OSDI'23): VBase, developed by Microsoft, represents the state-of-the-art in Multi-Vector Search (MVS) methodologies. It employs index scanning optimization across multiple single-space indexes to enhance search efficiency and accuracy. This approach positions VBase as a leading solution in the field, particularly for handling complex multi-modal or multi-view data.
- Milvus (SIGMOD'21): Milvus, released by Zilliz, utilizes candidate merging optimization with multiple single-space indexes to address the challenges of MVS. This method focuses on improving the scalability and performance of the merging operation, making it a robust tool for applications requiring multi-vector processing.
- Merging (TPAMI'14, IR'05): Merging is an earlier MVS technique designed for hybrid queries, relying on multiple single-space indexes to compute similarities. While it laid the groundwork for modern MVS approaches, its reliance on simple merging operation introduces limitations in efficiency and accuracy compared to more recent advancements. Nonetheless, it remains a foundational method in the evolution of MVS technologies.
The Multi-Space Graph (MSG) index introduces a modular architecture that separates computation acceleration and index compression, enabling seamless integration and removal of components as needed. This design ensures adaptability, allowing MSG to efficiently handle multi-vector queries with diverse vector combinations and weighting schemes within a unified index framework. By decoupling the index layout and computation acceleration from the search procedure, MSG provides unparalleled flexibility, facilitating diverse configuration choices and enabling further optimizations tailored to specific use cases. This modular and adaptive approach positions MSG as a versatile and scalable solution for high-dimensional multi-vector search challenges.
The detailed data format instructions can be found here.
Dataset | # |
||||
---|---|---|---|---|---|
Recipe (link) | 1.3M | 2 | 2,048 | 1~2 | |
MIT-States (link) | 2.1M | 6 | 3,456 | 1~6 | |
CelebA (link) | 1M | 4 | 2,304 | 1~4 | |
FashionIQ (link) | 1M | 2 | 1,024 | 1~2 | |
MS-COCO (link) | 1M | 3 | 1,536 | 1~3 | |
Shopping* | 1M | 2 | 1,024 | 1~2 | |
MIT-States+ (link) | 16M | 6 | 3,456 | 1~2 |
*Please contact the author of the dataset to get access to the images.
$n$ is the number of objects.$m$ is the number of vectors in each object.$t$ is the number of vectors in a multi-vector query.$D_m$ is the total vector dimension in an object. #$q$ is the number of queries.$w_i$ is the weight of the$i$ -th vector distance. It is dynamic at the per-query level.
Since the original data scale is small (e.g., the scale of CelebA is merely 200K), we expand the datasets using generative models (Edwards, et al. 2022, Yang, et al. 2020), which allows us to create additional samples from the learned distribution of the real data.
GCC 4.8+ with OpenMP
CMake 2.8+
Eigen
(i) Generate transformation matrix
cd ./data
python eigen_vector_matrix.py
(ii) Index construction
cd ./script
./index_msg.sh
(iii) Index compression
cd ./script
./index_compress.sh
(iv) Search
cd ./script
./search_msg.sh
Search performance of multi-vector queries for different methods (Latency : ms). The bold values are the best.
We express our gratitude for the ADSampling repository, which greatly inspires our implementation.