Skip to content

Latest commit

 

History

History
112 lines (82 loc) · 5.58 KB

README.md

File metadata and controls

112 lines (82 loc) · 5.58 KB

PSP - Proximity Graph with Spherical Pathway for MIPS

This repository contains the source code for our paper: Maximum Inner Product is Query-Scaled Nearest Neighbor (VLDB25).

1 Abstract

This paper presents a novel theoretical framework that equates MIPS with NNS without requiring space transformation, thereby allowing us to leverage advanced graph-based indices for NNS and efficient edge pruning strategies, significantly reducing unnecessary computations.

2 Competitors

The initial code for ip-nsw and ip-nsw+ came from the original papers, and we reconstructed the relevant code using hnswlib 0.8.0. For Möbius-graph and NAPG, we did not find available open-source code, so we implemented our own version, which will also be open-sourced for relevant evaluations. ScaNN is a commercial product, so we used the open-source version of the code, it achieves excellent results through careful parameter tuning. Its official examples do not enable multithreading. If you are conducting evaluations using multithreading, please invoke the relevant functions.

  • ip-NSW (Paper): A graph based method using inner product navigable small world graph.

  • ip-NSW+ (Paper): An enhancement of ip-NSW that introduces an additional angular proximity graph.

  • Möbius-Graph (Paper): A graph based method that reduces the MIPS problem to an NNS problem using Möbius transformation. Since the original code is not available, we implemented a version based on the paper.

  • NAPG (Paper): A recent graph-based method claiming state-of-the-art performance by improving ip-NSW with a specialized metric, using an adaptive $\alpha$ for different norm distributions.

  • Fargo (Paper): The latest state-of-the-art LSH based method with theoretical guarantees.

  • ScaNN (Paper): The state-of-the-art quantization method.

3 Datasets

The data format is: Number of vector (n) * Dimension (d).

*: Data source will be released upon publication.

Dataset Base Size Dim Query Size Modality
MNIST (link) 60,000 784 10,000 Image
DBpedia100K (link) 100,000 3072 1,000 Text
DBpedia1M (link) 1,000,000 1536 1,000 Text
Music100 (link) 1,000,000 100 10,000 Audio
Text2Image1M (link) 1,000,000 200 100,000 Multi
Text2Image10M (link) 10,000,000 200 100,000 Multi
Laion10M (link) 12,244,692 512 1,000 Multi
Commerce100M* 100,279,529 48 64,111 E-commerce

4 Building Instruction

Prerequisites

  • GCC 4.9+ with OpenMP
  • CMake 2.8+
  • Boost 1.55+
  • Faiss (optional)

Compile On Linux

$ mkdir build/ && cd build/
$ cmake ..
$ make -j

5 Usage

Code Structure

  • datasets: datasets
  • include: C++ class interface
  • output: PSP index, k-MIP result file
  • script: some scripts to run the experiments
  • src: main function implementation
  • test: test codes

How to use

Step 1. Build kNN Graph

Firstly, we need to prepare a kNN graph. You can use Faiss and other libs.

Step 2. PSP indexing

./test/test_mips_index DATA_PATH KNNG_PATH L R Angle M PSP_PATH DIM
  • DATA_PATH is the path of the base data in bin format.
  • KNNG_PATH is the path of the pre-built kNN graph in Step 1..
  • L candidate pool size.
  • Rmaximum out-degree.
  • Angle minimal angle between edges.
  • M IP neighbor.
  • PSP_PATH is the path of the generated PSP index.
  • DIM dimension of dataset.

Step 3. PSP searching

./test/test_mips_search DATA_PATH QUERY_PATH PSP_PATH searh_L K RESULT_PATH DIM
  • DATA_PATH is the path of the base data in bin format.
  • QUERY_PATH is the path of the query data in bin format.
  • PSP_PATH is the path of the generated PSP index.
  • search_L search pool size, the larger the better but slower (must larger than K).
  • K the result size.
  • PSP_PATH is the path of the result neighbors.
  • DIM dimension of dataset.

6 To-Do Lists

  • ✅ Open-source code is available for the major components of the original paper.
  • ✅ Real-world evaluation scenarios of Commerce100M on the Shopee platform.
  • 🔄 More automated entry point selection and early stopping techniques.
  • 🔄 Improve compatibility of SIMD-related PQ codes.
  • 🔄 Python wrapper.
  • 🔄 A wider range of datasets with diverse norm distributions.

7 Performance

Evaluation Metric

  • QPS, Distance computation (for graph-based method)

evaluation