A production-ready semantic search system built with NVIDIA cuVS, featuring advanced vector quantization and optimized memory access patterns for sub-100ms query latency on 35M high-dimensional embeddings.
This project implements a complete end-to-end neural search pipeline capable of handling large-scale semantic search across 35 million 768-dimensional Wikipedia embeddings. The system achieves 13x storage compression and <100ms average query latency through advanced indexing techniques and memory optimization.
- Problem: 35M vectors × 768 dims × 2 bytes = 53.76 GB storage requirement
- Solution: Implemented IVF-PQ (Inverted File with Product Quantization) using NVIDIA cuVS
- Result: Compressed to 3.36 GB (96 bytes/vector) - 93.7% storage reduction
Technical Details:
- 32,768 IVF clusters for coarse quantization
- 96 PQ sub-quantizers (8 dimensions each, 8 bits per code)
- Trained on 2M representative vectors, populated with remaining 33M in 500K batches
- Problem: Direct PQ search yielded only 60% recall@10
- Solution: Implemented two-stage retrieval with oversampling + reranking
- Result: Achieved ~89% recall@10 on the first
N_TRAINvectors (2M subset) while maintaining performance
Implementation:
Stage 1: Retrieve 1000 candidates using PQ approximation
Stage 2: Exact cosine similarity reranking on original embeddings (subset loaded into RAM)
Note: Candidates beyond the loaded subset are skipped during reranking. Load the list-wise memmap to score the full 35M corpus.
- Problem: Random memory access for 1000 vectors took 800ms (unacceptable)
- Solution: List-wise memory layout optimization leveraging IVF structure
- Result: 1-10ms average retrieval, 100ms worst-case (80x improvement)
Key Insight: With n_probes=40, results come from max 40 lists. Arranging embeddings by list enables sequential memory access patterns.
- FastAPI-based REST endpoint with comprehensive error handling
- Async request processing with connection pooling
- Integrated monitoring and logging
- Memory-mapped file handling for efficient resource utilization
| Metric | Value | Baseline Comparison |
|---|---|---|
| Storage Compression | 13.0x | 53.76 GB → 3.36 GB |
| Query Latency (avg) | <100ms | - |
| Recall@10 | ~89% (subset) | 60% (direct PQ) |
| Memory Access Speed | 1-10ms | 800ms (random access) |
Recall measured on the first 2M embeddings using src/search_index.py with oversample + rerank.
Query → Cohere API → GPU Index Search → Memory-Optimized Retrieval → Metadata Lookup → Results
70ms 20ms 1-10ms 1-5ms <100ms total
-
Index Training Pipeline (
src/train_index.py)- Efficient data loading with multiprocessing
- GPU memory monitoring and optimization
- Configurable training parameters
-
Optimized Search Engine (
src/search_index.py)- Two-stage retrieval implementation
- List-wise memory access patterns
- Batch processing for metadata retrieval
-
Production Server (
server.py)- FastAPI with async processing
- LMDB for metadata storage
- Comprehensive error handling and logging
-
Memory Management (
src/utils/)- Custom offset-based caching
- GPU memory monitoring
- Validation utilities
- Coarse Quantization: 32,768 clusters using k-means
- Fine Quantization: 96 sub-quantizers with 256 centroids each
- Distance Metric: Inner product (optimized for normalized vectors)
# Original: Random access across 35M vectors
embeddings[random_ids] # 800ms for 1000 vectors
# Optimized: List-wise sequential access
for list_id in active_lists:
embeddings[list_start:list_end][local_offsets] # 1-10ms total- LMDB for fast key-value retrieval
- Sorted access patterns for cache efficiency
- Binary encoding for space optimization
- Vector Processing: NVIDIA cuVS, CuPy, NumPy
- API Framework: FastAPI, Uvicorn
- Data Storage: LMDB, Memory-mapped files
- ML Pipeline: Hugging Face Datasets, Cohere API
- Monitoring: Custom GPU memory tracking, structured logging
- Horizontal Scaling: Stateless API design enables load balancing
- Memory Efficiency: Memory-mapped files reduce RAM requirements
- GPU Utilization: Batch processing maximizes throughput
- Cache Optimization: List-wise layout improves OS page cache hits
- NVIDIA GPU with CUDA 12.1+
- Python 3.8+
- 8GB+ GPU memory
# Clone repository
git clone <repository-url>
cd neural-search-engine
# Install dependencies
pip install -r requirements.txt
python src/train_index.py— trains the IVF-PQ codebook on the firstN_TRAINembeddings and savestrained_ivfpq_index.bin.python src/add_embeddings_to_index.py— streams the remaining embeddings into the index and writestrained_ivfpq_index_full.bin.python src/search_index.py— measures recall/latency on the sameN_TRAINsubset using oversample + rerank and logs results undersearch_logs/.- To evaluate recall on the entire dataset, load the list-wise memmap so reranking can score every candidate ID returned by IVF-PQ.
- Multi-GPU Support: Distribute index across multiple GPUs
- Dynamic Updates: Implement incremental index updates
- Query Optimization: Adaptive n_probes based on query characteristics
- Caching Layer: Redis for frequently accessed results
- Memory Access Patterns Matter: Achieved 80x speedup through data layout optimization
- Quantization Trade-offs: Balanced compression vs. accuracy through two-stage retrieval
- System Integration: Built complete pipeline from training to production deployment
- Performance Monitoring: Implemented comprehensive logging for optimization insights