Skip to content

Conversation

@metonymic-smokey
Copy link
Member

@metonymic-smokey metonymic-smokey commented May 9, 2025

Include a binary vector index in the vector index section, alongside the float vector index.
Currently, BFlat and BIVF indexes are supported.

  1. Modify the vector index cache to cache binary indexes after reading.
  2. Indexing path: Modify the write and merge segment functionality to include binary vector indexes.
  3. Search path: Perform an oversampled search on the binary index and use the results to perform individual distance computes(rescores and reranks) to pick the top K.

@metonymic-smokey metonymic-smokey marked this pull request as ready for review May 12, 2025 15:34
// required info to create a cache entry.
type cacheEntryReqs struct {
alpha float64
index *faiss.IndexImpl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just have the indexes array itself here

indexes []*faiss.IndexImpl

func (vc *vectorIndexCache) insertLOCKED(fieldIDPlus1 uint16,
index *faiss.IndexImpl, vecDocIDMap map[int64]uint32, loadDocVecIDMap bool,
docVecIDMap map[uint32][]int64) {
func (vc *vectorIndexCache) insertLOCKED(fieldIDPlus1 uint16, ce cacheEntryReqs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take pointer to the cacheEntryReqs struct, don't copy the struct

@metonymic-smokey metonymic-smokey force-pushed the bq branch 2 times, most recently from 097fc02 to e652071 Compare May 19, 2025 11:22
}

// return packed binary vectors.
func convertToBinary(vecs []float32) []uint8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to vec as its only 1 vector being binarized here (nit)


// return packed binary vectors.
func convertToBinary(vecs []float32) []uint8 {
var packed []uint8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prealloc packed

packedLen := (len(vec) + 7) / 8
packed := make([]uint8, 0, packedLen)

bitCount = 0
}

// Optionally, you can handle cases where the number of floats isn't a multiple of 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment as its misleading right, because this is already done in L596

}

// Shift the bit into the correct position in the byte
currentByte |= (bit << (7 - bitCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit seems redundant to me
isn't the code below equivalent?

if value >= 0.0 {
	currentByte |= (1 << (7 - bitCount))
}

}
defer binaryFaissIndex.Close()

bvecs := convertToBinary(indexData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as before. If D is a multiple of 8 this works. Else we will create a wrong index itself. (creating a binary index where multiple X-dimensional vectors from different documents get compressed to one vector, thereby losing the posting data)

Comment on lines +401 to +416
h := &maxHeap{}
heap.Init(h)
for i := 0; i < len(binIDs); i++ {
heap.Push(h, &distanceID{distance: distances[i], id: binIDs[i]})
if h.Len() > int(k) {
heap.Pop(h)
}
}

// Pop the top K in reverse order to get them in ascending order
ids := make([]int64, k)
scores := make([]float32, k)
for i := int(k) - 1; i >= 0; i-- {
distanceID := heap.Pop(h).(*distanceID)
scores[i] = distanceID.distance
ids[i] = distanceID.id
}
Copy link
Member

@CascadingRadium CascadingRadium May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, we already have the complete set of ids and their corresponding distances available for Top-K selection. This scenario differs from the K-sized heap approach used in Bleve (see knn.go#L116), where the full result set (N*K) is not known ahead of time and must be streamed or iterated over (making the K-Sized Heap approach optimal).

In our case, since the entire dataset is available upfront, using a heap becomes unnecessarily expensive—particularly in terms of space complexity O(K). This overhead can compound significantly with the number of segments in a highly distributed index, as the heap becomes temporary query-specific state for each segment.

Given that, it would be more efficient to use a Randomized QuickSelect algorithm (https://courses.grainger.illinois.edu/cs473/sp2015/w/lec/15_notes.pdf) for in-place (O(1) vs O(K) space complexity) Top-K selection. This approach reduces memory usage, avoids unnecessary allocations, and maintains excellent average-case performance (O(K) compared to O(KlogK)) suitable for batch-style processing where the full dataset is already materialized. Also, this benefits the argument that we do not actually need a sorted list of distances and ids, we just need the top K in any order, as the upstream implementations already handle sorting the final result set based on scores. Although this might be a bit of a stretch and potentially not impactful, would you mind coding this out and benchmarking the performance? I understand it may be a time sink, but it could help validate whether the Randomized QuickSelect approach is worthwhile in our context. Thanks a lot!

@CascadingRadium
Copy link
Member

pls

  • rebase on top of master
  • update zap.md to reflect the new index file format
  • update command line tooling to work with binary index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants