-
Notifications
You must be signed in to change notification settings - Fork 14
MB-62985 - Add support for binary quantised vectors. #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| // required info to create a cache entry. | ||
| type cacheEntryReqs struct { | ||
| alpha float64 | ||
| index *faiss.IndexImpl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just have the indexes array itself here
indexes []*faiss.IndexImpl
| func (vc *vectorIndexCache) insertLOCKED(fieldIDPlus1 uint16, | ||
| index *faiss.IndexImpl, vecDocIDMap map[int64]uint32, loadDocVecIDMap bool, | ||
| docVecIDMap map[uint32][]int64) { | ||
| func (vc *vectorIndexCache) insertLOCKED(fieldIDPlus1 uint16, ce cacheEntryReqs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take pointer to the cacheEntryReqs struct, don't copy the struct
097fc02 to
e652071
Compare
| } | ||
|
|
||
| // return packed binary vectors. | ||
| func convertToBinary(vecs []float32) []uint8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to vec as its only 1 vector being binarized here (nit)
|
|
||
| // return packed binary vectors. | ||
| func convertToBinary(vecs []float32) []uint8 { | ||
| var packed []uint8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prealloc packed
packedLen := (len(vec) + 7) / 8
packed := make([]uint8, 0, packedLen)
| bitCount = 0 | ||
| } | ||
|
|
||
| // Optionally, you can handle cases where the number of floats isn't a multiple of 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this comment as its misleading right, because this is already done in L596
| } | ||
|
|
||
| // Shift the bit into the correct position in the byte | ||
| currentByte |= (bit << (7 - bitCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit seems redundant to me
isn't the code below equivalent?
if value >= 0.0 {
currentByte |= (1 << (7 - bitCount))
}
| } | ||
| defer binaryFaissIndex.Close() | ||
|
|
||
| bvecs := convertToBinary(indexData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as before. If D is a multiple of 8 this works. Else we will create a wrong index itself. (creating a binary index where multiple X-dimensional vectors from different documents get compressed to one vector, thereby losing the posting data)
| h := &maxHeap{} | ||
| heap.Init(h) | ||
| for i := 0; i < len(binIDs); i++ { | ||
| heap.Push(h, &distanceID{distance: distances[i], id: binIDs[i]}) | ||
| if h.Len() > int(k) { | ||
| heap.Pop(h) | ||
| } | ||
| } | ||
|
|
||
| // Pop the top K in reverse order to get them in ascending order | ||
| ids := make([]int64, k) | ||
| scores := make([]float32, k) | ||
| for i := int(k) - 1; i >= 0; i-- { | ||
| distanceID := heap.Pop(h).(*distanceID) | ||
| scores[i] = distanceID.distance | ||
| ids[i] = distanceID.id | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, we already have the complete set of ids and their corresponding distances available for Top-K selection. This scenario differs from the K-sized heap approach used in Bleve (see knn.go#L116), where the full result set (N*K) is not known ahead of time and must be streamed or iterated over (making the K-Sized Heap approach optimal).
In our case, since the entire dataset is available upfront, using a heap becomes unnecessarily expensive—particularly in terms of space complexity O(K). This overhead can compound significantly with the number of segments in a highly distributed index, as the heap becomes temporary query-specific state for each segment.
Given that, it would be more efficient to use a Randomized QuickSelect algorithm (https://courses.grainger.illinois.edu/cs473/sp2015/w/lec/15_notes.pdf) for in-place (O(1) vs O(K) space complexity) Top-K selection. This approach reduces memory usage, avoids unnecessary allocations, and maintains excellent average-case performance (O(K) compared to O(KlogK)) suitable for batch-style processing where the full dataset is already materialized. Also, this benefits the argument that we do not actually need a sorted list of distances and ids, we just need the top K in any order, as the upstream implementations already handle sorting the final result set based on scores. Although this might be a bit of a stretch and potentially not impactful, would you mind coding this out and benchmarking the performance? I understand it may be a time sink, but it could help validate whether the Randomized QuickSelect approach is worthwhile in our context. Thanks a lot!
|
pls
|
Include a binary vector index in the vector index section, alongside the float vector index.
Currently, BFlat and BIVF indexes are supported.