-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
formally support int8 and uint8 within langchain and 2 distance metrics #561
Comments
I noticed that the underlying c++ library doesn't support float16 or bfloat16...where would I put in an issue for that support as well? |
Here is a prompt that you can feed to AI to get an initial draft. It's a starting point, but unfortunately I don't have the expertise to perfect it; otherwise, I would. In my experience, the AI is good at initial drafts and this will save time, but this level requires more expertise. DRAFT QUESTION TO AI TO GET START IMPLEMENTING int8/uint8 and add 2 new distance metrics
to something like: # Allow int8/uint8 types to pass through
if vector_type in [np.dtype('int8'), np.dtype('uint8')]:
input_vectors = np.array(embeddings).astype(vector_type)
else:
input_vectors = np.array(embeddings).astype(np.float32)
np.array([np.array(embedding).astype(np.float32)]).astype(np.float32) to: # Preserve original vector type
vector_type = self.vector_index.vector_type
np.array([np.array(embedding).astype(vector_type)]).astype(vector_type)
vectors[i] = np.array(embeddings[i], dtype=np.float32) would need to be modified to use the correct type. However, there are some important caveats:
You might want to add validation to ensure the int8/uint8 vectors stay within their valid ranges (-128 to 127 for int8, 0 to 255 for uint8). I want you to tell me if any of the source code or the picture of the source code files indicate where the euclidean distance is implemented. I want to check if, perhaps, the tiledb library supports it or whether langchain's tiledb.py simply only implements it as a design choice. Looking at the tiledb.py wrapper code, we can see INDEX_METRICS = frozenset(["euclidean"]) is defined at the top, but the actual distance calculation isn't implemented in this file. The closest reference to it is when the code calls the vector index's query method: d, i = self.vector_index.query( Here is ivf_flat_index.py: """ IVFFlatIndex is based on During ingestion, TileDB computes the To answer a query, the search focuses only on a small number of partitions, based on the query’s proximity IVFFlatIndex provides a vector search implementation that can trade-off accuracy for performance. Queries can be run in multiple modes:
import numpy as np from tiledb.cloud.dag import Mode TILE_SIZE_BYTES = 64000000 # 64MB def submit_local(d, func, *args, **kwargs): class IVFFlatIndex(index.Index):
def create(
Looking at ivf_flat_index.py, we can see that TileDB's vector search actually supports multiple distance metrics. This is evident in the create() function where it checks the distance metric: if ( TileDB supports: SUM_OF_SQUARES So the limitation to only Euclidean distance in the langchain wrapper (INDEX_METRICS = frozenset(["euclidean"])) appears to be a design choice by the wrapper authors, not a limitation of TileDB itself. if self.distance_metric == vspy.DistanceMetric.COSINE: So if you wanted to use other distance metrics, you would need to modify the langchain wrapper to expose TileDB's full range of supported distance metrics. Do you have enough information to confidently modify the langchain tiledb.py to support the other distance metrics as well as "int8" and "uint8"?
|
Draft revised tiledb.py for your consideration below. draft script
|
I moved this issue over to tiledb-vector-search to keep track. We won't be able to work on this immediately, but we do have some vector-search improvements scheduled in the coming months and may be able to include this. re: #561 (comment)
You can raise a feature request here: https://github.com/TileDB-Inc/TileDB |
did you give the correct link? |
Yes. (for clarity, I edited the comment to add a quote of the question I was responding to) |
Sorry for the overlapping posts, but I thought it prudent to initiate a separate issue if this is going to be worked on:
Here's a summary:
Tiledb.py
, langchain's integration with thetiledb
library, automatically converts embeddings tofloat32
here:np.array([np.array(embedding).astype(np.float32)]).astype(np.float32),
Relatively newer versions of
sentence-transformers
supportint8
,uint8
,binary
,ubinary
. Pull request hereTiledb
seems to supportint8
anduint8
but not the other two.Again, langchain's integration of
tiledb
withintiledb.py
doesn't distinguish and converts everything tofloat32
.Does that succinctly summarize the current state of affairs? Is it possible to at least modify tiledb.py to formally support
int8
anduint8
if not the other two? I noticed that @nikolaos did the initial integration in November, 2023. Is he still around at the company by chance? lol.Here is the related "issue" where I realized this for peoples' cross reference: TileDB-Inc/TileDB-Py#2130 (comment)
The text was updated successfully, but these errors were encountered: