-
Notifications
You must be signed in to change notification settings - Fork 7
TASTI Usage
TASTI is an index which is used to generate high-quality proxy scores via semantic similarity across embeddings.These scores can be used in existing proxy-based query processing algorithm (e.g. for aggregation, selection, inference services optimization, etc.). We have developed TASTI by employing the FPF algorithm to choose cluster representatives and utilizing the Vector Database to identify the closest cluster representatives for each data record. By evaluating the labels of these representatives and the distances of data records to them, TASTI efficiently generates a proxy score for each record. This is achieved without the need for query-specific training or extensive annotations.
We've also designed a limit engine that ranks data records based on their proxy scores. This sequential execution approach conserves resources by reducing the need for frequent deep learning model invocations.
A vector database is a specialized database designed to store and efficiently query high-dimensional vectors, enabling rapid retrieval of similar embeddings through Vector Search. Currently, we support three kinds of vector database.
Vector Database | Features |
---|---|
FAISS | Fastest. Local vector database. Default to return exact result. Support gpu |
Chroma | Local vector database. Using HNSW, doesn't return exact result |
Weaviate | Extremely slow when reading data from database. Online vector database. Doesn't return exact result |
You can use our interface to initialize vector database and insert data.
FAISS
from aidb.vector_database.faiss_vector_database import FaissVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value
user_database = FaissVectorDatabase(index_path='./vector_database')
user_database.create_index(index_name='tasti', embedding_dim=128, recreate_index=True)
user_database.insert_data(index_name='tasti', data)
user_database.save_index(index_name='tasti') # this will save index into your local path
Chroma
from aidb.vector_database.chroma_vector_database import ChromaVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value
user_database = ChromaVectorDatabase(index_path='./vector_database')
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)
Weaviate
url = ''
api_key = os.environ.get('WEAVIATE_API_KEY')
weaviate_auth = WeaviateAuth(url, api_key=api_key)
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value
user_database = WeaviateVectorDatabase(weaviate_auth)
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)
TASTI utilizes the FPF algorithm to choose cluster representatives and leverages the Vector Database to choose nearest k cluster representatives and compute the distances to them for each data record.
tasti_index = Tasti(index_name='tasti', vector_ids, user_database, nb_buckets, percent_fpf, seed, reps=None) # initialization
'''
* param index_name: vector database index name
* param vector_ids: blob index in blob table, it should be unique for each data record
* param vector_database: initialized vector database, currently support FAISS, Chroma or Weaviate
* param nb_buckets: number of buckets for FPF
* param percent_fpf: percent of randomly selected buckets in FPF
* param seed: random seed
* param reps: preexisting representative ids
'''
There are two main functions of tasti index.
# use FPF algorithm to get cluster representatives vector ids
representative_ids = get_representative_vector_ids()
# get nearest k cluster representatives and the distances to them for each data record.
topk_for_all = get_topk_representatives_for_all(top_k = 5)