-
Notifications
You must be signed in to change notification settings - Fork 7
TASTI Usage
TASTI is an index which is used to generate high-quality proxy scores via semantic similarity across embeddings.These scores can be used in existing proxy-based query processing algorithm (e.g. for aggregation, selection, inference services optimization, etc.). We have developed TASTI by employing the FPF algorithm to choose cluster representatives and utilizing the Vector Database to identify the closest cluster representatives for each data record. By evaluating the labels of these representatives and the distances of data records to them, TASTI efficiently generates a proxy score for each record. This is achieved without the need for query-specific training or extensive annotations.
We've also designed a limit engine that ranks data records based on their proxy scores. This sequential execution approach conserves resources by reducing the need for frequent deep learning model invocations.
from multiprocessing import Process
import numpy as np
import os
import pandas as pd
import time
from aidb.utils.asyncio import asyncio_run
from aidb.vector_database.faiss_vector_database import FaissVectorDatabase
from aidb.vector_database.tasti import Tasti
from tests.inference_service_utils.inference_service_setup import register_inference_services
from tests.inference_service_utils.http_inference_service_setup import run_server
from tests.utils import setup_gt_and_aidb_engine
DB_URL = 'sqlite+aiosqlite://'
async def test_jackson_number_objects(queries):
dirname = os.path.dirname(__file__)
data_dir = os.path.join(dirname, 'tests/data/jackson')
p = Process(target=run_server, args=[str(data_dir)])
p.start()
time.sleep(3)
# load embeddings, each data record corresponds to one embedding
embeddings = np.load('./tests/embeddings.npy')
embeddings = list(embeddings)
embeddings_list = [embeddings[i] for i in range(0, 15000, 15)]
vector_ids = pd.DataFrame({'id': range(1000)})
data = pd.DataFrame({'id': range(1000), 'values': embeddings_list})
# vector database configuration
user_database = FaissVectorDatabase(path='./')
user_database.create_index(index_name='tasti', embedding_dim=128, recreate_index=True)
user_database.insert_data(index_name='tasti', data=data)
user_database.save_index(index_name='tasti')
tasti_index = Tasti(index_name='tasti', vector_ids=vector_ids, vector_database=user_database, nb_buckets=100)
blob_mapping_table_name = 'blobs_mapping'
_, aidb_engine = await setup_gt_and_aidb_engine(DB_URL, data_dir, tasti_index, blob_mapping_table_name)
register_inference_services(aidb_engine, data_dir)
for aidb_query in queries:
aidb_res = aidb_engine.execute(aidb_query)
print(aidb_res)
p.terminate()
if __name__ == '__main__':
queries = [
'''SELECT * FROM colors02 WHERE frame >= 1000 and colors02.color = 'black' LIMIT 100;''',
'''SELECT frame, light_1, light_2 FROM lights01 WHERE light_2 = 'green' LIMIT 100;''',
'''SELECT * FROM objects00 WHERE object_name = 'car' OR frame < 1000 LIMIT 100;'''
]
asyncio_run(test_jackson_number_objects(queries))
A vector database is a specialized database designed to store and efficiently query high-dimensional vectors, enabling rapid retrieval of similar embeddings through Vector Search. Currently, we support three kinds of vector database.
Vector Database | Features |
---|---|
FAISS | Fastest. Local vector database. Default to return exact result. Support gpu |
Chroma | Local vector database. Using HNSW, doesn't return exact result |
Weaviate | Extremely slow when reading data from database. Online vector database. Doesn't return exact result |
You can use our interface to initialize vector database and insert data.
FAISS
from aidb.vector_database.faiss_vector_database import FaissVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value
user_database = FaissVectorDatabase(path='./vector_database')
user_database.create_index(index_name='tasti', embedding_dim=128, recreate_index=True)
user_database.insert_data(index_name='tasti', data)
user_database.save_index(index_name='tasti') # this will save index into your local path
Chroma
from aidb.vector_database.chroma_vector_database import ChromaVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value
user_database = ChromaVectorDatabase(path='./vector_database')
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)
Weaviate
url = ''
api_key = os.environ.get('WEAVIATE_API_KEY')
weaviate_auth = WeaviateAuth(url, api_key=api_key)
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value
user_database = WeaviateVectorDatabase(weaviate_auth)
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)
TASTI utilizes the FPF algorithm to choose cluster representatives and leverages the Vector Database to choose nearest k cluster representatives and compute the distances to them for each data record.
tasti_index = Tasti(index_name='tasti', vector_ids, user_database, nb_buckets, percent_fpf, seed, reps=None) # initialization
'''
* param index_name: vector database index name
* param vector_ids: blob index in blob table, it should be unique for each data record
* param vector_database: initialized vector database, currently support FAISS, Chroma or Weaviate
* param nb_buckets: number of buckets for FPF
* param percent_fpf: percent of randomly selected buckets in FPF
* param seed: random seed
* param reps: preexisting representative ids
'''
There are two main functions of tasti index.
# use FPF algorithm to get cluster representatives vector ids
representative_ids = get_representative_vector_ids()
# get nearest k cluster representatives and the distances to them for each data record.
topk_for_all = get_topk_representatives_for_all(top_k = 5)
- Add HAVING clause predicate filtering.
- Implement the optimization where the inference services are ordered for a select query with a complex predicate.
- Implement control variates for approximate aggregation.
- Add Colab quick start