Skip to content

TASTI Usage

ttt-77 edited this page Oct 17, 2023 · 15 revisions

Introduction

TASTI is an index which is used to generate high-quality proxy scores via semantic similarity across embeddings.These scores can be used in existing proxy-based query processing algorithm (e.g. for aggregation, selection, inference services optimization, etc.). We have developed TASTI by employing the FPF algorithm to choose cluster representatives and utilizing the Vector Database to identify the closest cluster representatives for each data record. By evaluating the labels of these representatives and the distances of data records to them, TASTI efficiently generates a proxy score for each record. This is achieved without the need for query-specific training or extensive annotations.

We've also designed a limit engine that ranks data records based on their proxy scores. This sequential execution approach conserves resources by reducing the need for frequent deep learning model invocations.

Quick Start

Detailed Documentation

Supported Vector Database

A vector database is a specialized database designed to store and efficiently query high-dimensional vectors, enabling rapid retrieval of similar embeddings through Vector Search. Currently, we support three kinds of vector database.

Vector Database Features
FAISS Fastest. Local vector database. Default to return exact result. Support gpu
Chroma Local vector database. Using HNSW, doesn't return exact result
Weaviate Extremely slow when reading data from database. Online vector database. Doesn't return exact result

You can use our interface to initialize vector database and insert data.

Example usage

FAISS

from aidb.vector_database.faiss_vector_database import FaissVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value 
user_database = FaissVectorDatabase(index_path='./vector_database')
user_database.create_index(index_name='tasti', embedding_dim=128, recreate_index=True)
user_database.insert_data(index_name='tasti', data)
user_database.save_index(index_name='tasti') # this will save index into your local path

Chroma

from aidb.vector_database.chroma_vector_database import ChromaVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value 
user_database = ChromaVectorDatabase(index_path='./vector_database')
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)

Weaviate

url = ''
api_key = os.environ.get('WEAVIATE_API_KEY')
weaviate_auth = WeaviateAuth(url, api_key=api_key)
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value 
user_database = WeaviateVectorDatabase(weaviate_auth)
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)

TASTI Index

TASTI utilizes the FPF algorithm to choose cluster representatives and leverages the Vector Database to choose nearest k cluster representatives and compute the distances to them for each data record.

Example usage

tasti_index = Tasti(index_name='tasti', vector_ids, user_database, nb_buckets, percent_fpf, seed, reps=None) # initialization
'''
*  param index_name: vector database index name
*  param vector_ids: blob index in blob table, it should be unique for each data record
*  param vector_database: initialized vector database, currently support FAISS, Chroma or Weaviate
*  param nb_buckets: number of buckets for FPF
*  param percent_fpf: percent of randomly selected buckets in FPF
*  param seed: random seed
*  param reps: preexisting representative ids
'''

There are two main functions of tasti index.

# use FPF algorithm to get cluster representatives vector ids
representative_ids = get_representative_vector_ids()
# get nearest k cluster representatives and the distances to them for each data record.
topk_for_all = get_topk_representatives_for_all(top_k = 5)

TASTI Engine

Limit Engine

TODO

  • Add HAVING clause predicate filtering.
  • Implement the optimization where the inference services are ordered for a select query with a complex predicate.
  • Implement control variates for approximate aggregation.