Skip to content

TASTI Usage

ttt-77 edited this page Oct 17, 2023 · 15 revisions

Introduction

TASTI is an index which is used to generate high-quality proxy scores via semantic similarity across embeddings.These scores can be used in existing proxy-based query processing algorithm (e.g. for aggregation, selection, inference services optimization, etc.). We have developed TASTI by employing the FPF algorithm to choose cluster representatives and utilizing the Vector Database to identify the closest cluster representatives for each data record. By evaluating the labels of these representatives and the distances of data records to them, TASTI efficiently generates a proxy score for each record. This is achieved without the need for query-specific training or extensive annotations.

We've also designed a limit engine that ranks data records based on their proxy scores. This sequential execution approach conserves resources by reducing the need for frequent deep learning model invocations.

Quick Start

from multiprocessing import Process
import numpy as np
import os
import pandas as pd
import time

from aidb.utils.asyncio import asyncio_run
from aidb.vector_database.faiss_vector_database import FaissVectorDatabase
from aidb.vector_database.tasti import Tasti
from tests.inference_service_utils.inference_service_setup import register_inference_services
from tests.inference_service_utils.http_inference_service_setup import run_server
from tests.utils import setup_gt_and_aidb_engine

DB_URL = 'sqlite+aiosqlite://'
async def test_jackson_number_objects(queries):

  dirname = os.path.dirname(__file__)
  data_dir = os.path.join(dirname, 'tests/data/jackson')
  p = Process(target=run_server, args=[str(data_dir)])
  p.start()
  time.sleep(3)

  # load embeddings, each data record corresponds to one embedding
  embeddings = np.load('./tests/embeddings.npy')
  embeddings = list(embeddings)
  embeddings_list = [embeddings[i] for i in range(0, 15000, 15)]
  vector_ids = pd.DataFrame({'id': range(1000)})
  data = pd.DataFrame({'id': range(1000), 'values': embeddings_list})

  # vector database configuration
  user_database = FaissVectorDatabase(path='./')
  user_database.create_index(index_name='tasti', embedding_dim=128, recreate_index=True)
  user_database.insert_data(index_name='tasti', data=data)
  user_database.save_index(index_name='tasti')

  tasti_index = Tasti(index_name='tasti', vector_ids=vector_ids, vector_database=user_database, nb_buckets=100)
  blob_mapping_table_name = 'blobs_mapping'

  _, aidb_engine = await setup_gt_and_aidb_engine(DB_URL, data_dir, tasti_index, blob_mapping_table_name)

  register_inference_services(aidb_engine, data_dir)

  for aidb_query in queries:
    aidb_res = aidb_engine.execute(aidb_query)
    print(aidb_res)

  p.terminate()


if __name__ == '__main__':
  queries = [
    '''SELECT * FROM colors02 WHERE frame >= 1000 and colors02.color = 'black' LIMIT 100;''',
    '''SELECT frame, light_1, light_2 FROM lights01 WHERE light_2 = 'green' LIMIT 100;''',
    '''SELECT * FROM objects00 WHERE object_name = 'car' OR frame < 1000 LIMIT 100;'''
  ]
  asyncio_run(test_jackson_number_objects(queries))

Detailed Documentation

Supported Vector Database

A vector database is a specialized database designed to store and efficiently query high-dimensional vectors, enabling rapid retrieval of similar embeddings through Vector Search. Currently, we support three kinds of vector database.

Vector Database Features
FAISS Fastest. Local vector database. Default to return exact result. Support gpu
Chroma Local vector database. Using HNSW, doesn't return exact result
Weaviate Extremely slow when reading data from database. Online vector database. Doesn't return exact result

You can use our interface to initialize vector database and insert data.

Example usage

FAISS

from aidb.vector_database.faiss_vector_database import FaissVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value 
user_database = FaissVectorDatabase(path='./vector_database')
user_database.create_index(index_name='tasti', embedding_dim=128, recreate_index=True)
user_database.insert_data(index_name='tasti', data)
user_database.save_index(index_name='tasti') # this will save index into your local path

Chroma

from aidb.vector_database.chroma_vector_database import ChromaVectorDatabase
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value 
user_database = ChromaVectorDatabase(path='./vector_database')
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)

Weaviate

url = ''
api_key = os.environ.get('WEAVIATE_API_KEY')
weaviate_auth = WeaviateAuth(url, api_key=api_key)
data = pd.DataFrame({'id':[], 'values':[]}) # id: vector id, values: embedding value 
user_database = WeaviateVectorDatabase(weaviate_auth)
user_database.create_index(index_name='tasti', recreate_index=True)
user_database.insert_data(index_name='tasti', data)

TASTI Index

TASTI utilizes the FPF algorithm to choose cluster representatives and leverages the Vector Database to choose nearest k cluster representatives and compute the distances to them for each data record.

Example usage

tasti_index = Tasti(index_name='tasti', vector_ids, user_database, nb_buckets, percent_fpf, seed, reps=None) # initialization
'''
*  param index_name: vector database index name
*  param vector_ids: blob index in blob table, it should be unique for each data record
*  param vector_database: initialized vector database, currently support FAISS, Chroma or Weaviate
*  param nb_buckets: number of buckets for FPF
*  param percent_fpf: percent of randomly selected buckets in FPF
*  param seed: random seed
*  param reps: preexisting representative ids
'''

There are two main functions of tasti index.

# use FPF algorithm to get cluster representatives vector ids
representative_ids = get_representative_vector_ids()
# get nearest k cluster representatives and the distances to them for each data record.
topk_for_all = get_topk_representatives_for_all(top_k = 5)

TASTI Engine

For a given query, we first identify its associated blob tables and map them to the representative and topk tables. If the representative and topk tables are absent from the normal database, TASTI Engine will require a blob mapping table which maps blob keys to vector id and a TASTI Index. The engine then processes the TASTI Index to retrieve for cluster representatives vector IDs and the topk representative IDs and distances for each data record. Finally, these results are saved in the normal database."

If these two tables exist, TASTI Engine utilizes them to calculate proxy score per-predicate per data record. The TASTI Engine will initially process all query-related inference services on cluster representative blobs, regardless of whether they meet the specified predicates. After inference, each predicate can be converted to a score. There are two types of filtering condition.

One is WHERE expression, like color = ‘blue’, frame > 1000. This can use 0/1 label. We need to change query to format: select IFF(color = ‘blue’, 1, 0) AS score1, IFF(frame>1000, 1, 0) AS score2. The reason we separate the WHERE condition into several scores rather than as a whole score is that we can get more decentralized distribution of proxy score.

(This is not supported yet) Another is aggregation type, such as Having Sum(car=’red’) > 3. It's more suitable to SELECT Sum(car=’red’) AS score3 rather than SELECT IFF(Sum(car=’red’), 1, 0) AS score3.

TASTI Engine will return per-predicate per-data-record proxy score, these scores can be used to accelerate queries such as aggregation, selection, inference services, etc.

Limit Engine

Limit Engine is one downstream query example that use proxy score to accelerate. We combine per-predicate per-data-record proxy score into
a singular proxy score for each data record based on the relations between predicates.

If predicate_A AND predicate_B, then score = Min(score_A, score_B)

If predicate_A OR predicate_B, then score = Max(score_A, score_B)

After getting singular proxy score for each data record, we rank data record based on score in descending order. We execute inference service on each data record sequentially until the limit cardinality is reached. This can reduce the invocation of resource-intensive deep learning models.

TODO

  • Add HAVING clause predicate filtering.
  • Implement the optimization where the inference services are ordered for a select query with a complex predicate.
  • Implement control variates for approximate aggregation.
  • Add Colab quick start