Skip to content

Latest commit

 

History

History
334 lines (232 loc) · 12.1 KB

vector-search-integrate-with-llamaindex.md

File metadata and controls

334 lines (232 loc) · 12.1 KB
title summary
Integrate Vector Search with LlamaIndex
Learn how to integrate TiDB Vector Search with LlamaIndex.

Integrate Vector Search with LlamaIndex

This tutorial demonstrates how to integrate the vector search feature of TiDB with LlamaIndex.

Warning:

The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an issue on GitHub.

Note:

The vector search feature is only available for TiDB Self-Managed clusters and TiDB Cloud Serverless clusters.

Tip

You can view the complete sample code on Jupyter Notebook, or run the sample code directly in the Colab online environment.

Prerequisites

To complete this tutorial, you need:

If you don't have a TiDB cluster, you can create one as follows:

If you don't have a TiDB cluster, you can create one as follows:

Get started

This section provides step-by-step instructions for integrating TiDB Vector Search with LlamaIndex to perform semantic searches.

Step 1. Create a new Jupyter Notebook file

In the root directory, create a new Jupyter Notebook file named integrate_with_llamaindex.ipynb:

touch integrate_with_llamaindex.ipynb

Step 2. Install required dependencies

In your project directory, run the following command to install the required packages:

pip install llama-index-vector-stores-tidbvector
pip install llama-index

Open the integrate_with_llamaindex.ipynb file in Jupyter Notebook and add the following code to import the required packages:

import textwrap

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore

Step 3. Configure environment variables

Configure the environment variables depending on the TiDB deployment option you've selected.

For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables:

  1. Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.

  2. Click Connect in the upper-right corner. A connection dialog is displayed.

  3. Ensure the configurations in the connection dialog match your operating environment.

    • Connection Type is set to Public.
    • Branch is set to main.
    • Connect With is set to SQLAlchemy.
    • Operating System matches your environment.
  4. Click the PyMySQL tab and copy the connection string.

    Tip:

    If you have not set a password yet, click Generate Password to generate a random password.

  5. Configure environment variables.

    This document uses OpenAI as the embedding model provider. In this step, you need to provide the connection string obtained from from the previous step and your OpenAI API key.

    To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key:

    # Use getpass to securely prompt for environment variables in your terminal.
    import getpass
    import os
    
    # Copy your connection string from the TiDB Cloud console.
    # Connection string format: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
    tidb_connection_string = getpass.getpass("TiDB Connection String:")
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

This document uses OpenAI as the embedding model provider. In this step, you need to provide the connection string of your TiDB cluster and your OpenAI API key.

To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key:

# Use getpass to securely prompt for environment variables in your terminal.
import getpass
import os

# Connection string format: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
tidb_connection_string = getpass.getpass("TiDB Connection String:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Taking macOS as an example, the cluster connection string is as follows:

TIDB_DATABASE_URL="mysql+pymysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE_NAME>"
# For example: TIDB_DATABASE_URL="mysql+pymysql://[email protected]:4000/test"

You need to modify the parameters in the connection string according to your TiDB cluster. If you are running TiDB on your local machine, <HOST> is 127.0.0.1 by default. The initial <PASSWORD> is empty, so if you are starting the cluster for the first time, you can omit this field.

The following are descriptions for each parameter:

  • <USERNAME>: The username to connect to the TiDB cluster.
  • <PASSWORD>: The password to connect to the TiDB cluster.
  • <HOST>: The host of the TiDB cluster.
  • <PORT>: The port of the TiDB cluster.
  • <DATABASE>: The name of the database you want to connect to.

Step 4. Load the sample document

Step 4.1 Download the sample document

In your project directory, create a directory named data/paul_graham/ and download the sample document paul_graham_essay.txt from the run-llama/llama_index GitHub repository.

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Step 4.2 Load the document

Load the sample document from data/paul_graham/paul_graham_essay.txt using the SimpleDirectoryReader class.

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)

for index, document in enumerate(documents):
   document.metadata = {"book": "paul_graham"}

Step 5. Embed and store document vectors

Step 5.1 Initialize the TiDB vector store

The following code creates a table named paul_graham_test in TiDB, which is optimized for vector search.

tidbvec = TiDBVectorStore(
   connection_string=tidb_connection_url,
   table_name="paul_graham_test",
   distance_strategy="cosine",
   vector_dimension=1536,
   drop_existing_table=False,
)

Upon successful execution, you can directly view and access the paul_graham_test table in your TiDB database.

Step 5.2 Generate and store embeddings

The following code parses the documents, generates embeddings, and stores them in the TiDB vector store.

storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
   documents, storage_context=storage_context, show_progress=True
)

The expected output is as follows:

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00,  8.76it/s]
Generating embeddings: 100%|██████████| 21/21 [00:02<00:00,  8.22it/s]

Step 6. Perform a vector search

The following creates a query engine based on the TiDB vector store and performs a semantic similarity search.

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))

Note

TiDBVectorStore only supports the default query mode.

The expected output is as follows:

The author worked on writing, programming, building microcomputers, giving talks at conferences,
publishing essays online, developing spam filters, painting, hosting dinner parties, and purchasing
a building for office use.

Step 7. Search with metadata filters

To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters.

Query with book != "paul_graham" filter

The following example excludes results where the book metadata field is "paul_graham":

from llama_index.core.vector_stores.types import (
   MetadataFilter,
   MetadataFilters,
)

query_engine = index.as_query_engine(
   filters=MetadataFilters(
      filters=[
         MetadataFilter(key="book", value="paul_graham", operator="!="),
      ]
   ),
   similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

The expected output is as follows:

Empty Response

Query with book == "paul_graham" filter

The following example filters results to include only documents where the book metadata field is "paul_graham":

from llama_index.core.vector_stores.types import (
   MetadataFilter,
   MetadataFilters,
)

query_engine = index.as_query_engine(
   filters=MetadataFilters(
      filters=[
         MetadataFilter(key="book", value="paul_graham", operator="=="),
      ]
   ),
   similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

The expected output is as follows:

The author learned programming on an IBM 1401 using an early version of Fortran in 9th grade, then
later transitioned to working with microcomputers like the TRS-80 and Apple II. Additionally, the
author studied philosophy in college but found it unfulfilling, leading to a switch to studying AI.
Later on, the author attended art school in both the US and Italy, where they observed a lack of
substantial teaching in the painting department.

Step 8. Delete documents

Delete the first document from the index:

tidbvec.delete(documents[0].doc_id)

Check whether the documents had been deleted:

query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

The expected output is as follows:

Empty Response

See also