title | summary |
---|---|
Integrate Vector Search with LlamaIndex |
Learn how to integrate TiDB Vector Search with LlamaIndex. |
This tutorial demonstrates how to integrate the vector search feature of TiDB with LlamaIndex.
Warning:
The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an issue on GitHub.
Note:
The vector search feature is only available for TiDB Self-Managed clusters and TiDB Cloud Serverless clusters.
Tip
You can view the complete sample code on Jupyter Notebook, or run the sample code directly in the Colab online environment.
To complete this tutorial, you need:
- Python 3.8 or higher installed.
- Jupyter Notebook installed.
- Git installed.
- A TiDB cluster.
If you don't have a TiDB cluster, you can create one as follows:
- Follow Deploy a local test TiDB cluster or Deploy a production TiDB cluster to create a local cluster.
- Follow Creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster.
If you don't have a TiDB cluster, you can create one as follows:
- (Recommended) Follow Creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster.
- Follow Deploy a local test TiDB cluster or Deploy a production TiDB cluster to create a local cluster of v8.4.0 or a later version.
This section provides step-by-step instructions for integrating TiDB Vector Search with LlamaIndex to perform semantic searches.
In the root directory, create a new Jupyter Notebook file named integrate_with_llamaindex.ipynb
:
touch integrate_with_llamaindex.ipynb
In your project directory, run the following command to install the required packages:
pip install llama-index-vector-stores-tidbvector
pip install llama-index
Open the integrate_with_llamaindex.ipynb
file in Jupyter Notebook and add the following code to import the required packages:
import textwrap
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore
Configure the environment variables depending on the TiDB deployment option you've selected.
For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables:
-
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
-
Click Connect in the upper-right corner. A connection dialog is displayed.
-
Ensure the configurations in the connection dialog match your operating environment.
- Connection Type is set to
Public
. - Branch is set to
main
. - Connect With is set to
SQLAlchemy
. - Operating System matches your environment.
- Connection Type is set to
-
Click the PyMySQL tab and copy the connection string.
Tip:
If you have not set a password yet, click Generate Password to generate a random password.
-
Configure environment variables.
This document uses OpenAI as the embedding model provider. In this step, you need to provide the connection string obtained from from the previous step and your OpenAI API key.
To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key:
# Use getpass to securely prompt for environment variables in your terminal. import getpass import os # Copy your connection string from the TiDB Cloud console. # Connection string format: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" tidb_connection_string = getpass.getpass("TiDB Connection String:") os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
This document uses OpenAI as the embedding model provider. In this step, you need to provide the connection string of your TiDB cluster and your OpenAI API key.
To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key:
# Use getpass to securely prompt for environment variables in your terminal.
import getpass
import os
# Connection string format: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
tidb_connection_string = getpass.getpass("TiDB Connection String:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
Taking macOS as an example, the cluster connection string is as follows:
TIDB_DATABASE_URL="mysql+pymysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE_NAME>"
# For example: TIDB_DATABASE_URL="mysql+pymysql://[email protected]:4000/test"
You need to modify the parameters in the connection string according to your TiDB cluster. If you are running TiDB on your local machine, <HOST>
is 127.0.0.1
by default. The initial <PASSWORD>
is empty, so if you are starting the cluster for the first time, you can omit this field.
The following are descriptions for each parameter:
<USERNAME>
: The username to connect to the TiDB cluster.<PASSWORD>
: The password to connect to the TiDB cluster.<HOST>
: The host of the TiDB cluster.<PORT>
: The port of the TiDB cluster.<DATABASE>
: The name of the database you want to connect to.
In your project directory, create a directory named data/paul_graham/
and download the sample document paul_graham_essay.txt
from the run-llama/llama_index GitHub repository.
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
Load the sample document from data/paul_graham/paul_graham_essay.txt
using the SimpleDirectoryReader
class.
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
document.metadata = {"book": "paul_graham"}
The following code creates a table named paul_graham_test
in TiDB, which is optimized for vector search.
tidbvec = TiDBVectorStore(
connection_string=tidb_connection_url,
table_name="paul_graham_test",
distance_strategy="cosine",
vector_dimension=1536,
drop_existing_table=False,
)
Upon successful execution, you can directly view and access the paul_graham_test
table in your TiDB database.
The following code parses the documents, generates embeddings, and stores them in the TiDB vector store.
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
The expected output is as follows:
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 8.76it/s]
Generating embeddings: 100%|██████████| 21/21 [00:02<00:00, 8.22it/s]
The following creates a query engine based on the TiDB vector store and performs a semantic similarity search.
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
Note
TiDBVectorStore
only supports thedefault
query mode.
The expected output is as follows:
The author worked on writing, programming, building microcomputers, giving talks at conferences,
publishing essays online, developing spam filters, painting, hosting dinner parties, and purchasing
a building for office use.
To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters.
The following example excludes results where the book
metadata field is "paul_graham"
:
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="!="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
Empty Response
The following example filters results to include only documents where the book
metadata field is "paul_graham"
:
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="=="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
The author learned programming on an IBM 1401 using an early version of Fortran in 9th grade, then
later transitioned to working with microcomputers like the TRS-80 and Apple II. Additionally, the
author studied philosophy in college but found it unfulfilling, leading to a switch to studying AI.
Later on, the author attended art school in both the US and Italy, where they observed a lack of
substantial teaching in the painting department.
Delete the first document from the index:
tidbvec.delete(documents[0].doc_id)
Check whether the documents had been deleted:
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
Empty Response