Vector index for Pandas dataframe like shown in Advanced text to sql. #14855

mraguth · 2024-07-20T06:31:18Z

mraguth
Jul 20, 2024

In advanced to text to sql notebook, there was a code snippet to convert csv table to store in the vector index. Based on code, can I save 'pandas' dataframe to vector index?

Code snippet for csv to vector store: I want to simulate the same to pandas dataframe called 'df'

Obtained from the llamaindex notebook :https://www.youtube.com/watch?v=L1o1VPVfbb0

#from llama_index import VectorStoreIndex, load_index_from_storage
from sqlalchemy import text
from llama_index.core.schema import TextNode
from llama_index.core.storage import StorageContext
import os
from pathlib import Path
from typing import Dict


def index_all_tables(
    sql_database: SQLDatabase, table_index_dir: str = "table_index_dir"
) -> Dict[str, VectorStoreIndex]:
    """Index all tables."""
    if not Path(table_index_dir).exists():
        os.makedirs(table_index_dir)

    vector_index_dict = {}
    engine = sql_database.engine
    for table_name in sql_database.get_usable_table_names():
        print(f"Indexing rows in table: {table_name}")
        if not os.path.exists(f"{table_index_dir}/{table_name}"):
            # get all rows from table
            with engine.connect() as conn:
                cursor = conn.execute(text(f'SELECT * FROM "{table_name}"'))
                result = cursor.fetchall()
                row_tups = []
                for row in result:
                    row_tups.append(tuple(row))

            # index each row, put into vector store index
            nodes = [TextNode(text=str(t)) for t in row_tups]

            # put into vector store index (use OpenAIEmbeddings by default)
            index = VectorStoreIndex(nodes, service_context=service_context)

            # save index
            index.set_index_id("vector_index")
            index.storage_context.persist(f"{table_index_dir}/{table_name}")
        else:
            # rebuild storage context
            storage_context = StorageContext.from_defaults(
                persist_dir=f"{table_index_dir}/{table_name}"
            )
            # load index
            index = load_index_from_storage(
                storage_context, index_id="vector_index", service_context=service_context
            )
        vector_index_dict[table_name] = index

    return vector_index_dict


vector_index_dict = index_all_tables(sql_database)

Answered by dosubot[bot]

Jul 20, 2024

To fix the AttributeError: 'str' object has no attribute 'text' error, you need to ensure that the documents being passed to VectorStoreIndex are in the correct format and not just plain strings. Here is an example of how you can load documents correctly and index them:

import pandas as pd
from llama_index.core.schema import TextNode
from llama_index.core.storage import StorageContext
from llama_index.core import VectorStoreIndex
import os
from pathlib import Path

def index_dataframe(df: pd.DataFrame, index_dir: str = "dataframe_index_dir") -> VectorStoreIndex:
    """Index a Pandas DataFrame."""
    if not Path(index_dir).exists():
        os.makedirs(index_dir)

    # Convert DataFram…

View full answer

@dosu · 2024-07-20T06:38:33Z

dosubot[bot]
bot Jul 20, 2024

To index a Pandas DataFrame in a similar manner to the provided code snippet for a SQL table, you can follow these steps. The key is to iterate over the DataFrame rows, create TextNode objects for each row, and then use VectorStoreIndex to store these nodes.

Here is an example of how you can achieve this:

import pandas as pd
from llama_index.core.schema import TextNode
from llama_index.core.storage import StorageContext
from llama_index.core import VectorStoreIndex
import os
from pathlib import Path

def index_dataframe(df: pd.DataFrame, index_dir: str = "dataframe_index_dir") -> VectorStoreIndex:
    """Index a Pandas DataFrame."""
    if not Path(index_dir).exists():
        os.makedirs(index_dir)

    # Convert DataFrame rows to TextNode objects
    nodes = [TextNode(text=str(row.to_dict())) for _, row in df.iterrows()]

    # Create a VectorStoreIndex with the nodes
    index = VectorStoreIndex(nodes, service_context=service_context)

    # Save the index
    index.set_index_id("vector_index")
    index.storage_context.persist(index_dir)

    return index

# Example usage
data = {
    'city': ['Toronto', 'Tokyo', 'Berlin'],
    'population': [2930000, 13960000, 3645000]
}
df = pd.DataFrame(data)

vector_index = index_dataframe(df)

In this code:

The DataFrame rows are converted to TextNode objects.
A VectorStoreIndex is created with these nodes.
The index is saved to the specified directory.

This approach allows you to index a Pandas DataFrame similarly to how you would index rows from a SQL table ^[1].

To continue talking to Dosu, mention @dosu.

6 replies

mraguth Jul 20, 2024
Author

@dosu... I have a dataframe with 100 rows. However, after vector_index, in vector query, it is showing as only two rows? how I know whole dataframe has been indexed in the above code?

dosubot[bot] bot Jul 20, 2024

To ensure that all 100 rows of your DataFrame have been indexed properly using the provided code, you can convert the VectorStoreIndex object to a pandas DataFrame and then check the number of rows. Here is how you can do it:

# Assuming `vector_index` is the instance of VectorStoreIndex created from the DataFrame

# Convert the indexed nodes back to a DataFrame
indexed_data = [node.text for node in vector_index.index_struct.nodes_dict.values()]
indexed_df = pd.DataFrame(indexed_data)

# Check the number of rows
if len(indexed_df) == 100:
    print("All 100 rows have been indexed properly.")
else:
    print(f"Expected 100 rows, but got {len(indexed_df)} rows.")

This code snippet will help you verify if all 100 rows have been indexed properly by comparing the number of rows in the original DataFrame with the number of indexed nodes.

To continue talking to Dosu, mention @dosu.

mraguth Jul 20, 2024
Author

@dosu AttributeError: 'str' object has no attribute 'text'

dosubot[bot] bot Jul 20, 2024

To fix the AttributeError: 'str' object has no attribute 'text' error, you need to ensure that the documents being passed to VectorStoreIndex are in the correct format and not just plain strings. Here is an example of how you can load documents correctly and index them:

import pandas as pd
from llama_index.core.schema import TextNode
from llama_index.core.storage import StorageContext
from llama_index.core import VectorStoreIndex
import os
from pathlib import Path

def index_dataframe(df: pd.DataFrame, index_dir: str = "dataframe_index_dir") -> VectorStoreIndex:
    """Index a Pandas DataFrame."""
    if not Path(index_dir).exists():
        os.makedirs(index_dir)

    # Convert DataFrame rows to TextNode objects
    nodes = [TextNode(text=str(row.to_dict())) for _, row in df.iterrows()]

    # Create a VectorStoreIndex with the nodes
    index = VectorStoreIndex(nodes, service_context=service_context)

    # Save the index
    index.set_index_id("vector_index")
    index.storage_context.persist(index_dir)

    return index

# Example usage
data = {
    'city': ['Toronto', 'Tokyo', 'Berlin'],
    'population': [2930000, 13960000, 3645000]
}
df = pd.DataFrame(data)

vector_index = index_dataframe(df)

# Verify that all rows have been indexed
indexed_data = [node.text for node in vector_index.index_struct.nodes_dict.values()]
indexed_df = pd.DataFrame(indexed_data)

# Check the number of rows
if len(indexed_df) == len(df):
    print("All rows have been indexed successfully.")
else:
    print(f"Expected {len(df)} rows, but got {len(indexed_df)} rows.")

Make sure that the nodes variable is a list of TextNode objects, where each TextNode represents a row in the DataFrame. This will prevent the AttributeError and ensure that the DataFrame rows are correctly indexed ^[1].

To continue talking to Dosu, mention @dosu.

Answer selected by mraguth

mraguth Jul 20, 2024
Author

@dosu One last question. How I create a just node from pandas dataframe, without storing in the 'vector_index'? can you change to function please?

dosubot[bot] bot Jul 20, 2024

To create TextNode objects from a Pandas DataFrame without storing them in a VectorStoreIndex, you can use the following function:

import pandas as pd
from llama_index.core.schema import TextNode
from typing import List

def df_to_text_nodes(df: pd.DataFrame) -> List[TextNode]:
    """Convert DataFrame rows into TextNode objects."""
    text_nodes = []
    for index, row in df.iterrows():
        content = " ".join(row.astype(str).tolist())
        text_node = TextNode(
            text=content,
            id_=str(index),
            metadata=None,
            start_char_idx=None,
            end_char_idx=None,
            relationships=None
        )
        text_nodes.append(text_node)
    return text_nodes

# Example usage
data = {
    'city': ['Toronto', 'Tokyo', 'Berlin'],
    'population': [2930000, 13960000, 3645000]
}
df = pd.DataFrame(data)

text_nodes = df_to_text_nodes(df)
for node in text_nodes:
    print(node.text)

This function converts each row of the DataFrame into a TextNode object and returns a list of these nodes. You can then use these nodes as needed without storing them in a VectorStoreIndex ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector index for Pandas dataframe like shown in Advanced text to sql. #14855

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Vector index for Pandas dataframe like shown in Advanced text to sql. #14855

mraguth Jul 20, 2024

Replies: 1 comment · 6 replies

dosubot[bot] bot Jul 20, 2024

mraguth Jul 20, 2024 Author

dosubot[bot] bot Jul 20, 2024

mraguth Jul 20, 2024 Author

dosubot[bot] bot Jul 20, 2024

mraguth Jul 20, 2024 Author

dosubot[bot] bot Jul 20, 2024

mraguth
Jul 20, 2024

Replies: 1 comment 6 replies

dosubot[bot]
bot Jul 20, 2024

mraguth Jul 20, 2024
Author

mraguth Jul 20, 2024
Author

mraguth Jul 20, 2024
Author