Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TypeError: Query column vector must be a vector. Got list<item: double>. #1335

Open
3 tasks done
zel2023 opened this issue Oct 30, 2024 · 10 comments
Open
3 tasks done
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@zel2023
Copy link

zel2023 commented Oct 30, 2024

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

I attempted to refer to https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/local_search.ipynb to write a Python file for running local search but failed.
However, using https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/global_search.ipynb as a reference, I successfully wrote a Python file to run global search.
Additionally, I successfully ran the local search by referring to https://github.com/microsoft/graphrag/blob/94f1e62e5c06795fc8c361dba6580bb76d6e77ce/docs/get_started.md.
Below is the error message:

Entity count: 3
Relationship count: 2
Report records: 1
Text unit records: 1
Traceback (most recent call last):
File "/data/zelongzheng/graphrag-main/local_search.py", line 172, in
asyncio.run(main())
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/data/zelongzheng/graphrag-main/local_search.py", line 167, in main
result = await search_engine.asearch("what is the relationship between xiaozhang and xiaoming?")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/zelongzheng/graphrag-main/graphrag/query/structured_search/local_search/search.py", line 67, in asearch
context_text, context_records = self.context_builder.build_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/zelongzheng/graphrag-main/graphrag/query/structured_search/local_search/mixed_context.py", line 140, in build_context
selected_entities = map_query_to_entities(
^^^^^^^^^^^^^^^^^^^^^^
File "/data/zelongzheng/graphrag-main/graphrag/query/context_builder/entity_extraction.py", line 57, in map_query_to_entities
search_results = text_embedding_vectorstore.similarity_search_by_text(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/zelongzheng/graphrag-main/graphrag/vector_stores/lancedb.py", line 136, in similarity_search_by_text
return self.similarity_search_by_vector(query_embedding, k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/zelongzheng/graphrag-main/graphrag/vector_stores/lancedb.py", line 115, in similarity_search_by_vector
.to_list()
^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/site-packages/lancedb/query.py", line 320, in to_list
return self.to_arrow().to_pylist()
^^^^^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/site-packages/lancedb/query.py", line 647, in to_arrow
return self.to_batches().read_all()
^^^^^^^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/site-packages/lancedb/query.py", line 678, in to_batches
result_set = self._table._execute_query(query, batch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/site-packages/lancedb/table.py", line 1742, in _execute_query
return ds.scanner(
^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/site-packages/lance/dataset.py", line 369, in scanner
builder = builder.nearest(**nearest)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zelongzheng/anaconda3/envs/graphrag/lib/python3.11/site-packages/lance/dataset.py", line 2449, in nearest
raise TypeError(
TypeError: Query column vector must be a vector. Got list<item: double>.

Steps to reproduce

1.pip install graphrag==0.3.6

2.Build the graph and use the command python -m graphrag.query --root ./ragtest --method local "what is the relationship between xiaozhang and xiaoming?" to confirm successful execution.

3.Write a python file using https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/local_search.ipynb as a reference. Modify certain parts of this file: change INPUT_DIR, comment out all variables related to covariates since no related files were generated when building the graph, set API_KEY, llm_model (deepseek-chat), embedding_model (text-embedding-3-small), and api_base="https://api.agicto.cn/v1".

4.Run the file using Python.

Expected Behavior

I expect it will response for what I ask.

GraphRAG Config Used

graphrag=0.3.6
llm = "deepseek-chat"
embedding_model = "text-embedding-3-small"
api_base = "https://api.agicto.cn/v1"

Logs and screenshots

Image

Additional Information

  • GraphRAG Version : 0.3.6
  • Operating System : Ubuntu 20.04.4 LTS
  • Python Version : 3.11.10
  • Related Issues:
@zel2023 zel2023 added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Oct 30, 2024
@EniacTNB
Copy link

EniacTNB commented Nov 4, 2024

Did you solve that?

@zel2023
Copy link
Author

zel2023 commented Nov 4, 2024

I did not solve that.

@yuanzhoulvpi2017
Copy link

I got same issue ,this is a tiny code, This bug is really weird.

import numpy as np
import json
import pyarrow as pa
import lancedb


class TextEmbeder:
    def __init__(self) -> None:
        pass

    def encode(self, x):
        return np.abs(np.around(np.random.randn(3), 3)).tolist()


textembeder = TextEmbeder()
textembeder.encode("a")[:3]


def build_fake_data():
    res = []
    for index in range(1000):
        id_ = str(index)
        text = f"hello test {index}"
        vector = textembeder.encode(text)
        extra_data = json.dumps(
            {"attr1": index, "attr2": index * 2, "attr3": "B站"}, ensure_ascii=True
        )

        res.append(
            {"id": id_, "text": text, "extra_data": extra_data, "vector": vector}
        )

    return res


fake_data_list = build_fake_data()
fake_data_list[0].keys()


db_url = "data/database"
db_table_name = "smalltest"
db_connection = lancedb.connect(db_url)


db_connection.table_names()


schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float64())),
        pa.field("id", pa.string()),
        pa.field("text", pa.string()),
        pa.field("extra_data", pa.string()),
    ]
)

db_connection.create_table(name=db_table_name, schema=schema, mode="overwrite")


table = db_connection.open_table(db_table_name)
table.add(fake_data_list)


query_vector = textembeder.encode("hh")
query_vector[:4]


docs = (
    table.search(query=query_vector, vector_column_name="vector", query_type="vector")
    .limit(4)
    .to_list()
)

print(docs)

env:

 pip show lancedb            
Name: lancedb
Version: 0.13.0
Summary: lancedb
Home-page: 
Author: 
Author-email: LanceDB Devs <[email protected]>
License: Apache-2.0
Location: /data2/miniconda3/envs/hz_grahrag/lib/python3.11/site-packages
Requires: attrs, cachetools, deprecation, overrides, packaging, pydantic, pylance, requests, retry, tqdm
Required-by: graphrag


Name: pyarrow
Version: 15.0.2
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: 
Author-email: 
License: Apache License, Version 2.0
Location: /data2/miniconda3/envs/hz_grahrag/lib/python3.11/site-packages
Requires: numpy
Required-by: datasets, datashaper, graphrag, pylance, streamlit

@yuanzhoulvpi2017
Copy link

I had fix this issue,in schema,need set vector_size, pa.field("vector", pa.list_(pa.float32(), list_size=DIM_VALUE)),

schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float32(), list_size=DIM_VALUE)),
        pa.field("id", pa.string()),
        pa.field("text", pa.string()),
        pa.field("extra_data", pa.string()),
    ]
)

@zel2023
Copy link
Author

zel2023 commented Nov 6, 2024

Congratulations! However, there is no “schema” in my code:

import os

import pandas as pd
import tiktoken
import asyncio
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import (
    store_entity_semantic_embeddings,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

INPUT_DIR = "xx"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
#COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

# load description embeddings to an in-memory lancedb vectorstore
# to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(
    collection_name="entity_description_embeddings",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)
entity_description_embeddings = store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)

print(f"Entity count: {len(entity_df)}")
entity_df.head()

relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

print(f"Relationship count: {len(relationship_df)}")
relationship_df.head()

# NOTE: covariates are turned off by default, because they generally need prompt tuning to be valuable
# Please see the GRAPHRAG_CLAIM_* settings
#covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")

#claims = read_indexer_covariates(covariate_df)

#print(f"Claim records: {len(claims)}")
#covariates = {"claims": claims}

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)

print(f"Report records: {len(report_df)}")
report_df.head()

text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

print(f"Text unit records: {len(text_unit_df)}")
text_unit_df.head()

api_key = "xx"
llm_model = "deepseek-chat"
embedding_model = "text-embedding-3-small"

#embedding_model = os.environ["text-embedding-3-small"]

llm = ChatOpenAI(
    api_key=api_key,
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,  # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI
    max_retries=20,
    api_base="https://api.agicto.cn/v1"
)

token_encoder = tiktoken.get_encoding("cl100k_base")

text_embedder = OpenAIEmbedding(
    api_key=api_key,
    api_base="https://api.agicto.cn/v1",
    api_type=OpenaiApiType.OpenAI,
    model=embedding_model,
    deployment_name=embedding_model,
    max_retries=20,
)

context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    # if you did not run covariates during indexing, set this to None
    covariates=None,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.TITLE,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)



local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.TITLE,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

llm_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}

search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

async def main():
    result = await search_engine.asearch("what is the relationship between xiaozhang and xiaoming?")
    print(result.response)


if __name__ == "__main__":
    asyncio.run(main())


@EachSheep
Copy link

I meet this problem too, do anyone have any suggestions?

@Taylor180520
Copy link

i had the same issue, and my problem is solved by adjusting the command.
Previously, I use "graphrag.index --root ./ragtest" for indexing, and "graphrag.query --root ./ragtest --method local "explain the relationship between Jay and May." for querying.

Then i use "python -m graphrag.index --root ./ragtest" and "python -m graphrag.query --root ./ragtest --method local "explain the relationship between Jay and May.", it solves my problem.

To sum up, in my particular case, i didn't include "pyhton -m" in my execution command, and it turns out problematic.

@MBRSL
Copy link

MBRSL commented Nov 8, 2024

TLDR: Don't use scripts in notebook examples since they are outdated. You can just call graphrag.api.global_search() directly. See examples in graphrag/cli/query.py.

This is due to backward compatibility things related to lancedb. There're some code patching lancedb files. You'll need to handle this kind of things if you decide to call lower level APIs like read_indexer_entities()

@EachSheep
Copy link

EachSheep commented Nov 8, 2024

locol_query.log

When I used the local_search query example shown at "https://microsoft.github.io/graphrag/examples_notebooks/local_search/", I encountered the same error as @zel2023. Here is my code:

import os
import asyncio
import argparse
import tiktoken
from transformers import AutoTokenizer
import pandas as pd
from dotenv import load_dotenv
load_dotenv()

from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.vector_stores.lancedb import LanceDBVectorStore

# local
from graphrag.query.structured_search.local_search.mixed_context import LocalSearchMixedContext
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.search import LocalSearch

# global
from graphrag.query.structured_search.global_search.community_context import GlobalCommunityContext
from graphrag.query.structured_search.global_search.search import GlobalSearch

INPUT_DIR = "./ragtest/output/"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2
HOME = os.getenv("HOME")
ENCODER_MODEL_PATH = f"{HOME}/Models/stella_en_1.5B_v5"

# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
# print("entity_df.head():")
# print(entity_df.head())
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
# print("entity_embedding_df.head():")
# print(entity_embedding_df.head())
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
# load description embeddings to an in-memory lancedb vectorstore to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(collection_name="entity.description")
description_embedding_store.connect(db_uri=LANCEDB_URI)
entity_description_embeddings = store_entity_semantic_embeddings(entities=entities, vectorstore=description_embedding_store)
# print(f"Entity count: {len(entity_df)}")
# print("entity_description_embeddings.head():")
# print(entity_description_embeddings.head())

relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)
# print(f"Relationship count: {len(relationship_df)}")
# print("relationship_df.head():")
# print(relationship_df.head())

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_colwidth', 50)
print(f"Report records: {len(report_df)}")
# print("report_df.head(1):")
# print(report_df.head(1))

text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)
# print(f"Text unit records: {len(text_unit_df)}")
# print("text_unit_df.head():")
# print(text_unit_df.head())

api_key = os.environ["GRAPHRAG_API_KEY"]
api_base = os.environ["GRAPHRAG_API_BASE"]
llm_model = os.environ["GRAPHRAG_LLM_MODEL"]
llm = ChatOpenAI(
    api_key=api_key,
    api_base=api_base,
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,  # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI
    max_retries=20,
)

embedding_model = os.environ["GRAPHRAG_EMBEDDING_MODEL"]
embedding_api_base = os.environ["GRAPHRAG_EMBEDDING_API_BASE"]
# token_encoder = tiktoken.get_encoding("cl100k_base")
token_encoder = AutoTokenizer.from_pretrained(ENCODER_MODEL_PATH)
# print("embedding_model:", embedding_model)
# print("embedding_api_base:", embedding_api_base)

def local_search():
    text_embedder = OpenAIEmbedding(
        api_key=api_key,
        api_base=embedding_api_base,
        api_type=OpenaiApiType.OpenAI,
        model=embedding_model,
        deployment_name=embedding_model,
        max_retries=20,
    )
    context_builder = LocalSearchMixedContext(
        community_reports=reports,
        text_units=text_units,
        entities=entities,
        relationships=relationships,
        # if you did not run covariates during indexing, set this to None
        # covariates=covariates,
        entity_text_embeddings=description_embedding_store,
        embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
        text_embedder=text_embedder,
        token_encoder=token_encoder,
    )
    # text_unit_prop: proportion of context window dedicated to related text units
    # community_prop: proportion of context window dedicated to community reports.
    # The remaining proportion is dedicated to entities and relationships. Sum of text_unit_prop and community_prop should be <= 1
    # conversation_history_max_turns: maximum number of turns to include in the conversation history.
    # conversation_history_user_turns_only: if True, only include user queries in the conversation history.
    # top_k_mapped_entities: number of related entities to retrieve from the entity description embedding store.
    # top_k_relationships: control the number of out-of-network relationships to pull into the context window.
    # include_entity_rank: if True, include the entity rank in the entity table in the context window. Default entity rank = node degree.
    # include_relationship_weight: if True, include the relationship weight in the context window.
    # include_community_rank: if True, include the community rank in the context window.
    # return_candidate_context: if True, return a set of dataframes containing all candidate entity/relationship/covariate records that
    # could be relevant. Note that not all of these records will be included in the context window. The "in_context" column in these
    # dataframes indicates whether the record is included in the context window.
    # max_tokens: maximum number of tokens to use for the context window.
    local_context_params = {
        "text_unit_prop": 0.5,
        "community_prop": 0.1,
        "conversation_history_max_turns": 5,
        "conversation_history_user_turns_only": True,
        "top_k_mapped_entities": 10,
        "top_k_relationships": 10,
        "include_entity_rank": True,
        "include_relationship_weight": True,
        "include_community_rank": False,
        "return_candidate_context": False,
        "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
        "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    }
    llm_params = {
        "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
        "temperature": 0.0,
    }
    search_engine = LocalSearch(
        llm=llm,
        context_builder=context_builder,
        token_encoder=token_encoder,
        llm_params=llm_params,
        context_builder_params=local_context_params,
        response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
    )
    async def local_async_search_engine():
        result = await search_engine.asearch("Tell me about Agent Mercer")
        # print(result)
        return result
    result = asyncio.run(local_async_search_engine())
    print(result.response)
    print(result.context_data["entities"].head())
    print(result.context_data["relationships"].head())
    print(result.context_data["reports"].head())
    print(result.context_data["sources"].head())
    if "claims" in result.context_data:
        print(result.context_data["claims"].head())
    
    # question generation
    question_generator = LocalQuestionGen(
        llm=llm,
        context_builder=context_builder,
        token_encoder=token_encoder,
        llm_params=llm_params,
        context_builder_params=local_context_params,
    )
    question_history = [
        "Tell me about Agent Mercer",
        "What happens in Dulce military base?",
    ]
    async def local_async_question_generation():
        result = await question_generator.agenerate(question_history=question_history, context_data=None, question_count=5)
        # print(result)
        return result
    candidate_questions = asyncio.run(local_async_question_generation())
    print(candidate_questions.response)

local_search()

# For global
def global_search():
    context_builder = GlobalCommunityContext(
        community_reports=reports,
        entities=entities,  # default to None if you don't want to use community weights for ranking
        token_encoder=token_encoder,
    )
    context_builder_params = {
        "use_community_summary": False,  # False means using full community reports. True means using community short summaries.
        "shuffle_data": True,
        "include_community_rank": True,
        "min_community_rank": 0,
        "community_rank_name": "rank",
        "include_community_weight": True,
        "community_weight_name": "occurrence weight",
        "normalize_community_weight": True,
        "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
        "context_name": "Reports",
    }
    map_llm_params = {
        "max_tokens": 1000,
        "temperature": 0.0,
        "response_format": {"type": "json_object"},
    }
    reduce_llm_params = {
        "max_tokens": 2000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500)
        "temperature": 0.0,
    }
    search_engine = GlobalSearch(
        llm=llm,
        context_builder=context_builder,
        token_encoder=token_encoder,
        max_data_tokens=12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
        map_llm_params=map_llm_params,
        reduce_llm_params=reduce_llm_params,
        allow_general_knowledge=False,  # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases.
        json_mode=True,  # set this to False if your LLM model does not support JSON mode.
        context_builder_params=context_builder_params,
        concurrent_coroutines=32,
        response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
    )
    async def global_async_search_engine():
        result = await search_engine.asearch(
            "What is the major conflict in this story and who are the protagonist and antagonist?"
        )
        # print(result)
        return result
    result = asyncio.run(global_async_search_engine())

    print(result.response)
    # inspect the data used to build the context for the LLM responses
    print(result.context_data["reports"])
    # inspect number of LLM calls and tokens
    print(f"LLM calls: {result.llm_calls}. LLM tokens: {result.prompt_tokens}"

Here is the error log:

Report records: 79
Traceback (most recent call last):
  File "/home/xxx/Source/cyowcopy/existing_graphrags/query_gen.py", line 193, in <module>
    local_search()
  File "/home/xxx/Source/cyowcopy/existing_graphrags/query_gen.py", line 165, in local_search
    result = asyncio.run(local_async_search_engine())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/xxx/Source/cyowcopy/existing_graphrags/query_gen.py", line 162, in local_async_search_engine
    result = await search_engine.asearch("Tell me about Agent Mercer")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/graphrag/query/structured_search/local_search/search.py", line 67, in asearch
    context_text, context_records = self.context_builder.build_context(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/graphrag/query/structured_search/local_search/mixed_context.py", line 140, in build_context
    selected_entities = map_query_to_entities(
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/graphrag/query/context_builder/entity_extraction.py", line 57, in map_query_to_entities
    search_results = text_embedding_vectorstore.similarity_search_by_text(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/graphrag/vector_stores/lancedb.py", line 136, in similarity_search_by_text
    return self.similarity_search_by_vector(query_embedding, k)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/graphrag/vector_stores/lancedb.py", line 115, in similarity_search_by_vector
    .to_list()
     ^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/lancedb/query.py", line 320, in to_list
    return self.to_arrow().to_pylist()
           ^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/lancedb/query.py", line 648, in to_arrow
    return self.to_batches().read_all()
           ^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/lancedb/query.py", line 680, in to_batches
    result_set = self._table._execute_query(query, batch_size)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/lancedb/table.py", line 1742, in _execute_query
    return ds.scanner(
           ^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/lance/dataset.py", line 369, in scanner
    builder = builder.nearest(**nearest)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/lance/dataset.py", line 2449, in nearest
    raise TypeError(
TypeError: Query column vector must be a vector. Got list<item: double>.

The command I used to generate the index is graphrag index --root ./ragtest, and the version of graphrag is 0.4.0.

When i use

graphrag query \
    --root ./ragtest \
    --method global \
    --query "What are the top themes in this story?"

I can get right result:

graphrag query \
--root ./ragtest \
--method global \
--query "What are the top themes in this story?"
/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/home/xxx/anaconda3/envs/CEPE/lib/python3.12/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,


creating llm client with {'api_key': 'REDACTED,len=13', 'type': "openai_chat", 'model': 'qwen2a5-72b-instruct', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://1.11.11.11:11/', 'api_version': None, 'organization': None, 'proxy': None, 'audience': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response:
# Top Themes in "A Christmas Carol"

## Transformation and Redemption
The central theme of "A Christmas Carol" is the transformation of Ebenezer Scrooge from a miserly, cold-hearted businessman to a generous and compassionate individual. This transformation is driven by the ghostly visitations of Jacob Marley and the spirits of Christmas Past, Present, and Yet to Come. These spirits show Scrooge the consequences of his actions and the importance of compassion and generosity, leading to a profound change in his character and behavior [Data: Reports (77, 28, 66, 22, 13, 34, 45, 75, 53, +more)].

## The Importance of Family and Community
The story emphasizes the value of family and community, particularly through the Cratchit family's strong bonds and their ability to find joy and hope despite their poverty. The community's support and the spirit of togetherness are highlighted in various scenes, such as the Cratchit family's Christmas celebration and the lighthouse keepers' camaraderie. Characters such as Scrooge's nephew Fred and the Spirit of Christmas Present also emphasize the importance of family, social connections, and the need to help those in need [Data: Reports (77, 75, 53, 61, 70, 26, 39, 42, 58, 80, 31, 49, 83, +more)].

## The Spirit of Christmas
The story celebrates the spirit of Christmas, characterized by generosity, kindness, and the joy of giving. The various Christmas celebrations, such as the Cratchit family's meal, the lighthouse keepers' toast, and Fezziwig's party, underscore the festive and communal aspects of the holiday. The theme of the Christmas spirit is a recurring element, highlighting the importance of these values during the holiday season [Data: Reports (77, 66, 75, 53, 45, 52, 2, 54, 50, +more)].

## The Power of Memory and Reflection
The spirits of Christmas Past, Present, and Yet to Come take Scrooge on a journey through his memories, the present, and the future. This journey allows Scrooge to reflect on his life and the consequences of his actions, leading to his transformation. The power of memory and reflection is a significant theme, emphasizing the influence of past experiences on one's present and future actions [Data: Reports (61, 70, 26, 31, 49, 83, 77, 21, 81, +more)].

## The Contrast Between Wealth and Poverty
The story highlights the stark contrast between Scrooge's wealth and the poverty of the Cratchit family. This contrast is used to illustrate the social and economic disparities of the time and the moral implications of Scrooge's miserly behavior. The theme of wealth and poverty serves to emphasize the importance of empathy and the need to help those less fortunate [Data: Reports (61, 70, 39, 58, 11, 18, 78, 51, 62, 30, 8, 57, 25, 47, 24, 40, 27, +more)].

## The Role of the Supernatural
The supernatural elements, such as the ghosts of Christmas Past, Present, and Yet to Come, play a crucial role in Scrooge's transformation. These ghostly visitations serve as catalysts for Scrooge's moral and spiritual awakening, highlighting the power of supernatural intervention in bringing about change. The blending of supernatural elements with the mundane aspects of daily life adds to the atmospheric richness and the transformative nature of Scrooge's journey [Data: Reports (18, 78, 51, 62, 30, 8, 57, 25, 47, 24, 40, 27, 52, 69, 65, 33, +more)].

## Generosity and Charity
The theme of generosity and charity is evident through the actions of characters like the Spirit of Christmas Present, Scrooge's nephew, and the portly gentlemen. These characters demonstrate the importance of giving and helping others, especially during the Christmas season. Scrooge's transformation is marked by his newfound generosity, such as his decision to give a large turkey to the Cratchit family and his willingness to help the poor [Data: Reports (32, 72, 63, 60, 61, 70, 39, 42, 58, 11, +more)].

## The Consequences of One's Actions
The story explores the consequences of Scrooge's past and present actions, as revealed through the spirits' visions. The potential future, including the tragic fate of Tiny Tim, serves as a powerful motivator for Scrooge to change his ways and become a more compassionate and generous person. This theme underscores the moral and ethical dimensions of Scrooge's character and the potential for redemption [Data: Reports (77, 28, 13, 34, 61, 70, 39, 58, 11, +more)].

These themes collectively contribute to the rich and multifaceted narrative of "A Christmas Carol," making it a timeless and deeply resonant story.

However, when i try to use:

graphrag query \               
    --root ./ragtest \
    --method local \
    --query "Who is Scrooge and what are his main relationships?" > ./locol_query.log 2>&1

Errors still appear extensively:

please refer to file [locol_query.log](https://github.com/user-attachments/files/17677864/locol_query.log) to see the error

I tried to fix each error individually, but this led to a series of new errors.

In summary, it seems that --method local is not well-supported. Both the example provided in the notebook and directly using the related command result in errors:

graphrag query \
--root ./ragtest \
--method local \
--query "Who is Scrooge and what are his main relationships?" > ./local_query.log 2>&1

@edgentai
Copy link

edgentai commented Nov 9, 2024

I was able to find a workaround.
Seems like a minor bug in lancedb code

I changed the schema to this:

  N = 3072
  schema = pa.schema([
      pa.field("id", pa.string()),
      pa.field("text", pa.string()),
      pa.field("vector", pa.list_(pa.float64(), N)),
      pa.field("attributes", pa.string()),
  ])

where N is the length of embedding vector produced by whatever embedding model you are using in your settings.yml file. I am using "text-embedding-3-large" from openai hence the number is 3072.

Location of the file to edit --

  1. If you built from source : graphrag/vector_stores/lancedb.py
  2. If you did pip install : /usr/local/lib/python3.10/dist-packages/graphrag/vector_stores/lancedb.py
    Line Number : 53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

7 participants