Showing duplicate points in qdrant db #14755

iam-amiya · 2024-07-15T05:57:00Z

iam-amiya
Jul 15, 2024

I am using Qdrant as my vector db and using ingestion pipeline to create nodes.
this is my ingestion pipeline code.

import time
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import SummaryExtractor

pipeline = IngestionPipeline(
    transformations=[
        TextCleaner(),
        SummaryExtractor(prompt_template=DEFAULT_SUMMARY_PROMPT_TMPL),
        VoyageEmbeddModel,
    ],
)

CvFirstChunk = pipeline.run(documents = first_node)
time.sleep(90)
CvSecondChunk = pipeline.run(documents = second_node)
time.sleep(90)
CvThirdChunk = pipeline.run(documents = third_node)
time.sleep(90)
CvFourthChunk = pipeline.run(documents = fourth_node)
time.sleep(90)
CvFifthChunk = pipeline.run(documents = fifth_node)
time.sleep(90)
CvSixthChunk = pipeline.run(documents = sixth_node)
nodes = CvFirstChunk + CvSecondChunk + CvThirdChunk + CvFourthChunk + CvFifthChunk + CvSixthChunk

as you can see I have total 113 nodes.

Now when I am trying to create an index using these piece of code.

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    nodes
    storage_context=storage_context,
    embed_model=EmbeddModel,
)

However , in my qdrant it is showing 572 points.
Can anyone please help me out why it is showing 572 points. I have even tried pinecone, milvus all of them is showing 572 points.

@dosu · 2024-07-15T06:07:34Z

dosubot[bot]
bot Jul 15, 2024

The discrepancy in the number of points in your vector database (572 points instead of 113) could be due to the way the ingestion pipeline and vector store are handling the data. Specifically, it seems that the nodes_dict and doc_id_dict in your index structure are not aligned with the expected number of points.

From the provided document, the nodes_dict contains 6 nodes, and the doc_id_dict maps a single document ID to these 6 nodes. However, the embeddings data provided seems to be extensive and might be contributing to the higher count of points.

Here is a snippet of the relevant part of the index structure:

{
  "index_struct": {
    "__type__": "simple_dict",
    "__data__": {
      "index_id": "9343c0c6-8313-4549-a423-1348960b6958",
      "summary": null,
      "nodes_dict": {
        "ddb890e7-82d5-4dc5-a950-b498c3ef2494": "ddb890e7-82d5-4dc5-a950-b498c3ef2494",
        "5a03268d-b654-48a4-b3ea-b9620a5ec614": "5a03268d-b654-48a4-b3ea-b9620a5ec614",
        "8d29350a-a932-4665-b436-a810e66def48": "8d29350a-a932-4665-b436-a810e66def48",
        "6b151549-3bcc-4e11-a181-c1ac4dd34c8a": "6b151549-3bcc-4e11-a181-c1ac4dd34c8a",
        "e2ac2af1-b59f-4c83-b365-f0205ba1f732": "e2ac2af1-b59f-4c83-b365-f0205ba1f732",
        "307b2feb-5b0a-4b86-aec7-994074c94d29": "307b2feb-5b0a-4b86-aec7-994074c94d29"
      },
      "doc_id_dict": {
        "707af1fd-62af-4693-93e1-01433b900aca": [
          "ddb890e7-82d5-4dc5-a950-b498c3ef2494",
          "5a03268d-b654-48a4-b3ea-b9620a5ec614",
          "8d29350a-a932-4665-b436-a810e66def48",
          "6b151549-3bcc-4e11-a181-c1ac4dd34c8a",
          "e2ac2af1-b59f-4c83-b365-f0205ba1f732",
          "307b2feb-5b0a-4b86-aec7-994074c94d29"
        ]
      },
      "embeddings_dict": {}
    }
  }
}

To resolve this issue, you should verify the following:

Ensure that the ingestion pipeline correctly maps the documents to the nodes.
Check if the embeddings are being duplicated or incorrectly assigned during the ingestion process.
Validate the consistency between the number of nodes and the embeddings being stored.

Additionally, based on the new context from the LlamaIndex integration tests, it is important to ensure that the ingestion pipeline is correctly transforming documents into nodes with embeddings. The test cases provided in the document show how to validate the ingestion pipeline and vector store operations:

def test_nodes(nodes: List[TextNode]) -> None:
    """Test Ingestion Pipeline transforming documents into nodes with embeddings."""
    assert isinstance(nodes, list)
    assert isinstance(nodes[0], TextNode)

Make sure that your ingestion pipeline is correctly transforming documents into nodes and that the nodes are being correctly added to the vector store. This will help ensure that the number of points in your vector database matches the expected count ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Showing duplicate points in qdrant db #14755

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Showing duplicate points in qdrant db #14755

iam-amiya Jul 15, 2024

Replies: 1 comment

dosubot[bot] bot Jul 15, 2024

iam-amiya
Jul 15, 2024

dosubot[bot]
bot Jul 15, 2024