Skip to content

[BUG] VectorRetrieval fails with KeyError #759

@HarshitMeda

Description

@HarshitMeda

Description

When querying documents, I get a keyError. Error Message:

File "/Users/harshitm/Developer/Kotaemon/libs/kotaemon/kotaemon/indices/vectorindex.py", line 289, in run
    text_doc = text_thumbnail_docs[thumbnail_doc.doc_id]
KeyError: 'fbe551c2-48ea-44d9-a60f-aab2fa10d89a'
User-id: e2d846534c594c4a95fd9705ca6f4c2b, can see public conversations: True

On trying to debug, this is what I found:

The ids in list thumbnail_doc_ids does not match ids of retrieved docs linked_thumbnail_docs.

Image

This is the document I uploaded:

Eicher Q4FY25 Results.pdf

Reproduction steps

1. Set set the default LLM and Embeddings model as Cohere 
1. Go to 'chat section'
2. Upload file
3. In file collection, select the uploaded file.
4. Query the document
5. See error message

Screenshots

<img width="1442" alt="Image" src="https://github.com/user-attachments/assets/899f75f3-1a42-42d9-be08-523eafb8d595" />

Logs

Session reasoning type None use mindmap True use citation highlight language en
Session LLM 
Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'>
Reasoning state {'app': {'regen': False}, 'pipeline': {}}
Thinking ...
Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x30a860100>, FSPath=PosixPath('/Users/harshitm/Developer/Kotaemon/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x30a8608e0>, get_extra_table=True, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x31877c160>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x31877cdc0>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x31877e020>), mmr=False, rerankers=[], retrieval_mode='hybrid', top_k=10, user_id='e2d846534c594c4a95fd9705ca6f4c2b'), GraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x104bfa9e0>, FSPath=<theflow.base.unset_ object at 0x104bfa9e0>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x104bfa9e0>, VS=<theflow.base.unset_ object at 0x104bfa9e0>, file_ids=[], user_id=<theflow.base.unset_ object at 0x104bfa9e0>), LightRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x104bfa9e0>, FSPath=<theflow.base.unset_ object at 0x104bfa9e0>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x104bfa9e0>, VS=<theflow.base.unset_ object at 0x104bfa9e0>, file_ids=[], search_type='local', user_id=<theflow.base.unset_ object at 0x104bfa9e0>)]
searching in doc_ids ['e5948fff-470c-40c5-8910-92e9cbaaedae']
retrieval_kwargs: dict_keys(['do_extend', 'scope', 'filters'])
Harshit thumbnail_count: 3
Number of requested results 100 is greater than number of elements in index 44, updating n_results = 44
Got 44 from vectorstore
Got 27 from docstore
Got raw 10 retrieved documents
Harshit thumbnail_doc_ids: {'a4ab8eb1-080b-4828-9924-ad0a55ecc0d4', '22c3486b-4a02-4eae-999c-cc1b2d67066f', '901db26a-69ee-4ae4-8416-b28d4c397725'}
Harshit linked_thumbnail_doc: 33332565-498c-4da5-94fa-cec84e38a411
Harshit linked_thumbnail_doc: ae6a1a3e-fba3-404e-aa72-85e54d00f1c3
Harshit linked_thumbnail_doc: 4ea898a7-0f50-4601-a929-c67d6ef495bb
thumbnail docs 3 non-thumbnail docs 7 raw-thumbnail docs 0
Traceback (most recent call last):
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/queueing.py", line 575, in process_events
    response = await route_utils.call_process_api(
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/blocks.py", line 1520, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/utils.py", line 663, in async_iteration
    return await iterator.__anext__()
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/utils.py", line 656, in __anext__
    return await anyio.to_thread.run_sync(
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2470, in run_sync_in_worker_thread
    return await future
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 967, in run
    result = context.run(func, *args)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/utils.py", line 639, in run_sync_iterator_async
    return next(iterator)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/gradio/utils.py", line 801, in gen_wrapper
    response = next(iterator)
  File "/Users/harshitm/Developer/Kotaemon/libs/ktem/ktem/pages/chat/__init__.py", line 1321, in chat_fn
    for response in pipeline.stream(chat_input, conversation_id, chat_history):
  File "/Users/harshitm/Developer/Kotaemon/libs/ktem/ktem/reasoning/simple.py", line 291, in stream
    docs, infos = self.retrieve(message, history)
  File "/Users/harshitm/Developer/Kotaemon/libs/ktem/ktem/reasoning/simple.py", line 132, in retrieve
    retriever_docs = retriever_node(text=query)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1097, in __call__
    raise e from None
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1088, in __call__
    output = self.fl.exec(func, args, kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/backends/base.py", line 151, in exec
    return run(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/middleware.py", line 144, in __call__
    raise e from None
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/middleware.py", line 141, in __call__
    _output = self.next_call(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/middleware.py", line 117, in __call__
    return self.next_call(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1017, in _runx
    return self.run(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/libs/ktem/ktem/index/file/pipelines.py", line 175, in run
    docs = self.vector_retrieval(text=text, top_k=self.top_k, **retrieval_kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1261, in exec
    return child(*args, **kwargs, __fl_runstates__=__fl_runstates__)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1097, in __call__
    raise e from None
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1088, in __call__
    output = self.fl.exec(func, args, kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/backends/base.py", line 151, in exec
    return run(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/middleware.py", line 144, in __call__
    raise e from None
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/middleware.py", line 141, in __call__
    _output = self.next_call(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/middleware.py", line 117, in __call__
    return self.next_call(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/install_dir/env/lib/python3.10/site-packages/theflow/base.py", line 1017, in _runx
    return self.run(*args, **kwargs)
  File "/Users/harshitm/Developer/Kotaemon/libs/kotaemon/kotaemon/indices/vectorindex.py", line 289, in run
    text_doc = text_thumbnail_docs[thumbnail_doc.doc_id]
KeyError: '33332565-498c-4da5-94fa-cec84e38a411'

Browsers

No response

OS

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions