Unpredictable response time with query method in LlamaIndex #14680

norbertkocon94 · 2024-07-10T12:02:31Z

norbertkocon94
Jul 10, 2024

Hello,

I am experiencing an issue with the response time stability when using the query method.

⬇️ My workflow is as follows ⬇️
PDF file -> SimpleDirectoryReader -> VectorStoreIndex -> as_query_engine(llm=OpenAI…) -> query(prompt)

Sometimes, I receive responses in 15-20 seconds, while other times it takes over 60 seconds. Has anyone else encountered this problem and knows how to resolve it? I am trying to understand why I am getting varying response times for the exact same example.

Best regards,
NK

Answered by dosubot[bot]

Jul 10, 2024

Given your project requirements for fast and accurate output, the best-suited response mode in LlamaIndex is ResponseMode.COMPACT.

elif response_mode == ResponseMode.COMPACT:
    return CompactAndRefine(
        service_context=service_context,
        text_qa_template=text_qa_template,
        refine_template=refine_template,
        output_cls=output_cls,
        streaming=streaming,
        structured_answer_filtering=structured_answer_filtering,
        program_factory=program_factory,
        verbose=verbose,
    )

The ResponseMode.COMPACT mode uses the CompactAndRefine synthesizer, which is designed to provide a balance between speed and accuracy by refining the responses in a comp…

View full answer

@norbertkocon94 · 2024-07-10T13:41:28Z

dosubot[bot]
bot Jul 10, 2024

Hey @norbertkocon94, 🌟I'm here to assist you with any bugs, questions, or contributions! Let's tackle this together.

The varying response times you are experiencing when using the query method in LlamaIndex could be due to several factors, including the configuration of your query engine and the performance of the underlying OpenAI API. Here are some suggestions:

Optimize Query Configuration: Ensure that your query configuration is optimized for performance. For example, you can adjust the similarity_top_k parameter to a lower value to reduce the number of documents being considered for each query.
Service Context Configuration: Make sure that your service_context is properly configured. This includes setting appropriate values for max_input_size, num_output, and max_chunk_overlap.
Response Mode: The response_mode parameter can significantly impact response times. Using tree_summarize mode, for instance, can be slower due to the hierarchical summarization process. You might want to experiment with different response modes to find the one that offers the best performance for your use case.

Here is an example configuration that you can use as a reference:

max_input_size = 4000
num_output = 2000
max_chunk_overlap = 20

prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.3, model_name="gpt-3.5-turbo"))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

query_configs = [
    {
        "index_struct_type": "default",
        "query_mode": "default",
        "query_kwargs": {
            "response_mode": "tree_summarize",
            "similarity_top_k": 1,
            "verbose": False,
            "text_qa_template": QA_PROMPT,
            "refine_template": CUSTOM_REFINE_PROMPT,
            "service_context": service_context,
        }
    },
]

Additionally, consider the following points:

API Performance: OpenAI's API performance can vary, which might contribute to inconsistent response times. Monitoring the API status and performance can help identify if this is a contributing factor.
Index Type: Different index types (e.g., ListIndex vs. GPTSimpleVectorIndex) can have different performance characteristics. Ensure you are using the most efficient index type for your use case.
Indexing Strategy: If you have a large number of documents or frequent updates, consider using a more efficient indexing strategy. For example, you can schedule indexing to run periodically and only update the index for files that have changed ^[1]^[2].

To continue talking to Dosu, mention @dosu.

4 replies

norbertkocon94 Jul 10, 2024
Author

Thanks for the reply @dosu :). According to:

Response Mode: The response_mode parameter can significantly impact response times. Using tree_summarize mode, for instance, can be slower due to the hierarchical summarization process. You might want to experiment with different response modes to find the one that offers the best performance for your use case.

What are the other response modes that I can use in my project?

dosubot[bot] bot Jul 10, 2024

Here are the other response modes available that you can use in your project to potentially improve response times:

COMPACT
- Concatenates the chunks beforehand, resulting in fewer LLM calls. It stuffs as much text (concatenated from the retrieved chunks) as can fit within the context window. If the text is too long, it is split into parts using a TokenTextSplitter.
SIMPLE_SUMMARIZE
- Truncates all text chunks to fit into a single LLM prompt. Good for quick summarization purposes, but may lose detail due to truncation.
TREE_SUMMARIZE
- Builds a tree index over the set of candidate nodes, with a summary prompt seeded with the query. The tree is built in a bottoms-up fashion, and in the end, the root node is returned as the response. Good for summarization purposes.
GENERATION
- Ignores context, just uses LLM to generate a response.
NO_TEXT
- Only runs the retriever to fetch the nodes that would have been sent to the LLM, without actually sending them. The retrieved nodes can be inspected by checking response.source_nodes.
ACCUMULATE
- Applies the query to each text chunk while accumulating the responses into an array. Returns a concatenated string of all responses. Good for when you need to run the same query separately against each text chunk.
COMPACT_ACCUMULATE
- Similar to accumulate, but will "compact" each LLM prompt similar to compact, and run the same query against each text chunk.
REFINE
- Creates and refines an answer by sequentially going through each retrieved text chunk. This makes a separate LLM call per Node/retrieved chunk. Good for more detailed answers.

You can configure the response synthesizer for a query engine using the response_mode kwarg setting. For example:

from llama_index.core.data_structs import Node
from llama_index.core.schema import NodeWithScore
from llama_index.core import get_response_synthesizer

response_synthesizer = get_response_synthesizer(response_mode="compact")

response = response_synthesizer.synthesize(
    "query text", nodes=[NodeWithScore(node=Node(text="text"), score=1.0), ...]
)

Or, more commonly, in a query engine after you've created an index:

query_engine = index.as_query_engine(response_synthesizer=response_synthesizer)
response = query_engine.query("query_text")

For more details on all available response synthesizers, modes, and how to build your own, you can refer to the LlamaIndex documentation ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

norbertkocon94 Jul 10, 2024
Author

According to my project requirements for fast and accurate output, which mode is best? @dosu

dosubot[bot] bot Jul 10, 2024

Given your project requirements for fast and accurate output, the best-suited response mode in LlamaIndex is ResponseMode.COMPACT.

elif response_mode == ResponseMode.COMPACT:
    return CompactAndRefine(
        service_context=service_context,
        text_qa_template=text_qa_template,
        refine_template=refine_template,
        output_cls=output_cls,
        streaming=streaming,
        structured_answer_filtering=structured_answer_filtering,
        program_factory=program_factory,
        verbose=verbose,
    )

The ResponseMode.COMPACT mode uses the CompactAndRefine synthesizer, which is designed to provide a balance between speed and accuracy by refining the responses in a compact manner. This mode concatenates the text chunks beforehand, resulting in fewer LLM calls compared to the refine mode, thus achieving faster response times while maintaining accuracy ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

Answer selected by norbertkocon94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpredictable response time with query method in LlamaIndex #14680

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unpredictable response time with query method in LlamaIndex #14680

norbertkocon94 Jul 10, 2024

Replies: 1 comment · 4 replies

dosubot[bot] bot Jul 10, 2024

norbertkocon94 Jul 10, 2024 Author

dosubot[bot] bot Jul 10, 2024

norbertkocon94 Jul 10, 2024 Author

dosubot[bot] bot Jul 10, 2024

norbertkocon94
Jul 10, 2024

Replies: 1 comment 4 replies

dosubot[bot]
bot Jul 10, 2024

norbertkocon94 Jul 10, 2024
Author

norbertkocon94 Jul 10, 2024
Author