Unpredictable response time with query method in LlamaIndex #14680
-
Hello, I am experiencing an issue with the response time stability when using the query method. ⬇️ My workflow is as follows ⬇️ Sometimes, I receive responses in 15-20 seconds, while other times it takes over 60 seconds. Has anyone else encountered this problem and knows how to resolve it? I am trying to understand why I am getting varying response times for the exact same example. Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Hey @norbertkocon94, 🌟I'm here to assist you with any bugs, questions, or contributions! Let's tackle this together. The varying response times you are experiencing when using the query method in LlamaIndex could be due to several factors, including the configuration of your query engine and the performance of the underlying OpenAI API. Here are some suggestions:
Here is an example configuration that you can use as a reference: max_input_size = 4000
num_output = 2000
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.3, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
query_configs = [
{
"index_struct_type": "default",
"query_mode": "default",
"query_kwargs": {
"response_mode": "tree_summarize",
"similarity_top_k": 1,
"verbose": False,
"text_qa_template": QA_PROMPT,
"refine_template": CUSTOM_REFINE_PROMPT,
"service_context": service_context,
}
},
] Additionally, consider the following points:
|
Beta Was this translation helpful? Give feedback.
Given your project requirements for fast and accurate output, the best-suited response mode in LlamaIndex is
ResponseMode.COMPACT
.The
ResponseMode.COMPACT
mode uses theCompactAndRefine
synthesizer, which is designed to provide a balance between speed and accuracy by refining the responses in a comp…