Integrating pandas pipeline qp.run Components with RAGStringQueryEngine #14604

mraguth · 2024-07-06T07:10:34Z

mraguth
Jul 6, 2024

Instead of using the Pandas Query pipeline qp.run, I am using query_engine = RAGStringQueryEngine as shown below. I hope I am utilizing all the components of qp.run or Pandas query pipeline. However, I am wondering if it is possible to use the components from qp.run directly within my RAGStringQueryEngine. It seems like I might be missing components such as add_link or llm2. Do you have any suggestions on how to integrate these components in a meaningful way?

https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline_pandas/ (reference to Pandas query pipeline)

My Custom Query Pipe line is below: Request for suggestion to include all pandas pipeline components:

query_engine = RAGStringQueryEngine(
    retriever=recursive_retriever,
    response_synthesizer=response_synthesizer,
    pandas_prompt=pandas_prompt,
    pandas_output_parser=pandas_output_parser,
    llm=llm,
    qa_prompt=qa_prompt,
    synthesize_response=True,
    response_synthesis_prompt=response_synthesis_prompt,
    callback_manager=Settings.callback_manager,  # Important for Langfuse.
)

This is Pandas query pipe line from above link:

from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate


instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)

qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")

response = qp.run(
    query_str="What is the correlation between survival and age?",
)

Answered by dosubot[bot]

Jul 8, 2024

Yes, you are correct that both the pipeline and the query engine can use the same components. The key difference lies in how these components are connected and managed.

Pipeline:

Explicit Connections: The add_link method is used to explicitly connect modules, allowing for complex workflows and dependencies to be managed within the pipeline.
Advantages:
- Fewer Lines of Code: Express common workflows with fewer lines of code and boilerplate.
- Readability: Greater readability due to the declarative nature of the pipeline.
- Integration: Better integration points with low-code/no-code solutions.
- Portability: Easy serializability of pipeline components, providing portability and easier deployme…

View full answer

@dosu · 2024-07-06T07:38:39Z

dosubot[bot]
bot Jul 6, 2024

To integrate the missing components such as add_link or llm2 from the Pandas Query Pipeline into your RAGStringQueryEngine, you can follow the example provided for creating a full RAG pipeline as a Directed Acyclic Graph (DAG). Here’s how you can do it:

Define the Modules: First, define the necessary modules such as the LLM, prompt templates, retriever, reranker, and summarizer.
Add Modules to the Pipeline: Use the add_modules method to add these modules to your pipeline.
Link the Modules: Use the add_link method to define the connections between these modules.

Here is a code snippet that demonstrates this process:

from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.response_synthesizers import TreeSummarize
from llama_index.core.query_pipeline import QueryPipeline, PromptTemplate
from llama_index.llms.openai import OpenAI

# Define modules
prompt_str = "Please generate a question about Paul Graham's life regarding the following topic {topic}"
prompt_tmpl = PromptTemplate(prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")
retriever = index.as_retriever(similarity_top_k=3)
reranker = CohereRerank()
summarizer = TreeSummarize(llm=llm)

# Define query pipeline
p = QueryPipeline(verbose=True)
p.add_modules(
    {
        "llm": llm,
        "prompt_tmpl": prompt_tmpl,
        "retriever": retriever,
        "summarizer": summarizer,
        "reranker": reranker,
    }
)

# Add links between modules
p.add_link("input", "prompt_tmpl")
p.add_link("prompt_tmpl", "llm")
p.add_link("llm", "retriever")
p.add_link("retriever", "reranker", dest_key="nodes")
p.add_link("llm", "reranker", dest_key="query_str")
p.add_link("reranker", "summarizer", dest_key="nodes")
p.add_link("llm", "summarizer", dest_key="query_str")

# Run the pipeline
response = p.run(topic="YC")
print(str(response))

This code sets up a full RAG pipeline with query rewriting, retrieval, reranking, and response synthesis, and integrates the missing components using the add_link method. Additionally, it includes an input component to start the pipeline, which is linked to the prompt_tmpl module, ensuring a smooth flow from input to final summarization.

Additionally, you can refer to the QueryPipelineQueryEngine class in the llama-index-cli/llama_index/cli/rag/base.py file for more advanced configurations and custom query handling:

class QueryPipelineQueryEngine(CustomQueryEngine):
    query_pipeline: QueryPipeline = Field(
        description="Query Pipeline to use for Q&A.",
    )

    def custom_query(self, query_str: str) -> RESPONSE_TYPE:
        return self.query_pipeline.run(query_str=query_str)

    async def acustom_query(self, query_str: str) -> RESPONSE_TYPE:
        return await self.query_pipeline.arun(query_str=query_str)

This class demonstrates how to integrate a QueryPipeline into a custom query engine, allowing for synchronous and asynchronous query handling.

For a more detailed example, you can refer to the RAG Fusion Query Pipeline setup in the LlamaIndex documentation. Here is a snippet from the example:

# construct query pipeline
p = QueryPipeline()
module_dict = {
    **self.retrievers,
    "input": InputComponent(),
    "summarizer": TreeSummarize(),
    "join": ArgPackComponent(),
    "reranker": rerank_component,
}
p.add_modules(module_dict)
# add links from input to retriever (id'ed by chunk_size)
for chunk_size in self.chunk_sizes:
    p.add_link("input", str(chunk_size))
    p.add_link(str(chunk_size), "join", dest_key=str(chunk_size))
p.add_link("join", "reranker")
p.add_link("input", "summarizer", dest_key="query_str")
p.add_link("reranker", "summarizer", dest_key="nodes")

This example shows how to set up a query pipeline with multiple retrievers, a join component, and a reranker, and how to link these components together. You can visualize the DAG using the pyvis library to ensure the pipeline is correctly configured ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

0 replies

mraguth · 2024-07-08T00:47:20Z

mraguth
Jul 8, 2024
Author

@dosu So, the pipeline and query engine are essentially based on the same concept and can use the same components, correct? In the pipeline, add_link connects the modules, whereas in the query engine, all modules perform their tasks without explicit connections. What is the real advantage of using a pipeline then? Am I thinking right? Please help with clarification.

3 replies

dosubot[bot] bot Jul 8, 2024

Yes, you are correct that both the pipeline and the query engine can use the same components. The key difference lies in how these components are connected and managed.

Pipeline:

Explicit Connections: The add_link method is used to explicitly connect modules, allowing for complex workflows and dependencies to be managed within the pipeline.
Advantages:
- Fewer Lines of Code: Express common workflows with fewer lines of code and boilerplate.
- Readability: Greater readability due to the declarative nature of the pipeline.
- Integration: Better integration points with low-code/no-code solutions.
- Portability: Easy serializability of pipeline components, providing portability and easier deployment to different systems.
- Observability: Propagates callbacks throughout all sub-modules, integrating with observability partners for better monitoring and debugging.

Query Engine:

Implicit Connections: Operates without explicit connections between components. The flow of data is managed dynamically.
Advantages:
- Flexibility: Allows for more flexible and optimized query handling.
- Dynamic Execution: Handles the execution flow dynamically based on the configured QueryPipeline.

Here is a code snippet demonstrating the use of QueryPipeline:

from llama_index.core.query_pipeline import QueryPipeline

# sequential chain
p = QueryPipeline(chain=[prompt_tmpl, llm], verbose=True)

# DAG
p = QueryPipeline(verbose=True)
p.add_modules({"prompt_tmpl": prompt_tmpl, "llm": llm})
p.add_link("prompt_tmpl", "llm")

# run pipeline
p.run(prompt_key1="<input1>", ...)

In summary, the real advantage of using a pipeline over a query engine is the ability to express workflows more concisely and readably, with better integration and future-proofing through easy serializability and deployment ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

Answer selected by mraguth

mraguth Jul 8, 2024
Author

@dosu Thank you so much...

dosubot[bot] bot Jul 8, 2024

You're welcome! If you have any more questions or need further assistance, feel free to ask. Happy coding!

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating pandas pipeline qp.run Components with RAGStringQueryEngine #14604

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Integrating pandas pipeline qp.run Components with RAGStringQueryEngine #14604

mraguth Jul 6, 2024

Replies: 2 comments · 3 replies

dosubot[bot] bot Jul 6, 2024

mraguth Jul 8, 2024 Author

dosubot[bot] bot Jul 8, 2024

mraguth Jul 8, 2024 Author

dosubot[bot] bot Jul 8, 2024

mraguth
Jul 6, 2024

Replies: 2 comments 3 replies

dosubot[bot]
bot Jul 6, 2024

mraguth
Jul 8, 2024
Author

mraguth Jul 8, 2024
Author