How can I provide context about my pandas dataset to Llamaindex? #14754

mraguth · 2024-07-14T23:44:51Z

mraguth
Jul 14, 2024

I am using a Pandas pipeline to query my dataframe. However, I would like to provide context about my data to Llamaindex so that it can help me write responses in a meaningful way. For example, my data column name is 'EmpID', but I want it to be referred to as 'Employee Number' in the response instead of 'EmpID'. How and where can I provide context about all my data attributes as well as an understanding of my data? Should it be in the response_prompt? Could you provide me with a sample code, please?

Answered by dosubot[bot]

Jul 15, 2024

To provide context about your pandas dataset to Llamaindex so that it can refer to your data attributes in a meaningful way, you can customize the DEFAULT_PANDAS_TMPL template. Here is an example of how you can modify the template to refer to 'EmpID' as 'Employee Number':

from llama_index.core.prompts import PromptTemplate, PromptType

############################################
# Pandas
############################################

CUSTOM_PANDAS_TMPL = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Please note the following context about the dataframe columns:

View full answer

@dosu · 2024-07-15T00:15:35Z

dosubot[bot]
bot Jul 15, 2024

To provide context about your pandas dataset to Llamaindex so that it can refer to your data attributes in a meaningful way, you can customize the DEFAULT_PANDAS_TMPL template. Here is an example of how you can modify the template to refer to 'EmpID' as 'Employee Number':

from llama_index.core.prompts import PromptTemplate, PromptType

############################################
# Pandas
############################################

CUSTOM_PANDAS_TMPL = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Please note the following context about the dataframe columns:\n"
    "'EmpID' should be referred to as 'Employee Number'.\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)

CUSTOM_PANDAS_PROMPT = PromptTemplate(
    CUSTOM_PANDAS_TMPL, prompt_type=PromptType.PANDAS
)

By including the line "Please note the following context about the dataframe columns:\n'EmpID' should be referred to as 'Employee Number'.\n" in the template, you provide the necessary context for Llamaindex to refer to 'EmpID' as 'Employee Number' in its responses ^[1].

To continue talking to Dosu, mention @dosu.

2 replies

mraguth Jul 15, 2024
Author

@dosu What is the use of response_synthesis_prompt or summary_refine part? Can I explain about my dataset those places? Is that repetitive if I describe my dataset multiple places? Because I am using all these processes in my query engine. Thank you.

dosubot[bot] bot Jul 15, 2024

To use the response_synthesis_prompt or summary_refine parts in LlamaIndex, you can follow the example provided in the test file llama-index-legacy/tests/response_synthesizers/test_refine.py. Here is a simplified example:

Define a Mock Program:

from collections import OrderedDict
from typing import Any, Dict, Optional, Type, cast
from llama_index.legacy.bridge.pydantic import BaseModel
from llama_index.legacy.response_synthesizers.refine import StructuredRefineResponse
from llama_index.legacy.types import BasePydanticProgram

class MockRefineProgram(BasePydanticProgram):
    def __init__(self, input_to_query_satisfied: Dict[str, bool]):
        self._input_to_query_satisfied = input_to_query_satisfied

    @property
    def output_cls(self) -> Type[BaseModel]:
        return StructuredRefineResponse

    def __call__(self, *args: Any, context_str: Optional[str] = None, context_msg: Optional[str] = None, **kwargs: Any) -> StructuredRefineResponse:
        input_str = context_str or context_msg
        input_str = cast(str, input_str)
        query_satisfied = self._input_to_query_satisfied[input_str]
        return StructuredRefineResponse(answer=input_str, query_satisfied=query_satisfied)

    async def acall(self, *args: Any, context_str: Optional[str] = None, context_msg: Optional[str] = None, **kwargs: Any) -> StructuredRefineResponse:
        input_str = context_str or context_msg
        input_str = cast(str, input_str)
        query_satisfied = self._input_to_query_satisfied[input_str]
        return StructuredRefineResponse(answer=input_str, query_satisfied=query_satisfied)

Create a Service Context:

from llama_index.legacy.service_context import ServiceContext
from llama_index.legacy.callbacks import CallbackManager

def mock_refine_service_context(patch_llm_predictor: Any) -> ServiceContext:
    cb_manager = CallbackManager([])
    return ServiceContext.from_defaults(llm_predictor=patch_llm_predictor, callback_manager=cb_manager)

Instantiate the Refine Class:

from llama_index.legacy.response_synthesizers import Refine

def refine_instance(mock_refine_service_context: ServiceContext) -> Refine:
    return Refine(service_context=mock_refine_service_context, streaming=False, verbose=True, structured_answer_filtering=True)

Use the Refine Instance:

import pytest
from collections import OrderedDict

@pytest.mark.asyncio()
async def test_answer_filtering_one_answer(mock_refine_service_context: ServiceContext) -> None:
    input_to_query_satisfied = OrderedDict([("input1", False), ("input2", True), ("input3", False)])

    def program_factory(*args: Any, **kwargs: Any) -> MockRefineProgram:
        return MockRefineProgram(input_to_query_satisfied)

    refine_instance = Refine(service_context=mock_refine_service_context, structured_answer_filtering=True, program_factory=program_factory)
    res = await refine_instance.aget_response("question", list(input_to_query_satisfied.keys()))
    assert res == "input2"

Additionally, LlamaIndex provides default prompt selectors for various tasks, including refining responses:

from llama_index.core.prompts import SelectorPromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_REFINE_PROMPT,
    DEFAULT_REFINE_TABLE_CONTEXT_PROMPT,
    DEFAULT_TEXT_QA_PROMPT,
    DEFAULT_TREE_SUMMARIZE_PROMPT,
)
from llama_index.core.prompts.utils import is_chat_model

# Refine
default_refine_conditionals = [(is_chat_model, CHAT_REFINE_PROMPT)]
DEFAULT_REFINE_PROMPT_SEL = SelectorPromptTemplate(
    default_template=DEFAULT_REFINE_PROMPT,
    conditionals=default_refine_conditionals,
)

Regarding describing your dataset, it could be repetitive to describe it in multiple places. It is generally a good practice to centralize such descriptions to avoid redundancy and potential inconsistencies. However, without more context on how you are describing your dataset, it is difficult to provide a definitive answer.

In summary, you can use the response_synthesis_prompt or summary_refine parts by following the structure provided in the test file and leveraging the default prompt selectors provided by LlamaIndex. It might be repetitive to describe your dataset in multiple places, so centralizing the description is advisable ^[1]^[2].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I provide context about my pandas dataset to Llamaindex? #14754

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How can I provide context about my pandas dataset to Llamaindex? #14754

mraguth Jul 14, 2024

Replies: 1 comment · 2 replies

dosubot[bot] bot Jul 15, 2024

mraguth Jul 15, 2024 Author

dosubot[bot] bot Jul 15, 2024

mraguth
Jul 14, 2024

Replies: 1 comment 2 replies

dosubot[bot]
bot Jul 15, 2024

mraguth Jul 15, 2024
Author