Skip to content

Conversation

@lrcouto
Copy link
Contributor

@lrcouto lrcouto commented Sep 29, 2025

Description

Solves kedro-org/kedro#5089

What is this?

A dataset that can be used to load .txt, .json and .yaml files and convert them into LangChain prompt objects. Currently works with PromptTemplate and ChatPromptTemplate.

I recommend using it with Python >= 3.10 for safety/compatibility, and LangChain >= 0.3.0.

Expected data format

PromptTemplate expects a string by default. It accepts a set of parameters from the user that can be used to generate a prompt for a language model. For example:

Hello {name}, welcome to Kedro!

Or with a defined list of input variables:

{
  "template": "You are an expert in {field}. Answer the following question: {question}",
  "input_variables": ["field", "question"]
}

Or in YAML format, same thing can be done:

template: |
  Context: {context}
  
  Task: {task}
  
  Please provide a {detail_level} response.
input_variables:
  - context
  - task
  - detail_level

ChatPromptTemplate expects as input a dictionary or a list of tuples, with pairs of role and content parameters. For example:

[
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
]

or

{
  "messages": {
    "system": "You are a coding assistant specialized in {language}.",
    "human": "Help me with: {problem}"
  } 
}

In yaml:

messages:
  - role: system
    content: "You are a {role}."
  - role: human
    content: "Please help with: {request}"
  - role: ai
    content: "I'll help you with {request}. Let me break it down."
  - role: human
    content: "Thanks! Can you also explain {additional_topic}?"

For further detail see the LangChain documentation -
PromptTemplate: https://python.langchain.com/v0.2/api_reference/core/prompts/langchain_core.prompts.prompt.PromptTemplate.html
ChatPromptTemplate: https://python.langchain.com/v0.2/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html

Data Catalog configuration.

Example:

json_chat_prompt:
  type: langchain_prompt_dataset.datasets.langchain_prompt_dataset.LangChainPromptDataset
  filepath: data/prompts/chat_simple.json
  template: ChatPromptTemplate
  dataset:
    type: json.JSONDataset
    fs_args:
      load_args:
        encoding: utf-8
      save_args:
        ensure_ascii: false
  credentials: dev_creds
  metadata:
    kedro-viz:
      layer: raw
  • filepath: Path to the dataset file.
  • template: Which LangChain template should be used for that dataset, either PromptTemplate or ChatPromptTemplate. If none is chosen, PromptTemplate is the default.
  • dataset: Arguments related to which underlying dataset is chosen.
    • type: Which underlying dataset should be used to load the file. Can be text.TextDataset, json.JSONDataset or yaml.YAMLDataset. If none is chosen, it'll be inferred through the file extension.
      • fs_args: If the chosen underlying dataset allows for extra arguments, they can be passed here.
  • credentials: Will be passed to the underlying dataset. Work the same as whichever dataset is chosen.

The kedro-viz: layer: raw parameter in the metadata allow the data preview to be displayed in the Kedro Viz metadata panel (kedro-org/kedro-viz#2490)

Looks like this:

image

Node example:

You can take a YAML file like:

messages:
  - role: system
    content: "You are a {role}."
  - role: human
    content: "Please help with: {request}"
  - role: ai
    content: "I'll help you with {request}. Let me break it down."
  - role: human
    content: "Thanks! Can you also explain {additional_topic}?"

And pass it to a Kedro node, where the inputs for those variables can be passed using the format_messages method, which returns a list of LangChain BaseMessages,

def test_yaml_chat_prompt(yaml_chat_prompt: ChatPromptTemplate) -> Dict[str, str]:
    """Test YAML chat prompt."""
    messages = yaml_chat_prompt.format_messages(
        role="technical writer",
        request="documentation for API endpoints",
        additional_topic="best practices for API versioning"
    )

The output of messages will be like this:

[
    SystemMessage(
        content='You are a technical writer.',
        additional_kwargs={},
        response_metadata={}
    ),
    HumanMessage(
        content='Please help with: documentation for API endpoints',
        additional_kwargs={},
        response_metadata={}
    ),
    AIMessage(
        content="I'll help you with documentation for API endpoints. Let me break it down.",
        additional_kwargs={},
        response_metadata={}
    ),
    HumanMessage(
        content='Thanks! Can you also explain best practices for API versioning?',
        additional_kwargs={},
        response_metadata={}
    )
]

Which is the right format to be fed into one of LangChain's chat models.

Development notes

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of everything in one dataset and draft implantation! Left a few comments to discuss.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @lrcouto!

I made another pass and left several things to address.

Apart from them can we please also:

  1. Add an example on how to test it as we do with normal datasets
  2. Spcify a minimum LangChain and Python versions this dataset will be supporting
  3. Test it with the kedro-org/kedro-viz#2490. It will require .preview() method to be implemented.

@rashidakanchwala, can you please help us to build a branch with the preview and node colouring features, so we can use it when testing?

@lrcouto lrcouto marked this pull request as ready for review October 4, 2025 00:51
@lrcouto
Copy link
Contributor Author

lrcouto commented Oct 4, 2025

@ElenaKhaustova everything should be corrected now.

For the format validation, I've left it only checking if the message isn't empty. If the data does not contain the correct format, the underlying dataset throws an exception (for example, bad JSON formatting), so it doesn't even load and doesn't even get to the validation function.

@lrcouto
Copy link
Contributor Author

lrcouto commented Oct 5, 2025

One more practical test that can be done as an example of usage for this dataset - we can use it on Elena's RAG chatbot to eliminate the need for the create_chat_prompt node.

Replace the system_prompt.txt file with a full query prompt file (for this test I used YAML):

messages:
  - role: system
    content: >
      You are a powerful assistant who will answer user questions about the Kedro framework.
      As a source, you have a vector store of previous questions answered by Kedro team members.
      You can search through this vector store and retrieve information for context.
      You can use retrieved data for context only if it relates to the original question, otherwise, you can use your internal knowledge.
      Never mention users from the retrieved context. However, you may refer to the GitHub issues and other links mentioned.
      Provide code snippets if it may be helpful for the original question.
  - role: human
    content: >
      {input}
  - role: placeholder
    content: >
     {agent_scratchpad}

Replace the system_prompt entry on the DataCatalog with one using the LangChainPromptDataset:

chat_prompt:
  type: kedro_rag_chatbot.datasets.langchain_prompt_dataset.LangChainPromptDataset
  filepath: data/01_raw/query_prompt.yml
  template: ChatPromptTemplate
  dataset:
    type: yaml.YAMLDataset

Then the create_chat_prompt node can be completely removed, and this chat_prompt dataset can be passed instead to the create_agent node. It should work just the same as it did before.

image

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lrcouto! The implementation looks much cleaner now and thanks for extending the PR description with examples ✨

I've added a few comments to make it ready for merging!

After you add the requirements to the pyproject.toml, could you please also share installation commands in the PR description, so the reviewers could easily test it?



class LangChainPromptDataset(AbstractDataset[PromptTemplate | ChatPromptTemplate, Any]):
"""Kedro dataset for loading LangChain prompts using existing Kedro datasets."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please extend class docstrings like we do for the rest of the datasets? It is used in the docs, so it should be quite informative.

Signed-off-by: Laura Couto <[email protected]>
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested recent changes - all works as expected.

There are still issues with the version and class docstrings. Happy to approve when they're resolved.

except Exception as e:
raise DatasetError(f"Failed to create underlying dataset: {e}")

def _build_dataset_config(self, dataset: dict[str, Any] | str | None) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to specify strictly what the underlying datasets can be - just TextDataset, YAMLDataset and JSONDataset and error out if it isn't instead of inferring it. Unlike PartitionedDataset it's not like this can be any underlying dataset type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yeah, it makes sense cause now one can set a random dataset that will load data incompatible with the langchain template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some validation for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was, instead of allowing for a case where the user hasn't specified an underlying dataset config and inferring from file extension, we can just error out if the dataset config is not provided. And if the config is provided, maybe we can check if the type is only TextDataset YAMLDataset or JSONDataset (on further discussion with @ElenaKhaustova, this might limit the user incase they wanted to use a custom underlying dataset so I am not too fussed about if we include this validation or not. If the data is not in the correct format langchain should complain anyway. But it might be fine to limit this to these types for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of allowing users to use a custom underlying dataset but I think this can be a future addition. I'd like to see if this dataset is something that people are actually interested in using first.


if dataset is not None:
dataset_type = dataset["type"] if isinstance(dataset, dict) else str(dataset)
if dataset_type not in valid_datasets:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work if the user sets the dataset using the full name, like kedro_datasets.text.TextDataset

### Example usage for the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/):
```python
>>> from kedro_datasets_experimental.langchain import LangChainPromptDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need >>> if wrapping into the python code block

Signed-off-by: Laura Couto <[email protected]>
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a few comments - the rest looks good.

Thank you @lrcouto!

dict: A normalized dataset configuration dictionary.
"""

valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would be a bit cleaner if this became a constant and the validation logic is moved to a separate method


valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"}

if dataset is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add at least two unit tests to check that proper errors are raised?

lrcouto and others added 3 commits October 9, 2025 14:32
…_prompt_dataset.py

Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lrcouto lrcouto merged commit 2a1a134 into main Oct 9, 2025
17 checks passed
@lrcouto lrcouto deleted the add-langchain-prompt-dataset branch October 9, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants