feat(datasets): Add LangchainPromptDataset to experimental datasets #1200

lrcouto · 2025-09-29T04:30:07Z

Description

What is this?

A dataset that can be used to load .txt, .json and .yaml files and convert them into LangChain prompt objects. Currently works with PromptTemplate and ChatPromptTemplate.

I recommend using it with Python >= 3.10 for safety/compatibility, and LangChain >= 0.3.0.

Expected data format

PromptTemplate expects a string by default. It accepts a set of parameters from the user that can be used to generate a prompt for a language model. For example:

Hello {name}, welcome to Kedro!

Or with a defined list of input variables:

{
  "template": "You are an expert in {field}. Answer the following question: {question}",
  "input_variables": ["field", "question"]
}

Or in YAML format, same thing can be done:

template: |
  Context: {context}
  
  Task: {task}
  
  Please provide a {detail_level} response.
input_variables:
  - context
  - task
  - detail_level

ChatPromptTemplate expects as input a dictionary or a list of tuples, with pairs of role and content parameters. For example:

[
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
]

or

{
  "messages": {
    "system": "You are a coding assistant specialized in {language}.",
    "human": "Help me with: {problem}"
  } 
}

In yaml:

messages:
  - role: system
    content: "You are a {role}."
  - role: human
    content: "Please help with: {request}"
  - role: ai
    content: "I'll help you with {request}. Let me break it down."
  - role: human
    content: "Thanks! Can you also explain {additional_topic}?"

For further detail see the LangChain documentation -
PromptTemplate: https://python.langchain.com/v0.2/api_reference/core/prompts/langchain_core.prompts.prompt.PromptTemplate.html
ChatPromptTemplate: https://python.langchain.com/v0.2/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html

Data Catalog configuration.

Example:

json_chat_prompt:
  type: langchain_prompt_dataset.datasets.langchain_prompt_dataset.LangChainPromptDataset
  filepath: data/prompts/chat_simple.json
  template: ChatPromptTemplate
  dataset:
    type: json.JSONDataset
    fs_args:
      load_args:
        encoding: utf-8
      save_args:
        ensure_ascii: false
  credentials: dev_creds
  metadata:
    kedro-viz:
      layer: raw

filepath: Path to the dataset file.
template: Which LangChain template should be used for that dataset, either PromptTemplate or ChatPromptTemplate. If none is chosen, PromptTemplate is the default.
dataset: Arguments related to which underlying dataset is chosen.
- type: Which underlying dataset should be used to load the file. Can be text.TextDataset, json.JSONDataset or yaml.YAMLDataset. If none is chosen, it'll be inferred through the file extension.
  - fs_args: If the chosen underlying dataset allows for extra arguments, they can be passed here.
credentials: Will be passed to the underlying dataset. Work the same as whichever dataset is chosen.

The kedro-viz: layer: raw parameter in the metadata allow the data preview to be displayed in the Kedro Viz metadata panel (kedro-org/kedro-viz#2490)

Looks like this:

Node example:

You can take a YAML file like:

messages:
  - role: system
    content: "You are a {role}."
  - role: human
    content: "Please help with: {request}"
  - role: ai
    content: "I'll help you with {request}. Let me break it down."
  - role: human
    content: "Thanks! Can you also explain {additional_topic}?"

And pass it to a Kedro node, where the inputs for those variables can be passed using the format_messages method, which returns a list of LangChain BaseMessages,

def test_yaml_chat_prompt(yaml_chat_prompt: ChatPromptTemplate) -> Dict[str, str]:
    """Test YAML chat prompt."""
    messages = yaml_chat_prompt.format_messages(
        role="technical writer",
        request="documentation for API endpoints",
        additional_topic="best practices for API versioning"
    )

The output of messages will be like this:

[
    SystemMessage(
        content='You are a technical writer.',
        additional_kwargs={},
        response_metadata={}
    ),
    HumanMessage(
        content='Please help with: documentation for API endpoints',
        additional_kwargs={},
        response_metadata={}
    ),
    AIMessage(
        content="I'll help you with documentation for API endpoints. Let me break it down.",
        additional_kwargs={},
        response_metadata={}
    ),
    HumanMessage(
        content='Thanks! Can you also explain best practices for API versioning?',
        additional_kwargs={},
        response_metadata={}
    )
]

Which is the right format to be fed into one of LangChain's chat models.

Development notes

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova

I like the idea of everything in one dataset and draft implantation! Left a few comments to discuss.

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova

Thank you, @lrcouto!

I made another pass and left several things to address.

Apart from them can we please also:

Add an example on how to test it as we do with normal datasets
Spcify a minimum LangChain and Python versions this dataset will be supporting
Test it with the kedro-org/kedro-viz#2490. It will require .preview() method to be implemented.

@rashidakanchwala, can you please help us to build a branch with the preview and node colouring features, so we can use it when testing?

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

Signed-off-by: Laura Couto <[email protected]>

lrcouto · 2025-10-04T04:43:57Z

@ElenaKhaustova everything should be corrected now.

For the format validation, I've left it only checking if the message isn't empty. If the data does not contain the correct format, the underlying dataset throws an exception (for example, bad JSON formatting), so it doesn't even load and doesn't even get to the validation function.

lrcouto · 2025-10-05T20:15:00Z

One more practical test that can be done as an example of usage for this dataset - we can use it on Elena's RAG chatbot to eliminate the need for the create_chat_prompt node.

Replace the system_prompt.txt file with a full query prompt file (for this test I used YAML):

messages:
  - role: system
    content: >
      You are a powerful assistant who will answer user questions about the Kedro framework.
      As a source, you have a vector store of previous questions answered by Kedro team members.
      You can search through this vector store and retrieve information for context.
      You can use retrieved data for context only if it relates to the original question, otherwise, you can use your internal knowledge.
      Never mention users from the retrieved context. However, you may refer to the GitHub issues and other links mentioned.
      Provide code snippets if it may be helpful for the original question.
  - role: human
    content: >
      {input}
  - role: placeholder
    content: >
     {agent_scratchpad}

Replace the system_prompt entry on the DataCatalog with one using the LangChainPromptDataset:

chat_prompt:
  type: kedro_rag_chatbot.datasets.langchain_prompt_dataset.LangChainPromptDataset
  filepath: data/01_raw/query_prompt.yml
  template: ChatPromptTemplate
  dataset:
    type: yaml.YAMLDataset

Then the create_chat_prompt node can be completely removed, and this chat_prompt dataset can be passed instead to the create_agent node. It should work just the same as it did before.

ElenaKhaustova

Thank you @lrcouto! The implementation looks much cleaner now and thanks for extending the PR description with examples ✨

I've added a few comments to make it ready for merging!

After you add the requirements to the pyproject.toml, could you please also share installation commands in the PR description, so the reviewers could easily test it?

kedro-datasets/RELEASE.md

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

ElenaKhaustova · 2025-10-06T11:11:24Z

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

+
+
+class LangChainPromptDataset(AbstractDataset[PromptTemplate | ChatPromptTemplate, Any]):
+    """Kedro dataset for loading LangChain prompts using existing Kedro datasets."""


Can we please extend class docstrings like we do for the rest of the datasets? It is used in the docs, so it should be quite informative.

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

Signed-off-by: Laura Couto <[email protected]>

…edro-plugins into add-langchain-prompt-dataset

Signed-off-by: Laura Couto <[email protected]>

kedro-datasets/kedro_datasets_experimental/langchain/__init__.py

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

Signed-off-by: Laura Couto <[email protected]>

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

kedro-datasets/pyproject.toml

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

ElenaKhaustova

I've tested recent changes - all works as expected.

There are still issues with the version and class docstrings. Happy to approve when they're resolved.

kedro-datasets/kedro_datasets_experimental/langchain/__init__.py

ankatiyar · 2025-10-08T12:47:30Z

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

+        except Exception as e:
+            raise DatasetError(f"Failed to create underlying dataset: {e}")
+
+    def _build_dataset_config(self, dataset: dict[str, Any] | str | None) -> dict[str, Any]:


I wonder if it's better to specify strictly what the underlying datasets can be - just TextDataset, YAMLDataset and JSONDataset and error out if it isn't instead of inferring it. Unlike PartitionedDataset it's not like this can be any underlying dataset type

Well, yeah, it makes sense cause now one can set a random dataset that will load data incompatible with the langchain template.

Added some validation for that.

What I meant was, instead of allowing for a case where the user hasn't specified an underlying dataset config and inferring from file extension, we can just error out if the dataset config is not provided. And if the config is provided, maybe we can check if the type is only TextDataset YAMLDataset or JSONDataset (on further discussion with @ElenaKhaustova, this might limit the user incase they wanted to use a custom underlying dataset so I am not too fussed about if we include this validation or not. If the data is not in the correct format langchain should complain anyway. But it might be fine to limit this to these types for now)

I like the idea of allowing users to use a custom underlying dataset but I think this can be a future addition. I'd like to see if this dataset is something that people are actually interested in using first.

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova · 2025-10-08T15:16:07Z

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

+
+        if dataset is not None:
+            dataset_type = dataset["type"] if isinstance(dataset, dict) else str(dataset)
+            if dataset_type not in valid_datasets:


This will not work if the user sets the dataset using the full name, like kedro_datasets.text.TextDataset

ElenaKhaustova · 2025-10-08T15:19:27Z

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

+
+    ### Example usage for the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/):
+    ```python
+        >>> from kedro_datasets_experimental.langchain import LangChainPromptDataset


We probably don't need >>> if wrapping into the python code block

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

Signed-off-by: Laura Couto <[email protected]>

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova

Approving with a few comments - the rest looks good.

Thank you @lrcouto!

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

ElenaKhaustova · 2025-10-09T11:19:07Z

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

+            dict: A normalized dataset configuration dictionary.
+        """
+
+        valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"}


Nit: would be a bit cleaner if this became a constant and the validation logic is moved to a separate method

ElenaKhaustova · 2025-10-09T11:24:20Z

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py

+
+        valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"}
+
+        if dataset is None:


Can we please also add at least two unit tests to check that proper errors are raised?

…_prompt_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]>

Signed-off-by: Laura Couto <[email protected]>

ankatiyar

LGTM!

Add LangchainPromptDataset to experimental datasets

243f4f7

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova reviewed Sep 29, 2025

View reviewed changes

lrcouto and others added 5 commits October 1, 2025 22:39

Add credential handling

7a7b0a0

Signed-off-by: Laura Couto <[email protected]>

Merge branch 'main' into add-langchain-prompt-dataset

47557db

Lint

6b9247b

Signed-off-by: Laura Couto <[email protected]>

Cleanup

2681408

Signed-off-by: Laura Couto <[email protected]>

Separate validation from _create_chat_prompt_template

f6b9504

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova requested changes Oct 3, 2025

View reviewed changes

ElenaKhaustova reviewed Oct 3, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Outdated Show resolved Hide resolved

lrcouto added 9 commits October 3, 2025 14:39

Change validation function to not try to validate the template format

68c3f94

Signed-off-by: Laura Couto <[email protected]>

Add unit tests for LangChainPromptDataset

cd17a1c

Signed-off-by: Laura Couto <[email protected]>

Map constant template type to function

0173dc3

Signed-off-by: Laura Couto <[email protected]>

Better docstrings

a3c23b0

Signed-off-by: Laura Couto <[email protected]>

Add LangChainPromptDataset to release notes

14591d1

Signed-off-by: Laura Couto <[email protected]>

Add new dataset to docs

f074b5e

Signed-off-by: Laura Couto <[email protected]>

Add new dataset to docs index

9dd1673

Signed-off-by: Laura Couto <[email protected]>

Fix mkdocs error

4695764

Signed-off-by: Laura Couto <[email protected]>

Add preview method

9dca77a

Signed-off-by: Laura Couto <[email protected]>

lrcouto marked this pull request as ready for review October 4, 2025 00:51

Fix preview method, should work on Viz now

0f65fd6

Signed-off-by: Laura Couto <[email protected]>

lrcouto requested review from ElenaKhaustova, merelcht and rashidakanchwala October 4, 2025 14:26

ElenaKhaustova reviewed Oct 6, 2025

View reviewed changes

lrcouto added 4 commits October 6, 2025 11:19

Add requirements to pyproject.toml

de94bac

Signed-off-by: Laura Couto <[email protected]>

Improve docstrings

a52d0a6

Signed-off-by: Laura Couto <[email protected]>

Add LangchainPromptDataset to experimental datasets

674e487

Signed-off-by: Laura Couto <[email protected]>

Add credential handling

0397ecb

Signed-off-by: Laura Couto <[email protected]>

lrcouto added 2 commits October 7, 2025 11:13

Merge branch 'add-langchain-prompt-dataset' of github.com:kedro-org/k…

33f1491

…edro-plugins into add-langchain-prompt-dataset

Add validation for plain string on ChatPromptTemplate

d08b51f

Signed-off-by: Laura Couto <[email protected]>

ankatiyar reviewed Oct 7, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/__init__.py Show resolved Hide resolved

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Outdated Show resolved Hide resolved

Fix indentation on docstring

4e31229

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova reviewed Oct 8, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Outdated Show resolved Hide resolved

ElenaKhaustova reviewed Oct 8, 2025

View reviewed changes

kedro-datasets/pyproject.toml Outdated Show resolved Hide resolved

ElenaKhaustova reviewed Oct 8, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Show resolved Hide resolved

ElenaKhaustova reviewed Oct 8, 2025

View reviewed changes

ankatiyar reviewed Oct 8, 2025

View reviewed changes

lrcouto added 5 commits October 8, 2025 10:54

update docstring and version

62da18f

Signed-off-by: Laura Couto <[email protected]>

Remove redundant part of docstring

ca864a8

Signed-off-by: Laura Couto <[email protected]>

Add validation for dataset type

115839d

Signed-off-by: Laura Couto <[email protected]>

Update docstring for _build_dataset_config

b2077d9

Signed-off-by: Laura Couto <[email protected]>

Update docstring for _build_dataset_config

a69820c

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova reviewed Oct 8, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Show resolved Hide resolved

Fix indentation on docstring

bb62c16

Signed-off-by: Laura Couto <[email protected]>

ankatiyar reviewed Oct 8, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Outdated Show resolved Hide resolved

ankatiyar reviewed Oct 8, 2025

View reviewed changes

kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py Outdated Show resolved Hide resolved

lrcouto added 2 commits October 8, 2025 15:49

Make dataset type parameter mandatory

14d48ad

Signed-off-by: Laura Couto <[email protected]>

Split by period and use one last two names in dataset type validation

0126125

Signed-off-by: Laura Couto <[email protected]>

ElenaKhaustova approved these changes Oct 9, 2025

View reviewed changes

lrcouto and others added 3 commits October 9, 2025 14:32

Update kedro-datasets/kedro_datasets_experimental/langchain/langchain…

e8f6f24

…_prompt_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]>

Separate validation on build config function

2943164

Signed-off-by: Laura Couto <[email protected]>

Lint?

bdfdc48

Signed-off-by: Laura Couto <[email protected]>

ankatiyar approved these changes Oct 9, 2025

View reviewed changes

lrcouto merged commit 2a1a134 into main Oct 9, 2025
17 checks passed

lrcouto deleted the add-langchain-prompt-dataset branch October 9, 2025 15:53

ElenaKhaustova mentioned this pull request Oct 28, 2025

Release kedro-datasets 9.0.0 kedro-org/kedro#5169

Open



		class LangChainPromptDataset(AbstractDataset[PromptTemplate \| ChatPromptTemplate, Any]):
		"""Kedro dataset for loading LangChain prompts using existing Kedro datasets."""


		valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"}

		if dataset is None:

feat(datasets): Add LangchainPromptDataset to experimental datasets #1200

feat(datasets): Add LangchainPromptDataset to experimental datasets #1200

Uh oh!

Conversation

lrcouto commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What is this?

Expected data format

Data Catalog configuration.

Node example:

Development notes

Checklist

Uh oh!

ElenaKhaustova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElenaKhaustova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lrcouto commented Oct 4, 2025

Uh oh!

lrcouto commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ElenaKhaustova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElenaKhaustova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElenaKhaustova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

lrcouto commented Sep 29, 2025 •

edited

Loading

ElenaKhaustova left a comment •

edited

Loading

lrcouto commented Oct 5, 2025 •

edited

Loading