-
Notifications
You must be signed in to change notification settings - Fork 107
feat(datasets): Add LangchainPromptDataset to experimental datasets #1200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of everything in one dataset and draft implantation! Left a few comments to discuss.
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @lrcouto!
I made another pass and left several things to address.
Apart from them can we please also:
- Add an example on how to test it as we do with normal datasets
- Spcify a minimum
LangChainandPythonversions this dataset will be supporting - Test it with the kedro-org/kedro-viz#2490. It will require
.preview()method to be implemented.
@rashidakanchwala, can you please help us to build a branch with the preview and node colouring features, so we can use it when testing?
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
|
@ElenaKhaustova everything should be corrected now. For the format validation, I've left it only checking if the message isn't empty. If the data does not contain the correct format, the underlying dataset throws an exception (for example, bad JSON formatting), so it doesn't even load and doesn't even get to the validation function. |
|
One more practical test that can be done as an example of usage for this dataset - we can use it on Elena's RAG chatbot to eliminate the need for the Replace the messages:
- role: system
content: >
You are a powerful assistant who will answer user questions about the Kedro framework.
As a source, you have a vector store of previous questions answered by Kedro team members.
You can search through this vector store and retrieve information for context.
You can use retrieved data for context only if it relates to the original question, otherwise, you can use your internal knowledge.
Never mention users from the retrieved context. However, you may refer to the GitHub issues and other links mentioned.
Provide code snippets if it may be helpful for the original question.
- role: human
content: >
{input}
- role: placeholder
content: >
{agent_scratchpad}Replace the chat_prompt:
type: kedro_rag_chatbot.datasets.langchain_prompt_dataset.LangChainPromptDataset
filepath: data/01_raw/query_prompt.yml
template: ChatPromptTemplate
dataset:
type: yaml.YAMLDatasetThen the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lrcouto! The implementation looks much cleaner now and thanks for extending the PR description with examples ✨
I've added a few comments to make it ready for merging!
After you add the requirements to the pyproject.toml, could you please also share installation commands in the PR description, so the reviewers could easily test it?
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
|
|
||
|
|
||
| class LangChainPromptDataset(AbstractDataset[PromptTemplate | ChatPromptTemplate, Any]): | ||
| """Kedro dataset for loading LangChain prompts using existing Kedro datasets.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please extend class docstrings like we do for the rest of the datasets? It is used in the docs, so it should be quite informative.
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
…edro-plugins into add-langchain-prompt-dataset
Signed-off-by: Laura Couto <[email protected]>
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Laura Couto <[email protected]>
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested recent changes - all works as expected.
There are still issues with the version and class docstrings. Happy to approve when they're resolved.
| except Exception as e: | ||
| raise DatasetError(f"Failed to create underlying dataset: {e}") | ||
|
|
||
| def _build_dataset_config(self, dataset: dict[str, Any] | str | None) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's better to specify strictly what the underlying datasets can be - just TextDataset, YAMLDataset and JSONDataset and error out if it isn't instead of inferring it. Unlike PartitionedDataset it's not like this can be any underlying dataset type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yeah, it makes sense cause now one can set a random dataset that will load data incompatible with the langchain template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some validation for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was, instead of allowing for a case where the user hasn't specified an underlying dataset config and inferring from file extension, we can just error out if the dataset config is not provided. And if the config is provided, maybe we can check if the type is only TextDataset YAMLDataset or JSONDataset (on further discussion with @ElenaKhaustova, this might limit the user incase they wanted to use a custom underlying dataset so I am not too fussed about if we include this validation or not. If the data is not in the correct format langchain should complain anyway. But it might be fine to limit this to these types for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of allowing users to use a custom underlying dataset but I think this can be a future addition. I'd like to see if this dataset is something that people are actually interested in using first.
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
|
|
||
| if dataset is not None: | ||
| dataset_type = dataset["type"] if isinstance(dataset, dict) else str(dataset) | ||
| if dataset_type not in valid_datasets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work if the user sets the dataset using the full name, like kedro_datasets.text.TextDataset
| ### Example usage for the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/): | ||
| ```python | ||
| >>> from kedro_datasets_experimental.langchain import LangChainPromptDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need >>> if wrapping into the python code block
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Show resolved
Hide resolved
Signed-off-by: Laura Couto <[email protected]>
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with a few comments - the rest looks good.
Thank you @lrcouto!
kedro-datasets/kedro_datasets_experimental/langchain/langchain_prompt_dataset.py
Outdated
Show resolved
Hide resolved
| dict: A normalized dataset configuration dictionary. | ||
| """ | ||
|
|
||
| valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would be a bit cleaner if this became a constant and the validation logic is moved to a separate method
|
|
||
| valid_datasets = {"text.TextDataset", "json.JSONDataset", "yaml.YAMLDataset"} | ||
|
|
||
| if dataset is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add at least two unit tests to check that proper errors are raised?
…_prompt_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!

Description
Solves kedro-org/kedro#5089
What is this?
A dataset that can be used to load
.txt,.jsonand.yamlfiles and convert them into LangChain prompt objects. Currently works withPromptTemplateandChatPromptTemplate.I recommend using it with Python >= 3.10 for safety/compatibility, and LangChain >= 0.3.0.
Expected data format
PromptTemplateexpects a string by default. It accepts a set of parameters from the user that can be used to generate a prompt for a language model. For example:Hello {name}, welcome to Kedro!Or with a defined list of input variables:
{ "template": "You are an expert in {field}. Answer the following question: {question}", "input_variables": ["field", "question"] }Or in YAML format, same thing can be done:
ChatPromptTemplateexpects as input a dictionary or a list of tuples, with pairs of role and content parameters. For example:[ ("system", "You are a helpful AI bot. Your name is {name}."), ("human", "Hello, how are you doing?"), ("ai", "I'm doing well, thanks!"), ("human", "{user_input}"), ]or
{ "messages": { "system": "You are a coding assistant specialized in {language}.", "human": "Help me with: {problem}" } }In yaml:
For further detail see the LangChain documentation -
PromptTemplate: https://python.langchain.com/v0.2/api_reference/core/prompts/langchain_core.prompts.prompt.PromptTemplate.html
ChatPromptTemplate: https://python.langchain.com/v0.2/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html
Data Catalog configuration.
Example:
filepath: Path to the dataset file.template: Which LangChain template should be used for that dataset, eitherPromptTemplateorChatPromptTemplate. If none is chosen,PromptTemplateis the default.dataset: Arguments related to which underlying dataset is chosen.type: Which underlying dataset should be used to load the file. Can betext.TextDataset,json.JSONDatasetoryaml.YAMLDataset. If none is chosen, it'll be inferred through the file extension.fs_args: If the chosen underlying dataset allows for extra arguments, they can be passed here.credentials: Will be passed to the underlying dataset. Work the same as whichever dataset is chosen.The
kedro-viz: layer: rawparameter in the metadata allow the data preview to be displayed in the Kedro Viz metadata panel (kedro-org/kedro-viz#2490)Looks like this:
Node example:
You can take a YAML file like:
And pass it to a Kedro node, where the inputs for those variables can be passed using the
format_messagesmethod, which returns a list of LangChain BaseMessages,The output of
messageswill be like this:[ SystemMessage( content='You are a technical writer.', additional_kwargs={}, response_metadata={} ), HumanMessage( content='Please help with: documentation for API endpoints', additional_kwargs={}, response_metadata={} ), AIMessage( content="I'll help you with documentation for API endpoints. Let me break it down.", additional_kwargs={}, response_metadata={} ), HumanMessage( content='Thanks! Can you also explain best practices for API versioning?', additional_kwargs={}, response_metadata={} ) ]Which is the right format to be fed into one of LangChain's chat models.
Development notes
Checklist
jsonschema/kedro-catalog-X.XX.jsonif necessaryRELEASE.mdfile