Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 7 additions & 12 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,22 @@
## Major features and improvements

- Group datasets documentation according to the dependencies to clean up the nav bar.
- Added the following new experimental datasets:
- Added the following new **experimental** datasets:

| Type | Description | Location |
|--------------------------------|-------------------------------------------------------------|--------------------------------------|
| `langfuse.LangfuseTraceDataset` | A dataset to provide Langfuse tracing clients and callbacks | `kedro_datasets_experimental.langfuse` |

- Added the following new **experimental** datasets:

| Type | Description | Location |
| ----------------------------------- | -------------------------------------------------------- | --------------------------------------- |
| `langchain.LangChainPromptDataset` | Kedro dataset for loading LangChain prompts | `kedro_datasets_experimental.langchain` |
| `pypdf.PDFDataset` | A dataset to read PDF files and extract text using pypdf | `kedro_datasets_experimental.pypdf` |

## Bug fixes and other changes
- Add HTMLPreview type.
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`.

## Major features and improvements

- Added the following new experimental datasets:
## Community contributions

| Type | Description | Location |
|--------------------------------|---------------------------------------------------------------|--------------------------------------|
| `pypdf.PDFDataset` | A dataset to read PDF files and extract text using pypdf | `kedro_datasets_experimental.pypdf` |
Many thanks to the following Kedroids for contributing PRs to this release:
- [Guillaume Tauzin](https://github.com/gtauzin)

# Release 8.1.0
## Major features and improvements
Expand Down Expand Up @@ -78,6 +72,7 @@ Many thanks to the following Kedroids for contributing PRs to this release:
- [Seohyun Park](https://github.com/soyamimi)
- [Daniel Russell-Brain](https://github.com/killerfridge)


# Release 7.0.0

## Major features and improvements
Expand Down
26 changes: 18 additions & 8 deletions kedro-datasets/kedro_datasets_experimental/optuna/study_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,13 +116,13 @@ def __init__( # noqa: PLR0913
self._study_name = self._validate_study_name(study_name=study_name)

credentials = self._validate_credentials(backend=backend, credentials=credentials)
storage = URL.create(
storage_url = URL.create(
drivername=backend,
database=database,
**credentials,
)

self._storage = str(storage)
self._storage_url = storage_url
self.metadata = metadata

filepath = None
Expand Down Expand Up @@ -286,8 +286,10 @@ def load(self) -> optuna.Study:
pruner_config = load_args.pop("pruner")
pruner = self._get_pruner(pruner_config)

storage_url_str = self._storage_url.render_as_string(hide_password=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this expose your database password? I have never used Optuna, so don't really know how this works, but ideally this should still be a secure way of handling credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed expose your database password. I am not sure what are the security consequences of this but I am not seeing a workaround.

Optuna's RDBStorage, requires the url of the RDB in the form of a SQLAlchemy URL string and stores it directly as a class attribute. This string includes the target db, its host, port, password and name. Previously with self._storage = str(storage), the password was hidden and translated to "***" within the string which meant authentication to a password-protected RDB was impossible. Now even if I instantiate the RDBStorage class in the dataset class constructor to reuse it in save/load methods, its url class attribute would still contain all the credentials.

Would you have any pointers regarding to the security scenario where that would be problematic or to a discussion regarding secure handling of credentials? I am happy to learn more about this and try and come back with a better solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @merelcht
It's been a few months already and I would really like to get this merge as the current StudyDataset does not really allow for distributed hyperparameter tuning. I am happy to work on a solution but I need some pointers as mentioned above. Please let me know :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gtauzin , sincere apologies for not getting back to you earlier! This slipped my mind during and after the 1.0 release of Kedro.

My main concern is whether this could accidentally leak credentials and make it possible for bad actors to retrieve your database keys. Would it work if you just directly call:

storage = optuna.storages.RDBStorage(url=self._storage_url)

And skip the rendering as string?

Copy link
Member

@deepyaman deepyaman Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the password handling and don't necessarily see something better. The rendered password is only in a variable, not in anything saved to the class, right? Unless I'm missing something in my pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Indeed, all credentials are stored in a class variable _storage_url which means they can be directly accessed as my_dataset._storage_url.render_as_string(hide_password=False). This is the only place where they appear.

@merelcht No worries! Initially I figured you were busy with 1.0 and then I forgot about it until I started needing it again :)

For your suggestion, the url parameter required by RDBStorage is a string, which is why I use the render_as_string method.

I have had a look at redis.PickleDataset and it seems to me you can get back credentials from a dataset instance as well by doing my_pickle_dataset._redis_db.connection_pool.connection_kwargs.

Is the problem related to credentials are stored in the class and can be accessed clearly or the presence of a line of code which exposes them clearly?

Thanks to both of you for the help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a look at redis.PickleDataset and it seems to me you can get back credentials from a dataset instance as well by doing my_pickle_dataset._redis_db.connection_pool.connection_kwargs.

This is a fair point! I don't have a good alternative at this point either and don't want to be a blocker to get this in. My suggestion is to merge this now as is and come back to it in case it becomes an issue. Thanks for your patience @gtauzin ! 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merelcht Thank you :)

storage = optuna.storages.RDBStorage(url=storage_url_str)
study = optuna.load_study(
storage=self._storage,
storage=storage,
study_name=self._get_load_study_name(),
sampler=sampler,
pruner=pruner,
Expand All @@ -297,37 +299,45 @@ def load(self) -> optuna.Study:

def save(self, study: optuna.Study) -> None:
save_study_name = self._get_save_study_name()

storage_url_str = self._storage_url.render_as_string(hide_password=False)
if self._backend == "sqlite":
os.makedirs(os.path.dirname(self._filepath), exist_ok=True)

if not os.path.isfile(self._filepath):
optuna.create_study(
storage=self._storage,
storage=storage_url_str,
)

storage = optuna.storages.RDBStorage(url=storage_url_str)

# To overwrite an existing study, we need to first delete it if it exists
if self._study_name_exists(save_study_name):
optuna.delete_study(
storage=self._storage,
storage=storage,
study_name=save_study_name,
)

optuna.copy_study(
from_study_name=study.study_name,
from_storage=study._storage,
to_storage=self._storage,
to_storage=storage,
to_study_name=save_study_name,
)

def _study_name_exists(self, study_name) -> bool:
if self._backend == "sqlite" and not os.path.isfile(self._database):
return False

study_names = optuna.study.get_all_study_names(storage=self._storage)
storage_url_str = self._storage_url.render_as_string(hide_password=False)
storage = optuna.storages.RDBStorage(url=storage_url_str)
study_names = optuna.study.get_all_study_names(storage=storage)
return study_name in study_names

def _study_name_glob(self, pattern):
study_names = optuna.study.get_all_study_names(storage=self._storage)
storage_url_str = self._storage_url.render_as_string(hide_password=False)
storage = optuna.storages.RDBStorage(url=storage_url_str)
study_names = optuna.study.get_all_study_names(storage=storage)
for study_name in study_names:
if fnmatch.fnmatch(study_name, pattern):
yield study_name
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,26 @@ def test_invalid_credentials(self):
credentials={"username": "user", "pwd": "pass"}, # pragma: allowlist secret
)

def test_study_existence(self):
"""Test invalid credentials raise ValueError."""
study_dataset = StudyDataset(
study_name="test",
backend="postgresql",
database="optuna_db",
credentials={"username": "user", "password": "pass"}, # pragma: allowlist secret
)

# Test that RDB storage can be created but DB access module cannot be imported
with pytest.raises(ImportError, match="Failed to import DB access module"):
study_dataset._study_name_exists(study_name="blah")

study_dataset = StudyDataset(
study_name="test",
backend="sqlite",
database="optuna.db",
)
assert study_dataset._study_name_exists(study_name="blah") is False

def test_study_name_exists(self, study_dataset, dummy_study):
"""Test `_study_name_exists` method."""
assert not study_dataset._study_name_exists("test")
Expand Down