stable-datasets

Datasets implemented as HuggingFace datasets builders, with custom download & caching.

This is an under-development research project; expect bugs and sharp edges.

What is it?

Datasets live in stable_datasets/images/ and stable_datasets/timeseries/.
Each dataset is a HuggingFace datasets.GeneratorBasedBuilder (via BaseDatasetBuilder).
Downloads use local custom logic (stable_datasets/utils.py) rather than HuggingFace’s download manager.
Returned objects are datasets.Dataset instances (Arrow-backed), which can be formatted for NumPy / PyTorch as needed.

Minimal Example

from stable_datasets.images.arabic_characters import ArabicCharacters

# First run will download + prepare cache, then return the split as a HF Dataset
ds = ArabicCharacters(split="train")

# If you omit the split (split=None), you get a DatasetDict with all available splits
ds_all = ArabicCharacters(split=None)

sample = ds[0]
print(sample.keys())  # {"image", "label"}

# Optional: make it PyTorch-friendly
ds_torch = ds.with_format("torch")

Building a dataset with `BaseDatasetBuilder`

Each dataset is a Hugging Face datasets.GeneratorBasedBuilder subclass that follows a simple convention:

Define VERSION: bump when your builder output changes.
Define SOURCE (or override _source()): provides at least {"homepage": "...", "citation": "...", "assets": {"train": "...", "test": "...", ...}}.
Implement _info(): defines features/metadata.
Implement _generate_examples(self, data_path, split): yields (key, example_dict); data_path is the downloaded artifact for that split.

Minimal skeleton:

import datasets

from stable_datasets.utils import BaseDatasetBuilder


class MyDataset(BaseDatasetBuilder):
    VERSION = datasets.Version("1.0.0")
    SOURCE = {
        "homepage": "https://example.com",
        "citation": "TBD",
        "assets": {
            "train": "https://example.com/train.zip",
            "test": "https://example.com/test.zip",
        },
    }

    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features({"x": datasets.Value("int32")}),
            supervised_keys=("x",),
            homepage=self.SOURCE.get("homepage"),
        )

    def _generate_examples(self, data_path, split):
        # read from data_path (zip/npz/etc), then yield examples
        yield "0", {"x": 0}

Custom cache locations

By default:

Downloads: ~/.stable_datasets/downloads/
Processed Arrow cache: ~/.stable_datasets/processed/

You can override both when constructing a dataset:

ds = ArabicCharacters(
    split="train",
    download_dir="/tmp/stable_datasets_downloads",
    processed_cache_dir="/tmp/stable_datasets_processed",
)

Installation

pip install -e .
# Optional (dev tools + tests + docs):
pip install -e ".[dev,docs]"

Running tests

pytest -q

Some tests download data and may be slow. You can filter by markers:

Skip slow tests: pytest -m "not slow"
Run only download tests: pytest -m download

Generating teaser figures

Use the generate_teaser.py script to create visual previews of datasets for documentation:

# Generate a teaser with 5 samples
python generate_teaser.py --name CIFAR10 --num-samples 5 --output docs/source/datasets/teasers/cifar10_teaser.png

# Generate and display (without saving)
python generate_teaser.py --name MNIST --num-samples 8

# Customize figure size
python generate_teaser.py --name CIFAR100 --num-samples 10 --figsize 2.0 --output cifar100.png

Datasets

See the module lists under stable_datasets/images/ and stable_datasets/timeseries/.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github		.github
docs		docs
stable_datasets		stable_datasets
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
arabic_characters_teaser.png		arabic_characters_teaser.png
generate_teaser.py		generate_teaser.py
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

stable-datasets

What is it?

Minimal Example

Building a dataset with `BaseDatasetBuilder`

Custom cache locations

Installation

Running tests

Generating teaser figures

Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Languages

License

rbalestr-lab/stable-datasets

Folders and files

Latest commit

History

Repository files navigation

stable-datasets

What is it?

Minimal Example

Building a dataset with BaseDatasetBuilder

Custom cache locations

Installation

Running tests

Generating teaser figures

Datasets

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Languages

Building a dataset with `BaseDatasetBuilder`

Packages