Datasets implemented as HuggingFace datasets builders, with custom download & caching.
This is an under-development research project; expect bugs and sharp edges.
- Datasets live in
stable_datasets/images/andstable_datasets/timeseries/. - Each dataset is a HuggingFace
datasets.GeneratorBasedBuilder(viaBaseDatasetBuilder). - Downloads use local custom logic (
stable_datasets/utils.py) rather than HuggingFace’s download manager. - Returned objects are
datasets.Datasetinstances (Arrow-backed), which can be formatted for NumPy / PyTorch as needed.
from stable_datasets.images.arabic_characters import ArabicCharacters
# First run will download + prepare cache, then return the split as a HF Dataset
ds = ArabicCharacters(split="train")
# If you omit the split (split=None), you get a DatasetDict with all available splits
ds_all = ArabicCharacters(split=None)
sample = ds[0]
print(sample.keys()) # {"image", "label"}
# Optional: make it PyTorch-friendly
ds_torch = ds.with_format("torch")Each dataset is a Hugging Face datasets.GeneratorBasedBuilder subclass that follows a simple convention:
- Define
VERSION: bump when your builder output changes. - Define
SOURCE(or override_source()): provides at least{"homepage": "...", "citation": "...", "assets": {"train": "...", "test": "...", ...}}. - Implement
_info(): defines features/metadata. - Implement
_generate_examples(self, data_path, split): yields(key, example_dict);data_pathis the downloaded artifact for that split.
Minimal skeleton:
import datasets
from stable_datasets.utils import BaseDatasetBuilder
class MyDataset(BaseDatasetBuilder):
VERSION = datasets.Version("1.0.0")
SOURCE = {
"homepage": "https://example.com",
"citation": "TBD",
"assets": {
"train": "https://example.com/train.zip",
"test": "https://example.com/test.zip",
},
}
def _info(self):
return datasets.DatasetInfo(
features=datasets.Features({"x": datasets.Value("int32")}),
supervised_keys=("x",),
homepage=self.SOURCE.get("homepage"),
)
def _generate_examples(self, data_path, split):
# read from data_path (zip/npz/etc), then yield examples
yield "0", {"x": 0}By default:
- Downloads:
~/.stable_datasets/downloads/ - Processed Arrow cache:
~/.stable_datasets/processed/
You can override both when constructing a dataset:
ds = ArabicCharacters(
split="train",
download_dir="/tmp/stable_datasets_downloads",
processed_cache_dir="/tmp/stable_datasets_processed",
)pip install -e .
# Optional (dev tools + tests + docs):
pip install -e ".[dev,docs]"pytest -qSome tests download data and may be slow. You can filter by markers:
- Skip slow tests:
pytest -m "not slow" - Run only download tests:
pytest -m download
Use the generate_teaser.py script to create visual previews of datasets for documentation:
# Generate a teaser with 5 samples
python generate_teaser.py --name CIFAR10 --num-samples 5 --output docs/source/datasets/teasers/cifar10_teaser.png
# Generate and display (without saving)
python generate_teaser.py --name MNIST --num-samples 8
# Customize figure size
python generate_teaser.py --name CIFAR100 --num-samples 10 --figsize 2.0 --output cifar100.pngSee the module lists under stable_datasets/images/ and stable_datasets/timeseries/.