Skip to content

Releases: huggingface/datasets

2.20.0

13 Jun 14:57
98fdc9e
Compare
Choose a tag to compare

Important

  • Remove default trust_remote_code=True by @lhoestq in #6954
    • datasets with a python loading script now require passing trust_remote_code=True to be used

Datasets features

  • [Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in #6658
    • checkpoint and resume an iterable dataset (e.g. when streaming):

      >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
      >>> for idx, example in enumerate(iterable_dataset):
      ...     print(example)
      ...     if idx == 2:
      ...         state_dict = iterable_dataset.state_dict()
      ...         print("checkpoint")
      ...         break
      >>> iterable_dataset.load_state_dict(state_dict)
      >>> print(f"restart from checkpoint")
      >>> for example in iterable_dataset:
      ...     print(example)

      Returns:

      {'a': 0}
      {'a': 1}
      {'a': 2}
      checkpoint
      restart from checkpoint
      {'a': 3}
      {'a': 4}
      {'a': 5}
      

General improvements and bug fixes

New Contributors

Full Changelog: 2.19.0...2.20.0

2.19.2

03 Jun 05:26
Compare
Choose a tag to compare

Bug fixes

  • Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in #6902
  • Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in #6883
  • Update requests >=2.32.1 to fix vulnerability by @albertvillanova in #6909
  • Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in #6925

Full Changelog: 2.19.1...2.19.2

2.19.1

06 May 09:40
bb2664c
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.19.0...2.19.1

2.19.0

19 Apr 08:46
0d3c746
Compare
Choose a tag to compare

Dataset Features

  • Add Polars compatibility by @psmyth94 in #6531
    • convert to a Polars dataframe using .to_polars();
      import polars as pl
      from datasets import load_dataset
      ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
      ds.to_polars() \
          .groupby("topic") \
          .agg(pl.len(), pl.first()) \
          .sort("len", descending=True)
    • Use Polars formatting to return Polars objects when accessing a dataset:
      ds = ds.with_format("polars")
      ds[:10].group_by("kind").len()
  • Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in #6096
    • Save on HF in any file format:
      ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
      ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
      ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
  • Add mode parameter to Image feature by @mariosasko in #6735
    • Set images to be read in a certain mode like "RGB"
      dataset = dataset.cast_column("image", Image(mode="RGB"))
  • Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
    • run command to open a PR in script-based dataset to convert it to Parquet:
      datasets-cli convert_to_parquet <dataset_id>
      
  • Add Dataset.take and Dataset.skip by @lhoestq in #6813
    • same as IterableDataset.take and IterableDataset.skip
      ds = ds.take(10)  # take only the first 10 examples

General improvements and bug fixes

New Contributors

Full Changelog: 2.18.0...2.19.0

2.18.0

01 Mar 21:00
ca8409a
Compare
Choose a tag to compare

Dataset features

  • Make JSON builder support an array of strings by @albertvillanova in #6696
  • Base parquet batch_size on parquet row group size by @lhoestq in #6701
    • Faster cold start for streaming
  • Change default compression argument for JsonDatasetWriter by @Rexhaif in #6659
  • Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in #6660
  • fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in #6687
    • Support latest fsspec up to 2024.2.0

General improvements and bug fixes

New Contributors

Full Changelog: 2.17.1...2.18.0

2.17.1

19 Feb 09:58
5d22682
Compare
Choose a tag to compare

Bug Fixes

Full Changelog: 2.17.0...2.17.1

2.17.0

09 Feb 10:09
7063357
Compare
Choose a tag to compare

Dataset Features

General improvements and bug fixes

New Contributors

Full Changelog: 2.16.1...2.17.0

2.16.1

30 Dec 16:46
7b2bcd7
Compare
Choose a tag to compare

Bug fixes

  • Fix dl_manager.extract returning FileNotFoundError by @lhoestq in #6543
    • Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
  • Fix custom configs from script by @lhoestq in #6544
    • Fix bug when loading a dataset with a loading script using custom arguments would fail
    • e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: 2.16.0...2.16.1

2.16.0

22 Dec 14:21
a85fb52
Compare
Choose a tag to compare

Security features

  • Add trust_remote_code argument by @lhoestq in #6429
    • Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
    • Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
    • Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
  • Use parquet export if possible by @lhoestq in #6448
    • This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
    • You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

  • Webdataset dataset builder by @lhoestq in #6391
  • Implement get dataset default config name by @albertvillanova in #6511
  • Lazy data files resolution and offline cache reload by @lhoestq in #6493
    • This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
    • Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
    • Reload a dataset from your cache even if you don't have internet connection
    • New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
    • Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

New Contributors

Full Changelog: 2.15.0...2.16.0

2.15.0

16 Nov 08:06
0caf912
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 2.14.7...2.15.0