Releases · huggingface/datasets

15 Nov 08:19

albertvillanova

2.14.7

bf02cff

2.14.7

Bug Fixes

Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in #6346
Fix python formatting for complex types in format_table by @mariosasko in #6368
Support pyarrow 14.0.0 by @albertvillanova in #6378
Do not try to download from HF GCS for generator by @yundai424 in #6372
Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in #6404

New Contributors

@cwallenwein made their first contribution in #6346
@yundai424 made their first contribution in #6372

Full Changelog: 2.14.6...2.14.7

Contributors

albertvillanova, cwallenwein, and 2 other contributors

Assets 2

24 Oct 08:15

lhoestq

2.14.6

06c3ffb

2.14.6

What's Changed

Ignore dataset_info.json in data files resolution by @mariosasko in #6224
Check builder cls default config name in inspect by @lhoestq in #6253
Add support for fsspec>=2023.9.0 by @mariosasko in #6244
Create DefunctDatasetError by @albertvillanova in #6286
Fix get_data_patterns for directories with the word data twice by @albertvillanova in #6309
Fix loading Hub datasets with CSV metadata file by @albertvillanova in #6316
datasets.filesystems: fix is_remote_filesystems by @ap-- in #6334
Pin upper version of fsspec by @albertvillanova in #6337
Fix regex get_data_files formatting for base paths by @ZachNagengast in #6322

New Contributors

@ap-- made their first contribution in #6334
@ZachNagengast made their first contribution in #6322

Full Changelog: 2.14.5...2.14.6

Contributors

ap--, ZachNagengast, and 3 other contributors

Assets 2

24 Oct 08:15

albertvillanova

2.14.5

1a598a0

2.14.5

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in #6091
Minor fix in iter_files for hidden files by @mariosasko in #6092
Use yaml instead of get data patterns when possible by @lhoestq in #6154
Fix Parquet loading with columns by @mariosasko in #6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in #6164
PyArrow 13 CI fixes by @mariosasko in #6175
Don't alter input in Features.from_dict by @lhoestq in #6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in #6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in #6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in #6200
Preserve split order in DataFilesDict by @albertvillanova in #6198
Add missing revision argument by @qgallouedec in #6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in #6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
Fix empty splitinfo json by @lhoestq in #6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in #6201
Fix checking patterns to infer packaged builder by @polinaeterna in #6215
Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in #6218

Other improvements

Deprecate Dataset.export by @mariosasko in #6081
Deprecate download_custom by @mariosasko in #6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in #6138
Remove unused allowed_extensions param by @albertvillanova in #6135
Export to_iterable_dataset to document by @npuichigo in #6145
[Docs] Add description of select_columns to guide by @unifyh in #6119
Ignore parallel warning in map_nested by @lhoestq in #6148
[docs] Complete to_iterable_dataset by @stevhliu in #6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in #6155
Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in #6171
Document BUILDER_CONFIG_CLASS by @lhoestq in #6166
Fix import in image_load doc by @mariosasko in #6181
Use object detection images from huggingface/documentation-images by @mariosasko in #6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in #6180

New Contributors

@npuichigo made their first contribution in #6145
@unifyh made their first contribution in #6119

Full Changelog: 2.14.4...2.14.5

Contributors

albertvillanova, npuichigo, and 8 other contributors

Assets 2

06 Sep 08:29

albertvillanova

2.13.2

98b1bdd

2.13.2

Bug fixes

Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208

Full Changelog: 2.13.1...2.13.2

Contributors

albertvillanova

Assets 2

08 Aug 15:52

albertvillanova

2.14.4

53d55f3

2.14.4

Bug fixes

Fix authentication issues by @albertvillanova in #6127

Full Changelog: 2.14.3...2.14.4

Contributors

albertvillanova

Assets 2

03 Aug 10:31

albertvillanova

2.14.3

33f736e

2.14.3

Bug fixes

Fix error when loading from GCP bucket by @albertvillanova in #6105
Fix deprecation of use_auth_token in file_utils by @albertvillanova in #6107

Full Changelog: 2.14.2...2.14.3

Contributors

albertvillanova

Assets 2

31 Jul 06:39

albertvillanova

2.14.2

09492ba

2.14.2

Bug fixes

Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in #6094
Fix deprecation of errors in TextConfig by @albertvillanova in #6095

Full Changelog: 2.14.1...2.14.2

Contributors

albertvillanova

Assets 2

27 Jul 17:09

lhoestq

2.14.1

029956a

2.14.1

Bug fixes

fix tqdm lock by @lhoestq in #6067
fix tqdm lock deletion by @lhoestq in #6068
Fix fsspec storage_options from load_dataset by @lhoestq in #6072
No gzip encoding from github by @lhoestq in #6076

Other improvements

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in #5902
Fix Quickstart notebook link by @mariosasko in #6070
Remove README link to deprecated Colab notebook by @mariosasko in #6080
Misc doc improvements by @mariosasko in #6074

Full Changelog: 2.14.0...2.14.1

Contributors

alvarobartt, lhoestq, and mariosasko

Assets 2

24 Jul 15:54

lhoestq

2.14.0

88896a7

2.14.0

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

Configure your dataset using YAML at the top of your dataset card (docs here)
Choose which file goes into which split

  ---
  configs:
  - config_name: default
    data_files:
    - split: train
       path: data.csv
    - split: test
        path: holdout.csv
  ---

Define multiple dataset configurations

  ---
  configs:
  - config_name: main_data
    data_files: main_data.csv
  - config_name: additional_data
    data_files: additional_data.csv
  ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

push_to_hub() additional dataset configurations

ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")

Support returning dataframe in map transform by @mariosasko in #5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in #5974
Fix select_columns columns order by @lhoestq in #5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in #5949
Pin joblib to avoid joblibspark test failures by @mariosasko in #6000
Align column_names type check with type hint in sort by @mariosasko in #6001
Deprecate use_auth_token in favor of token by @mariosasko in #5996
Drop Python 3.7 support by @mariosasko in #6005
Misc improvements by @mariosasko in #6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
Fix cast for dictionaries with no keys by @mariosasko in #6009
Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
Deprecate task api by @mariosasko in #5865
Add metadata ui screenshot in docs by @lhoestq in #6015
Fix ClassLabel min max check for None values by @mariosasko in #6023
[docs] Update return statement of index search by @stevhliu in #6021
Improve logging by @mariosasko in #6019
Fix style with ruff 0.0.278 by @lhoestq in #6026
Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in #6027
[docs] Fix link by @stevhliu in #6029
fixed typo in comment by @NightMachinery in #6030
Fix legacy_dataset_infos by @lhoestq in #6040
Flatten repository_structure docs on yaml by @lhoestq in #6041
Use new hffs by @lhoestq in #6028
Bump dev version by @lhoestq in #6047
Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in #6052
Dill 3.7 support by @mariosasko in #6061
Improve Dataset.from_list docstring by @mariosasko in #6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in #6045
Release: 2.14.0 by @lhoestq in #6063

New Contributors

@mathewjacob1002 made their first contribution in #5986
@pappacena made their first contribution in #5976

Full Changelog: 2.13.1...2.14.0

Contributors

pappacena, polinaeterna, and 6 other contributors

Assets 2

22 Jun 18:31

lhoestq

2.13.1

682d21e

2.13.1

General improvements and bug fixes

Fix JSON generation in benchmarks CI by @mariosasko in #5966
Always return list in list_datasets by @mariosasko in #5964
Add encoding and errors params to JSON loader by @mariosasko in #5969
Filter unsupported extensions by @lhoestq in #5972

Full Changelog: 2.13.0...2.13.1

Contributors

lhoestq and mariosasko

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fixes

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Bug fixes

Other improvements

New Contributors

Contributors

Bug fixes

Contributors

Bug fixes

Contributors

Bug fixes

Contributors

Bug fixes

Contributors

Bug fixes

Other improvements

Contributors

Important: caching

Dataset Configuration

Dataset Features

What's Changed

New Contributors

Contributors

General improvements and bug fixes

Contributors

Releases: huggingface/datasets

2.14.7

Bug Fixes

New Contributors

Contributors

2.14.6

What's Changed

New Contributors

Contributors

2.14.5

Bug fixes

Other improvements

New Contributors

Contributors

2.13.2

Bug fixes

Contributors

2.14.4

Bug fixes

Contributors

2.14.3

Bug fixes

Contributors

2.14.2

Bug fixes

Contributors

2.14.1

Bug fixes

Other improvements

Contributors

2.14.0

Important: caching

Dataset Configuration

Dataset Features

What's Changed

New Contributors

Contributors

2.13.1

General improvements and bug fixes

Contributors