Releases: huggingface/datasets
Releases 路 huggingface/datasets
2.14.7
Bug Fixes
- Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in #6346
- Fix python formatting for complex types in format_table by @mariosasko in #6368
- Support pyarrow 14.0.0 by @albertvillanova in #6378
- Do not try to download from HF GCS for generator by @yundai424 in #6372
- Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in #6404
New Contributors
- @cwallenwein made their first contribution in #6346
- @yundai424 made their first contribution in #6372
Full Changelog: 2.14.6...2.14.7
2.14.6
What's Changed
- Ignore dataset_info.json in data files resolution by @mariosasko in #6224
- Check builder cls default config name in inspect by @lhoestq in #6253
- Add support for fsspec>=2023.9.0 by @mariosasko in #6244
- Create DefunctDatasetError by @albertvillanova in #6286
- Fix get_data_patterns for directories with the word data twice by @albertvillanova in #6309
- Fix loading Hub datasets with CSV metadata file by @albertvillanova in #6316
- datasets.filesystems: fix is_remote_filesystems by @ap-- in #6334
- Pin upper version of fsspec by @albertvillanova in #6337
- Fix regex get_data_files formatting for base paths by @ZachNagengast in #6322
New Contributors
- @ap-- made their first contribution in #6334
- @ZachNagengast made their first contribution in #6322
Full Changelog: 2.14.5...2.14.6
2.14.5
Bug fixes
- Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in #6091
- Minor fix in
iter_files
for hidden files by @mariosasko in #6092 - Use yaml instead of get data patterns when possible by @lhoestq in #6154
- Fix Parquet loading with
columns
by @mariosasko in #6160 - Fix: Missing a MetadataConfigs init when the repo has a
datasets_info.json
but no README by @clefourrier in #6164 - PyArrow 13 CI fixes by @mariosasko in #6175
- Don't alter input in Features.from_dict by @lhoestq in #6189
- Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in #6165
- Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in #6192
- Temporarily pin pandas < 2.1.0 by @albertvillanova in #6200
- Preserve split order in DataFilesDict by @albertvillanova in #6198
- Add missing
revision
argument by @qgallouedec in #6191 - Temporarily pin fsspec < 2023.9.0 by @albertvillanova in #6210
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
- Fix empty splitinfo json by @lhoestq in #6211
- Fix to_json ValueError and remove pandas pin by @albertvillanova in #6201
- Fix checking patterns to infer packaged builder by @polinaeterna in #6215
- Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in #6218
Other improvements
- Deprecate
Dataset.export
by @mariosasko in #6081 - Deprecate
download_custom
by @mariosasko in #6093 - Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in #6138
- Remove unused allowed_extensions param by @albertvillanova in #6135
- Export to_iterable_dataset to document by @npuichigo in #6145
- [Docs] Add description of
select_columns
to guide by @unifyh in #6119 - Ignore parallel warning in map_nested by @lhoestq in #6148
- [docs] Complete
to_iterable_dataset
by @stevhliu in #6158 - Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in #6155
- Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in #6171
- Document BUILDER_CONFIG_CLASS by @lhoestq in #6166
- Fix import in
image_load
doc by @mariosasko in #6181 - Use object detection images from
huggingface/documentation-images
by @mariosasko in #6177 - Use
hf-internal-testing
repos for hosting test dataset repos by @mariosasko in #6180
New Contributors
- @npuichigo made their first contribution in #6145
- @unifyh made their first contribution in #6119
Full Changelog: 2.14.4...2.14.5
2.13.2
Bug fixes
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
Full Changelog: 2.13.1...2.13.2
2.14.4
2.14.3
Bug fixes
- Fix error when loading from GCP bucket by @albertvillanova in #6105
- Fix deprecation of use_auth_token in file_utils by @albertvillanova in #6107
Full Changelog: 2.14.2...2.14.3
2.14.2
Bug fixes
- Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in #6094
- Fix deprecation of errors in TextConfig by @albertvillanova in #6095
Full Changelog: 2.14.1...2.14.2
2.14.1
Bug fixes
- fix tqdm lock by @lhoestq in #6067
- fix tqdm lock deletion by @lhoestq in #6068
- Fix fsspec storage_options from load_dataset by @lhoestq in #6072
- No gzip encoding from github by @lhoestq in #6076
Other improvements
- Fix
Overview.ipynb
& detach Jupyter Notebooks fromdatasets
repository by @alvarobartt in #5902 - Fix Quickstart notebook link by @mariosasko in #6070
- Remove README link to deprecated Colab notebook by @mariosasko in #6080
- Misc doc improvements by @mariosasko in #6074
Full Changelog: 2.14.0...2.14.1
2.14.0
Important: caching
- Datasets downloaded and cached using
datasets>=2.14.0
may not be reloaded from cache using older version ofdatasets
(and therefore re-downloaded). - Datasets that were already cached are still supported.
- This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
- This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.
Dataset Configuration
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
- Configure your dataset using YAML at the top of your dataset card (docs here)
- Choose which file goes into which split
--- configs: - config_name: default data_files: - split: train path: data.csv - split: test path: holdout.csv ---
- Define multiple dataset configurations
--- configs: - config_name: main_data data_files: main_data.csv - config_name: additional_data data_files: additional_data.csv ---
Dataset Features
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
push_to_hub()
additional dataset configurations
ds.push_to_hub("username/dataset_name", config_name="additional_data") # reload later ds = load_dataset("username/dataset_name", "additional_data")
-
Support returning dataframe in map transform by @mariosasko in #5995
What's Changed
- Deprecate
errors
param in favor ofencoding_errors
in text builder by @mariosasko in #5974 - Fix select_columns columns order by @lhoestq in #5994
- Replace metadata utils with
huggingface_hub
's RepoCard API by @mariosasko in #5949 - Pin
joblib
to avoidjoblibspark
test failures by @mariosasko in #6000 - Align
column_names
type check with type hint insort
by @mariosasko in #6001 - Deprecate
use_auth_token
in favor oftoken
by @mariosasko in #5996 - Drop Python 3.7 support by @mariosasko in #6005
- Misc improvements by @mariosasko in #6004
- Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
- Fix cast for dictionaries with no keys by @mariosasko in #6009
- Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
- Deprecate task api by @mariosasko in #5865
- Add metadata ui screenshot in docs by @lhoestq in #6015
- Fix
ClassLabel
min max check forNone
values by @mariosasko in #6023 - [docs] Update return statement of index search by @stevhliu in #6021
- Improve logging by @mariosasko in #6019
- Fix style with ruff 0.0.278 by @lhoestq in #6026
- Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
- Delete
task_templates
inIterableDataset
when they are no longer valid by @mariosasko in #6027 - [docs] Fix link by @stevhliu in #6029
- fixed typo in comment by @NightMachinery in #6030
- Fix legacy_dataset_infos by @lhoestq in #6040
- Flatten repository_structure docs on yaml by @lhoestq in #6041
- Use new hffs by @lhoestq in #6028
- Bump dev version by @lhoestq in #6047
- Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
- Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
- Remove
HfFileSystem
and deprecateS3FileSystem
by @mariosasko in #6052 - Dill 3.7 support by @mariosasko in #6061
- Improve
Dataset.from_list
docstring by @mariosasko in #6062 - Check if column names match in Parquet loader only when config
features
are specified by @mariosasko in #6045 - Release: 2.14.0 by @lhoestq in #6063
New Contributors
- @mathewjacob1002 made their first contribution in #5986
- @pappacena made their first contribution in #5976
Full Changelog: 2.13.1...2.14.0
2.13.1
General improvements and bug fixes
- Fix JSON generation in benchmarks CI by @mariosasko in #5966
- Always return list in
list_datasets
by @mariosasko in #5964 - Add
encoding
anderrors
params to JSON loader by @mariosasko in #5969 - Filter unsupported extensions by @lhoestq in #5972
Full Changelog: 2.13.0...2.13.1