Releases · huggingface/datasets

26 Nov 16:58

lhoestq

1.16.1

acca8f4

1.16.1

Bug fixes

Fix import datasets on python 3.10 by @lhoestq in #3326
Fix wrongly converted assert by @eliasws in #3323

Contributors

eliasws and lhoestq

Assets 2

26 Nov 14:22

lhoestq

1.16.0

d50f5f9

1.16.0

Datasets Changes

New: riddle_sense by @ziyiwu9494 in #3161
New: Multi-Lingual LibriSpeech by @patrickvonplaten in #3198
New: XCSR by @yangxqiao in #3074
New: CMU Hinglish DoG by @Ishan-Kumar2 in #3149
New: Multidoc2dial by @sivasankalpp in #3205
New: IndoNLI by @afaji in #3307
Update: DaNE - updated URL for download by @MalteHB in #3203
Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in #3254
Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in #3225
Update: KILT - update metadata JSON by @albertvillanova in #3276
Update: Covost 2 - update download instructions by @patrickvonplaten in #3281
Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in #3290
Fix: tuple_ie - fix download url by @mariosasko in #3213
Fix: id_newspapers_2018 - fix streaming by @lhoestq in #3249
Fix: bookcorpusopen - fix RAM usage by @lhoestq in #3280
Fix: Scielo - fix ConnectionError by @mariosasko in #3260
Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in #3321

Datasets Features

Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in #3098:
- upload your dataset to the Hugging face Hub with the push_to_hub() method !
- See documentation here
200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in #3110
- Stream from Google Drive and other hosts by @lhoestq in #3248
- Support Audio feature in streaming mode by @albertvillanova in #3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in #3129
Resolve data_files by split name automatically by @lhoestq in #3221
- It takes into account the file names to know which file goes into which split
- See documentation here
Filter method for batched=True by @thomasw21 in #3244
Adding with_rank arg to pass process rank to map by @TevenLeScao in #3314

Dataset Cards

Add full tagset to conll2003 README by @BramVanroy in #3230
Fix some contact information formats by @lhoestq in #3274
Add wikipedia tags by @lhoestq in #3301
Updating details of IRC disentanglement data by @jkkummerfeld in #3259

Metrics Changes

New: OpenAI's pass@k code evaluation metric by @lvwerra in #2916
Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in #3235
Update: CER - update to support latest release by @mariosasko in #3252
Update: WER - update to the documentation by @wooters in #3278

Documentation

Add docs for to_tf_dataset by @stevhliu in #3175
Small updates to to_tf_dataset documentation by @Rocketknight1 in #3215
Update link to Datasets Tagging app in Spaces by @albertvillanova in #3194
Improve repository structure docs by @lhoestq in #3233
Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in #3241
Add docs for audio processing by @stevhliu in #3222
Add push_to_hub docs by @lhoestq in #3319

Additional improvements and bug fixes

Catch token invalid error in CI by @lhoestq in #3200
Pin keras version until TF fixes its release by @albertvillanova in #3208
Fix disable_nullable default value to False by @lhoestq in #3211
Fix code quality in riddle_sense dataset by @albertvillanova in #3218
Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in #3160
Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in #3121
Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in #3216
Group tests in multiprocessing workers by test file by @albertvillanova in #3231
Fix load_from_disk temporary directory by @lhoestq in #3245
[tiny] fix typo in stream docs by @nollied in #3246
Avoid PyArrow type optimization if it fails by @mariosasko in #3234
Remove redundant isort module placement by @mariosasko in #3243
asserts replaced by exception for text classification task with test. by @manisnesan in #3256
Add os.listdir for streaming by @lhoestq in #3270
asserts replaced with exception for image classification task, csv, json by @manisnesan in #3262
Force data files extraction if download_mode='force_redownload' by @mariosasko in #3275
Minor Typo Fix - Precision to Recall by @SebastinSanty in #3279
Decode audio from remote by @lhoestq in #3271
Fix build_docs CI by @lhoestq in #3286
Allow datasets with indices table when concatenating along axis=1 by @mariosasko in #3288
f-string formatting by @Mehdi2402 in #3277
Unpin markdown for build_docs now that it's fixed by @lhoestq in #3289
Pin version exclusion for Markdown by @albertvillanova in #3293
Use f-strings in the dataset scripts by @Carlosbogo in #3291
fix old_val typo in f-string by @Mehdi2402 in #3302
asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in #3305
fix: files counted twice in inferred structure by @borisdayma in #3309
Finish transition to PyArrow 3.0.0 by @mariosasko in #3318
Removing query params for dynamic URL caching by @anton-l in #3315

Citation

Update BibTeX entry by @albertvillanova in #3223
Fix paper BibTeX citation with proceedings reference by @albertvillanova in #3226
Add CITATION file by @albertvillanova in #3228
Fix URL in CITATION file by @albertvillanova in #3229

Deprecations

Deprecate prepare_module by @albertvillanova in #3166

Full Changelog: 1.15.1...1.16.0

Contributors

manisnesan, borisdayma, and 26 other contributors

Assets 2

02 Nov 21:47

lhoestq

1.15.1

0181006

1.15.1

Dependencies

Bump huggingface_hub to 0.1.0 by @lhoestq in #3199

Contributors

lhoestq

Assets 2

02 Nov 21:22

lhoestq

1.15.0

dcaa3c0

1.15.0

Dataset Changes

Update: JNLBA - add tags names by @bhavitvyamalik in #3092
Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in #3125 and #3176
Update: RONEC - update to v2 by @dumitrescustefan in #3184
Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in #3136
Fix: HLGD - fix label mapping by @VictorSanh in #3180

Dataset Features

Allow dynamic first dimension for ArrayXD by @rpowalski in #2891
add multi-proc in to_csv by @bhavitvyamalik in #2896
QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in #3196

Dataset Cards

Fill in dataset card for NCBI disease dataset by @edugp in #3115

Metrics Changes

New: metric for the MATH dataset (competition_math). by @hacobe in #3020
New: Google BLEU (aka GLEU) metric by @slowwavesleep in #3108
New: TER by @BramVanroy in #3153
New: ChrF(++) by @BramVanroy in #3187

General improvements and bug fixes

Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in #3120
Fixes to to_tf_dataset by @Rocketknight1 in #3085
Add security policy to the project by @albertvillanova in #2958
Update doc links to point to new docs by @mariosasko in #3116
Fix caching bugs by @mariosasko in #3141
Fix numpy deprecation warning for ragged tensors by @lhoestq in #3137
Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in #3157
Fix some typos in the documentation by @h4iku in #3152
Fix string encoding for Value type by @lhoestq in #3158
Fix CLI test to ignore verfications when saving infos by @albertvillanova in #3147
Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in #3159
Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in #3173
Asserts replaced by exceptions (#3171) by @joseporiolayats in #3174
Preserve ordering in zip_dict by @mariosasko in #3170
Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in #3182
Re-add faiss to windows testing suite by @BramVanroy in #3151
Add missing docstring to DownloadConfig by @mariosasko in #3183
More efficient nested features encoding by @eladsegal in #3124
Fix optimized encoding for arrays by @lhoestq in #3197

Contributors

BramVanroy, h4iku, and 15 other contributors

Assets 2

19 Oct 16:46

albertvillanova

1.14.0

ec82422

1.14.0

Dataset changes

Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
Update: SUPERB - use Audio features #3101 (@anton-l)
Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
Fix project description in PyPI #3103 (@albertvillanova)
Align tqdm control with cache control #3031 (@mariosasko)
Add paper BibTeX citation #3107 (@albertvillanova)

Contributors

iliaschalkidis, albertvillanova, and 3 other contributors

Assets 2

15 Oct 15:50

albertvillanova

1.13.3

10dc68c

1.13.3

Dataset changes

Update: Adapt all audio datasets #3081 (@patrickvonplaten)

Bug fixes

Update BibTeX entry #3090 (@albertvillanova)
Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
Fix Audio feature mp3 resampling #3096 (@albertvillanova)

Contributors

albertvillanova, patrickvonplaten, and mariosasko

Assets 2

14 Oct 16:02

albertvillanova

1.13.2

e82164f

1.13.2

Bug fixes

Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
Remove _resampler from Audio fields #3086 (@albertvillanova)

Contributors

albertvillanova

Assets 2

14 Oct 12:50

albertvillanova

1.13.1

2ed762b

1.13.1

Bug fixes

Fix loading a metric with internal import #3077 (@albertvillanova)

Contributors

albertvillanova

Assets 2

13 Oct 15:15

lhoestq

1.13.0

38ec259

1.13.0

Dataset changes

New: CaSiNo #2867 (@kushalchawla)
New: Mostly Basic Python Problems #2893 (@lvwerra)
New: OpenAI's HumanEval #2897 (@lvwerra)
New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
New: SEDE #2942 (@Hazoom)
New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
New: AMI #2853 (@cahya-wirawan)
New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
New: KanHope #2985 (@adeepH)
New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
New: SwedMedNER #2940 (@bwang482)
New: SberQuAD #3039 (@Alenush)
New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
New: Greek Legal Code #2966 (@christospi)
New: Story Cloze Test #3067 (@zaidalyafeai)
Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
Update: TriviaQA - add web and wiki config #2949 (@shirte)
Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
Update: Biosses - fix column names #3054 (@bwang482)
Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

Update: meteor - update from nltk update #2946 (@lhoestq)
Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

Use with TensorFlow:
- Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add remove_columns to IterableDataset #3030 (@cccntu)
- All the above ZIP features also work in streaming mode
New utilities:
- Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
Replace script_version with revision #2933 (@albertvillanova)
- The script_version parameter in load_dataset is now deprecated, in favor of revision
Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)

Documentation

Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

Fix filter leaking #3019 (@lhoestq)
- calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
Update BibTeX entry #2928 (@albertvillanova)
Fix exception chaining #2911 (@albertvillanova)
Add regression test for null Sequence #2929 (@albertvillanova)
Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
Fix fn kwargs in filter #2950 (@lhoestq)
Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
Fix missing conda deps #2952 (@lhoestq)
Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
Fix CI doc build #2961 (@albertvillanova)
Run tests in parallel #2954 (@albertvillanova)
Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
Take namespace into account in caching #2938 (@lhoestq)
Make Dataset.map accept list of np.array #2990 (@albertvillanova)
Fix loading compressed CSV without streaming #2994 (@albertvillanova)
Fix json loader when conversion not implemented #3000 (@lhoestq)
Remove all query parameters when extracting protocol #2996 (@albertvillanova)
Correct a typo #3007 (@Yann21)
Fix Windows test suite #3025 (@albertvillanova)
Remove unused parameter in xdirname #3017 (@albertvillanova)
Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
Fix typo #3023 (@qqaatw)
Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
Use cache folder for lockfile #2887 (@Dref360)
Fix streaming: catch Timeout error #3050 (@borisdayma)
Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
Fix task reloading from cache #3059 (@lhoestq)
Fix test command after refac #3065 (@lhoestq)
Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
Update summary on PyPi beyond NLP #3062 (@thomwolf)
Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
feat: increase streaming retry config #3068 (@borisdayma)
Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.

Contributors

jimregan, craffel, and 33 other contributors

Assets 2

15 Sep 17:45

lhoestq

1.12.1

2c1fc9c

1.12.1

Bug fixes

Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
- this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors

Contributors

pierre-godard and lhoestq

Assets 2

Releases: huggingface/datasets

1.16.1

Bug fixes

Contributors

1.16.0

Datasets Changes

Datasets Features

Dataset Cards

Metrics Changes

Documentation

Additional improvements and bug fixes

Citation

Deprecations

Contributors

1.15.1

Dependencies

Contributors

1.15.0

Dataset Changes

Dataset Features

Dataset Cards

Metrics Changes

General improvements and bug fixes

Contributors

1.14.0

Dataset changes

Dataset features

General improvements and bug fixes

Contributors

1.13.3

Dataset changes

Bug fixes

Contributors

1.13.2

Bug fixes

Contributors

1.13.1

Bug fixes

Contributors

1.13.0

Dataset changes

Metric changes

Dataset features

Dataset cards

Documentation

General improvements and bug fixes

Breaking changes:

Contributors

1.12.1

Bug fixes

Contributors