Releases: huggingface/datasets
Releases 路 huggingface/datasets
1.16.1
1.16.0
Datasets Changes
- New: riddle_sense by @ziyiwu9494 in #3161
- New: Multi-Lingual LibriSpeech by @patrickvonplaten in #3198
- New: XCSR by @yangxqiao in #3074
- New: CMU Hinglish DoG by @Ishan-Kumar2 in #3149
- New: Multidoc2dial by @sivasankalpp in #3205
- New: IndoNLI by @afaji in #3307
- Update: DaNE - updated URL for download by @MalteHB in #3203
- Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in #3254
- Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in #3225
- Update: KILT - update metadata JSON by @albertvillanova in #3276
- Update: Covost 2 - update download instructions by @patrickvonplaten in #3281
- Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in #3290
- Fix: tuple_ie - fix download url by @mariosasko in #3213
- Fix: id_newspapers_2018 - fix streaming by @lhoestq in #3249
- Fix: bookcorpusopen - fix RAM usage by @lhoestq in #3280
- Fix: Scielo - fix ConnectionError by @mariosasko in #3260
- Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in #3321
Datasets Features
- Push to hub capabilities for
Dataset
andDatasetDict
by @LysandreJik in #3098:- upload your dataset to the Hugging face Hub with the
push_to_hub()
method ! - See documentation here
- upload your dataset to the Hugging face Hub with the
- 200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in #3110
- Stream from Google Drive and other hosts by @lhoestq in #3248
- Support Audio feature in streaming mode by @albertvillanova in #3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in #3129
- Resolve data_files by split name automatically by @lhoestq in #3221
- It takes into account the file names to know which file goes into which split
- See documentation here
- Filter method for batched=True by @thomasw21 in #3244
- Adding
with_rank
arg to pass process rank tomap
by @TevenLeScao in #3314
Dataset Cards
- Add full tagset to conll2003 README by @BramVanroy in #3230
- Fix some contact information formats by @lhoestq in #3274
- Add wikipedia tags by @lhoestq in #3301
- Updating details of IRC disentanglement data by @jkkummerfeld in #3259
Metrics Changes
- New: OpenAI's pass@k code evaluation metric by @lvwerra in #2916
- Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in #3235
- Update: CER - update to support latest release by @mariosasko in #3252
- Update: WER - update to the documentation by @wooters in #3278
Documentation
- Add docs for
to_tf_dataset
by @stevhliu in #3175 - Small updates to to_tf_dataset documentation by @Rocketknight1 in #3215
- Update link to Datasets Tagging app in Spaces by @albertvillanova in #3194
- Improve repository structure docs by @lhoestq in #3233
- Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in #3241
- Add docs for audio processing by @stevhliu in #3222
- Add push_to_hub docs by @lhoestq in #3319
Additional improvements and bug fixes
- Catch token invalid error in CI by @lhoestq in #3200
- Pin keras version until TF fixes its release by @albertvillanova in #3208
- Fix disable_nullable default value to False by @lhoestq in #3211
- Fix code quality in riddle_sense dataset by @albertvillanova in #3218
- Better error msg if
len(predictions)
doesn't matchlen(references)
in metrics by @mariosasko in #3160 - Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in #3121
- Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in #3216
- Group tests in multiprocessing workers by test file by @albertvillanova in #3231
- Fix load_from_disk temporary directory by @lhoestq in #3245
- [tiny] fix typo in stream docs by @nollied in #3246
- Avoid PyArrow type optimization if it fails by @mariosasko in #3234
- Remove redundant isort module placement by @mariosasko in #3243
- asserts replaced by exception for text classification task with test. by @manisnesan in #3256
- Add os.listdir for streaming by @lhoestq in #3270
- asserts replaced with exception for image classification task, csv, json by @manisnesan in #3262
- Force data files extraction if download_mode='force_redownload' by @mariosasko in #3275
- Minor Typo Fix - Precision to Recall by @SebastinSanty in #3279
- Decode audio from remote by @lhoestq in #3271
- Fix build_docs CI by @lhoestq in #3286
- Allow datasets with indices table when concatenating along axis=1 by @mariosasko in #3288
- f-string formatting by @Mehdi2402 in #3277
- Unpin markdown for build_docs now that it's fixed by @lhoestq in #3289
- Pin version exclusion for Markdown by @albertvillanova in #3293
- Use f-strings in the dataset scripts by @Carlosbogo in #3291
- fix old_val typo in f-string by @Mehdi2402 in #3302
- asserts replaced with exception for
fingerprint.py
,search.py
,arrow_writer.py
andmetric.py
by @Ishan-Kumar2 in #3305 - fix: files counted twice in inferred structure by @borisdayma in #3309
- Finish transition to PyArrow 3.0.0 by @mariosasko in #3318
- Removing query params for dynamic URL caching by @anton-l in #3315
Citation
- Update BibTeX entry by @albertvillanova in #3223
- Fix paper BibTeX citation with proceedings reference by @albertvillanova in #3226
- Add CITATION file by @albertvillanova in #3228
- Fix URL in CITATION file by @albertvillanova in #3229
Deprecations
- Deprecate prepare_module by @albertvillanova in #3166
Full Changelog: 1.15.1...1.16.0
1.15.1
1.15.0
Dataset Changes
- Update: JNLBA - add tags names by @bhavitvyamalik in #3092
- Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in #3125 and #3176
- Update: RONEC - update to v2 by @dumitrescustefan in #3184
- Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in #3136
- Fix: HLGD - fix label mapping by @VictorSanh in #3180
Dataset Features
- Allow dynamic first dimension for ArrayXD by @rpowalski in #2891
- add multi-proc in
to_csv
by @bhavitvyamalik in #2896 - QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in #3196
Dataset Cards
Metrics Changes
- New: metric for the MATH dataset (competition_math). by @hacobe in #3020
- New: Google BLEU (aka GLEU) metric by @slowwavesleep in #3108
- New: TER by @BramVanroy in #3153
- New: ChrF(++) by @BramVanroy in #3187
General improvements and bug fixes
- Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in #3120
- Fixes to
to_tf_dataset
by @Rocketknight1 in #3085 - Add security policy to the project by @albertvillanova in #2958
- Update doc links to point to new docs by @mariosasko in #3116
- Fix caching bugs by @mariosasko in #3141
- Fix numpy deprecation warning for ragged tensors by @lhoestq in #3137
- Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in #3157
- Fix some typos in the documentation by @h4iku in #3152
- Fix string encoding for Value type by @lhoestq in #3158
- Fix CLI test to ignore verfications when saving infos by @albertvillanova in #3147
- Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in #3159
- Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in #3173
- Asserts replaced by exceptions (#3171) by @joseporiolayats in #3174
- Preserve ordering in
zip_dict
by @mariosasko in #3170 - Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in #3182
- Re-add faiss to windows testing suite by @BramVanroy in #3151
- Add missing docstring to DownloadConfig by @mariosasko in #3183
- More efficient nested features encoding by @eladsegal in #3124
- Fix optimized encoding for arrays by @lhoestq in #3197
1.14.0
Dataset changes
- Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
- Update: SUPERB - use Audio features #3101 (@anton-l)
- Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)
Dataset features
General improvements and bug fixes
- Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
- Fix project description in PyPI #3103 (@albertvillanova)
- Align tqdm control with cache control #3031 (@mariosasko)
- Add paper BibTeX citation #3107 (@albertvillanova)
1.13.3
Dataset changes
- Update: Adapt all audio datasets #3081 (@patrickvonplaten)
Bug fixes
- Update BibTeX entry #3090 (@albertvillanova)
- Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
- Fix Audio feature mp3 resampling #3096 (@albertvillanova)
1.13.2
Bug fixes
- Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
- Remove _resampler from Audio fields #3086 (@albertvillanova)
1.13.1
Bug fixes
- Fix loading a metric with internal import #3077 (@albertvillanova)
1.13.0
Dataset changes
- New: CaSiNo #2867 (@kushalchawla)
- New: Mostly Basic Python Problems #2893 (@lvwerra)
- New: OpenAI's HumanEval #2897 (@lvwerra)
- New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
- New: SEDE #2942 (@Hazoom)
- New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
- New: AMI #2853 (@cahya-wirawan)
- New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
- New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
- New: KanHope #2985 (@adeepH)
- New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
- New: SwedMedNER #2940 (@bwang482)
- New: SberQuAD #3039 (@Alenush)
- New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
- New: Greek Legal Code #2966 (@christospi)
- New: Story Cloze Test #3067 (@zaidalyafeai)
- Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
- Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
- Update: TriviaQA - add web and wiki config #2949 (@shirte)
- Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
- Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
- Update: Biosses - fix column names #3054 (@bwang482)
- Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
- Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
- Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
- Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
- Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
- Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)
Metric changes
- Update: meteor - update from nltk update #2946 (@lhoestq)
- Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
- Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
- Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)
Dataset features
- Use with TensorFlow:
- Adding
to_tf_dataset
method #2731 #2931 #2951 #2974 (@Rocketknight1)
- Adding
- Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
- Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add
remove_columns
toIterableDataset
#3030 (@cccntu) - All the above ZIP features also work in streaming mode
- New utilities:
- Replace script_version with revision #2933 (@albertvillanova)
- The
script_version
parameter inload_dataset
is now deprecated, in favor ofrevision
- The
- Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed
Dataset cards
- Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)
Documentation
General improvements and bug fixes
- Fix filter leaking #3019 (@lhoestq)
- calling
filter
several times in a row was not returning the right results in 1.12.0 and 1.12.1
- calling
- Update BibTeX entry #2928 (@albertvillanova)
- Fix exception chaining #2911 (@albertvillanova)
- Add regression test for null Sequence #2929 (@albertvillanova)
- Don't use old, incompatible cache for the new
filter
#2947 (@lhoestq) - Fix fn kwargs in filter #2950 (@lhoestq)
- Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
- Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
- Fix missing conda deps #2952 (@lhoestq)
- Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
- Support pandas 1.3 new
read_csv
parameters #2960 (@SBrandeis) - Fix CI doc build #2961 (@albertvillanova)
- Run tests in parallel #2954 (@albertvillanova)
- Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
- Take namespace into account in caching #2938 (@lhoestq)
- Make Dataset.map accept list of np.array #2990 (@albertvillanova)
- Fix loading compressed CSV without streaming #2994 (@albertvillanova)
- Fix json loader when conversion not implemented #3000 (@lhoestq)
- Remove all query parameters when extracting protocol #2996 (@albertvillanova)
- Correct a typo #3007 (@Yann21)
- Fix Windows test suite #3025 (@albertvillanova)
- Remove unused parameter in xdirname #3017 (@albertvillanova)
- Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
- Fix typo #3023 (@qqaatw)
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
- Use cache folder for lockfile #2887 (@Dref360)
- Fix streaming: catch Timeout error #3050 (@borisdayma)
- Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
- Fix task reloading from cache #3059 (@lhoestq)
- Fix test command after refac #3065 (@lhoestq)
- Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
- Update summary on PyPi beyond NLP #3062 (@thomwolf)
- Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
- feat: increase streaming retry config #3068 (@borisdayma)
- Fix pathlib patches for streaming #3072 (@lhoestq)
Breaking changes:
- Due to the big refactoring at #2986, the
prepare_module
function doesn't support thereturn_resolved_file_path
andreturn_associated_base_path
parameters. As an alternative, you may use thedataset_module_factory
instead.