20 Apr 17:05

lhoestq

40bb9e6

1.6.0

Dataset changes

New: MOROCO #2002 (@MihaelaGaman)
New: CBT dataset #2044 (@gchhablani)
New: MDD Dataset #2051 (@gchhablani)
New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
New: bAbI QA tasks #2053 (@gchhablani)
New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
New: banking77 #2140 (@dkajtoch)
New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
New: CUAD dataset #2219 (@bhavitvyamalik)
Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
Update: Wikiann - added spans field #2141 (@rabeehk)
Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
Fix: wikiauto - fix link #2171 (@mounicam)
Fix: wino_bias - use right splits #1930 (@JieyuZhao)
Fix: lc_quad - update download checksum #2213 (@mariosasko)
Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
Fix: xnli - fix tuple key #2233 (@NikhilBartwal)

Dataset features

Allow stateful function in dataset.map #1960 (@mariosasko)
MIAM dataset - new citation details #2101 (@eusip)
[Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
Allow pickling of big in-memory tables #2150 (@lhoestq)
updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
Fast table queries with interpolation search #2122 (@lhoestq)
Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
Implementation of class_encode_column #2184 #2227 (@SBrandeis)
Add support for axis in concatenate datasets #2151 (@albertvillanova)
Set default in-memory value depending on the dataset size #2182 (@albertvillanova)

Metrics changes

New: CER metric #2138 (@chutaklee)
Update: WER - Compute metric iteratively #2111 (@albertvillanova)
Update: seqeval - configurable options to seqeval metric #2204 (@marrodion)

Dataset cards

REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
Winobiais: fix split infos #2152 (@JieyuZhao)
all: Fix size categories in YAML Tags #2074 (@gchhablani)
LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
Swda: Update README.md #2235 (@PierreColombo)

General improvements and bug fixes

Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
Fix max_wait_time in requests #2085 (@lhoestq)
Fix copy snippet in docs #2091 (@mariosasko)
Fix deprecated warning message and docstring #2100 (@albertvillanova)
Move Dataset.to_csv to csv module #2102 (@albertvillanova)
Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
Implement Dataset as context manager #2113 (@albertvillanova)
Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
add social thumbnial #2177 (@philschmid)
Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
Fix typo in huggingface hub #2192 (@LysandreJik)
Update metadata if dataset features are modified #2087 (@mariosasko)
fix missing indices_files in load_form_disk #2197 (@lhoestq)
Fix backward compatibility in Dataset.load_from_disk #2199 (@albertvillanova)
Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
Remove Python2 leftovers #2208 (@mariosasko)
Revert breaking change in cache_files property #2217 (@lhoestq)
Set test cache config #2223 (@albertvillanova)
Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
Preserve split type when reloading dataset #2168 (@mariosasko)

Docs

make documentation more clear to use different cloud storage #2127 (@philschmid)
Render docstring return type as inline #2147 (@albertvillanova)
Add table classes to the documentation #2155 (@lhoestq)
Pin docutils for better doc #2174 (@sgugger)
Fix docstrings issues #2081 (@albertvillanova)
Add code of conduct to the project #2209 (@albertvillanova)
Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
Fix bash snippet formatting in ADD_NEW_DATASET.md #2234 (@mariosasko)

Assets 2

18 Mar 14:21

lhoestq

1.5.0

f256b77

1.5.0

Datasets changes

New: Europarl Bilingual #1874 (@lucadiliello)
New: Stanford Sentiment Treebank #1961 (@patpizio)
New: RO-STS #1978 (@lorinczb)
New: newspop #1871 (@frankier)
New: FashionMNIST #1999 (@gchhablani)
New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
New: Cryptonite #2013 (@theo-m)
New: RoSent #2011 (@gchhablani)
New: PersiNLU reading-comprehension #2028 (@danyaljj)
New: conllpp #1991 (@ZihanWangKi)
New: LaRoSeDa #2004 (@MihaelaGaman)
Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
Update: Md gender - card update #2018 (@mcmillanmajora)
Update: XQuAD - add Romanian #2023 (@M-Salti)
Update: DROP - all answers #1980 (@KaijuML)
Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
Fix: Doc2dial update data_infos and data_loaders #2041 (@songfeng)
Fix: ZEST - update download link #2057 (@matt-peters)
Fix: ted_talks_iwslt - fix version error #2064 (@mariosasko)

Datasets Features

Implement Dataset from CSV #1946 (@albertvillanova)
Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
Implement Dataset from text #2030 (@albertvillanova)
Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset

General Bug fixes and improvements

Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
Fix various typos/grammer in the docs #2008 (@mariosasko)
Update format columns in Dataset.rename_columns #2027 (@mariosasko)
Replace print with logging in dataset scripts #2019 (@mariosasko)
Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
Not all languages have 2 digit codes. #2016 (@asiddhant)
Fix arrow memory checks issue in tests #2042 (@lhoestq)
Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
Fix text-classification tags #2049 (@gchhablani)
Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
Fixes check of TF_AVAILABLE and TORCH_AVAILABLE #2073 (@philschmid)
Add and fix docstring for NamedSplit #2069 (@albertvillanova)
Bump huggingface_hub version #2077 (@SBrandeis)
Fix docstring issues #2072 (@albertvillanova)

Assets 2

04 Mar 09:16

lhoestq

1.4.1

ca41320

1.4.1

Fix an issue #1981 with WMT downloads #1982 (@albertvillanova)

Assets 2

03 Mar 17:13

lhoestq

1.4.0

f42658e

1.4.0

Datasets Changes

New: iapp_wiki_qa_squad #1873 (@cstorm125)
New: Financial PhraseBank #1866 (@frankier)
New: CoVoST2 #1935 (@patil-suraj)
New: TIMIT #1903 (@vrindaprabhu)
New: Mlama (multilingual lama) #1931 (@pdufter)
New: FewRel #1823 (@gchhablani)
New: CCAligned Multilingual Dataset #1815 (@gchhablani)
New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
Fix: Wiki_dpr - fix when with_embeddings is False or index_name is "no_index" #1925 (@lhoestq)
Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)

Datasets Features

Add to_dict and to_pandas for Dataset #1889 (@SBrandeis)
Add to_csv for Dataset #1887 (@SBrandeis)
Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
- The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
Add cross-platform support for datasets-cli #1951 (@mariosasko)

Metrics Changes

New: sari metric #1875 (@ddhruvkr)

Offline loading

Handle timeouts #1952 (@lhoestq)
Add datasets full offline mode with HF_DATASETS_OFFLINE #1976 (@lhoestq)

General improvements and bugfixes

Replace flatten_nested #1879 (@albertvillanova)
add missing info on how to add large files #1885 (@stas00)
Docs for adding new column on formatted dataset #1888 (@lhoestq)
Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
Bugfix for string_to_arrow timestamp[ns] support #1900 (@justin-yan)
Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
Standardizing datasets dtypes #1921 (@justin-yan)
Remove unused py_utils objects #1916 (@albertvillanova)
Fix save_to_disk with relative path #1923 (@lhoestq)
Updating old cards #1928 (@mcmillanmajora)
Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
Fix builder config creation with data_dir #1932 (@lhoestq)
Disallow ClassLabel with no names #1938 (@lhoestq)
Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
Documentation for to_csv, to_pandas and to_dict #1953 (@lhoestq)
typos + grammar #1955 (@stas00)
Fix unused arguments #1962 (@mariosasko)
Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)

Assets 2

15 Feb 16:54

lhoestq

1.3.0

ef633da

1.3.0

Dataset Features

On-the-fly data transforms (#1795)
ADD S3 support for downloading and uploading processed datasets (#1723)
Allow loading dataset in-memory (#1792)
Support future datasets (#1813)
Enable/disable caching (#1703)
Offline dataset loading (#1726)

Datasets Hub Features

Loading from the Datasets Hub (#1860)
This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library.
Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation

Dataset Changes

New: LJ Speech (#1878)
New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
New: cord 19 (#1850)
New: Tweet Eval Dataset (#1829)
New: CIFAR-100 Dataset (#1812)
New: SICK (#1804)
New: BBC Hindi NLI Dataset (#1158)
New: Freebase QA Dataset (#1814)
New: Arabic sarcasm (#1798)
New: Semantic Scholar Open Research Corpus (#1606)
New: DuoRC Dataset (#1800)
New: Aggregated dataset for the GEM benchmark (#1807)
New: CC-News dataset of English language articles (#1323)
New: irc disentangle (#1586)
New: Narrative QA Manual (#1778)
New: Universal Morphologies (#1174)
New: SILICONE (#1761)
New: Librispeech ASR (#1767)
New: OSCAR (#1694, #1868, #1833)
New: CANER Corpus (#1684)
New: Arabic Speech Corpus (#1852)
New: id_liputan6 (#1740)
New: Stuctured Argument Extraction for Korean dataset (#1748)
New: TurkCorpus (#1732)
New: Hatexplain Dataset (#1716)
New: adversarialQA (#1714)
Update: Doc2dial - reading comprehension update to latest version (#1816)
Update: OPUS Open Subtitles - add with metadata information (#1865)
Update: SWDA - use all metadata features(#1799)
Update: SWDA - add metadata and correct splits (#1749)
Update: CommonGen - update citation information (#1787)
Update: SciFact - update URL (#1780)
Update: BrWaC - update features name (#1736)
Update: TLC - update urls to be github links (#1737)
Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
Fix: multi_woz_v22 - fix checksums (#1880)
Fix: limit - fix url (#1861)
Fix: WebNLG - fix test test + more field (#1739)
Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
Fix: reuters - add missing "brief" entries (#1744)
Fix: thainer: empty token bug (#1734)
Fix: lst20: empty token bug (#1734)

Metrics Changes

New: Word Error Metric (#1847)
New: COMET (#1577, #1753)
Fix: bert_score - set version dependency (#1851)

Metric Docs

Add metrics usage examples and tests (#1820)

CLI Changes

[BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
- instead, use the huggingface-hub CLI

Bug fixes

fix writing GPU Faiss index (#1862)
update pyarrow import warning (#1782)
Ignore definition line number of functions for caching (#1779)
update saving and loading methods for faiss index so to accept path like objects (#1663)
Print error message with filename when malformed CSV (#1826)
Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)

Refactoring

Refactoring: Create config module (#1848)
Use a config id in the cache directory names for custom configs (#1754)

Logging

Enable logging propagation and remove logging handler (#1845)

Assets 2

13 Jan 15:29

lhoestq

1.2.1

a59580b

1.2.1

New Features

Fast start up (#1690): Importing datasets is now significantly faster.

Datasets Changes

New: MNIST (#1730)
New: Korean intonation-aided intention identification dataset (#1715)
New: Switchboard Dialog Act Corpus (#1678)
Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (#1722)
Update: Scientific papers - Mirror datasets zip (#1721)
Update: Update DBRD dataset card and download URL (#1699)
Fix: Thainer - fix ner_tag bugs (#1695)
Fix: reuters21578 - metadata parsing errors (#1693)
Fix: ade_corpus_v2 - fix config names (#1689)
Fix: DaNE - fix last example (#1688)

Datasets tagging

rename "part-of-speech-tagging" tag in some dataset cards (#1645)

Bug Fixes

Fix column list comparison in transmit format (#1719)
Fix windows path scheme in cached path (#1711)

Docs

Add information about caching and verifications in "Load a Dataset" docs (#1705)

Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)

Assets 2

04 Jan 18:38

lhoestq

1.2.0

dae6880

1.2.0

Intermediate release before v2.0.0
Includes all the datasets added during the datasets sprint of December 2020 (currently over 610 datasets).

Assets 2

19 Nov 18:33

lhoestq

1.1.3

000b584

1.1.3

Datasets changes

New: NLI-Tr (#787)
New: Amazon Reviews (#791)(#844)(#845)(#799)
New: ASNQ - answer sentence selection (#780)
New: OpenBookCorpus (#856)
New: ASLG-PC12 - sign language translation (#731)
New: Quail - question answering dataset (#747)
Update: SNLI: Created dataset card snli.md (#663)
Update: csv - Use pandas reader in csv (#857)
- Better memory management
- Breaking: the previous read_options, parse_options and convert_options are replaced with plain parameters like pandas.read_csv
Update: conll2000, conll2003, germeval_14, wnut_17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (#850)
- Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
Update: XNLI - Add XNLI train set (#781)
Update: XSUM - Use full released xsum dataset (#754)
Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (#748)
Update: CLUE - add OCNLI, a new CLUE dataset (#742)
Fix: KOR-NLI - Fix csv reader (#855)
Fix: Discofuse - fix discofuse urls (#793)
Fix: Emotion - fix description (#745)
Fix: TREC - update urls (#740)

Metrics changes

New: accuracy, precision, recall and F1 metrics (#825)
Fix: squad_v2 (#840)
Fix: seqeval (#810)(#738)
Fix: Rouge - fix description (#774)
Fix: GLUE - fix description (#734)
Fix: BertScore - fix custom baseline (#763)

Command line tools

add clear_cache parameter in the test command (#863)

Dependencies

Integrate file_lock inside the lib for better logging control (#859)

Dataset features

Add writer_batch_size attribute to GeneratorBasedBuilder (#828)
pretty print dataset objects (#725)
allow custom split names in text dataset (#776)

Tests

All configs is a slow test now

Bug fixes

Make save function use deterministic global vars order (#819)
fix type hints pickling in python 3.6 (#818)
fix metric deletion when attributes are missing (#782)
Fix custom builder caching (#770)
Fix metric with cache dir (#772)
Fix train_test_split output format (#719)

Assets 2

06 Oct 14:22

lhoestq

1.1.2

2256521

1.1.2

Dataset changes

Fix: text - use python read instead of pandas reader (#715):
- fix delimiter/overflow issues
- better memory handling

Bug fixes

Fix dataset configuration creation using data_files per splits using NamedSplit (#706)
Fix permission issue on windows - don't use tqdm 4.50.0 (#718)

Assets 2

02 Oct 13:12

lhoestq

1.1.0

fe52b67

1.1.0: Windows support, Better Multiprocessing, New Datasets

Windows support

Add Windows support (#644):
- add tests and CI for Windows
- fix numerous windows specific issues
- The library now fully supports Windows

Dataset changes

New: HotpotQA (#703)
New: OpenWebText (#660)
New: Winogrande - add debiased subset (#655)
Update: XNLI - update download link (#695)
Update: text - switch to pandas reader, better memory usage, fix delimiter issues (#689)
Update: csv - add features parameter to CSV (#685)
Fix: GAP - fix wrong computation of boolean features (#680)
Fix: C4 - fix manual instruction function (#681)

Metric changes

Update: ROUGE - Add rouge 2 and rouge Lsum to rouge metric outputs by default (#701, #702)
Fix: SQuAD - fix kwargs description (#670)

Dataset Features

Use multiprocess from pathos for multiprocessing (#656):
- allow lambda functions in multiprocessed map
- allow local functions in multiprocessed map
- and more ! As long as functions are compatible with dill

Bug fixes

Datasets: fix possible program hanging with tokenizers - Disable tokenizers parallelism in multiprocessed map (#688)
Datasets: fix cast with unordered features - fix column order issue in cast (#684)
Datasets: fix first time creation of cache directory - move cache dir root creation in builder's init (#677)
Datasets: fix OverflowError when using negative ids - fix negative ids in slicing with an array (#679)
Datasets: fix empty dictionaries afetr multiprocessing - keep new columns in transmit format (#659)
Datasets: fix type inference for nested types - handle data alteration when trying type (#653)
Metrics: fix compute metric with empty input - pass metric features to the reader (#654)

Documentation

Elasticsearch integration documentation (#696)

Tests

Use GitHub instead of AWS in remote dataset tests (#694)

Assets 2

Releases: huggingface/datasets

1.6.0

Dataset changes

Dataset features

Metrics changes

Dataset cards

General improvements and bug fixes

Docs

1.5.0

Datasets changes

Datasets Features

General Bug fixes and improvements

1.4.1

1.4.0

Datasets Changes

Datasets Features

Metrics Changes

Offline loading

General improvements and bugfixes

1.3.0

Dataset Features

Datasets Hub Features

Dataset Changes

Metrics Changes

Metric Docs

CLI Changes

Bug fixes

Refactoring

Logging

1.2.1

New Features

Datasets Changes

Datasets tagging

Bug Fixes

Docs

1.2.0

1.1.3

Datasets changes

Metrics changes

Command line tools

Dependencies

Dataset features

Tests

Bug fixes

1.1.2

Dataset changes

Bug fixes

1.1.0: Windows support, Better Multiprocessing, New Datasets

Windows support

Dataset changes

Metric changes

Dataset Features

Bug fixes

Documentation

Tests