Skip to content

Latest commit

 

History

History
121 lines (106 loc) · 32.5 KB

datasets.md

File metadata and controls

121 lines (106 loc) · 32.5 KB

Datasets

Area of Collected Datasets

Named Entity Recognition

Language Dataset Size #Types Description Paper Download
Chinese msra 46364/-/4365 3 Levow damo/msra_ner
Chinese resume 3821/463/477 9 Zhang & Yang damo/resume_ner
Chinese weibo 1350/269/270 4 Peng & Dredze damo/weibo_ner
Chinese ontonotes-v4-zh 15724/4301/4346 - ldc/ontonotes-v4
Chinese cluener2020 10748/1343/1345 10 Xu et al., 2020 github/cluener2020
Chinese people_dairy1998 3 github/ChineseNLPCorpus
Chinese people_dairy2014 3 baidu-pan passwrod:1fa3
Chinese cmeee 15000/5000/3000 CMeEE dataset in CBLUE benchmark Zhang et al., 2022 github/cblue
Chinese yidu-s4k - openkg/yidu-s4k
Chinese ecommerce Jie et al., 2019 github/ner_incomplete_annotation/ecommerce
Chinese dlner Xu, et al.,2017 github/dlner
Dutch conll2002-nl 15796/2895/5196 4 Tjong Kim Sang, 2002
English wnut2016 2394/1000/3850 Noisy User-generated Text Strauss et al., 2016 damo/wnut16
English wnut2017 3394/1009/1287 Derczynski et al., 2017 damo/wnut17
English conll2003-en 14041/3250/3453 4 Tjong Kim Sang & De Meulder, 2003
English conllpp 14041/3250/3453 4 corrected version of the conll03-en NER dataset Wang et al., 2019 damo/conllpp_ner
English ontonotes-v5-en 59924/8528/8262(TBD) Pradhan et al., 2013 ldc/ontonotes-v5
English ai 100/350/431 Liu et al., 2020 damo/cross_ner
English literature 100/400/416 Liu et al., 2020 damo/cross_ner
English music 100/541/465 Liu et al., 2020 damo/cross_ner
English politics 200/541/651 Liu et al., 2020 damo/cross_ner
English science 200/450/543 Liu et al., 2020 damo/cross_ner
English bc5cdr 4560/4581/4797 Li et al., 2016
English ncbi 5424/923/940 Doğan et al., 2014
English mit-movie 6816/1000/1953(TBD) Liu et al., 2013 mit/movie
English mit-restaurant 6900/760/1521 Liu et al., 2013 mit/restaurant
English ace2004-en 7 nested ner Doddington et al., 2005 ldc/ace04
English ace2005-en 7 nested ner - ldc/ace05
English kbp2017 nested ner - -
English genia nested ner Ohta et al., 2002
English few-nerd 131767/18824/37548 8 / 66 a few-shot ner dataset Ding et al., 2021
English wikigold Balasuriya et al.,2009
English bionlp2014 Collier & Kim, 2004
English fin Alvarado et al., 2015
English btc 6338/1001/2000 3 Derczynski et al., 2016
English ttc Rijhwani & Preot¸iuc-Pietro github/ttc
English tweebank Jiang et al.,2022 github/tweebank
English tweetner7 Ushio, et al., 2022 huggingface/tweetner7
German conll2003-de 12152/2866/3005 4 Tjong Kim Sang & De Meulder, 2003
Spanish conll2002-es 8302/1919/1517 4 Tjong Kim Sang, 2002
English twitter2015 multi-modal Zhang et al., 2018
English snap multi-modal Lu et al., 2018 github/UMT
English twitter2017 multi-modal Yu et al., 2020 github/UMT
English wiki-diverse constructed from wiki-diverse (a multi-modal entity typing dataset) Wang et al., 2022 github/wikidiverse
11 langs multiconer2022 - 6 dataset of SemEval 2022 Task 11
(English, Spanish, Dutch, Russian, Turkish, Korean, Farsi, German, Chinese, Hindi, and Bangla)
Malmasi et al., 2022 aws/multiconer
282 langs wikiann - silver-standard data Pan et al, 2017 github/wikiann
9 langs wikiner - silver-standard data Nothman et al, 2013
9 langs wikineural - silver-standard data Tedeschi et al, 2021
10 langs multinerd - silver-standard data Tedeschi & Navigli. 2022

Chinese Word Segmentation

Language Dataset Size #Types Description Paper Download
Chinese PKU 19056/-/1944 - - sighan05 train
test
Chinese MSRA 86924/-/3985 - - sighan05 train
test
Chinese CTB6 23401/2078/2795 - - Chinese Tree Bank v6 train
dev
test

Part-of-Speech Tagging

Language Dataset Size #Types Description Paper Download
Chinese CTB5 - - - train
dev
test
Chinese CTB8 23401 2078 2795 - - Chinese Tree Bank v6 train
dev
test
Chinese CTB9 - - - train
dev
test

Ultra-fine Entity-Typing

Language Dataset Size #Types Description Paper Download
English UFET 1998/1998/1998 10331 Ultra-fine Entity Typing Choi et al., 2018 izhx404/ufet
Chinese CFET 2880/960/958 1299 Unofficial split, no official split provided. Lee et al., 2020 izhx404/cfet

Event Extraction

Language Dataset Size Description Paper Download
Chinese FewFC 7185/899/898 Passage level Zhou et al., 2021 here
Chinese Duee 11908/1492/34904 Passage level Li et al., 2020 here
Chinese Duee-fin 7015/1171/59394 Document level Li et al., 2020 here
Chinese ChFinAnn 25632/3204/3204 Document level Zheng et al., 2019 here
English WIKIEVENTS 206/20/20 Document level Li et al., 2021 train / dev / test
English RAMS 7329/924/871 Document level Ebner et al., 2020 here

Entity Relation joint Extraction

Language Dataset Size Description Paper Download
English NYT - - Ren et al.,2017 here
English NYT10-HRL/11-HRL 70339/-/4006;62648/-/369 got by preprocessing in paper HRL Takanobu et al., 2019 here
English WebNLG 5019/-/703 - Gardent et al.,2017 here
English ADE - - Gurulingappa et al., 2012 -
English SciERC 1816/275/551 - Luan et al., 2018 here
English CoNLL04 - - Roth et al., 2004 -
English ACE04 - - - here
English ACE05 10051/2424/2050 - - here
Chinese DuIE2.0 171135/-/21055 - Li et al., 2019 here

End-to-End Entity Linking

Language Domain Dataset Train/Dev/Test/KB Size Paper/Link Download
English News AIDA-CoNLL 12820/4242/3953/5903530 Hoffart et al.,2011 here
English Medical BC5CDR 9535/9481/10032/2291 Li et al., 2016 here
English Speech NLPCC2022 28400/7640/2905/118795 NLPCC2022 here
Chinese ShortText CCKS2020 69691/9148/-/3234418 CCKS2020 -