Skip to content

Latest commit

 

History

History
10 lines (8 loc) · 1.29 KB

File metadata and controls

10 lines (8 loc) · 1.29 KB

Dataset: CoNLL 2003 (English)

The English dataset was obtained from PapersWithCode which was introduced by Sang and Meulder in their paper in 2013, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. At the time of writing, the dataset consists of the following:

English Data Articles Sentences Tokens LOC MISC ORG PER
Training set 946 14,987 203,621 7140 3438 6321 6600
Development set 216 3,466 51,362 1837 922 1341 1842
Test set 231 3,684 46,435 1668 702 1661 1617

Moreover, the leaderboard for Named Entity Recognition (NER) with this dataset can be found here. The state of the art currently is an F1 score of 94.6 using ACE + document-context model, which is described in more detail in their paper.