The English dataset was obtained from PapersWithCode which was introduced by Sang and Meulder in their paper in 2013, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. At the time of writing, the dataset consists of the following:
English Data | Articles | Sentences | Tokens | LOC | MISC | ORG | PER |
---|---|---|---|---|---|---|---|
Training set | 946 | 14,987 | 203,621 | 7140 | 3438 | 6321 | 6600 |
Development set | 216 | 3,466 | 51,362 | 1837 | 922 | 1341 | 1842 |
Test set | 231 | 3,684 | 46,435 | 1668 | 702 | 1661 | 1617 |
Moreover, the leaderboard for Named Entity Recognition (NER) with this dataset can be found here. The state of the art currently is an F1 score of 94.6 using ACE + document-context model, which is described in more detail in their paper.