Skip to content

Datasets

Botchkarev edited this page Sep 26, 2023 · 4 revisions

NLP and text corpora

OpenWebTextCorpus
skylion007.github.io/OpenWebTextCorpus
An open source effort to reproduce OpenAI’s WebText dataset, as detailed in this paper:
Language Models are Unsupervised Multitask Learners pdf
Alec Radford, Jeffrey Wu, et al.
2019, Preprint Papers with Code: Link

Corpus of German-Language Fiction (txt)
https://figshare.com/articles/dataset/Corpus_of_German-Language_Fiction_txt_/4524680/1
Description extracted from its website:
Contains 2,735 German-language prose works (mainly novels and short stories) by 549 authors, spanning from ca. 1510 to the 1940s (the bulk of texts is from 1840–1930). This makes for 937,8 MB uncompressed literary data. Texts were extracted from the Gutenberg-DE Edition 13 DVD-ROM (released in November, 2013) and converted from HTML to TXT. Each prose work is saved as a text file.
Frank Fischer, Jannik Strötgen

One Million Posts Corpus
https://ofai.github.io/million-post-corpus
Contains around a million user comments from the Austrian Newspaper "DER STANDARD" which have been posted on their website. The data is saved in form of an sqlite Database file and is around 350MB large.
This data set was presented as a short paper at the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017) and as a follow up paper in the 11th International Conference on Language Resources and Evaluation (LREC 2018).

Benchmarks

Clone this wiki locally