Skip to content

Benchmarks

Max Kuhmichel edited this page Sep 14, 2023 · 22 revisions

Huggingface Leaderboard Papers with Code about all NLP tasks

Tools and other things

MLPerf™ Inference Benchmark Suite Github Paper

Beta Version ExplainaBoard

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Homepage, paper, papers with code

Basic Benchmarks

General Language Understanding Evaluation (GLUE)
Homepage Paper Github

General language capabilities and multipurpose benchmarks

RACE: Large-scale ReAding Comprehension Dataset From Examinations NLU + QA, Understanding and reasoning 2017 Paper

BIG-bench
Collection of many NLP tasks
Github

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
Github

Question answering

Stanford Question Answering Dataset (SQuAD)
Homepage
arxiv.org SQuAD 1 SQuAD 2

Microsoft NewsQS
Homepage Paper Github

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
Dataset and Benchmark in one, coould be useful to fine tune QA/CBQA Models
Paper

Natural Language Inference

The Stanford Natural Language Inference (SNLI) Corpus
Homepage Paper

The Multi-Genre NLI Corpus
Homepage

Hey might have annotation artifacts and bias in their labeling:
Annotation Artifacts in Natural Language Inference Data

Natural Language Generation

GEM is a benchmark environment for Natural Language Generation with a focus on its Evaluation
Homepage

GLGE: A New General Language Generation Evaluation Benchmark
Natural Language Generation, 24 Taks 3 Difficulties; contains MASS, BART, and Prophet-Net Baselines
arxiv.org 2021 Link
Microsoft Research, College of Computer Science Sichuan University, Dayiheng Liu, et al.
Public Repository and guide https://microsoft.github.io/glge/

BERTSCORE: Evaluating text generation with BERT
automatic evaluation metric for text generation
Cite LinkPaperGithub

Common reasoning

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Situations With Adversarial Generations
Homepage Paper

HellaSwag: Can a Machine Really Finish Your Sentence?
Like SWAG but harder
Homepage 2019 Paper

POS-Tagging

Stanford typed dependencies manual

Penn-Treebank
Papers with Code, Kaggle, Leaderboard

Papers with Code Benchmarks and References

Low Resource Named Entity Recognition
Dependency Parsing
Semantic Similarity
Semantic Parsing
Semantic Textual Similarity
question-answering
natural-language-understanding
reading-comprehension
natural-language-inference
sentiment-analysis
language-modelling
text-classification