This repository contains the dataset and statistical analysis code released with the submission of EMNLP 2017 paper "Why We Need New Evaluation Metrics for NLG".
- emnlp_data_individual_hum_scores.csv - the dataset with system outputs and evaluation ratings of 3 crowd-workers for each output
- emnlp_data_individual_hum_scores.csv - the dataset with system outputs, original human references, scores of automatic metrics and medians of human ratings
- analysis_emnlp.R - R code with statistical analysis discussed in the paper
Jekaterina Novikova, Ondrej Dusek, Amanda Cercas-Curry and Verena Rieser (2017): Why We Need New Evaluation Metrics for NLG. In Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP 2017, Copenhaged, Denmark