Summarization Repository

Authors: Alex Fabbri*, Wojciech Kryściński*, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

This project is a collaboration work between Yale LILY Lab and Salesforce Research.

_{^{* - Equal contributions from authors}}

Updates

04/19/2020 - Updated the human annotation file to include all models from paper and metric scores.
04/19/2020 - SummEval is now pip-installable! Check out the pypi page.
04/09/2020 - Please see this comment with code for computing system-level metric correlations!
11/12/2020 - Added the reference-less BLANC and SUPERT metrics!
7/16/2020 - Initial commit! :)

Data

As part of this release, we share summaries generated by recent summarization model trained on the CNN/DailyMail dataset here.
We also share human annotations, collected from both crowdsource workers and experts here.

Both datasets are shared WITHOUT the source articles that were used to generate the summaries.
To recreate the full dataset please follow the instructions listed here.

Model Outputs

Model	Paper	Outputs	Type
M0	Lead-3 Baseline	Link	Extractive
M1	Neural Document Summarization by Jointly Learning to Score and Select Sentences	Link	Extractive
M2	BANDITSUM: Extractive Summarization as a Contextual Bandit	Link	Extractive
M3	Neural Latent Extractive Document Summarization	Link	Extractive
M4	Ranking Sentences for Extractive Summarization with Reinforcement Learning	Link	Extractive
M5	Learning to Extract Coherent Summary via Deep Reinforcement Learning	Link	Extractive
M6	Neural Extractive Text Summarization with Syntactic Compression	Link	Extractive
M7	STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings	Link	Extractive
M8	Get To The Point: Summarization with Pointer-Generator Networks	Link	Abstractive
M9	Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting	Link	Abstractive
M10	Bottom-Up Abstractive Summarization	Link	Abstractive
M11	Improving Abstraction in Text Summarization	Link	Abstractive
M12	A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss	Link	Abstractive
M13	Multi-Reward Reinforced Summarization with Saliency and Entailment	Link	Abstractive
M14	Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation	Link	Abstractive
M15	Closed-Book Training to Improve Summarization Encoder Memory	Link	Abstractive
M16	An Entity-Driven Framework for Abstractive Summarization	Link	Abstractive
M17	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	Link	Abstractive
M18	Better Rewards Yield Better Summaries: Learning to Summarise Without References	Link	Abstractive
M19	Text Summarization with Pretrained Encoders	Link	Abstractive
M20	Fine-Tuning GPT-2 from Human Preferences	Link	Abstractive
M21	Unified Language Model Pre-training for Natural Language Understanding and Generation	Link	Abstractive
M22	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension	Link	Abstractive
M23	PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization	Link	Abstractive

IMPORTANT:

All model outputs were obtained from the original authors of the models and shared with their consent.
When using any of the model outputs, please also cite the original paper.

Human annotations

Human annotations of model generated summaries can be found here.

The annotations include summaries generated by 16 models from 100 source news articles (1600 examples in total).
Each of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total).
Summaries were evaluated across 4 dimensions: coherence, consistency, fluency, relevance.
Each source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.

Data preparation

Both model generated outputs and human annotated data require pairing with the original CNN/DailyMail articles.

To recreate the datasets follow the instructions:

Download CNN Stories and Daily Mail Stories from https://cs.nyu.edu/~kcho/DMQA/
Create a cnndm directory and unpack downloaded files into the directory
Download and unpack model outputs or human annotations.
Run the pair_data.py script to pair the data with original articles

Example call for model outputs:

python3 data_processing/pair_data.py --model_outputs <file-with-data-annotations> --story_files <dir-with-stories>

Example call for human annotations:

python3 data_processing/pair_data.py --data_annotations <file-with-data-annotations> --story_files <dir-with-stories>

Evaluation Toolkit

We provide a toolkit for summarization evaluation to unify metrics and promote robust comparison of summarization systems. The toolkit contains popular and recent metrics for summarization as well as several machine translation metrics.

Metrics

Below are the metrics included in the tookit, followed by the associated paper and code used within the toolkit:

Metric	Paper	Code
ROUGE	ROUGE: A Package for Automatic Evaluation of Summaries	Link
ROUGE-we	Better Summarization Evaluation with Word Embeddings for ROUGE	Link
MoverScore	MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance	Link
BertScore	BertScore: Evaluating Text Generation with BERT	Link
Sentence Mover's Similarity	Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts	Link
SummaQA	Answers Unite! Unsupervised Metrics for Reinforced Summarization Models	Link
BLANC	Fill in the BLANC: Human-free quality estimation of document summaries	Link
SUPERT	SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization	Link
METEOR	METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments	Link
S³	Learning to Score System Summaries for Better Content Selection Evaluation	Link
Misc. statistics (extractiveness, novel n-grams, repetition, length)	Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies	Link
Syntactic Evaluation	Automatic Analysis of Syntactic Complexity in Second Language writing	Link
CIDer	CIDEr: Consensus-based Image Description Evaluation	Link
CHRF	CHRF++: words helping character n-grams	Link
BLEU	BLEU: a Method for Automatic Evaluation of Machine Translation	Link

SETUP

You can install summ_eval via pip:

pip install summ-eval

You can also install summ_eval from source:

git clone https://github.com/Yale-LILY/SummEval.git
cd evaluation
pip install -e .

You can test your installation (assuming you're in the ./summ_eval folder) and get familiar with the library through tests/

python -m unittest discover

Command-line interface

We provide a command-line interface calc-scores which makes use of gin config files to set metric parameters.

Examples

Run ROUGE on given source and target files and write to rouge.jsonl, analogous to files2rouge.

calc-scores --config-file=examples/basic.config --metrics "rouge" --summ-file summ_eval/1.summ --ref-file summ_eval/1.ref --output-file rouge.jsonl --eos " . " --aggregate True

NOTE: if you're seeing slow-ish startup time, try commenting out the metrics you're not using in the config; otherwise this will load all modules.

Run ROUGE and BertScore on a .jsonl file which contains reference and decoded (i.e., system output) keys and write to output.jsonl.

calc-scores --config-file=examples/basic.config --metrics "rouge, bert_score" --jsonl-file data.jsonl --output-file rouge_bertscore.jsonl

For a full list of options, please run:

calc-scores --help

For use in scripts

If you want to use the evaluation metrics as part of other scripts, we have you covered!

from summ_eval.rouge_metric import RougeMetric
rouge = RougeMetric()

Evaluate on a batch

summaries = ["This is one summary", "This is another summary"]
references = ["This is one reference", "This is another"]

rouge_dict = rouge.evaluate_batch(summaries, references)

Evaluate on a single example

rouge_dict = rouge.evaluate_example(summaries[0], references[0])

Evaluate with multiple references

Currently the command-line tool does not use multiple references for simplicity. Each metric has a supports_multi_ref property to tell you if it supports multiple references.

print(rouge.supports_multi_ref) # True
multi_references = [["This is ref 1 for summ 1", "This is ref 2 for summ 1"], ["This is ref 1 for summ 2", "This is ref 2 for summ 2"]]
rouge_dict = rouge.evaluate_batch(summaries, multi_references)

Citation

@article{fabbri2020summeval,
  title={SummEval: Re-evaluating Summarization Evaluation},
  author={Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
  journal={arXiv preprint arXiv:2007.12626},
  year={2020}
}

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Summarization Repository

Table of Contents

Updates

Data

Model Outputs

Human annotations

Data preparation

Evaluation Toolkit

Metrics

SETUP

Command-line interface

Examples

For use in scripts

Evaluate on a batch

Evaluate on a single example

Evaluate with multiple references

Citation

Get Involved

Files

README.md

Latest commit

History

README.md

File metadata and controls

Summarization Repository

Table of Contents

Updates

Data

Model Outputs

Human annotations

Data preparation

Evaluation Toolkit

Metrics

SETUP

Command-line interface

Examples

For use in scripts

Evaluate on a batch

Evaluate on a single example

Evaluate with multiple references

Citation

Get Involved