SlideShare a Scribd company logo
Sending out an SOS (Summary of Summaries):
A Brief Survey of Recent Work on Abstractive Summarization Evaluation
Griffin Adams,1
1
Columbia University, New York, NY, USA
Abstract
Research on the evaluation of abstractive summarization
models has evolved considerably in the last few years. To take
stock of these changes, namely, the shift from n-gram overlap
to fact-based assessments, I have written a brief summary of
papers on evaluation metrics, most of which focuses on the
period from 2018 to mid-2020. The paper is written in a terse
literature review format, so as to aid researchers when craft-
ing related works sections of papers on summary evaluation.
Introduction
While recent progress on abstractive summarization has led
to remarkably fluent summaries, factual errors in generated
summaries still severely limit their use in practice. Recent
work has moved away from n-gram overlap based metrics,
such as ROUGE and BLEU, to semantic-inspired metrics
which assess a model’s faithfulness to the source text. In the
sections below, I trace this arc by discussing relevant papers.
Please cite this paper if used as a guide in your research.
Related Works
Manual Evaluation.
For decades, summarization systems were purely extractive
and, as an area of research, summarization was cast as a
sentence selection problem. Metrics capturing exact match
overlap, such as precision and recall, were the de facto stan-
dard for evaluation. Then, the field started to shift atten-
tion to abstractive summarization, a more natural, human-
like method for transducing document content. Given the
large variance in language generation, from an evaluation
standpoint, the focus shifted from surface form sentential
overlap to semantic equivalence at varying granularity. The
most prominent example, the Pyramid Method, proposes a
method for manual identification of semantic equivalence
based on Summary Content Units (SCUs) (Passonneau et al.
2005), (Nenkova, Passonneau, and McKeown 2007). SCUs
are semantic units formed by clustering semantically equiv-
alent text spans from multiple reference summaries. SCUs
form a pyramid (by exhibiting a Zipfian distribution) based
on the number of supporting references, i.e. frequently ab-
stracted facts obtain are presumed to hold more weight and
are placed at the top of the pyramid. Evaluation with the
Pyramid Scheme, then, targets semantic equivalence at the
subsentential level, weighted by popularity. While also se-
mantically motivated, the Responsiveness score targets a
more holistic measure of relevance (Dang 2005). Respon-
siveness compares to information retrieval objectives in that
it ranks summaries by how much they satisfy information
needs. Information needs are expressed by discrete topic
statements that cover different aspects of the source.
Automatic Span Overlap.
Manual assessment of summaries is effective yet time con-
suming. (Hovy et al. 2006) provide an automated corol-
lary to the Pyramid Scheme, replacing manually identi-
fied SCUs with syntactic constituents called Basic Elements
(BE), and matching BEs through hard and soft alignment.
The most commonly used metrics for automatically evalu-
ating generated language are BLEU (Papineni et al. 2002),
ROUGE (Lin and Hovy 2003), and METEOR (Banerjee and
Lavie 2005). Simpler than Basic Elements, they measure lo-
cally constrained n-gram overlap. BLEU captures precision
while ROUGE introduces recall, in addition to a more flex-
ible n-gram metric: ROUGE-L defined over longest com-
mon subsequences (LCS). Designed specifically for MT sys-
tems, METEOR adds a measure of fragmentation because
word alignment is often integral to an accurate translation.
BERTScore leverages the rise to prominence of deep contex-
tualized embeddings to extend ROUGE and BLEU scores.
More specifically, BERTScore replaces exact match n-gram
overlap with soft alignment of contextualized BERT embed-
dings (Zhang et al. 2019). It is shown to correlate more with
human judgment by being semantically driven and more for-
giving to surface lexical divergences between reference and
model summary. ROUGESal offers a simple enhancement to
the ROUGE score: upweight salient phrases as determined
by a keyphrase classifier (Pasunuru and Bansal 2018).
Limitations of BLUE & ROUGE.
While BLEU and ROUGE have been shown to correlate
with human judgment, recent work has uncovered several
serious limitations (Novikova et al. 2017). (Schluter 2017)
question whether a perfect ROUGE score indicates a perfect
summary, and provide the first NP-hardness proof for global
optimisation with ROUGE. Notably, it has been demon-
strated that the correlation of BLEU/ROUGE to human
judgment degrades as the number of summary references
declines (Louis and Nenkova 2013). A recent strand of liter-
ature has identified that summaries with a high BLEU or
REUGE score can mask serious deficiencies, particularly
with respect to the misrepresentations of facts (Maynez et al.
2020). (Maynez et al. 2020) introduce the concept of a hal-
lucination as a span of generated text not supported by the
input document. All hallucinations are not considered faith-
ful to the original text, but they may be considered factual.
A factual hallucinate refers to a span of text not directly sup-
ported by the source text but can be supported as a fact by
other knowledge sources. Depending on the intended audi-
ence, it may be desirable to integrate background knowledge
into summaries. For instance, it may be apropros to define
an interception if included as part of a summary of a foot-
ball game. It is possible, then, to be factual without being
faithful, but it is not usually possible to be faithful without
being factual. If a summary is not factual, it is only faithful
if there is an error in the source.
By and large, two strands of extrinsic evaluations have
been proposed to directly assess the faithfulness of a sum-
mary to its source: question answering and directed logical
entailment.
Question Answering as Evaluation Framework.
Some recent literature posits that, given salient questions,
a relevant summary should produce similar answers to that
of the source. These evaluations rely on the availability, or
ability to generate, relevant questions and a reliable QA sys-
tem from which to extract answers. For question genera-
tion, APES identifies and masks named entities in a refer-
ence summary to create cloze style fill in the blank ques-
tions which must be predicted correctly when conditioning
on the model summary (Eyal, Baumel, and Elhadad 2019).
(Chen et al. 2018) automatically generate “WH” factoid
questions. ”Answers Unite!” offers a reference-free alterna-
tive to APES by extending the QA task to the source doc-
ument rather than a reference (Scialom et al. 2019). Obvi-
ating the need for human-generated references enables self-
supervision, and they demonstrate that augmented training
data leads to better overall performance. These metrics re-
quire generating questions from either the source or refer-
ence summary. (Durmus, He, and Diab 2020) reverse the
order and condition on model summaries when generating
questions. They train a question generator by fine-tuning a
pretrained BART language model (Lewis et al. 2019) on a
modified version of the QA2D dataset (Demszky, Guu, and
Liang 2018). Given a cloze-style masked sequence, the gen-
erator learns a question whose answer is the masked span.
For the discussed QA-style evaluation metrics, question
generation is a necessary component to assessing factual-
ness. More broadly, however, when questions are generated
by an external model, this introduces an extra source of er-
rors, which propagate through to the answer system. These
questions merely demarcate specific information seeking
needs - i.e. it may be a non-essential, replaceable step.
(Goodrich et al. 2019) remove this key intermediary task
by defining factual accuracy as precision between model
and reference summary extracted relation tuples. In part to
provide training model for the fact extractor, they create a
new dataset for fact extraction using distant supervision on
Wikipedia text by cross referencing facts from the Wikidata
knowledge base. Likewise, (Kry´sci´nski et al. 2019) propose
a weakly-supervised, model approach to assess factual con-
sistency that does not involve generating questions. They
pre-train a fact checker on synthetic fact dataset by first se-
lecting sentence from source document, then paraphrasing
for positive examples, and perturbing for negative examples
(negation, entity replacement, etc.).
Natural Language Inference as Evaluation
Framework.
Some researchers have studied the use of natural language
inference (NLI) (Bowman et al. 2015), or textual entail-
ment (Dagan, Glickman, and Magnini 2005), as a means of
assessing factualness. The underlying premise is that sum-
maries which are logically entailed by the source, or a ref-
erence summary, are less likely to contain factual errors.
(Pasunuru and Bansal 2018) find that entailment is com-
plementary to ROUGE as part of the reward in a reinforce-
ment learning objective. Others have explored entailment in
the context of re-ranking (Falke et al. 2019), (Welleck et al.
2018).
Reference Free Evaluations
A major bottleneck to some of the metrics discussed above
is their reliance upon high-quality references, which can be
difficult to obtain, and often encode idiosyncratic dataset
curation biases. There are generally many ways to craft a
strong summary, of which a small set of references repre-
sents a noisy sample. Most of the summarization corpora
involve news articles, for which multiple references are rel-
atively easy to obtain. As the environment diversifies into
lower resource domains, this dependence grows more prob-
lematic. Rather than depend on a noisy reference, reference-
less evaluations directly compare the source document to
a model generated summary. Highlight-based ROUGE (H-
ROUGE) obviates the need for reference summaries by ask-
ing humans to evaluate summaries against manually high-
lighted salient parts of the source document (Narayan, Vla-
chos, and others 2019). This requires manual highlights of
the source as well manual judgment of summaries against
the highlights. SUPERT rates the quality of a summary by
its semantic similarity with a pseudo reference summary,
i.e. selected salient sentences from the source documents,
using contextualized embeddings and soft token alignment
techniques (Gao, Zhao, and Eger 2020). (ShafieiBavani et
al. 2018) identify desired summary attributes: distributional
semantic similarity, topical relevance, coherence, novelty
and derive automatic measures to calculate scores based
on contextualized embeddings. As part of the topical rele-
vance score, they produce a topic-aware embedding for both
source and summary based on topic distributions from La-
tent Dirchlet Allocation (LDA) (Blei, Ng, and Jordan 2003).
References
Banerjee, S., and Lavie, A. 2005. Meteor: An automatic
metric for mt evaluation with improved correlation with hu-
man judgments. In Proceedings of the acl workshop on in-
trinsic and extrinsic evaluation measures for machine trans-
lation and/or summarization, 65–72.
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent
dirichlet allocation. Journal of machine Learning research
3(Jan):993–1022.
Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D.
2015. A large annotated corpus for learning natural language
inference. arXiv preprint arXiv:1508.05326.
Chen, P.; Wu, F.; Wang, T.; and Ding, W. 2018. A seman-
tic qa-based approach for text summarization evaluation. In
Thirty-Second AAAI Conference on Artificial Intelligence.
Dagan, I.; Glickman, O.; and Magnini, B. 2005. The pas-
cal recognising textual entailment challenge. In Machine
Learning Challenges Workshop, 177–190. Springer.
Dang, H. T. 2005. Overview of duc 2005. In Proceedings
of the document understanding conference, volume 2005, 1–
12.
Demszky, D.; Guu, K.; and Liang, P. 2018. Transforming
question answering datasets into natural language inference
datasets. arXiv preprint arXiv:1809.02922.
Durmus, E.; He, H.; and Diab, M. 2020. Feqa: A ques-
tion answering evaluation framework for faithfulness as-
sessment in abstractive summarization. arXiv preprint
arXiv:2005.03754.
Eyal, M.; Baumel, T.; and Elhadad, M. 2019. Question
answering as an automatic evaluation metric for news article
summarization. arXiv preprint arXiv:1906.00318.
Falke, T.; Ribeiro, L. F.; Utama, P. A.; Dagan, I.; and
Gurevych, I. 2019. Ranking generated summaries by cor-
rectness: An interesting but challenging application for nat-
ural language inference. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics,
2214–2220.
Gao, Y.; Zhao, W.; and Eger, S. 2020. Supert:
Towards new frontiers in unsupervised evaluation met-
rics for multi-document summarization. arXiv preprint
arXiv:2005.03724.
Goodrich, B.; Rao, V.; Liu, P. J.; and Saleh, M. 2019. As-
sessing the factual accuracy of generated text. In Proceed-
ings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, 166–175.
Hovy, E. H.; Lin, C.-Y.; Zhou, L.; and Fukumoto, J. 2006.
Automated summarization evaluation with basic elements.
In LREC, volume 6, 899–902. Citeseer.
Kry´sci´nski, W.; McCann, B.; Xiong, C.; and Socher, R.
2019. Evaluating the factual consistency of abstractive text
summarization. arXiv arXiv–1910.
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo-
hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.
2019. Bart: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and comprehen-
sion. arXiv preprint arXiv:1910.13461.
Lin, C.-Y., and Hovy, E. 2003. Automatic evaluation of
summaries using n-gram co-occurrence statistics. In Pro-
ceedings of the 2003 Human Language Technology Confer-
ence of the North American Chapter of the Association for
Computational Linguistics, 150–157.
Louis, A., and Nenkova, A. 2013. Automatically assessing
machine summary content without a gold standard. Compu-
tational Linguistics 39(2):267–300.
Maynez, J.; Narayan, S.; Bohnet, B.; and McDonald, R.
2020. On faithfulness and factuality in abstractive summa-
rization. arXiv preprint arXiv:2005.00661.
Narayan, S.; Vlachos, A.; et al. 2019. Highres: Highlight-
based reference-less evaluation of summarization. arXiv
preprint arXiv:1906.01361.
Nenkova, A.; Passonneau, R.; and McKeown, K. 2007.
The pyramid method: Incorporating human content selec-
tion variation in summarization evaluation. ACM Transac-
tions on Speech and Language Processing (TSLP) 4(2):4–es.
Novikova, J.; Duˇsek, O.; Curry, A. C.; and Rieser, V. 2017.
Why we need new evaluation metrics for nlg. arXiv preprint
arXiv:1707.06875.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Bleu: a method for automatic evaluation of machine transla-
tion. In Proceedings of the 40th annual meeting on associa-
tion for computational linguistics, 311–318. Association for
Computational Linguistics.
Passonneau, R. J.; Nenkova, A.; McKeown, K.; and Sigel-
man, S. 2005. Applying the pyramid method in duc 2005.
In Proceedings of the document understanding conference
(DUC 05), Vancouver, BC, Canada.
Pasunuru, R., and Bansal, M. 2018. Multi-reward reinforced
summarization with saliency and entailment. arXiv preprint
arXiv:1804.06451.
Schluter, N. 2017. The limits of automatic summarisation
according to rouge. In Proceedings of the 15th Conference of
the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, 41–45.
Scialom, T.; Lamprier, S.; Piwowarski, B.; and Staiano, J.
2019. Answers unite! unsupervised metrics for reinforced
summarization models. arXiv preprint arXiv:1909.01610.
ShafieiBavani, E.; Ebrahimi, M.; Wong, R.; and Chen, F.
2018. Summarization evaluation in the absence of human
model summaries using the compositionality of word em-
beddings. In Proceedings of the 27th International Confer-
ence on Computational Linguistics, 905–914.
Welleck, S.; Weston, J.; Szlam, A.; and Cho, K. 2018.
Dialogue natural language inference. arXiv preprint
arXiv:1811.00671.
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi,
Y. 2019. Bertscore: Evaluating text generation with bert.
arXiv preprint arXiv:1904.09675.

More Related Content

PDF
Text Mining: (Asynchronous Sequences)
PDF
Conceptual Sentiment Analysis Model
PDF
N15-1013
PDF
Exploiting rhetorical relations to
PDF
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
PDF
Text Segmentation for Online Subjective Examination using Machine Learning
PDF
A SURVEY ON QUESTION AND ANSWER SYSTEM BY RETRIEVING THE DESCRIPTIONS USING L...
PDF
Aq35241246
Text Mining: (Asynchronous Sequences)
Conceptual Sentiment Analysis Model
N15-1013
Exploiting rhetorical relations to
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
Text Segmentation for Online Subjective Examination using Machine Learning
A SURVEY ON QUESTION AND ANSWER SYSTEM BY RETRIEVING THE DESCRIPTIONS USING L...
Aq35241246

What's hot (20)

PDF
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
PPTX
Search Engines
PDF
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
PDF
Prediction of Answer Keywords using Char-RNN
PPT
Cmaps as intellectual prosthesis (GERAS 34, Paris)
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
IRJET- A Survey Paper on Text Summarization Methods
ODP
Exploring the roots of rigor: Exploring a methodology for analyzing the conce...
PDF
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS
PDF
Case-Based Reasoning for Explaining Probabilistic Machine Learning
PDF
Review of Various Text Categorization Methods
PDF
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
PDF
graduate_thesis (1)
PDF
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
PDF
D1802023136
PDF
P13 corley
DOC
taghelper-final.doc
PDF
Probabilistic Information Retrieval
PDF
Evolution and state-of-the art of Altmetric research: Insights from network a...
PPTX
Exploring Session Context using Distributed Representations of Queries and Re...
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
Search Engines
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
Prediction of Answer Keywords using Char-RNN
Cmaps as intellectual prosthesis (GERAS 34, Paris)
International Journal of Computational Engineering Research(IJCER)
IRJET- A Survey Paper on Text Summarization Methods
Exploring the roots of rigor: Exploring a methodology for analyzing the conce...
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS
Case-Based Reasoning for Explaining Probabilistic Machine Learning
Review of Various Text Categorization Methods
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
graduate_thesis (1)
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
D1802023136
P13 corley
taghelper-final.doc
Probabilistic Information Retrieval
Evolution and state-of-the art of Altmetric research: Insights from network a...
Exploring Session Context using Distributed Representations of Queries and Re...
Ad

Similar to Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation (20)

PDF
Assessing the Sufficiency of Arguments through Conclusion Generation.pdf
DOCX
Evaluation19(3) 321 –332© The Author(s) 2013 Reprints .docx
PDF
Automated evaluation of coherence in student essays.pdf
PDF
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
PDF
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
PDF
Document Retrieval System, a Case Study
PDF
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
PDF
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
PDF
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
PDF
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
PDF
J017145559
PDF
Challenging Issues and Similarity Measures for Web Document Clustering
DOCX
bồn tắm jacuzzi.docx
PDF
Natural Language Processing Through Different Classes of Machine Learning
PDF
A Semantic Scoring Rubric For Concept Maps Design And Reliability
PDF
Large language models-based metric for generative question answering systems
PDF
Automated Trait Scores For GRE Writing Tasks
PDF
An efficient approach for semantically enhanced document clustering by using ...
PDF
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
PDF
Blei ngjordan2003
Assessing the Sufficiency of Arguments through Conclusion Generation.pdf
Evaluation19(3) 321 –332© The Author(s) 2013 Reprints .docx
Automated evaluation of coherence in student essays.pdf
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
Document Retrieval System, a Case Study
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
J017145559
Challenging Issues and Similarity Measures for Web Document Clustering
bồn tắm jacuzzi.docx
Natural Language Processing Through Different Classes of Machine Learning
A Semantic Scoring Rubric For Concept Maps Design And Reliability
Large language models-based metric for generative question answering systems
Automated Trait Scores For GRE Writing Tasks
An efficient approach for semantically enhanced document clustering by using ...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Blei ngjordan2003
Ad

Recently uploaded (20)

PDF
. Radiology Case Scenariosssssssssssssss
PDF
Sciences of Europe No 170 (2025)
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
C1 cut-Methane and it's Derivatives.pptx
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPTX
Overview of calcium in human muscles.pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
perinatal infections 2-171220190027.pptx
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPT
veterinary parasitology ````````````.ppt
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPTX
Fluid dynamics vivavoce presentation of prakash
. Radiology Case Scenariosssssssssssssss
Sciences of Europe No 170 (2025)
Introduction to Cardiovascular system_structure and functions-1
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Placing the Near-Earth Object Impact Probability in Context
Phytochemical Investigation of Miliusa longipes.pdf
C1 cut-Methane and it's Derivatives.pptx
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Overview of calcium in human muscles.pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
perinatal infections 2-171220190027.pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
veterinary parasitology ````````````.ppt
7. General Toxicologyfor clinical phrmacy.pptx
Biomechanics of the Hip - Basic Science.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Science Quipper for lesson in grade 8 Matatag Curriculum
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Fluid dynamics vivavoce presentation of prakash

Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation

  • 1. Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation Griffin Adams,1 1 Columbia University, New York, NY, USA Abstract Research on the evaluation of abstractive summarization models has evolved considerably in the last few years. To take stock of these changes, namely, the shift from n-gram overlap to fact-based assessments, I have written a brief summary of papers on evaluation metrics, most of which focuses on the period from 2018 to mid-2020. The paper is written in a terse literature review format, so as to aid researchers when craft- ing related works sections of papers on summary evaluation. Introduction While recent progress on abstractive summarization has led to remarkably fluent summaries, factual errors in generated summaries still severely limit their use in practice. Recent work has moved away from n-gram overlap based metrics, such as ROUGE and BLEU, to semantic-inspired metrics which assess a model’s faithfulness to the source text. In the sections below, I trace this arc by discussing relevant papers. Please cite this paper if used as a guide in your research. Related Works Manual Evaluation. For decades, summarization systems were purely extractive and, as an area of research, summarization was cast as a sentence selection problem. Metrics capturing exact match overlap, such as precision and recall, were the de facto stan- dard for evaluation. Then, the field started to shift atten- tion to abstractive summarization, a more natural, human- like method for transducing document content. Given the large variance in language generation, from an evaluation standpoint, the focus shifted from surface form sentential overlap to semantic equivalence at varying granularity. The most prominent example, the Pyramid Method, proposes a method for manual identification of semantic equivalence based on Summary Content Units (SCUs) (Passonneau et al. 2005), (Nenkova, Passonneau, and McKeown 2007). SCUs are semantic units formed by clustering semantically equiv- alent text spans from multiple reference summaries. SCUs form a pyramid (by exhibiting a Zipfian distribution) based on the number of supporting references, i.e. frequently ab- stracted facts obtain are presumed to hold more weight and are placed at the top of the pyramid. Evaluation with the Pyramid Scheme, then, targets semantic equivalence at the subsentential level, weighted by popularity. While also se- mantically motivated, the Responsiveness score targets a more holistic measure of relevance (Dang 2005). Respon- siveness compares to information retrieval objectives in that it ranks summaries by how much they satisfy information needs. Information needs are expressed by discrete topic statements that cover different aspects of the source. Automatic Span Overlap. Manual assessment of summaries is effective yet time con- suming. (Hovy et al. 2006) provide an automated corol- lary to the Pyramid Scheme, replacing manually identi- fied SCUs with syntactic constituents called Basic Elements (BE), and matching BEs through hard and soft alignment. The most commonly used metrics for automatically evalu- ating generated language are BLEU (Papineni et al. 2002), ROUGE (Lin and Hovy 2003), and METEOR (Banerjee and Lavie 2005). Simpler than Basic Elements, they measure lo- cally constrained n-gram overlap. BLEU captures precision while ROUGE introduces recall, in addition to a more flex- ible n-gram metric: ROUGE-L defined over longest com- mon subsequences (LCS). Designed specifically for MT sys- tems, METEOR adds a measure of fragmentation because word alignment is often integral to an accurate translation. BERTScore leverages the rise to prominence of deep contex- tualized embeddings to extend ROUGE and BLEU scores. More specifically, BERTScore replaces exact match n-gram overlap with soft alignment of contextualized BERT embed- dings (Zhang et al. 2019). It is shown to correlate more with human judgment by being semantically driven and more for- giving to surface lexical divergences between reference and model summary. ROUGESal offers a simple enhancement to the ROUGE score: upweight salient phrases as determined by a keyphrase classifier (Pasunuru and Bansal 2018). Limitations of BLUE & ROUGE. While BLEU and ROUGE have been shown to correlate with human judgment, recent work has uncovered several serious limitations (Novikova et al. 2017). (Schluter 2017) question whether a perfect ROUGE score indicates a perfect
  • 2. summary, and provide the first NP-hardness proof for global optimisation with ROUGE. Notably, it has been demon- strated that the correlation of BLEU/ROUGE to human judgment degrades as the number of summary references declines (Louis and Nenkova 2013). A recent strand of liter- ature has identified that summaries with a high BLEU or REUGE score can mask serious deficiencies, particularly with respect to the misrepresentations of facts (Maynez et al. 2020). (Maynez et al. 2020) introduce the concept of a hal- lucination as a span of generated text not supported by the input document. All hallucinations are not considered faith- ful to the original text, but they may be considered factual. A factual hallucinate refers to a span of text not directly sup- ported by the source text but can be supported as a fact by other knowledge sources. Depending on the intended audi- ence, it may be desirable to integrate background knowledge into summaries. For instance, it may be apropros to define an interception if included as part of a summary of a foot- ball game. It is possible, then, to be factual without being faithful, but it is not usually possible to be faithful without being factual. If a summary is not factual, it is only faithful if there is an error in the source. By and large, two strands of extrinsic evaluations have been proposed to directly assess the faithfulness of a sum- mary to its source: question answering and directed logical entailment. Question Answering as Evaluation Framework. Some recent literature posits that, given salient questions, a relevant summary should produce similar answers to that of the source. These evaluations rely on the availability, or ability to generate, relevant questions and a reliable QA sys- tem from which to extract answers. For question genera- tion, APES identifies and masks named entities in a refer- ence summary to create cloze style fill in the blank ques- tions which must be predicted correctly when conditioning on the model summary (Eyal, Baumel, and Elhadad 2019). (Chen et al. 2018) automatically generate “WH” factoid questions. ”Answers Unite!” offers a reference-free alterna- tive to APES by extending the QA task to the source doc- ument rather than a reference (Scialom et al. 2019). Obvi- ating the need for human-generated references enables self- supervision, and they demonstrate that augmented training data leads to better overall performance. These metrics re- quire generating questions from either the source or refer- ence summary. (Durmus, He, and Diab 2020) reverse the order and condition on model summaries when generating questions. They train a question generator by fine-tuning a pretrained BART language model (Lewis et al. 2019) on a modified version of the QA2D dataset (Demszky, Guu, and Liang 2018). Given a cloze-style masked sequence, the gen- erator learns a question whose answer is the masked span. For the discussed QA-style evaluation metrics, question generation is a necessary component to assessing factual- ness. More broadly, however, when questions are generated by an external model, this introduces an extra source of er- rors, which propagate through to the answer system. These questions merely demarcate specific information seeking needs - i.e. it may be a non-essential, replaceable step. (Goodrich et al. 2019) remove this key intermediary task by defining factual accuracy as precision between model and reference summary extracted relation tuples. In part to provide training model for the fact extractor, they create a new dataset for fact extraction using distant supervision on Wikipedia text by cross referencing facts from the Wikidata knowledge base. Likewise, (Kry´sci´nski et al. 2019) propose a weakly-supervised, model approach to assess factual con- sistency that does not involve generating questions. They pre-train a fact checker on synthetic fact dataset by first se- lecting sentence from source document, then paraphrasing for positive examples, and perturbing for negative examples (negation, entity replacement, etc.). Natural Language Inference as Evaluation Framework. Some researchers have studied the use of natural language inference (NLI) (Bowman et al. 2015), or textual entail- ment (Dagan, Glickman, and Magnini 2005), as a means of assessing factualness. The underlying premise is that sum- maries which are logically entailed by the source, or a ref- erence summary, are less likely to contain factual errors. (Pasunuru and Bansal 2018) find that entailment is com- plementary to ROUGE as part of the reward in a reinforce- ment learning objective. Others have explored entailment in the context of re-ranking (Falke et al. 2019), (Welleck et al. 2018). Reference Free Evaluations A major bottleneck to some of the metrics discussed above is their reliance upon high-quality references, which can be difficult to obtain, and often encode idiosyncratic dataset curation biases. There are generally many ways to craft a strong summary, of which a small set of references repre- sents a noisy sample. Most of the summarization corpora involve news articles, for which multiple references are rel- atively easy to obtain. As the environment diversifies into lower resource domains, this dependence grows more prob- lematic. Rather than depend on a noisy reference, reference- less evaluations directly compare the source document to a model generated summary. Highlight-based ROUGE (H- ROUGE) obviates the need for reference summaries by ask- ing humans to evaluate summaries against manually high- lighted salient parts of the source document (Narayan, Vla- chos, and others 2019). This requires manual highlights of the source as well manual judgment of summaries against the highlights. SUPERT rates the quality of a summary by its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques (Gao, Zhao, and Eger 2020). (ShafieiBavani et al. 2018) identify desired summary attributes: distributional semantic similarity, topical relevance, coherence, novelty and derive automatic measures to calculate scores based on contextualized embeddings. As part of the topical rele- vance score, they produce a topic-aware embedding for both source and summary based on topic distributions from La- tent Dirchlet Allocation (LDA) (Blei, Ng, and Jordan 2003).
  • 3. References Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. In Proceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, 65–72. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research 3(Jan):993–1022. Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. Chen, P.; Wu, F.; Wang, T.; and Ding, W. 2018. A seman- tic qa-based approach for text summarization evaluation. In Thirty-Second AAAI Conference on Artificial Intelligence. Dagan, I.; Glickman, O.; and Magnini, B. 2005. The pas- cal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 177–190. Springer. Dang, H. T. 2005. Overview of duc 2005. In Proceedings of the document understanding conference, volume 2005, 1– 12. Demszky, D.; Guu, K.; and Liang, P. 2018. Transforming question answering datasets into natural language inference datasets. arXiv preprint arXiv:1809.02922. Durmus, E.; He, H.; and Diab, M. 2020. Feqa: A ques- tion answering evaluation framework for faithfulness as- sessment in abstractive summarization. arXiv preprint arXiv:2005.03754. Eyal, M.; Baumel, T.; and Elhadad, M. 2019. Question answering as an automatic evaluation metric for news article summarization. arXiv preprint arXiv:1906.00318. Falke, T.; Ribeiro, L. F.; Utama, P. A.; Dagan, I.; and Gurevych, I. 2019. Ranking generated summaries by cor- rectness: An interesting but challenging application for nat- ural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2214–2220. Gao, Y.; Zhao, W.; and Eger, S. 2020. Supert: Towards new frontiers in unsupervised evaluation met- rics for multi-document summarization. arXiv preprint arXiv:2005.03724. Goodrich, B.; Rao, V.; Liu, P. J.; and Saleh, M. 2019. As- sessing the factual accuracy of generated text. In Proceed- ings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 166–175. Hovy, E. H.; Lin, C.-Y.; Zhou, L.; and Fukumoto, J. 2006. Automated summarization evaluation with basic elements. In LREC, volume 6, 899–902. Citeseer. Kry´sci´nski, W.; McCann, B.; Xiong, C.; and Socher, R. 2019. Evaluating the factual consistency of abstractive text summarization. arXiv arXiv–1910. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo- hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehen- sion. arXiv preprint arXiv:1910.13461. Lin, C.-Y., and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Pro- ceedings of the 2003 Human Language Technology Confer- ence of the North American Chapter of the Association for Computational Linguistics, 150–157. Louis, A., and Nenkova, A. 2013. Automatically assessing machine summary content without a gold standard. Compu- tational Linguistics 39(2):267–300. Maynez, J.; Narayan, S.; Bohnet, B.; and McDonald, R. 2020. On faithfulness and factuality in abstractive summa- rization. arXiv preprint arXiv:2005.00661. Narayan, S.; Vlachos, A.; et al. 2019. Highres: Highlight- based reference-less evaluation of summarization. arXiv preprint arXiv:1906.01361. Nenkova, A.; Passonneau, R.; and McKeown, K. 2007. The pyramid method: Incorporating human content selec- tion variation in summarization evaluation. ACM Transac- tions on Speech and Language Processing (TSLP) 4(2):4–es. Novikova, J.; Duˇsek, O.; Curry, A. C.; and Rieser, V. 2017. Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine transla- tion. In Proceedings of the 40th annual meeting on associa- tion for computational linguistics, 311–318. Association for Computational Linguistics. Passonneau, R. J.; Nenkova, A.; McKeown, K.; and Sigel- man, S. 2005. Applying the pyramid method in duc 2005. In Proceedings of the document understanding conference (DUC 05), Vancouver, BC, Canada. Pasunuru, R., and Bansal, M. 2018. Multi-reward reinforced summarization with saliency and entailment. arXiv preprint arXiv:1804.06451. Schluter, N. 2017. The limits of automatic summarisation according to rouge. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 41–45. Scialom, T.; Lamprier, S.; Piwowarski, B.; and Staiano, J. 2019. Answers unite! unsupervised metrics for reinforced summarization models. arXiv preprint arXiv:1909.01610. ShafieiBavani, E.; Ebrahimi, M.; Wong, R.; and Chen, F. 2018. Summarization evaluation in the absence of human model summaries using the compositionality of word em- beddings. In Proceedings of the 27th International Confer- ence on Computational Linguistics, 905–914. Welleck, S.; Weston, J.; Szlam, A.; and Cho, K. 2018. Dialogue natural language inference. arXiv preprint arXiv:1811.00671. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.