Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation

Sending out an SOS (Summary of Summaries):
A Brief Survey of Recent Work on Abstractive Summarization Evaluation
Griffin Adams,1
1
Columbia University, New York, NY, USA
Abstract
Research on the evaluation of abstractive summarization
models has evolved considerably in the last few years. To take
stock of these changes, namely, the shift from n-gram overlap
to fact-based assessments, I have written a brief summary of
papers on evaluation metrics, most of which focuses on the
period from 2018 to mid-2020. The paper is written in a terse
literature review format, so as to aid researchers when craft-
ing related works sections of papers on summary evaluation.
Introduction
While recent progress on abstractive summarization has led
to remarkably fluent summaries, factual errors in generated
summaries still severely limit their use in practice. Recent
work has moved away from n-gram overlap based metrics,
such as ROUGE and BLEU, to semantic-inspired metrics
which assess a model’s faithfulness to the source text. In the
sections below, I trace this arc by discussing relevant papers.
Please cite this paper if used as a guide in your research.
Related Works
Manual Evaluation.
For decades, summarization systems were purely extractive
and, as an area of research, summarization was cast as a
sentence selection problem. Metrics capturing exact match
overlap, such as precision and recall, were the de facto stan-
dard for evaluation. Then, the field started to shift atten-
tion to abstractive summarization, a more natural, human-
like method for transducing document content. Given the
large variance in language generation, from an evaluation
standpoint, the focus shifted from surface form sentential
overlap to semantic equivalence at varying granularity. The
most prominent example, the Pyramid Method, proposes a
method for manual identification of semantic equivalence
based on Summary Content Units (SCUs) (Passonneau et al.
2005), (Nenkova, Passonneau, and McKeown 2007). SCUs
are semantic units formed by clustering semantically equiv-
alent text spans from multiple reference summaries. SCUs
form a pyramid (by exhibiting a Zipfian distribution) based
on the number of supporting references, i.e. frequently ab-
stracted facts obtain are presumed to hold more weight and
are placed at the top of the pyramid. Evaluation with the
Pyramid Scheme, then, targets semantic equivalence at the
subsentential level, weighted by popularity. While also se-
mantically motivated, the Responsiveness score targets a
more holistic measure of relevance (Dang 2005). Respon-
siveness compares to information retrieval objectives in that
it ranks summaries by how much they satisfy information
needs. Information needs are expressed by discrete topic
statements that cover different aspects of the source.
Automatic Span Overlap.
Manual assessment of summaries is effective yet time con-
suming. (Hovy et al. 2006) provide an automated corol-
lary to the Pyramid Scheme, replacing manually identi-
fied SCUs with syntactic constituents called Basic Elements
(BE), and matching BEs through hard and soft alignment.
The most commonly used metrics for automatically evalu-
ating generated language are BLEU (Papineni et al. 2002),
ROUGE (Lin and Hovy 2003), and METEOR (Banerjee and
Lavie 2005). Simpler than Basic Elements, they measure lo-
cally constrained n-gram overlap. BLEU captures precision
while ROUGE introduces recall, in addition to a more flex-
ible n-gram metric: ROUGE-L defined over longest com-
mon subsequences (LCS). Designed specifically for MT sys-
tems, METEOR adds a measure of fragmentation because
word alignment is often integral to an accurate translation.
BERTScore leverages the rise to prominence of deep contex-
tualized embeddings to extend ROUGE and BLEU scores.
More specifically, BERTScore replaces exact match n-gram
overlap with soft alignment of contextualized BERT embed-
dings (Zhang et al. 2019). It is shown to correlate more with
human judgment by being semantically driven and more for-
giving to surface lexical divergences between reference and
model summary. ROUGESal offers a simple enhancement to
the ROUGE score: upweight salient phrases as determined
by a keyphrase classifier (Pasunuru and Bansal 2018).
Limitations of BLUE & ROUGE.
While BLEU and ROUGE have been shown to correlate
with human judgment, recent work has uncovered several
serious limitations (Novikova et al. 2017). (Schluter 2017)
question whether a perfect ROUGE score indicates a perfect

summary, and provide the first NP-hardness proof for global
optimisation with ROUGE. Notably, it has been demon-
strated that the correlation of BLEU/ROUGE to human
judgment degrades as the number of summary references
declines (Louis and Nenkova 2013). A recent strand of liter-
ature has identified that summaries with a high BLEU or
REUGE score can mask serious deficiencies, particularly
with respect to the misrepresentations of facts (Maynez et al.
2020). (Maynez et al. 2020) introduce the concept of a hal-
lucination as a span of generated text not supported by the
input document. All hallucinations are not considered faith-
ful to the original text, but they may be considered factual.
A factual hallucinate refers to a span of text not directly sup-
ported by the source text but can be supported as a fact by
other knowledge sources. Depending on the intended audi-
ence, it may be desirable to integrate background knowledge
into summaries. For instance, it may be apropros to define
an interception if included as part of a summary of a foot-
ball game. It is possible, then, to be factual without being
faithful, but it is not usually possible to be faithful without
being factual. If a summary is not factual, it is only faithful
if there is an error in the source.
By and large, two strands of extrinsic evaluations have
been proposed to directly assess the faithfulness of a sum-
mary to its source: question answering and directed logical
entailment.
Question Answering as Evaluation Framework.
Some recent literature posits that, given salient questions,
a relevant summary should produce similar answers to that
of the source. These evaluations rely on the availability, or
ability to generate, relevant questions and a reliable QA sys-
tem from which to extract answers. For question genera-
tion, APES identifies and masks named entities in a refer-
ence summary to create cloze style fill in the blank ques-
tions which must be predicted correctly when conditioning
on the model summary (Eyal, Baumel, and Elhadad 2019).
(Chen et al. 2018) automatically generate “WH” factoid
questions. ”Answers Unite!” offers a reference-free alterna-
tive to APES by extending the QA task to the source doc-
ument rather than a reference (Scialom et al. 2019). Obvi-
ating the need for human-generated references enables self-
supervision, and they demonstrate that augmented training
data leads to better overall performance. These metrics re-
quire generating questions from either the source or refer-
ence summary. (Durmus, He, and Diab 2020) reverse the
order and condition on model summaries when generating
questions. They train a question generator by fine-tuning a
pretrained BART language model (Lewis et al. 2019) on a
modified version of the QA2D dataset (Demszky, Guu, and
Liang 2018). Given a cloze-style masked sequence, the gen-
erator learns a question whose answer is the masked span.
For the discussed QA-style evaluation metrics, question
generation is a necessary component to assessing factual-
ness. More broadly, however, when questions are generated
by an external model, this introduces an extra source of er-
rors, which propagate through to the answer system. These
questions merely demarcate specific information seeking
needs - i.e. it may be a non-essential, replaceable step.
(Goodrich et al. 2019) remove this key intermediary task
by defining factual accuracy as precision between model
and reference summary extracted relation tuples. In part to
provide training model for the fact extractor, they create a
new dataset for fact extraction using distant supervision on
Wikipedia text by cross referencing facts from the Wikidata
knowledge base. Likewise, (Kry´sciński et al. 2019) propose
a weakly-supervised, model approach to assess factual con-
sistency that does not involve generating questions. They
pre-train a fact checker on synthetic fact dataset by first se-
lecting sentence from source document, then paraphrasing
for positive examples, and perturbing for negative examples
(negation, entity replacement, etc.).
Natural Language Inference as Evaluation
Framework.
Some researchers have studied the use of natural language
inference (NLI) (Bowman et al. 2015), or textual entail-
ment (Dagan, Glickman, and Magnini 2005), as a means of
assessing factualness. The underlying premise is that sum-
maries which are logically entailed by the source, or a ref-
erence summary, are less likely to contain factual errors.
(Pasunuru and Bansal 2018) find that entailment is com-
plementary to ROUGE as part of the reward in a reinforce-
ment learning objective. Others have explored entailment in
the context of re-ranking (Falke et al. 2019), (Welleck et al.
2018).
Reference Free Evaluations
A major bottleneck to some of the metrics discussed above
is their reliance upon high-quality references, which can be
difficult to obtain, and often encode idiosyncratic dataset
curation biases. There are generally many ways to craft a
strong summary, of which a small set of references repre-
sents a noisy sample. Most of the summarization corpora
involve news articles, for which multiple references are rel-
atively easy to obtain. As the environment diversifies into
lower resource domains, this dependence grows more prob-
lematic. Rather than depend on a noisy reference, reference-
less evaluations directly compare the source document to
a model generated summary. Highlight-based ROUGE (H-
ROUGE) obviates the need for reference summaries by ask-
ing humans to evaluate summaries against manually high-
lighted salient parts of the source document (Narayan, Vla-
chos, and others 2019). This requires manual highlights of
the source as well manual judgment of summaries against
the highlights. SUPERT rates the quality of a summary by
its semantic similarity with a pseudo reference summary,
i.e. selected salient sentences from the source documents,
using contextualized embeddings and soft token alignment
techniques (Gao, Zhao, and Eger 2020). (ShafieiBavani et
al. 2018) identify desired summary attributes: distributional
semantic similarity, topical relevance, coherence, novelty
and derive automatic measures to calculate scores based
on contextualized embeddings. As part of the topical rele-
vance score, they produce a topic-aware embedding for both
source and summary based on topic distributions from La-
tent Dirchlet Allocation (LDA) (Blei, Ng, and Jordan 2003).

References
Banerjee, S., and Lavie, A. 2005. Meteor: An automatic
metric for mt evaluation with improved correlation with hu-
man judgments. In Proceedings of the acl workshop on in-
trinsic and extrinsic evaluation measures for machine trans-
lation and/or summarization, 65–72.
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent
dirichlet allocation. Journal of machine Learning research
3(Jan):993–1022.
Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D.
2015. A large annotated corpus for learning natural language
inference. arXiv preprint arXiv:1508.05326.
Chen, P.; Wu, F.; Wang, T.; and Ding, W. 2018. A seman-
tic qa-based approach for text summarization evaluation. In
Thirty-Second AAAI Conference on Artificial Intelligence.
Dagan, I.; Glickman, O.; and Magnini, B. 2005. The pas-
cal recognising textual entailment challenge. In Machine
Learning Challenges Workshop, 177–190. Springer.
Dang, H. T. 2005. Overview of duc 2005. In Proceedings
of the document understanding conference, volume 2005, 1–
12.
Demszky, D.; Guu, K.; and Liang, P. 2018. Transforming
question answering datasets into natural language inference
datasets. arXiv preprint arXiv:1809.02922.
Durmus, E.; He, H.; and Diab, M. 2020. Feqa: A ques-
tion answering evaluation framework for faithfulness as-
sessment in abstractive summarization. arXiv preprint
arXiv:2005.03754.
Eyal, M.; Baumel, T.; and Elhadad, M. 2019. Question
answering as an automatic evaluation metric for news article
summarization. arXiv preprint arXiv:1906.00318.
Falke, T.; Ribeiro, L. F.; Utama, P. A.; Dagan, I.; and
Gurevych, I. 2019. Ranking generated summaries by cor-
rectness: An interesting but challenging application for nat-
ural language inference. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics,
2214–2220.
Gao, Y.; Zhao, W.; and Eger, S. 2020. Supert:
Towards new frontiers in unsupervised evaluation met-
rics for multi-document summarization. arXiv preprint
arXiv:2005.03724.
Goodrich, B.; Rao, V.; Liu, P. J.; and Saleh, M. 2019. As-
sessing the factual accuracy of generated text. In Proceed-
ings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, 166–175.
Hovy, E. H.; Lin, C.-Y.; Zhou, L.; and Fukumoto, J. 2006.
Automated summarization evaluation with basic elements.
In LREC, volume 6, 899–902. Citeseer.
Kry´sciński, W.; McCann, B.; Xiong, C.; and Socher, R.
2019. Evaluating the factual consistency of abstractive text
summarization. arXiv arXiv–1910.
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo-
hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.
2019. Bart: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and comprehen-
sion. arXiv preprint arXiv:1910.13461.
Lin, C.-Y., and Hovy, E. 2003. Automatic evaluation of
summaries using n-gram co-occurrence statistics. In Pro-
ceedings of the 2003 Human Language Technology Confer-
ence of the North American Chapter of the Association for
Computational Linguistics, 150–157.
Louis, A., and Nenkova, A. 2013. Automatically assessing
machine summary content without a gold standard. Compu-
tational Linguistics 39(2):267–300.
Maynez, J.; Narayan, S.; Bohnet, B.; and McDonald, R.
2020. On faithfulness and factuality in abstractive summa-
rization. arXiv preprint arXiv:2005.00661.
Narayan, S.; Vlachos, A.; et al. 2019. Highres: Highlight-
based reference-less evaluation of summarization. arXiv
preprint arXiv:1906.01361.
Nenkova, A.; Passonneau, R.; and McKeown, K. 2007.
The pyramid method: Incorporating human content selec-
tion variation in summarization evaluation. ACM Transac-
tions on Speech and Language Processing (TSLP) 4(2):4–es.
Novikova, J.; Duˇsek, O.; Curry, A. C.; and Rieser, V. 2017.
Why we need new evaluation metrics for nlg. arXiv preprint
arXiv:1707.06875.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Bleu: a method for automatic evaluation of machine transla-
tion. In Proceedings of the 40th annual meeting on associa-
tion for computational linguistics, 311–318. Association for
Computational Linguistics.
Passonneau, R. J.; Nenkova, A.; McKeown, K.; and Sigel-
man, S. 2005. Applying the pyramid method in duc 2005.
In Proceedings of the document understanding conference
(DUC 05), Vancouver, BC, Canada.
Pasunuru, R., and Bansal, M. 2018. Multi-reward reinforced
summarization with saliency and entailment. arXiv preprint
arXiv:1804.06451.
Schluter, N. 2017. The limits of automatic summarisation
according to rouge. In Proceedings of the 15th Conference of
the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, 41–45.
Scialom, T.; Lamprier, S.; Piwowarski, B.; and Staiano, J.
2019. Answers unite! unsupervised metrics for reinforced
summarization models. arXiv preprint arXiv:1909.01610.
ShafieiBavani, E.; Ebrahimi, M.; Wong, R.; and Chen, F.
2018. Summarization evaluation in the absence of human
model summaries using the compositionality of word em-
beddings. In Proceedings of the 27th International Confer-
ence on Computational Linguistics, 905–914.
Welleck, S.; Weston, J.; Szlam, A.; and Cho, K. 2018.
Dialogue natural language inference. arXiv preprint
arXiv:1811.00671.
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi,
Y. 2019. Bertscore: Evaluating text generation with bert.
arXiv preprint arXiv:1904.09675.

Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation

More Related Content

What's hot (20)

Similar to Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation (20)

Recently uploaded (20)

Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on Abstractive Summarization Evaluation