Automatically Detecting Scientific Misinformation

Automatically Detecting
Scientific
Misinformation
Isabelle Augenstein*
augenstein@di.ku.dk
@IAugenstein
http://isabelleaugenstein.github.io/
*partial slide credit: Dustin Wright
CONSTRAINT
Workshop @ ACL
27 May 2022

Supporting the Life Cycle of Research
26/05/2022 3
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis

Fact Checking
26/05/2022 4
Focus on veracity
What about more subtle
forms of misinformation?

26/05/2022 5
Press Release
BBC DailyMail
The Express
Etc...
Credibility and Veracity of Science Communication

Knowledge-Intensive Tasks
26/05/2022 6
CLAIMS
EVIDENCE
Factuality,
faithfulness, credit
assignment, etc.
Goal: automatically ensure the credibility and veracity of scientific information

Fundamental Unit of Information: Claims
• Focus: tasks involving scientific claims, i.e. assertions
about the world which are factual in nature
26/05/2022 7
Current treatment options for ALS are based on symptom
management and respiratory support with the only approved
medications in widespread use, Riluzole and Edaravone,
providing only modest benefits and only in some patients.
• Challenges
• Supervised learning is hard: annotation is expensive, requiring
domain experts
• Language used is diverse across fields
• Different modalities
• Scientific claims are complex
• Meta-data also important

Overview of Today’s Talk
• Introduction
• The Life Cycle of Scientific Research
• Part 1: Claim detection and generation
• Cite-worthiness detection
• Scientific claim generation for zero-shot scientific fact checking
• Part 2: Exaggeration Detection
• Exaggeration detection of health science press releases
• Conclusion
• Future research challenges

CiteWorth: Cite-Worthiness Detection
for Improved Scientific Document
Understanding
Dustin Wright, Isabelle Augenstein
ACL 2021 (Findings)
9

Claims Should be Properly Credited
26/05/2022 10
Cite-worthiness detection

Citances in Machine Learning
26/05/2022 11
We use the model from the original BERT paper (Devlin et al. 2019).
Cite-worthiness: Is this a citance? Yes
Recommendation: What paper should be cited? Devlin et al. (2019)
Influence: Was this an influential paper? Yes
Intent: What is the purpose of the citation? Method

Cite-Worthiness Datasets
• Tend to be small and limited to only a few domains
(e.g. Computer Science)
• No attention paid to how clean the data is
26/05/2022 12
We use the model from Devlin et al. (2019) as a baseline.
e.g. ungrammatical phrases

Our Research Questions
• How can a dataset for cite-worthiness detection be
automatically curated with low noise?
• What methods are most effective for automatically
detecting cite-worthy sentences?
• How does domain affect learning cite-worthiness
detection?
• Can large scale cite-worthiness data be used to
perform transfer learning to downstream scientific text
tasks?
26/05/2022 13

CiteWorth: Dataset Curation
26/05/2022 14
1. https://github.com/allenai/s2orc
We use the model from the original BERT paper (Devlin et al. 2019).
We use the model from the original BERT paper [1].
Parenthetical author/year and bracketed numerical citations only
Citations must be at the end of a sentence
• We limit citances as follows
• Source data: S2ORC1 – millions of extracted scientific
documents from Semantic Scholar
RQ1: How can a dataset for cite-worthiness detection be automatically curated with low noise?

CiteWorth Final Dataset
• 1,181,793 sentences
• 10 different fields, 20,000+ paragraphs per field
• Much cleaner than a naive baseline which only
removes citation text based on gold spans
26/05/2022 15
RQ1: How can a dataset for cite-worthiness detection be automatically curated with low noise?
Method Sentences Clean (%) Citation Markers Removed (%)
Naive Baseline 92.07 92.78
CiteWorth (Ours) 98.90 98.10

Predicting on Individual Sentences
26/05/2022 16
Can context improve performance?
Method P R F1
Logistic
Regression
46.65 64.88 54.28
CRNN 50.87 62.21 55.97
Transformer 47.92 71.59 57.39
BERT 55.04 69.02 61.23
SciBERT 57.03 68.08 62.06
Hard Examples – Low Confidence
Easy Examples – High Confidence
Explanations using inputXGrad*:
RQ2: What methods are most effective for automatically detecting cite-worthy sentences?
* Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Sven Dähne. 2016. Investigating the Influence of Noise and
Distractors on the Interpretation of Neural Networks. arXiv preprint arXiv:1611.07270.

Predicting Multiple Sentences at Once
26/05/2022 17
Are there variations across field?
RQ2: What methods are most effective for automatically detecting cite-worthy sentences?
Longformer*
[CLS] !!
! !!
"
[SEP] !"
!
!"
"
!"
#
[SEP]
Pooling
Classify
Pooling
Classify
… …
Method P R F1
SciBERT 57.03 68.08 62.06
Longformer-Solo 57.21 68.00 62.14
Longformer-Ctx 59.92 77.15 67.45 Δ 5 pts
* Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. CoRR, abs/2004.05150.

Domain Effects
26/05/2022 18
67.58 58.41 56.86 62.35 68.23
66.62 60.25 60.11 64.02 68.07
65.05 59.36 61.99 63.85 66.72
65.49 58.03 56.69 65.10 68.27
66.59 58.80 58.22 64.54 69.12
Ch
E
CS
P
B
Ch E CS P B
Test
Train
RQ3: How does domain affect learning cite-worthiness detection?
Average Pooled BERT representations of
sentences
Clustered using GMM
Roee Aharoni and Y. Goldberg. 2020. Unsupervised Domain Clusters in Pretrained Language Models. In ACL.
Longformer-Ctx performance in
cross-domain evaluation

Generating Scientific Claims for Zero-
Shot Scientific Fact Checking
Dustin Wright, Dave Wadden, Kyle Lo,
Bailey Kuehl, Isabelle Augenstein, Lucie
Lu Wang
ACL 2022
19

Claims Should be Factually Correct
26/05/2022 20
ALS cannot be treated by Riluzole.
Scientific claim generation for zero-shot
scientific fact checking

Citances in Scientific Fact Checking
● Scientific fact checking data is difficult to collect --
where to collect claims?
● Previous work: SciFact
○ Crowdsource claims from citances
○ Requires manually rewriting complex claims into atomic
claims
26/05/2022 21
Australian statistics show that 1 in 7 young people have an
anxiety disorder and 1 in 16 have depression
Australian statistics show that 1 in 7 young people
have an anxiety disorder
Australian statistics show that 1 in 16 young
people have depression
Individual claims are “atomic verifiable statements
expressing a finding about one aspect of a scientific
entity or process,
which can be verified from a single source” ZSSFC (ACL 2022)

Research Questions
1. How can we generate claims from citances that are useful for
scientific fact checking?
➢ Citances -- sentences which have references to external source
➢ Useful for fact checking because they contain a link to evidence
which can be used to verify their content
2. What methods can generate claims which are fluent, faithful,
atomic, and de-contextualized?
3. In what situations do generated claims help improve
performance on zero-shot scientific fact checking?
26/05/2022 22
ZSSFC (ACL 2022)

Methods: ClaimGen-Entity
26/05/2022 23
P
P,a
1
P,a
2
P,a
3
Mq
Q1,a
1
Q2,a
2
Q3,a
3
Mc
C1
C2
C3
NER
Citance Named
Entity
Recognitio
n
(scispacy)
Question
Generation
(BART)
Claim
Generation
(BART)
ZSSFC (ACL 2022)

Methods: ClaimGen-BART
26/05/2022 24
P
P,a
1
P,a
2
P,a
3
Mq
Q1,a
1
Q2,a
2
Q3,a
3
Mc
C1
C2
C3
NER
Citance Named
Entity
Recognitio
n
(scispacy)
Question
Generation
(BART)
Claim
Generation
(BART)
Input format: “{CONTEXT} || {CITANCE} || {CLAIM}”
Sample multiple claims at test time
ZSSFC (ACL 2022)

Generating Negations: KBIN
26/05/2022 25
Exergames improve function and
reduce the risk of falls.
C0184511
UMLS
'C0426422'
'C1998348'
'C0024103'
'C0043094'
'C1457868'
'C1827505'
'C1719838'
T03
3
cui2vec1
C1457868
1. Andrew L. Beam, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin M. Weber, Nathan P.
Palmer, Xu Shi, Tianxi Cai, Isaac S. Kohane: Clinical Concept Embeddings Learned from
Massive Sources of Multimodal Medical Data. PSB 2020: 295-306
ZSSFC (ACL 2022)

Generating Negations: KBIN
26/05/2022 26
UMLS
C1457868
Exergames worse function
and reduce the risk of falls.
Exergames deteriorating
function and reduce the risk
of falls.
Exergames deteriorate
of falls.
Exergames worsened
of falls.
GPT2
Rank by
perplexity
Exergames
deteriorate
function and
reduce the risk of
falls.
Additionally: Sample N top concepts, run best
claims through NLI model with original claim and
select claim with highest contradiction
ZSSFC (ACL 2022)

Evaluation: Zero-Shot Fact Checking
• Task: generate claims using CG-Entity/CG-BART +
KBIN, use those claims to train LongChecker
• LongChecker: Longformer based scientific FC model which
uses an entire abstract as evidence
• Comparison:
• Baseline: training on general domain claims (FEVER dataset)
• Upper bound: training on original SciFact claims
26/05/2022 27
ZSSFC (ACL 2022)

Evaluation: Zero-Shot Fact Checking
26/05/2022 28
Method P R F1
FEVER 69.51 66.51 67.80
SciFact (Upper Bound) 77.88 77.51 77.70
CG-Entity 72.86 69.38 71.08
CG-BART 64.09 79.43 70.94
• Both methods achieve within 90% the performance of the upper
bound
• Significant improvement over out of domain pre-training
• Not much difference in performance between the two methods
ZSSFC (ACL 2022)

Conclusions
• We introduce CiteWorth – a large, rigorously cleaned
dataset for citation-related tasks
• We show that paragraph level context is crucial to
perform cite-worthiness detection
• We show that the data is diverse with a significant
domain effect
• We show that citances can be used to generate high
quality, atomic scientific claims usable to train models
for scientific fact checking
26/05/2022 29

Exaggeration Detection of Science
Press Releases
Dustin Wright, Isabelle Augenstein
EMNLP 2021
31

Claims Should be Reported on Faithfully
26/05/2022 32
Riluzole and Edaravone can cure ALS in most patients.
Exaggerated
Exaggeration detection of health science press releases

Exaggeration in Science Journalism
Sumner et al. 20141 and Bratton et al. 20192: InSciOut
26/05/2022 33
1. Sumner, P., Vivian-Griffiths, S., Boivin, J., Williams, A., Venetis, C. A., Davies, A., ... & Chambers, C. D. (2014). The
association between exaggeration in health related science news and academic press releases: retrospective
observational study. Bmj, 349.
2. Bratton, L., Adams, R. C., Challenger, A., Boivin, J., Bott, L., Chambers, C. D., & Sumner, P. (2019). The association
between exaggeration in health-related science news and academic press releases: a replication study. Wellcome
open research, 4.
Objective: To identify the source (press releases or news) of
distortions, exaggerations, or changes to the main conclusions
drawn from research that could potentially influence a reader’s
health related behaviour.
Conclusions:
• 33% of press releases contain exaggerations of conclusions of
scientific papers
• Exaggeration in news is strongly associated with exaggeration in
press releases

Our Work on Exaggeration Detection in Science
• Task: predicting when a press release exaggerates a scientific
paper
• Input: primary finding of the paper as written in the abstract
and the press release
• Focus of previous work: causal claim strength prediction of
these primary findings
26/05/2022 34

Task Formulations
• T1
• Entailment-like task to predict
exaggeration label
• Paired (press release, abstract) data
26/05/2022 35
ℒ!" = #
0 Downplays
1 Same
2 Exaggerates
• T2
• Text classification task to predict causal
claim strength
• Unpaired press releases and abstracts
• Final prediction compares strength of
paired press release and abstract
ℒ!# =
0 No Relation
1 Correlational
2 Conditional Causal
3 Direct Causal
Label Type Language Cue
0 No Relation
1 Correlational
association, associated with, predictor, at high risk
of
2 Conditional causal
increase, decrease, lead to, effect on, contribute to,
result in (Cues indicating doubt: may, might, appear
to, probably)
3 Direct causal
increase, decrease, lead to, effective on, contribute
to, reduce, can
Li et al. 2017

Evaluation Dataset Creation
26/05/2022 36
Start with the 823 labeled pairs from
Sumner et al. 2014 and Bratton et al. 2019
(InSciOut)
Collect original abstract text from Semantic
Scholar
Match original conclusion sentences to
paraphrased annotations via ROUGE
score
Manually inspect and discard missing or
incorrect abstracts
!
Downplays +! < +"
Same +! = +"
Exaggerates +! > +"
Final label: compare annotated claim
strength (+! for press release, +" for abstract)
Total data: 663 pairs (100 training, 553 test)

MT-PET
26/05/2022 37
Eating chocolate causes
happiness. The claim strength
is [MASK]
ℳ
0.01 0.21 0.15 &. '(
m
edium
estim
ated
cautious
distorted
Scientists claim eating
chocolate sometimes causes
happiness. Reporters claim
eating chocolate causes
happiness. The reporters
claims are [MASK]
0.01 0.05 &. )*
prelim
inary
identical
naive
+!
+"
+!
#
ℳ!
,!
-
.!
,!
Soft Labels
KL-Divergence Loss
(Unlabelled)
+"
#
."
+!
$
ℳ"
.!
+"
$
."

T1 (Exaggeration Detection) with MT-PET
26/05/2022 38
28,06
33,1
29,05
41,9
39,87 39,12
47,8 47,99 47,35
25
30
35
40
45
50
P R F1
Supervised PET MT-PET
Substantial improvements when using PET (10 points)
Further improvements with MT-PET (8 points)
Demonstrates transfer of knowledge from claim strength prediction to
exaggeration prediction

Error Analysis
• All models:
• disproportionately get pairs involving direct causal claims
incorrect
• do best for correlational claims from abstracts and claims
from press releases which are correlational or stronger
• MT-PET:
• helps the most for the most difficult category -- causal claims
26/05/2022 39

Summary
• We formalize the problem of scientific exaggeration
detection, providing two task formulations for the
problem
• We curate a set of benchmark data to evaluate
automatic methods for performing the task
• We propose MT-PET, a few-shot learning method
based on PET, which we demonstrate outperforms
strong baselines
26/05/2022 40

26/05/2022 43
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis

26/05/2022 44
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Credibility
Detection
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis
NEW

Overall Take-Aways
• Why scholarly document processing?
• Supporting the life cycle of research, from information discovery to
research impact tracking
• Why credibility detection for scholarly communication?
• Detect claims which should be backed up by evidence
(cite-worthiness detection)
• Detect inconsistencies between primary and secondary sources of
information (exaggeration detection, fact checking)

Overall Take-Aways
• Overarching challenges
• Difficult NLP tasks (require understanding of pragmatics)
• Domain effects, importance of context pose further challenges
• Not well-studied yet
• Scarcity of available benchmarks
• Many opportunities for future work
• Explore more different settings
• Gather more datasets
• Methods for domain adaptation & few-shot learning
• Tools for journalists & authors

Thank you!
isabelleaugenstein.github.io
augenstein@di.ku.dk
@IAugenstein
github.com/isabelleaugenstein
26/05/2022 47

Acknowledgements
48
CopeNLU
https://copenlu.github.io/
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under the Marie Skłodowska-Curie grant agreement No 801199.
Dustin Wright Dave Wadden Kyle Lo Bailey Kuehl Lucy Lu Wang

Presented Papers
Isabelle Augenstein. Determining the Credibility of Science
Communication. SDP workshop, 2021.
Dustin Wright, Isabelle Augenstein. CiteWorth: Cite-Worthiness
Detection for Improved Scientific Document Understanding. ACL 2021.
Dustin Wright, Dave Wadden, Kyle Lo, Bailey Kuehl, Isabelle
Augenstein, Lucy Lu Wang. Generating Scientific Claims for Zero-Shot
Scientific Fact Checking. ACL 2022.
Dustin Wright, Isabelle Augenstein. Exaggeration Detection of Science
Press Releases. EMNLP 2021.

Other Recent Related Papers
Andreas Nugaard Holm, Barbara Plank, Dustin Wright, Isabelle
Augenstein. Longitudinal Citation Prediction using Temporal Graph
Neural Networks. AAAI 2022 Workshop on Scientific Document
Understanding (SDU 2022), February 2022.
Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp,
Georg Rehm. Neighborhood Contrastive Learning for Scientific
Document Representations with Citation Embeddings. CoRR,
abs/2202.06671, February 2022.
Shailza Jolly, Pepa Atanasova, Isabelle Augenstein. Generating Fluent
Fact Checking Explanations with Unsupervised Post-Editing. CoRR,
abs/2112.06924, December 2021.

Automatically Detecting Scientific Misinformation

More Related Content

Similar to Automatically Detecting Scientific Misinformation (20)

More from Isabelle Augenstein (20)

Recently uploaded (20)

Automatically Detecting Scientific Misinformation