SlideShare a Scribd company logo
Restoring Trust by Computing: Data-driven
Fact-checking and Exceptional Fact Finding
Chengkai Li
Professor and Associate Chair
Director, Innovative Database and Information Systems (IDIR) Lab
CSE Department, UT-Arlington
Texas Health Resources
August 29, 2019
o A Brief History of Our Computational Journalism
Research
o Computational Journalism
– Data-driven fact-checking (ClaimBuster)
– Other ongoing fact-checking projects
– Exceptional fact finding (FactWatcher, Maverick)
o Graph Data Usability (Orion, GQBE, TableView, Maverick)
Outline
2
3
2009
How it Started
How did they
come up
with that?
Chris Paul had 16 points, 10 rebounds, 13
assists and five steals…… The only other active
player to have such a game is Jason Kidd…
4
This is computational journalism.
“Developing the Field of Computational
Journalism” Workshop, July 27-31, 2009
Let’s work on this fact-finding problem.
#@%&!#!%
Summer
2010
How it Started
5
fact-finding fact-checking
CIDR 2011
6
7
It was always there, since “Yellow Journalism” started in
1890s.
− Exaggerated headlines, clickbait, computational propaganda,
misinformation, disinformation
− Snopes.com (1994), Glenn Kessler (1996), FactCheck.org (2003),
PolitiFact (2007)
But it became a daily talking point since 2016.
− Pizzagate
− “filter bubble” and “echo chamber” exacerbated by social media
− The many false claims in political discourses
− Russian meddling with U.S. election
− Twitter bots, Facebook ads, trendy topic algorithms 8
The “Fake News” Problem
o A Brief History of our Computation + Journalism Research
o Computational Journalism
– Data-driven fact-checking (ClaimBuster)
– Other ongoing fact-checking projects
– Exceptional fact finding (FactWatcher, Maverick)
o Graph Data Usability (Orion, GQBE, TableView, Maverick)
Outline
9
Toward the Holy Grail of Automated Fact-checking
idir.uta.edu/claimbuster
10
The Holy Grail: Automated, Live Fact-Checking
11
Source: Bill Adair
Fact-checking is Hard
“… our Navy is smaller than it's been since 1917", said Republican
candidate Mitt Romney in third presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney
http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf
12
Fact-checking is Hard
“… our Navy is smaller than it's been since 1917", said Republican
candidate Mitt Romney in a Republican presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney
http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf
http://en.wikipedia.org/wiki/United_States_Navy
vs
13
PolitiFact “Buffet” of Factual Claims
14
Presidential
Debate
Transcripts
(1960-2012)
Ground
Truth
Finding Important Factual Claims:
A Supervised Learning Task
Human
Annotation Feature
Vectors
Feature
Extraction
Learning
Algorithm
2016
Presidential
Debates
Important
factual claims
15
Classification and Ranking by Check-worthiness
CFS: Important factual claims
“We spend less on the military today than at any time in our history.” “The
President’s position on gay marriage has changed.” “More people are
unemployed today than four years ago.”
UFS: Unimportant factual claims
“I was in Iowa yesterday.” “My mother enjoys cooking.” “I ran for President
once before.”
NFS: No factual claims (opinions, questions & declarations)
“Iran must not get nuclear weapons.” “7% unemployment is too high.” “My
opponent is wishy-washy.” “I will be tough on crime.” "Why should we do
that?“ “Hello, New Hampshire!” “Our plan is to reduce tax rate by 10%.”
16
Our Own CrowdFlower
17 17
Ground Truth Collection
o 20 months, 374 coders, ~$4,000
paid
o 30 training sentences
o 1032 screening sentences (731
NFS, 63 UFS, 238 CFS) to detect
spammers & low-quality coders
Class Count Percentage
CFS 4849 23.52%
UFS 2097 10.17%
NFS 13671 66.31%
86 top-quality coders
52333 labels
20788 sentences
374 coders
76552 labels
Majority voting
20617 admitted sentences
http://www.engineersdaily.com/2014/03/basics-of-soil-compaction.html
18
19
20
LSTM-based Model for Claim Spotting
21
o Model based on an LSTM (long short-term memory) network
o Tested several different word embeddings but found no significant
difference in overall performance
− Different embeddings did have different affinities (i.e., some
produce models which score text with numbers higher on
average).
o We used nonsensical sentences to make the model more resistant to
attacks. The neural models outperformed the SVM model by 29% in
precision and 12% in recall.
Adversarially Trained LSTM
22
o Training the LSTM model on adversarial (modified) examples.
o Adversarial noise (perturbations), designed to maximize the chance
of misclassification by the network, was calculated using back-
propagation.
o Perturbation was applied to word-embeddings rather than directly to
the input.
Adversarially Trained LSTM
23
PrecisionRecall F1-score
ClaimBuster API
24
25
PolitiFact “Buffet” of
Factual Claims
26
Duke Reporters’ Lab uses
ClaimBuster API in creating
daily newsletters that
recommend to The Washington
Post, New York Times,
PolitiFact, and other fact-
checkers the most check-worthy
claims in CNN programs,
Tweets, Facebook ports, and the
Congressional Record.
27
Claim
Spotter
The First-ever End-to-end Fact-checking System
28
[CIKM15a, KDD17, VLDB17demo, IJCNN18]
* First Runner-Up, SIGMOD17 Student Research Competition
29
30
o A Brief History of our Computation + Journalism Research
o Computational Journalism
– Data-driven fact-checking (ClaimBuster)
– Other ongoing fact-checking projects
– Exceptional fact finding (FactWatcher, Maverick)
o Graph Data Usability (Orion, GQBE, TableView, Maverick)
Outline
31
32
ClaimPortal: Integrated Monitoring, Searching, Checking, and Analytics
of Factual Claims on Twitter
https://idir.uta.edu/claimportal/
o A Brief History of our Computation + Journalism Research
o Computational Journalism
– Data-driven fact-checking (ClaimBuster)
– Other ongoing fact-checking projects
– Exceptional fact finding (FactWatcher, Maverick)
o Graph Data Usability (Orion, GQBE, TableView, Maverick)
Outline
33
34
2009
How did they
come up
with that?
Chris Paul had 16 points, 10 rebounds, 13
assists and five steals…… The only other active
player to have such a game is Jason Kidd…
How it Started
New tuple appended to database
Fact-finding algorithms
Wesley had 12 points, 13 assists and 5 rebounds on
February 25, 1996 to become the first player with a
12/13/5(points/assists/rebounds)inFebruary.
Number-based facts make news stories more engaging
http://en.wikipedia.org/wiki/Basketball
Exceptional Fact Finding
Real-world event (sports, transportation,
crime, weather, finance, social media)
35
idir.uta.edu/factwatcher
Excellent
Demo Award
36
FactWatcher [VLDB14 demo]
o [ICDE14] Situational Facts: “No other player scored more pts and reb against
DAL than Jordan.”
o [KDD12] One-of-the-Few: “Jordan scored 10 pts & 10 reb. Only 3 others had
similar performance.”
o [KDD11] Prominent Streaks: “The Nikkei 225 closed below 10000 for the 12th
consecutive week, the longest such streak since June 2009.”
o [TKDD14] General Prominent Streaks: “James has scored at least 20 points and
handed out 10 or more assists in each of his last five games against the Hawks, the
longest streak he has ever had against one team.”
Frequent Episode Mining [ICDE15]
37
Fact-finding Algorithms
Fact-finding Algorithms
o Many interesting facts in the real-world can be modeled as
various types of skyline points.
o The gist of fact-finding is to efficiently, incrementally maintain
skyline points over ever-changing data, while considering
constraints such as selection conditions.
Stephan Börzsönyi, Donald Kossmann, Konrad Stocker: The Skyline Operator. ICDE 2001: 421-430 38
“Paul George had 21 points, 11 rebounds and 5 assists to become
the first Pacers player with a 20/10/5 (points/rebounds/assists) game
against the Bulls since Detlef Schrempf in December 1992.”
(http://espn.go.com/espn/elias?date=20130205)
Modeling Situational Facts
“Paul George had 21 points, 11 rebounds and 5 assists to become
the first Pacers player with a 20/10/5 (points/rebounds/assists) game
against the Bulls since Detlef Schrempf in December 1992.”
(http://espn.go.com/espn/elias?date=20130205)
Modeling Situational Facts
id player day month season team opp_team pts ast reb
t1 Bogues 11 Feb. 1991-92 Hornets Hawks 4 12 5
t2 Seikaly 13 Feb. 1991-92 Heat Hawks 24 5 15
t3 Sherman 7 Dec. 1993-94 Celtics Nets 13 13 5
t4 Wesley 4 Feb. 1994-95 Celtics Nets 2 5 2
t5 Wesley 5 Feb. 1994-95 Celtics Timberwolves 3 5 3
t6 Strictland 3 Jan. 1995-96 Blazers Celtics 27 18 8
Dimension space: D={d1,… ,dn} Measure space: M ={m1,… ,ms}
Modeling Situational Facts
a1,c1 b1,c1a1,b1
b1 c1a1
Τ
a1,b1,c1
Skyline maintenance +
Data Cube/OLAP
o A Brief History of our Computation + Journalism Research
o Computational Journalism
– Data-driven fact-checking (ClaimBuster)
– Other ongoing fact-checking projects
– Exceptional fact finding (FactWatcher, Maverick)
o Graph Data Usability (Orion, GQBE, TableView, Maverick)
Outline
42
Tackling Graph Data Usability Challenges
Challenges
o Massive, complex graphs; millions of entities; billions of edges.
o Requires substantial understanding of schema and data and complex pre-
processing, before one can fetch information or gain insights from data.
Objectives
o Make it easy to understand, query, explore, and clean graph data.
Systems
o GQBE[TKDE15,ICDE14demo]:graphquerybyexample
o Orion[VLDB15demo,SIGMOD17tutorial]:auto-suggestionforinteractivequeryformulation
o TableView[SIGMOD16,ICDE18demo]:generatingpreviewtablesforknowledgegraphs
(*SIGMODMostReproduciblePaperAward)
o Maverick[SIGMOD18,VLDB18demo]:findingoutliersanderrorsingraphs
43
IDIR Projects and Demos
Demonstration Videos https://vimeo.com/channels/1024406
o ClaimBuster (idir.uta.edu/claimbuster) Automated, live fact-checking
o ClaimPortal (idir.uta.edu/claimportal) Monitoring factual-claims on social media
o CrewScout (idir.uta.edu/crewscout) Expert team finding by skyline groups
o ERQ: Entity-Relationship Query (idir.uta.edu/erq) Structured query on Wikipedia
o Facetedpedia (idir.uta.edu/facetedpedia) Faceted search interface for Wikipedia
o FactWatcher (idir.uta.edu/factwatcher) Fact-finding from real-world events
o FrameAnnotator (idir.uta.edu/frameannotator) Frame annotation tool
o GQBE (idir.uta.edu/gqbe) Graph query by example
o Orion (idir.uta.edu/orion) Auto-suggestion for interactive graph query
formulation
o TableView Generating preview tables for knowledge graphs
o Maverick Exceptional fact finding from knowledge graphs 44
Current IDIR Students
45
Farahnaz
Akrami
Fatma
Arslan
Shadekur
Rahman
Samiul
Saeef
Theodora
Toutountzi
Ph.D. students
Israa
Jaradat
Xiao
Shi
Zeyu
Zhang
M.S. students
Damian
Jimenez
Priyank
Arora
Sumeet
Lubal
Sarthak
Majithia
Daniel
Obembe
Sarbajit
Roy
B.S. students
Kyrell
Dixon
Jacob
Devasier
Graduated Students
46
Ning Yan
2013
Research Scientist
Huawei, Santa Clara
Ph.D. students
Gensheng Zhang
2017
Google
Nandish Jayaram
2016
Pivotal
Naeemul Hassan
2016
Assistant Professor
University of Mississippi
Afroza Sultana
2018
Teradata Lab
Aditya Telang
2011 (co-advised)
IBM Research India
Tulsi Chandwani (2017, Red Hat)
Abu Ayub Ansari Syed (2017)
Ishwor Timilsina (2017, Fidelity)
Nigesh Shakya (2017)
Rohit Bhoopalam (2016, Akamai)
Fatma Dogan (2015, UTA Ph.D.)
Minumol Joseph (2015, Capital One)
Ramesh Venkataraman (2014, Amazon)
Mahesh Gupta (2012, Electronic Arts)
Jijo Philip (2012, Cerner Corporation)
Avinash Bharadwaj (2011, Copper Labs)
Quazi (Sunny) Hasan (2010, Dematic)
Jared Ashman (2010, Ambit Energy)
Muhammad Safiullah (2008, Microsoft)
M.S. Thesis students Josue Caraballo (2017)
Damian Jimenez (2017, UTA Ph.D.)
Long Ly (2015)
Sidharth Goyal (2015)
Huadong Feng (2014)
Raju Karki (2012)
Angus Helm (2010)
Aakash Tuli (2010)
B.S. students
Collaborators on Current and
Past IDIR Projects
47
o BillAdair(Duke,PublicPolicy)
o PankajAgarwal(Duke)
o XiangAo(ChineseAcademyof
Sciences)
o VassilisAthitsos(UTA)
o SouravBhowmick(Nanyang
TechnologicalUniversity)
o SharmaChakravarthy(UTA)
o GongCheng(NanjingUniversity)
o ByronChoi(HongKongBaptist
University)
o ChristophCsallner(UTA)
o ZhiqiangLin(OhioState)
o PingLuo(ChineseAcademyof
Sciences)
o MarkStancel(Duke,PublicPolicy)
o Mark Tremayne(UTA,
Communication)
o MinWang(Google)
o XifengYan(UCSB)
o Jun Yang(Duke)
o CongYu(GoogleResearch)
o NanZhang(PennState)
o GautamDas(UTA)
o ChrisDing(UTA)
o RamezElmasri(UTA)
o LeonidasFegaras(UTA
o PeterFray(Universityof TechnologySydney,
Journalism)
o JamesHamilton(Stanford,Communication)
o NaeemulHassan(Mississippi)
o WeiHu(NanjingUniversity)
o ArijitKhan(NanyangTechnological
University)
o AngelaLee(UTD,Communication)
48
Funding Sponsors
Entrepreneurship
Thank
you!
49

More Related Content

PPTX
Loras College 2014 Business Analytics Symposium | Andy Stevens: Big Data Anal...
PDF
Machine Learning Summer School 2016
PDF
Data: Past, Present, and Future (Cornell Digital Life Seminar on Data Literac...
PDF
Enabling Computational Journalism: Automated Fact-Checking and Story-Finding
PDF
Comparing Automated Factual Claim Detection Against Judgments of Journalism O...
PDF
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
PPTX
DeepFakes H4D Stanford 2019
PDF
Loras College 2014 Business Analytics Symposium | Andy Stevens: Big Data Anal...
Machine Learning Summer School 2016
Data: Past, Present, and Future (Cornell Digital Life Seminar on Data Literac...
Enabling Computational Journalism: Automated Fact-Checking and Story-Finding
Comparing Automated Factual Claim Detection Against Judgments of Journalism O...
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
DeepFakes H4D Stanford 2019

Similar to Restoring Trust by Computing: Data-driven Fact-checking and Exceptional Fact Finding (20)

PDF
Data Journalism: chapter from Online Journalism Handbook first edition
PDF
Data Journalism 101: A Brief Survey
PDF
Data In, Fact Out: Automated Monitoring of Facts by FactWatcher
PDF
Data In, Fact Out: Automated Monitoring of Facts by FactWatcher
PDF
IRJET- Fake Message Deduction using Machine Learining
PPTX
The Fact Checking Project from the American Press Institute
PPTX
Fact-Checking Workshop by API & PolitiFact
PDF
"data: past, present, and future" lecture 1 (intro) 1/22/19
PDF
Data! Action! Data journalism issues to watch in the next 10 years
PPTX
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
PDF
Fake News Detection
PPTX
Data journalism Overview
PDF
history and ethics of data
PDF
Big, Open, Data and Semantics for Real-World Application Near You
PDF
MediaEval 2016 - Verifying Multimedia Use Task Overview
PDF
The rise of fact checking sites in europe
PDF
IRJET- Milestones and Challenges of Fake News Detection using Digital Forensi...
PDF
Fakebuster fake news detection system using logistic regression technique i...
PDF
The story of Data Stories
PDF
Convcomp2016: Verso la “chat intelligente”: la ricerca in Natural Language P...
Data Journalism: chapter from Online Journalism Handbook first edition
Data Journalism 101: A Brief Survey
Data In, Fact Out: Automated Monitoring of Facts by FactWatcher
Data In, Fact Out: Automated Monitoring of Facts by FactWatcher
IRJET- Fake Message Deduction using Machine Learining
The Fact Checking Project from the American Press Institute
Fact-Checking Workshop by API & PolitiFact
"data: past, present, and future" lecture 1 (intro) 1/22/19
Data! Action! Data journalism issues to watch in the next 10 years
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Fake News Detection
Data journalism Overview
history and ethics of data
Big, Open, Data and Semantics for Real-World Application Near You
MediaEval 2016 - Verifying Multimedia Use Task Overview
The rise of fact checking sites in europe
IRJET- Milestones and Challenges of Fake News Detection using Digital Forensi...
Fakebuster fake news detection system using logistic regression technique i...
The story of Data Stories
Convcomp2016: Verso la “chat intelligente”: la ricerca in Natural Language P...
Ad

More from The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington (20)

PDF
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
PDF
Facetedpedia: Dynamic Generation of Query-Dependent Faceted Interfaces for Wi...
PDF
Anything You Can Do, I Can Do Better: Finding Expert Teams by CrewScout
PDF
Anything You Can Do, I Can Do Better: Finding Expert Teams by CrewScoutCrewsc...
PDF
VIIQ: Auto-suggestion Enabled Visual Interface for Interactive Graph Query Fo...
PDF
Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
PDF
TableView: A Visual Interface for Generating Preview Tables of Entity Graphs
PDF
Maverick: Discovering Exceptional Facts from Knowledge Graphs
PDF
An Empirical Study on Identifying Sentences with Salient Factual Statements
PDF
Continuous Monitoring of Pareto Frontiers on Partially Ordered Attributes for...
PDF
Maverick: Discovering Exceptional Facts from Knowledge Graphs
PDF
ClaimPortal: Integrated Monitoring, Searching, Checking, and Analytics of Fac...
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Facetedpedia: Dynamic Generation of Query-Dependent Faceted Interfaces for Wi...
Anything You Can Do, I Can Do Better: Finding Expert Teams by CrewScout
Anything You Can Do, I Can Do Better: Finding Expert Teams by CrewScoutCrewsc...
VIIQ: Auto-suggestion Enabled Visual Interface for Interactive Graph Query Fo...
Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
TableView: A Visual Interface for Generating Preview Tables of Entity Graphs
Maverick: Discovering Exceptional Facts from Knowledge Graphs
An Empirical Study on Identifying Sentences with Salient Factual Statements
Continuous Monitoring of Pareto Frontiers on Partially Ordered Attributes for...
Maverick: Discovering Exceptional Facts from Knowledge Graphs
ClaimPortal: Integrated Monitoring, Searching, Checking, and Analytics of Fac...
Ad

Recently uploaded (20)

PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Introduction to the R Programming Language
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Global Data and Analytics Market Outlook Report
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Introduction to Data Science and Data Analysis
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Transcultural that can help you someday.
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
How to run a consulting project- client discovery
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Introduction to the R Programming Language
[EN] Industrial Machine Downtime Prediction
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Global Data and Analytics Market Outlook Report
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Microsoft Core Cloud Services powerpoint
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Data Science and Data Analysis
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
IMPACT OF LANDSLIDE.....................
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Transcultural that can help you someday.
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Qualitative Qantitative and Mixed Methods.pptx
How to run a consulting project- client discovery

Restoring Trust by Computing: Data-driven Fact-checking and Exceptional Fact Finding

  • 1. Restoring Trust by Computing: Data-driven Fact-checking and Exceptional Fact Finding Chengkai Li Professor and Associate Chair Director, Innovative Database and Information Systems (IDIR) Lab CSE Department, UT-Arlington Texas Health Resources August 29, 2019
  • 2. o A Brief History of Our Computational Journalism Research o Computational Journalism – Data-driven fact-checking (ClaimBuster) – Other ongoing fact-checking projects – Exceptional fact finding (FactWatcher, Maverick) o Graph Data Usability (Orion, GQBE, TableView, Maverick) Outline 2
  • 3. 3 2009 How it Started How did they come up with that? Chris Paul had 16 points, 10 rebounds, 13 assists and five steals…… The only other active player to have such a game is Jason Kidd…
  • 4. 4 This is computational journalism. “Developing the Field of Computational Journalism” Workshop, July 27-31, 2009 Let’s work on this fact-finding problem. #@%&!#!% Summer 2010 How it Started
  • 6. 6
  • 7. 7
  • 8. It was always there, since “Yellow Journalism” started in 1890s. − Exaggerated headlines, clickbait, computational propaganda, misinformation, disinformation − Snopes.com (1994), Glenn Kessler (1996), FactCheck.org (2003), PolitiFact (2007) But it became a daily talking point since 2016. − Pizzagate − “filter bubble” and “echo chamber” exacerbated by social media − The many false claims in political discourses − Russian meddling with U.S. election − Twitter bots, Facebook ads, trendy topic algorithms 8 The “Fake News” Problem
  • 9. o A Brief History of our Computation + Journalism Research o Computational Journalism – Data-driven fact-checking (ClaimBuster) – Other ongoing fact-checking projects – Exceptional fact finding (FactWatcher, Maverick) o Graph Data Usability (Orion, GQBE, TableView, Maverick) Outline 9
  • 10. Toward the Holy Grail of Automated Fact-checking idir.uta.edu/claimbuster 10
  • 11. The Holy Grail: Automated, Live Fact-Checking 11 Source: Bill Adair
  • 12. Fact-checking is Hard “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012. http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf 12
  • 13. Fact-checking is Hard “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in a Republican presidential debate in 2012. http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf http://en.wikipedia.org/wiki/United_States_Navy vs 13
  • 14. PolitiFact “Buffet” of Factual Claims 14
  • 15. Presidential Debate Transcripts (1960-2012) Ground Truth Finding Important Factual Claims: A Supervised Learning Task Human Annotation Feature Vectors Feature Extraction Learning Algorithm 2016 Presidential Debates Important factual claims 15
  • 16. Classification and Ranking by Check-worthiness CFS: Important factual claims “We spend less on the military today than at any time in our history.” “The President’s position on gay marriage has changed.” “More people are unemployed today than four years ago.” UFS: Unimportant factual claims “I was in Iowa yesterday.” “My mother enjoys cooking.” “I ran for President once before.” NFS: No factual claims (opinions, questions & declarations) “Iran must not get nuclear weapons.” “7% unemployment is too high.” “My opponent is wishy-washy.” “I will be tough on crime.” "Why should we do that?“ “Hello, New Hampshire!” “Our plan is to reduce tax rate by 10%.” 16
  • 18. Ground Truth Collection o 20 months, 374 coders, ~$4,000 paid o 30 training sentences o 1032 screening sentences (731 NFS, 63 UFS, 238 CFS) to detect spammers & low-quality coders Class Count Percentage CFS 4849 23.52% UFS 2097 10.17% NFS 13671 66.31% 86 top-quality coders 52333 labels 20788 sentences 374 coders 76552 labels Majority voting 20617 admitted sentences http://www.engineersdaily.com/2014/03/basics-of-soil-compaction.html 18
  • 19. 19
  • 20. 20
  • 21. LSTM-based Model for Claim Spotting 21 o Model based on an LSTM (long short-term memory) network o Tested several different word embeddings but found no significant difference in overall performance − Different embeddings did have different affinities (i.e., some produce models which score text with numbers higher on average). o We used nonsensical sentences to make the model more resistant to attacks. The neural models outperformed the SVM model by 29% in precision and 12% in recall.
  • 22. Adversarially Trained LSTM 22 o Training the LSTM model on adversarial (modified) examples. o Adversarial noise (perturbations), designed to maximize the chance of misclassification by the network, was calculated using back- propagation. o Perturbation was applied to word-embeddings rather than directly to the input.
  • 26. 26 Duke Reporters’ Lab uses ClaimBuster API in creating daily newsletters that recommend to The Washington Post, New York Times, PolitiFact, and other fact- checkers the most check-worthy claims in CNN programs, Tweets, Facebook ports, and the Congressional Record.
  • 28. The First-ever End-to-end Fact-checking System 28 [CIKM15a, KDD17, VLDB17demo, IJCNN18] * First Runner-Up, SIGMOD17 Student Research Competition
  • 29. 29
  • 30. 30
  • 31. o A Brief History of our Computation + Journalism Research o Computational Journalism – Data-driven fact-checking (ClaimBuster) – Other ongoing fact-checking projects – Exceptional fact finding (FactWatcher, Maverick) o Graph Data Usability (Orion, GQBE, TableView, Maverick) Outline 31
  • 32. 32 ClaimPortal: Integrated Monitoring, Searching, Checking, and Analytics of Factual Claims on Twitter https://idir.uta.edu/claimportal/
  • 33. o A Brief History of our Computation + Journalism Research o Computational Journalism – Data-driven fact-checking (ClaimBuster) – Other ongoing fact-checking projects – Exceptional fact finding (FactWatcher, Maverick) o Graph Data Usability (Orion, GQBE, TableView, Maverick) Outline 33
  • 34. 34 2009 How did they come up with that? Chris Paul had 16 points, 10 rebounds, 13 assists and five steals…… The only other active player to have such a game is Jason Kidd… How it Started
  • 35. New tuple appended to database Fact-finding algorithms Wesley had 12 points, 13 assists and 5 rebounds on February 25, 1996 to become the first player with a 12/13/5(points/assists/rebounds)inFebruary. Number-based facts make news stories more engaging http://en.wikipedia.org/wiki/Basketball Exceptional Fact Finding Real-world event (sports, transportation, crime, weather, finance, social media) 35
  • 37. FactWatcher [VLDB14 demo] o [ICDE14] Situational Facts: “No other player scored more pts and reb against DAL than Jordan.” o [KDD12] One-of-the-Few: “Jordan scored 10 pts & 10 reb. Only 3 others had similar performance.” o [KDD11] Prominent Streaks: “The Nikkei 225 closed below 10000 for the 12th consecutive week, the longest such streak since June 2009.” o [TKDD14] General Prominent Streaks: “James has scored at least 20 points and handed out 10 or more assists in each of his last five games against the Hawks, the longest streak he has ever had against one team.” Frequent Episode Mining [ICDE15] 37 Fact-finding Algorithms
  • 38. Fact-finding Algorithms o Many interesting facts in the real-world can be modeled as various types of skyline points. o The gist of fact-finding is to efficiently, incrementally maintain skyline points over ever-changing data, while considering constraints such as selection conditions. Stephan Börzsönyi, Donald Kossmann, Konrad Stocker: The Skyline Operator. ICDE 2001: 421-430 38
  • 39. “Paul George had 21 points, 11 rebounds and 5 assists to become the first Pacers player with a 20/10/5 (points/rebounds/assists) game against the Bulls since Detlef Schrempf in December 1992.” (http://espn.go.com/espn/elias?date=20130205) Modeling Situational Facts
  • 40. “Paul George had 21 points, 11 rebounds and 5 assists to become the first Pacers player with a 20/10/5 (points/rebounds/assists) game against the Bulls since Detlef Schrempf in December 1992.” (http://espn.go.com/espn/elias?date=20130205) Modeling Situational Facts
  • 41. id player day month season team opp_team pts ast reb t1 Bogues 11 Feb. 1991-92 Hornets Hawks 4 12 5 t2 Seikaly 13 Feb. 1991-92 Heat Hawks 24 5 15 t3 Sherman 7 Dec. 1993-94 Celtics Nets 13 13 5 t4 Wesley 4 Feb. 1994-95 Celtics Nets 2 5 2 t5 Wesley 5 Feb. 1994-95 Celtics Timberwolves 3 5 3 t6 Strictland 3 Jan. 1995-96 Blazers Celtics 27 18 8 Dimension space: D={d1,… ,dn} Measure space: M ={m1,… ,ms} Modeling Situational Facts a1,c1 b1,c1a1,b1 b1 c1a1 Τ a1,b1,c1 Skyline maintenance + Data Cube/OLAP
  • 42. o A Brief History of our Computation + Journalism Research o Computational Journalism – Data-driven fact-checking (ClaimBuster) – Other ongoing fact-checking projects – Exceptional fact finding (FactWatcher, Maverick) o Graph Data Usability (Orion, GQBE, TableView, Maverick) Outline 42
  • 43. Tackling Graph Data Usability Challenges Challenges o Massive, complex graphs; millions of entities; billions of edges. o Requires substantial understanding of schema and data and complex pre- processing, before one can fetch information or gain insights from data. Objectives o Make it easy to understand, query, explore, and clean graph data. Systems o GQBE[TKDE15,ICDE14demo]:graphquerybyexample o Orion[VLDB15demo,SIGMOD17tutorial]:auto-suggestionforinteractivequeryformulation o TableView[SIGMOD16,ICDE18demo]:generatingpreviewtablesforknowledgegraphs (*SIGMODMostReproduciblePaperAward) o Maverick[SIGMOD18,VLDB18demo]:findingoutliersanderrorsingraphs 43
  • 44. IDIR Projects and Demos Demonstration Videos https://vimeo.com/channels/1024406 o ClaimBuster (idir.uta.edu/claimbuster) Automated, live fact-checking o ClaimPortal (idir.uta.edu/claimportal) Monitoring factual-claims on social media o CrewScout (idir.uta.edu/crewscout) Expert team finding by skyline groups o ERQ: Entity-Relationship Query (idir.uta.edu/erq) Structured query on Wikipedia o Facetedpedia (idir.uta.edu/facetedpedia) Faceted search interface for Wikipedia o FactWatcher (idir.uta.edu/factwatcher) Fact-finding from real-world events o FrameAnnotator (idir.uta.edu/frameannotator) Frame annotation tool o GQBE (idir.uta.edu/gqbe) Graph query by example o Orion (idir.uta.edu/orion) Auto-suggestion for interactive graph query formulation o TableView Generating preview tables for knowledge graphs o Maverick Exceptional fact finding from knowledge graphs 44
  • 45. Current IDIR Students 45 Farahnaz Akrami Fatma Arslan Shadekur Rahman Samiul Saeef Theodora Toutountzi Ph.D. students Israa Jaradat Xiao Shi Zeyu Zhang M.S. students Damian Jimenez Priyank Arora Sumeet Lubal Sarthak Majithia Daniel Obembe Sarbajit Roy B.S. students Kyrell Dixon Jacob Devasier
  • 46. Graduated Students 46 Ning Yan 2013 Research Scientist Huawei, Santa Clara Ph.D. students Gensheng Zhang 2017 Google Nandish Jayaram 2016 Pivotal Naeemul Hassan 2016 Assistant Professor University of Mississippi Afroza Sultana 2018 Teradata Lab Aditya Telang 2011 (co-advised) IBM Research India Tulsi Chandwani (2017, Red Hat) Abu Ayub Ansari Syed (2017) Ishwor Timilsina (2017, Fidelity) Nigesh Shakya (2017) Rohit Bhoopalam (2016, Akamai) Fatma Dogan (2015, UTA Ph.D.) Minumol Joseph (2015, Capital One) Ramesh Venkataraman (2014, Amazon) Mahesh Gupta (2012, Electronic Arts) Jijo Philip (2012, Cerner Corporation) Avinash Bharadwaj (2011, Copper Labs) Quazi (Sunny) Hasan (2010, Dematic) Jared Ashman (2010, Ambit Energy) Muhammad Safiullah (2008, Microsoft) M.S. Thesis students Josue Caraballo (2017) Damian Jimenez (2017, UTA Ph.D.) Long Ly (2015) Sidharth Goyal (2015) Huadong Feng (2014) Raju Karki (2012) Angus Helm (2010) Aakash Tuli (2010) B.S. students
  • 47. Collaborators on Current and Past IDIR Projects 47 o BillAdair(Duke,PublicPolicy) o PankajAgarwal(Duke) o XiangAo(ChineseAcademyof Sciences) o VassilisAthitsos(UTA) o SouravBhowmick(Nanyang TechnologicalUniversity) o SharmaChakravarthy(UTA) o GongCheng(NanjingUniversity) o ByronChoi(HongKongBaptist University) o ChristophCsallner(UTA) o ZhiqiangLin(OhioState) o PingLuo(ChineseAcademyof Sciences) o MarkStancel(Duke,PublicPolicy) o Mark Tremayne(UTA, Communication) o MinWang(Google) o XifengYan(UCSB) o Jun Yang(Duke) o CongYu(GoogleResearch) o NanZhang(PennState) o GautamDas(UTA) o ChrisDing(UTA) o RamezElmasri(UTA) o LeonidasFegaras(UTA o PeterFray(Universityof TechnologySydney, Journalism) o JamesHamilton(Stanford,Communication) o NaeemulHassan(Mississippi) o WeiHu(NanjingUniversity) o ArijitKhan(NanyangTechnological University) o AngelaLee(UTD,Communication)