SlideShare a Scribd company logo
How to Measure Quality
with Disagreement?
or the Three Sides of
CrowdTruth
Lora Aroyo & Chris Welty
CrowdTruth
Annotator disagreement is signal, not noise.
It is indicative of the variation in human
semantic interpretation of signs
It can indicate ambiguity, vagueness,
similarity, over-generality, etc, as well as
quality
CrowdTruth Dependencies
worker metrics for detecting spam
à quality of sentences
à quality of the target semantics
worker quality metrics can improve significantly when the
quality of these other aspects of semantic interpretation
are considered
The Three Sides of
CrowdTruth
Representation
Worker Vector
1	

 1	

 1
Representation
Sentence Vector
1	

 1	

 1	

1	

 1	

1	

1	

 1	

1	

 1	

1	

 1	

1	

1	

1	

0	

 1	

 1	

 0	

 0	

 4	

 3	

 0	

 0	

 5	

 1	

 0
Feeling the way the CHEST expands (PALPATION), can identify areas of
the lung that are full of fluid.
?PALPATIONIs CHEST related to
diagnose location associated
with
is_a otherpart_of
0 0 02 3 0 0 0 1 0 0 44 1
Disagreement for Sentence
Clarity
Unclear relationship between the two arguments
reflected in the disagreement
?CONJUNCTIVITISHYPERAEMIA related toIs
0 0 0 1 0 0 0 013 0 0 0 0 0
symptomcause
Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora)
of the eyes are symptoms common to all forms of CONJUNCTIVITIS.
Disagreement for Sentence
Clarity
Clearly expressed relation between the two
arguments reflected in the agreement
Sentence-Relation Score
Measures how clearly a sentence expresses a relation
0	

 1	

 1	

 0	

 0	

 4	

 3	

 0	

 0	

 5	

 1	

 0	

Unit vector for
relation R6	

Sentence
Vector	

Cosine = .55
Worker Disagreement
Measured per worker
Worker-sentence disagreement
0	

 1	

 1	

 0	

 0	

 4	

 3	

 0	

 0	

 5	

 1	

 0	

Worker’s
sentence vector	

Sentence
Vector	

AVG (Cosine)
Worker Metrics
how much A WORKER disagrees with THE CROWD per sentence à the avg
of all cosine distances between each worker’s sentence vector & the full sentence
vector (minus that worker)
are there consistently like-minded workers à pairwise metric - avg for a
particular worker à there may be communities of thought that consistently
disagree with others, but agree within themselves
Low quality workers generally have high scores in both
avg relations per sentence à per worker the number of relations he/she
chooses per sentence averaged over all sentences he/she annotates.
High score here can help indicate low quality workers.
Sentence Metrics
Sentence-relation score à core CrowdTruth metric for
relation extraction à measured for each relation on each
sentence as the cosine of the unit vector for the relation
with the sentence vector
indicating that a relation is clearly or vaguely expressed,
Sentence clarity à defined for each sentence as the max
relation score for that sentence
indicating a clear or ambiguous or confusing sentence
Relation Metrics
Relation similarity à the causal power (pairwise conditional
probability). high similarity score indicates the relations are
confusable to workers
Relation ambiguity is defined for each relation as the max relation
similarity for the relation. If a relation is clear, then it will have a low
score.
Relation clarity à defined for each relation as the max
sentence-relation score for the relation over all sentences.
If a relation has a high clarity score, it means that
it is at least possible to express the relation clearly
Relation frequency is the number of times the relation is
annotated at least once in a sentence
Impact of Dependencies
Impact of Dependencies
Impact of Sentence Quality on
Worker Quality
(a) the space with no filtering of sentences or relations, a single line cannot separate the
spammers from non-spammers
(b) the space after sentence filtering, Figure (c) after relation filtering, and Figure (d)
after both sentence and relation filtering. Sentence filtering makes the classes
linearly separable, and the separation between the classes increases in the
subsequent figures.
Impact of Relation
Quality on Worker
Quality
(a) the space with no filtering of sentences or relations,
a single line cannot separate the spammers from non-
spammers
(c) after relation filtering
the relation filtering much more clearly
defines the space, with a large
separation between positive and
negative instances.
the pairwise improvements to the
worker scores are significant
with p < :001, which is better than the
sentence clarity improvements
Combining Sentence &
Relation Filtering
•  first filtering out low clarity
sentences
•  then filtering vague and
ambiguous relations
•  worker metrics were
computed on these new
sentences and vectors
•  proves to even further
separate the space, and the
pairwise improvement in
worker scores from the
baseline (unfiltered) is
significant with p < :0005.
•  The improvement over
sentence filtering alone is
also significant (p < :01)
•  The improvement over
relation filtering alone is only
significant with p < :05.
quality measures in
semantic interpretation tasks
are inter-dependent
higher accuracy can be achieved by considering the impact of
sentence quality & relation quality on worker quality measurements
significant improvement in worker quality metrics with respect
to known spammers by incorporating the quality of the individual
sentences & target relations
relationships between the different corners of the triangle of
reference, e.g.
à the impact of relation & worker quality on sentence measures,
à the impact of worker & sentence quality on relation measures
crowdtruth.org

More Related Content

PDF
WebSci2013 Harnessing Disagreement in Crowdsourcing
PDF
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
PDF
Crowdsourcing & Semantic Web: Dagstuhl 2014 (Presentation Lora)
PDF
Crowds & Niches Teaching Machines to Diagnose: NLeSC Kick off eHumanities pr...
PDF
Exploiting disagreement through open ended tasks for capturing interpretation...
PPTX
Good News is No News? Effects of Positive Stories about African Americans on ...
PPTX
Textual & Sentiment Analysis of Movie Reviews
PDF
Pluss
WebSci2013 Harnessing Disagreement in Crowdsourcing
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Crowdsourcing & Semantic Web: Dagstuhl 2014 (Presentation Lora)
Crowds & Niches Teaching Machines to Diagnose: NLeSC Kick off eHumanities pr...
Exploiting disagreement through open ended tasks for capturing interpretation...
Good News is No News? Effects of Positive Stories about African Americans on ...
Textual & Sentiment Analysis of Movie Reviews
Pluss

Viewers also liked (20)

PDF
Truth is a Lie: Rules & Semantics from Crowd Perspectives (RR'2015 Keynote)
PPTX
ESWC - PhD Symposium 2016
PDF
Gamification of crowdsourcing tasks: What motivates a medical expert?
PPT
Visualization of Disagreement-based Quality Metrics of Crowdsourcing Data
PDF
Crowdsourcing Disagreement on Open-Domain Questions
PDF
Utilizing Social Health Websites for Cognitive Computing and Clinical Decisio...
PDF
Towards Better Media Understanding and Searchability
PDF
Dive+@ICTOpen2017
PDF
CrowdTruth Games @NLeSc eHumanities day 2015
PPTX
Dive+ NL eScience symposium 2015
PDF
SXSW2017 @NewDutchMedia Talk: Exploration is the New Search
PDF
Boosting Named Entity Extraction through Crowdsourcing
PDF
Europeana GA 2016: Harnessing Crowds, Niches & Professionals in the Digital Age
PPTX
DIVE Semantic Web Challenge Presentation
PDF
Genuine semantic publishing
PDF
DIVE+ @ NLeSymposium 2015: Towards New Cultural Commons with DIVE+
PDF
CrowdTruth for Digital Hermeneutics
PDF
Closing Event - Watson Innovation Course
PDF
Stitch by Stitch: Annotating Fashion at the Rijksmuseum
Truth is a Lie: Rules & Semantics from Crowd Perspectives (RR'2015 Keynote)
ESWC - PhD Symposium 2016
Gamification of crowdsourcing tasks: What motivates a medical expert?
Visualization of Disagreement-based Quality Metrics of Crowdsourcing Data
Crowdsourcing Disagreement on Open-Domain Questions
Utilizing Social Health Websites for Cognitive Computing and Clinical Decisio...
Towards Better Media Understanding and Searchability
Dive+@ICTOpen2017
CrowdTruth Games @NLeSc eHumanities day 2015
Dive+ NL eScience symposium 2015
SXSW2017 @NewDutchMedia Talk: Exploration is the New Search
Boosting Named Entity Extraction through Crowdsourcing
Europeana GA 2016: Harnessing Crowds, Niches & Professionals in the Digital Age
DIVE Semantic Web Challenge Presentation
Genuine semantic publishing
DIVE+ @ NLeSymposium 2015: Towards New Cultural Commons with DIVE+
CrowdTruth for Digital Hermeneutics
Closing Event - Watson Innovation Course
Stitch by Stitch: Annotating Fashion at the Rijksmuseum
Ad

Similar to (Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014 (20)

PPT
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
PPTX
Partial correlation
PPT
Hph7310week2winter2009narr
PDF
cannonicalpresentation-110505114327-phpapp01.pdf
PPTX
Correlaton vs regression report for master of education major in administrati...
PPTX
Reliability bachman 1990 chapter 6
PPTX
Reliability bachman 1990 chapter 6
PDF
Sentence level sentiment polarity calculation for customer reviews by conside...
DOCX
this activity is designed for you to explore the continuum of an a.docx
PPTX
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
PPTX
Cannonical correlation
PPTX
Cannonical Correlation
PPTX
6 the six uContinuous data analysis.pptx
PPTX
Measure of Association
PDF
Basics of Measurements.pdf
PDF
Basics of Measurements.pdf
PDF
Aspect mining and sentiment association
DOCX
Cost and Quality Analysis 1Unsatisfactory0.002Less th.docx
PPT
Correlational research
PPT
Quantitative Data analysis
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
Partial correlation
Hph7310week2winter2009narr
cannonicalpresentation-110505114327-phpapp01.pdf
Correlaton vs regression report for master of education major in administrati...
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6
Sentence level sentiment polarity calculation for customer reviews by conside...
this activity is designed for you to explore the continuum of an a.docx
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
Cannonical correlation
Cannonical Correlation
6 the six uContinuous data analysis.pptx
Measure of Association
Basics of Measurements.pdf
Basics of Measurements.pdf
Aspect mining and sentiment association
Cost and Quality Analysis 1Unsatisfactory0.002Less th.docx
Correlational research
Quantitative Data analysis
Ad

More from Lora Aroyo (20)

PDF
NeurIPS2023 Keynote: The Many Faces of Responsible AI.pdf
PDF
CATS4ML Data Challenge: Crowdsourcing Adverse Test Sets for Machine Learning
PDF
Harnessing Human Semantics at Scale (updated)
PDF
Data excellence: Better data for better AI
PDF
CHIP Demonstrator presentation @ CATCH Symposium
PDF
Semantic Web Challenge: CHIP Demonstrator
PDF
The Rijksmuseum Collection as Linked Data
PDF
Keynote at International Conference of Art Libraries 2018 @Rijksmuseum
PDF
FAIRview: Responsible Video Summarization @NYCML'18
PDF
Understanding bias in video news & news filtering algorithms
PDF
StorySourcing: Telling Stories with Humans & Machines
PDF
Data Science with Humans in the Loop
PPTX
Digital Humanities Benelux 2017: Keynote Lora Aroyo
PDF
DH Benelux 2017 Panel: A Pragmatic Approach to Understanding and Utilising Ev...
PDF
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
PDF
My ESWC 2017 keynote: Disrupting the Semantic Comfort Zone
PDF
Data Science with Human in the Loop @Faculty of Science #Leiden University
PDF
"Video Killed the Radio Star": From MTV to Snapchat
PPTX
UMAP 2016 Opening Ceremony
PDF
Crowdsourcing & Nichesourcing: Enriching Cultural Heritage with Experts & Cr...
NeurIPS2023 Keynote: The Many Faces of Responsible AI.pdf
CATS4ML Data Challenge: Crowdsourcing Adverse Test Sets for Machine Learning
Harnessing Human Semantics at Scale (updated)
Data excellence: Better data for better AI
CHIP Demonstrator presentation @ CATCH Symposium
Semantic Web Challenge: CHIP Demonstrator
The Rijksmuseum Collection as Linked Data
Keynote at International Conference of Art Libraries 2018 @Rijksmuseum
FAIRview: Responsible Video Summarization @NYCML'18
Understanding bias in video news & news filtering algorithms
StorySourcing: Telling Stories with Humans & Machines
Data Science with Humans in the Loop
Digital Humanities Benelux 2017: Keynote Lora Aroyo
DH Benelux 2017 Panel: A Pragmatic Approach to Understanding and Utilising Ev...
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
My ESWC 2017 keynote: Disrupting the Semantic Comfort Zone
Data Science with Human in the Loop @Faculty of Science #Leiden University
"Video Killed the Radio Star": From MTV to Snapchat
UMAP 2016 Opening Ceremony
Crowdsourcing & Nichesourcing: Enriching Cultural Heritage with Experts & Cr...

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf

(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014

  • 1. How to Measure Quality with Disagreement? or the Three Sides of CrowdTruth Lora Aroyo & Chris Welty
  • 2. CrowdTruth Annotator disagreement is signal, not noise. It is indicative of the variation in human semantic interpretation of signs It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality
  • 3. CrowdTruth Dependencies worker metrics for detecting spam à quality of sentences à quality of the target semantics worker quality metrics can improve significantly when the quality of these other aspects of semantic interpretation are considered
  • 4. The Three Sides of CrowdTruth
  • 6. Representation Sentence Vector 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 4 3 0 0 5 1 0
  • 7. Feeling the way the CHEST expands (PALPATION), can identify areas of the lung that are full of fluid. ?PALPATIONIs CHEST related to diagnose location associated with is_a otherpart_of 0 0 02 3 0 0 0 1 0 0 44 1 Disagreement for Sentence Clarity Unclear relationship between the two arguments reflected in the disagreement
  • 8. ?CONJUNCTIVITISHYPERAEMIA related toIs 0 0 0 1 0 0 0 013 0 0 0 0 0 symptomcause Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora) of the eyes are symptoms common to all forms of CONJUNCTIVITIS. Disagreement for Sentence Clarity Clearly expressed relation between the two arguments reflected in the agreement
  • 9. Sentence-Relation Score Measures how clearly a sentence expresses a relation 0 1 1 0 0 4 3 0 0 5 1 0 Unit vector for relation R6 Sentence Vector Cosine = .55
  • 10. Worker Disagreement Measured per worker Worker-sentence disagreement 0 1 1 0 0 4 3 0 0 5 1 0 Worker’s sentence vector Sentence Vector AVG (Cosine)
  • 11. Worker Metrics how much A WORKER disagrees with THE CROWD per sentence à the avg of all cosine distances between each worker’s sentence vector & the full sentence vector (minus that worker) are there consistently like-minded workers à pairwise metric - avg for a particular worker à there may be communities of thought that consistently disagree with others, but agree within themselves Low quality workers generally have high scores in both avg relations per sentence à per worker the number of relations he/she chooses per sentence averaged over all sentences he/she annotates. High score here can help indicate low quality workers.
  • 12. Sentence Metrics Sentence-relation score à core CrowdTruth metric for relation extraction à measured for each relation on each sentence as the cosine of the unit vector for the relation with the sentence vector indicating that a relation is clearly or vaguely expressed, Sentence clarity à defined for each sentence as the max relation score for that sentence indicating a clear or ambiguous or confusing sentence
  • 13. Relation Metrics Relation similarity à the causal power (pairwise conditional probability). high similarity score indicates the relations are confusable to workers Relation ambiguity is defined for each relation as the max relation similarity for the relation. If a relation is clear, then it will have a low score. Relation clarity à defined for each relation as the max sentence-relation score for the relation over all sentences. If a relation has a high clarity score, it means that it is at least possible to express the relation clearly Relation frequency is the number of times the relation is annotated at least once in a sentence
  • 16. Impact of Sentence Quality on Worker Quality (a) the space with no filtering of sentences or relations, a single line cannot separate the spammers from non-spammers (b) the space after sentence filtering, Figure (c) after relation filtering, and Figure (d) after both sentence and relation filtering. Sentence filtering makes the classes linearly separable, and the separation between the classes increases in the subsequent figures.
  • 17. Impact of Relation Quality on Worker Quality (a) the space with no filtering of sentences or relations, a single line cannot separate the spammers from non- spammers (c) after relation filtering the relation filtering much more clearly defines the space, with a large separation between positive and negative instances. the pairwise improvements to the worker scores are significant with p < :001, which is better than the sentence clarity improvements
  • 18. Combining Sentence & Relation Filtering •  first filtering out low clarity sentences •  then filtering vague and ambiguous relations •  worker metrics were computed on these new sentences and vectors •  proves to even further separate the space, and the pairwise improvement in worker scores from the baseline (unfiltered) is significant with p < :0005. •  The improvement over sentence filtering alone is also significant (p < :01) •  The improvement over relation filtering alone is only significant with p < :05.
  • 19. quality measures in semantic interpretation tasks are inter-dependent higher accuracy can be achieved by considering the impact of sentence quality & relation quality on worker quality measurements significant improvement in worker quality metrics with respect to known spammers by incorporating the quality of the individual sentences & target relations relationships between the different corners of the triangle of reference, e.g. à the impact of relation & worker quality on sentence measures, à the impact of worker & sentence quality on relation measures