SlideShare a Scribd company logo
Toward Better Crowdsourcing Science
(& Predicting Annotator Performance)
Matt Lease
School of Information
University of Texas at Austin
ir.ischool.utexas.edu
@mattlease
ml@utexas.edu
Slides: www.slideshare.net/mattlease
“The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org
The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
3
Matt Lease <ml@utexas.edu>
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
4
Matt Lease <ml@utexas.edu>
Roadmap
Hyun Joon Jung
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
5
Matt Lease <ml@utexas.edu>
Roadmap
A Popular Tale of Crowdsourcing Woe
• Heroic ML researcher asks the
crowd to perform a simple task
• Crowd (invariably) screws it up…
• “Aha!” cries the ML researcher, “Fortunately,
I know exactly how to solve this problem!”
Matt Lease <ml@utexas.edu>
6
Matt Lease <ml@utexas.edu>
7
But why can’t the workers just get it
right to begin with?
Matt Lease <ml@utexas.edu>
8
Is everyone just lazy, stupid, or deceitful?!?
Much of our literature
seems to suggest this:
• Cheaters
• Fraudsters
• “Lazy Turkers”
• Scammers
• Spammers
Another story (a parable)
“We had a great software interface, but we went
out of business because our customers were too
stupid to figure out how to use it.”
Moral
• Even if a user were stupid or lazy, we still lose
• By accepting our own responsibility, we create
another opportunity to fix the problem…
– Cynical view: idiot-proofing
Matt Lease <ml@utexas.edu>
9
What is our responsibility?
• Ill-defined/incomplete/ambiguous/subjective task?
• Confusing, difficult, or unusable interface?
• Incomplete or unclear instructions?
• Insufficient or unhelpful examples given?
• Gold standard with low or unknown inter-assessor
agreement (i.e. measurement error in assessing
response quality)?
• Task design matters! (garbage in = garbage out)
– Report it for review, completeness, & reproducibility
Matt Lease <ml@utexas.edu>
10
A Few Simple Suggestions (1 of 2)
1. Make task self-contained: everything the worker
needs to know should be visible in-task
2. Short, simple, & clear instructions with examples
3. Avoid domain-specific & advanced terminology;
write for typical people (e.g., your mom)
4. Engage worker / avoid boring stuff. If possible,
select interesting content for people to work on
5. Always ask for open-ended feedback
Matt Lease <ml@utexas.edu>
11
Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
Suggested Sequencing (2 of 2)
1. Simulate first draft of task with your in-house personnel.
Assess, revise, & iterate (ARI)
2. Run task using relatively few workers & examples (ARI)
1. Do workers understand the instructions?
2. How long does it take? Is pay effective & ethical?
3. Replicate results on another dataset (generalization). (ARI)
4. [Optional] qualification test. (ARI)
5. Increase items. Look for boundary items & noisy gold (ARI)
6. Increase # of workers (ARI)
Matt Lease <ml@utexas.edu>
12
Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
Toward Better Crowdsourcing Science
Goal: Strengthen individual studies and minimize
unwarranted spread of bias in our scientific literature
• Occam’s Razor: avoid making assumptions beyond
what the data actually tells us (avoid prejudice!)
• Enumerate hypotheses for possible causes of low data
quality, assess supporting evidence for each hypothesis,
and for any claims made, cite supporting evidence
• Recognize uncertainty of analyses and convey this via
hedge statements such as, “the data suggests that…”
• Avoid derogatory language use without very strong
supporting evidence. The crowd enables our work!!
– Acknowledge your workers!
Matt Lease <ml@utexas.edu>
13
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
14
Matt Lease <ml@utexas.edu>
Roadmap
Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
15
Matt Lease <ml@utexas.edu>
CACM August, 2013
16
Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.
Matt Lease <ml@utexas.edu>
• “Contribute to society and human well-being”
• “Avoid harm to others”
“As an ACM member I will
– Uphold and promote the principles of this Code
– Treat violations of this code as inconsistent with membership in the ACM”
17
Matt Lease <ml@utexas.edu>
“Which approaches are less expensive and is this sensible? With the advent of
outsourcing and off-shoring these matters become more complex and take on new
dimensions …there are often related ethical issues concerning exploitation…
“…legal, social, professional and ethical [topics] should feature in all computing degrees.”
2008 ACM/IEEE Curriculum Update
• Mistakes are made in HITs rejection, worker blocking
– e.g., student error, bug, poor task design, noisy gold, etc.
• Workers have limited recourse for appeal
• Our errors impact real people’s lives
• What is the loss function to optimize?
• Should anyone hold researchers accountable? IRB?
• How do we balance the risk of human harm vs.
the potential benefit if our research succeeds?
Power Asymmetry on MTurk
18
Matt Lease <ml@utexas.edu>
ACM: “Contribute to society and human
well-being; avoid harm to others”
• How do we know who is doing the work, or if a
decision to work (for a given price) is freely made?
• Does it matter if work is performed by
– Political refugees? Children? Prisoners? Disabled?
• What (if any) moral obligation do crowdsourcing
researchers have to consider broader impacts of
our research (either good or bad) on the lives of
those we depend on to power our systems?
Matt Lease <ml@utexas.edu>
19
Who Are We Building a Better Future For?
• “Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders
of innovative technologies, rather than employers
unconcerned with working conditions.”
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of the
people we ask to power our computing?”
20
Could Effective Human Computation
Sometimes Be a Bad Idea?
• The Googler who Looked at the Worst of the Internet
• Policing the Web’s Lurid Precincts
• Facebook content moderation
• The dirty job of keeping Facebook clean
• Even linguistic annotators report stress &
nightmares from reading news articles!
21
Matt Lease <ml@utexas.edu>
Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw
an informal, occasional blog for researchers
interested in ethical issues in crowd work
22
Matt Lease <ml@utexas.edu>
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
23
Matt Lease <ml@utexas.edu>
Roadmap
Hyun Joon Jung
Quality Control in Crowdsourcing
7/10/2015 24
Crowd workers
Label
Aggregation
Workflow
Design
Worker
Management
Existing Quality Control Methods
Task Design
Who is more accurate?
(worker performance estimation
and prediction)
Requester
Online marketplace
Crowd
workers
Motivation
Matt Lease <ml@utexas.edu>
25
Equally Accurate Workers?
1 0 1 0
7/10/2015 26
1 0 1 0 1 0
0 0 0 0 1 0 1 1 1 1
Alice
Bob
time t
Correctness of the ith task instance
1 -> correct , 0 -> wrong
Accuracy(Alice) = Accuracy(Bob) = 0.5
But should we expect equal work quality in the future?
What if examples are not i.i.d.?
Bob seems to be improving over time.
1: Time-series model
27
Latent Autoregressive
Real observation
Noise Model
Latent variable
𝑦𝑡 = f(𝑥 𝑡)
𝑥𝑡
Temporal correlation
How frequently y has
changed over time
𝜑
Offset
Sign navigates direction
between correct vs. not
𝑐
1 0 1 0
-0.3 0.4 -0.10.8𝑥𝑡
𝑦𝑡
𝑐 φ 𝑐 φ 𝑐 φ𝑐 φ
EM Variant (LAMORE, Park et al. 2014)
Jung et al. Predicting Next Label
Quality: A Time-Series Model of
Crowdwork. AAAI HCOMP 2014.
7/10/2015 28
Integrate multi-dimensional features of a
crowd assessor
Multiple features
Alice
accuracy time
temporal
effect
topic
familiarity
# of
labels
00.7 10.3 0.6 0.8 20
0.6 8.5 0.5 0.2 21 1
0.65 7.5 0.4 0.4 22 0
0.63 11.5 0.3 0.5 23 ?
Predict an assessor’s next label
quality based on a single feature
Alice
0.6
0.5
0.4
0.3
0
1
0
?
temporal
effect
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
2: Modeling More Features
Features
7/10/2015 29
[1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10
[2] Ipeirotis, P.G., Gabrilovich, E.: Quizz: targeted crowdsourcing with a billion (potential) users. WWW’14
[3] Jung, H., et al.: Predicting Next Label Quality: A Time-Series Model of Crowdwork. HCOMP’14
How do we flexibly capture a wider range of assessor behaviors by
incorporating multi-dimensional features?
[1]
[1]
[2]
[3]
[3]
[3]
Various
accuracy
measures
Task features
Temporal
features
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
Model
7/10/2015 30
Input: X (features for crowd assessor model)
Learning Framework [ ]
Output: Y (likelihood of getting correct label at t)
Generalizable feature-based Assessor Model (GAM)
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
Which Features Matter?
7/10/2015 31
. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
s varying decision rejection options (δ=[0⇠0.25] by 0.05). While theother methodss
cant decreasein coverage, under all thegiven reject options, GAM showsbetter cov
l asprediction performance.
49#
43#
39#
28#
27#
23#
22#
20#
19#
16#
10#
7#
5#
0# 10# 20# 30# 40# 50#
AA#
BA_opt#
BA_PES#
C#
NumLabels#
CurrentLabelQuality#
AccChangeDirecHon#
SA#
Phi#
BA_uni#
TaskTime#
TopicChange#
TopicEverSeen#
Fig.4. Summary of relativefeature importance across 54 regression models.
ases (27), which implicitly indicates that task familiarity affects an assessor’s
A GAM with the only top 5 features shows good performance
(7-10% less than full-featured GAM )
Relative feature importance across 54 individual prediction models.
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
3: Reducing Supervision
Matt Lease <ml@utexas.edu>
32
Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supervision. HCOMP 2015.
Soft Label Updating & Discounting
Matt Lease <ml@utexas.edu>
33
Soft Label Updating
Matt Lease <ml@utexas.edu>
34
The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
35
Matt Lease <ml@utexas.edu>
Thank You!
ir.ischool.utexas.eduSlides: www.slideshare.net/mattlease

More Related Content

PDF
The Rise of Crowd Computing (July 7, 2016)
PDF
The Rise of Crowd Computing - 2016
PDF
The Rise of Crowd Computing (December 2015)
PDF
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
PDF
Crowdsourcing for Search Evaluation and Social-Algorithmic Search
PDF
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
PDF
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
PDF
UT Dallas CS - Rise of Crowd Computing
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing (December 2015)
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Crowdsourcing for Search Evaluation and Social-Algorithmic Search
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
UT Dallas CS - Rise of Crowd Computing

What's hot (20)

PDF
But Who Protects the Moderators?
PPTX
Future of learning 20180425 v1
PDF
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
PPTX
20220203 jim spohrer purdue v12
PPTX
20210908 jim spohrer naples forum_2021 v1
PPTX
Korea day1 keynote 20161013 v6
PPTX
K tech santa clara 20131114 v1
PPT
Thefutureofcitiesandregions 20200724 v5
PPTX
20210325 jim spohrer sir rel future_ai v10 copy
PPTX
Japan 20200724 v13
PPTX
20201209 jim spohrer platform economy v3
PPTX
People's Interactions with Cognitive Assistants for Enhanced Performance
PPTX
An Introduction to Human Computation and Games With A Purpose - Part I
PDF
許永真/Crowd Computing for Big and Deep AI
PPTX
20201213 jim spohrer icis augmented intelligence v6
PPTX
20210325 jim spohrer future ai v11
PPTX
Ert 20200420 v11
PPTX
Aaai fs 2017 cog_asst_in_gov_and_psa 20171110 v2
PDF
Robotisation of Knowledge and Service Work
PPTX
20210322 jim spohrer eaae deans summit v13
But Who Protects the Moderators?
Future of learning 20180425 v1
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
20220203 jim spohrer purdue v12
20210908 jim spohrer naples forum_2021 v1
Korea day1 keynote 20161013 v6
K tech santa clara 20131114 v1
Thefutureofcitiesandregions 20200724 v5
20210325 jim spohrer sir rel future_ai v10 copy
Japan 20200724 v13
20201209 jim spohrer platform economy v3
People's Interactions with Cognitive Assistants for Enhanced Performance
An Introduction to Human Computation and Games With A Purpose - Part I
許永真/Crowd Computing for Big and Deep AI
20201213 jim spohrer icis augmented intelligence v6
20210325 jim spohrer future ai v11
Ert 20200420 v11
Aaai fs 2017 cog_asst_in_gov_and_psa 20171110 v2
Robotisation of Knowledge and Service Work
20210322 jim spohrer eaae deans summit v13
Ad

Similar to Toward Better Crowdsourcing Science (20)

PDF
Crowdsourcing: From Aggregation to Search Engine Evaluation
PDF
A brief introduction to crowdsourcing for data collection
PDF
Crowdsourcing for Information Retrieval: From Statistics to Ethics
PDF
Crowdsourcing for research libraries
PDF
On Quality Control and Machine Learning in Crowdsourcing
PDF
Did you mean crowdsourcing for recommender systems?
PDF
The Search for Truth in Objective & Subject Crowdsourcing
PDF
Toward Effective and Sustainable Online Crowd Work
PDF
SSSW 2016 Cognition Tutorial
PDF
Rise of Crowd Computing (December 2012)
PPTX
Tutorial Cognition - Irene
PPTX
Human computation, crowdsourcing and social: An industrial perspective
PDF
Quizz: Targeted Crowdsourcing with a Billion (Potential) Users
PDF
Workshop on Crowd-sourcing
PPTX
Crowdsourcing a News Query Classification Dataset
PDF
Fundamentals of human computation
PDF
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
PDF
Social machines: theory design and incentives
PDF
AI & Work, with Transparency & the Crowd
PDF
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
Crowdsourcing: From Aggregation to Search Engine Evaluation
A brief introduction to crowdsourcing for data collection
Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for research libraries
On Quality Control and Machine Learning in Crowdsourcing
Did you mean crowdsourcing for recommender systems?
The Search for Truth in Objective & Subject Crowdsourcing
Toward Effective and Sustainable Online Crowd Work
SSSW 2016 Cognition Tutorial
Rise of Crowd Computing (December 2012)
Tutorial Cognition - Irene
Human computation, crowdsourcing and social: An industrial perspective
Quizz: Targeted Crowdsourcing with a Billion (Potential) Users
Workshop on Crowd-sourcing
Crowdsourcing a News Query Classification Dataset
Fundamentals of human computation
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Social machines: theory design and incentives
AI & Work, with Transparency & the Crowd
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
Ad

More from Matthew Lease (16)

PDF
Automated Models for Quantifying Centrality of Survey Responses
PDF
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
PDF
Explainable Fact Checking with Humans in-the-loop
PDF
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
PDF
Designing Human-AI Partnerships to Combat Misinfomation
PDF
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
PDF
Fact Checking & Information Retrieval
PDF
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
Systematic Review is e-Discovery in Doctor’s Clothing
PDF
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
PDF
Crowdsourcing Transcription Beyond Mechanical Turk
PDF
Crowdsourcing & ethics: a few thoughts and refences.
PDF
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
PDF
Mechanical Turk is Not Anonymous
PDF
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...
Automated Models for Quantifying Centrality of Survey Responses
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Explainable Fact Checking with Humans in-the-loop
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Designing Human-AI Partnerships to Combat Misinfomation
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Fact Checking & Information Retrieval
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Systematic Review is e-Discovery in Doctor’s Clothing
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Crowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Mechanical Turk is Not Anonymous
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Empathic Computing: Creating Shared Understanding
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Empathic Computing: Creating Shared Understanding
Chapter 3 Spatial Domain Image Processing.pdf
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
sap open course for s4hana steps from ECC to s4
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Toward Better Crowdsourcing Science

  • 1. Toward Better Crowdsourcing Science (& Predicting Annotator Performance) Matt Lease School of Information University of Texas at Austin ir.ischool.utexas.edu @mattlease ml@utexas.edu Slides: www.slideshare.net/mattlease
  • 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 www.ischools.org
  • 3. The Future of Crowd Work, CSCW’13 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 3 Matt Lease <ml@utexas.edu>
  • 4. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 4 Matt Lease <ml@utexas.edu> Roadmap Hyun Joon Jung
  • 5. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 5 Matt Lease <ml@utexas.edu> Roadmap
  • 6. A Popular Tale of Crowdsourcing Woe • Heroic ML researcher asks the crowd to perform a simple task • Crowd (invariably) screws it up… • “Aha!” cries the ML researcher, “Fortunately, I know exactly how to solve this problem!” Matt Lease <ml@utexas.edu> 6
  • 8. But why can’t the workers just get it right to begin with? Matt Lease <ml@utexas.edu> 8 Is everyone just lazy, stupid, or deceitful?!? Much of our literature seems to suggest this: • Cheaters • Fraudsters • “Lazy Turkers” • Scammers • Spammers
  • 9. Another story (a parable) “We had a great software interface, but we went out of business because our customers were too stupid to figure out how to use it.” Moral • Even if a user were stupid or lazy, we still lose • By accepting our own responsibility, we create another opportunity to fix the problem… – Cynical view: idiot-proofing Matt Lease <ml@utexas.edu> 9
  • 10. What is our responsibility? • Ill-defined/incomplete/ambiguous/subjective task? • Confusing, difficult, or unusable interface? • Incomplete or unclear instructions? • Insufficient or unhelpful examples given? • Gold standard with low or unknown inter-assessor agreement (i.e. measurement error in assessing response quality)? • Task design matters! (garbage in = garbage out) – Report it for review, completeness, & reproducibility Matt Lease <ml@utexas.edu> 10
  • 11. A Few Simple Suggestions (1 of 2) 1. Make task self-contained: everything the worker needs to know should be visible in-task 2. Short, simple, & clear instructions with examples 3. Avoid domain-specific & advanced terminology; write for typical people (e.g., your mom) 4. Engage worker / avoid boring stuff. If possible, select interesting content for people to work on 5. Always ask for open-ended feedback Matt Lease <ml@utexas.edu> 11 Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
  • 12. Suggested Sequencing (2 of 2) 1. Simulate first draft of task with your in-house personnel. Assess, revise, & iterate (ARI) 2. Run task using relatively few workers & examples (ARI) 1. Do workers understand the instructions? 2. How long does it take? Is pay effective & ethical? 3. Replicate results on another dataset (generalization). (ARI) 4. [Optional] qualification test. (ARI) 5. Increase items. Look for boundary items & noisy gold (ARI) 6. Increase # of workers (ARI) Matt Lease <ml@utexas.edu> 12 Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
  • 13. Toward Better Crowdsourcing Science Goal: Strengthen individual studies and minimize unwarranted spread of bias in our scientific literature • Occam’s Razor: avoid making assumptions beyond what the data actually tells us (avoid prejudice!) • Enumerate hypotheses for possible causes of low data quality, assess supporting evidence for each hypothesis, and for any claims made, cite supporting evidence • Recognize uncertainty of analyses and convey this via hedge statements such as, “the data suggests that…” • Avoid derogatory language use without very strong supporting evidence. The crowd enables our work!! – Acknowledge your workers! Matt Lease <ml@utexas.edu> 13
  • 14. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 14 Matt Lease <ml@utexas.edu> Roadmap
  • 15. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010. 15 Matt Lease <ml@utexas.edu>
  • 16. CACM August, 2013 16 Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013. Matt Lease <ml@utexas.edu>
  • 17. • “Contribute to society and human well-being” • “Avoid harm to others” “As an ACM member I will – Uphold and promote the principles of this Code – Treat violations of this code as inconsistent with membership in the ACM” 17 Matt Lease <ml@utexas.edu> “Which approaches are less expensive and is this sensible? With the advent of outsourcing and off-shoring these matters become more complex and take on new dimensions …there are often related ethical issues concerning exploitation… “…legal, social, professional and ethical [topics] should feature in all computing degrees.” 2008 ACM/IEEE Curriculum Update
  • 18. • Mistakes are made in HITs rejection, worker blocking – e.g., student error, bug, poor task design, noisy gold, etc. • Workers have limited recourse for appeal • Our errors impact real people’s lives • What is the loss function to optimize? • Should anyone hold researchers accountable? IRB? • How do we balance the risk of human harm vs. the potential benefit if our research succeeds? Power Asymmetry on MTurk 18 Matt Lease <ml@utexas.edu>
  • 19. ACM: “Contribute to society and human well-being; avoid harm to others” • How do we know who is doing the work, or if a decision to work (for a given price) is freely made? • Does it matter if work is performed by – Political refugees? Children? Prisoners? Disabled? • What (if any) moral obligation do crowdsourcing researchers have to consider broader impacts of our research (either good or bad) on the lives of those we depend on to power our systems? Matt Lease <ml@utexas.edu> 19
  • 20. Who Are We Building a Better Future For? • “Irani and Silberman (2013) – “…AMT helps employers see themselves as builders of innovative technologies, rather than employers unconcerned with working conditions.” • Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of the people we ask to power our computing?” 20
  • 21. Could Effective Human Computation Sometimes Be a Bad Idea? • The Googler who Looked at the Worst of the Internet • Policing the Web’s Lurid Precincts • Facebook content moderation • The dirty job of keeping Facebook clean • Even linguistic annotators report stress & nightmares from reading news articles! 21 Matt Lease <ml@utexas.edu>
  • 22. Join the conversation! Crowdwork-ethics, by Six Silberman http://crowdwork-ethics.wtf.tw an informal, occasional blog for researchers interested in ethical issues in crowd work 22 Matt Lease <ml@utexas.edu>
  • 23. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 23 Matt Lease <ml@utexas.edu> Roadmap Hyun Joon Jung
  • 24. Quality Control in Crowdsourcing 7/10/2015 24 Crowd workers Label Aggregation Workflow Design Worker Management Existing Quality Control Methods Task Design Who is more accurate? (worker performance estimation and prediction) Requester Online marketplace Crowd workers
  • 26. Equally Accurate Workers? 1 0 1 0 7/10/2015 26 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1 Alice Bob time t Correctness of the ith task instance 1 -> correct , 0 -> wrong Accuracy(Alice) = Accuracy(Bob) = 0.5 But should we expect equal work quality in the future? What if examples are not i.i.d.? Bob seems to be improving over time.
  • 27. 1: Time-series model 27 Latent Autoregressive Real observation Noise Model Latent variable 𝑦𝑡 = f(𝑥 𝑡) 𝑥𝑡 Temporal correlation How frequently y has changed over time 𝜑 Offset Sign navigates direction between correct vs. not 𝑐 1 0 1 0 -0.3 0.4 -0.10.8𝑥𝑡 𝑦𝑡 𝑐 φ 𝑐 φ 𝑐 φ𝑐 φ EM Variant (LAMORE, Park et al. 2014) Jung et al. Predicting Next Label Quality: A Time-Series Model of Crowdwork. AAAI HCOMP 2014.
  • 28. 7/10/2015 28 Integrate multi-dimensional features of a crowd assessor Multiple features Alice accuracy time temporal effect topic familiarity # of labels 00.7 10.3 0.6 0.8 20 0.6 8.5 0.5 0.2 21 1 0.65 7.5 0.4 0.4 22 0 0.63 11.5 0.3 0.5 23 ? Predict an assessor’s next label quality based on a single feature Alice 0.6 0.5 0.4 0.3 0 1 0 ? temporal effect Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015. 2: Modeling More Features
  • 29. Features 7/10/2015 29 [1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10 [2] Ipeirotis, P.G., Gabrilovich, E.: Quizz: targeted crowdsourcing with a billion (potential) users. WWW’14 [3] Jung, H., et al.: Predicting Next Label Quality: A Time-Series Model of Crowdwork. HCOMP’14 How do we flexibly capture a wider range of assessor behaviors by incorporating multi-dimensional features? [1] [1] [2] [3] [3] [3] Various accuracy measures Task features Temporal features Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  • 30. Model 7/10/2015 30 Input: X (features for crowd assessor model) Learning Framework [ ] Output: Y (likelihood of getting correct label at t) Generalizable feature-based Assessor Model (GAM) Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  • 31. Which Features Matter? 7/10/2015 31 . Prediction performance (MAE) of assessors’ next judgments and corresponding cov s varying decision rejection options (δ=[0⇠0.25] by 0.05). While theother methodss cant decreasein coverage, under all thegiven reject options, GAM showsbetter cov l asprediction performance. 49# 43# 39# 28# 27# 23# 22# 20# 19# 16# 10# 7# 5# 0# 10# 20# 30# 40# 50# AA# BA_opt# BA_PES# C# NumLabels# CurrentLabelQuality# AccChangeDirecHon# SA# Phi# BA_uni# TaskTime# TopicChange# TopicEverSeen# Fig.4. Summary of relativefeature importance across 54 regression models. ases (27), which implicitly indicates that task familiarity affects an assessor’s A GAM with the only top 5 features shows good performance (7-10% less than full-featured GAM ) Relative feature importance across 54 individual prediction models. Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  • 32. 3: Reducing Supervision Matt Lease <ml@utexas.edu> 32 Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supervision. HCOMP 2015.
  • 33. Soft Label Updating & Discounting Matt Lease <ml@utexas.edu> 33
  • 34. Soft Label Updating Matt Lease <ml@utexas.edu> 34
  • 35. The Future of Crowd Work, CSCW’13 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 35 Matt Lease <ml@utexas.edu>