Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

Quality
Crowdsourcing for Human
Computer Interaction Research

Ed H. Chi

Research Scientist

Google

(work done while at [Xerox] PARC)

Aniket Kittur, Ed H. Chi, Bongwon Suh.

Crowdsourcing User Studies With Mechanical Turk. In CHI2008.

1

Example Task from Amazon MTurk

2

Historical Footnote

•  De Prony, 1794, hired hairdressers

•  (unemployed after French revolution; knew only
addition and subtraction)

•  to create logarithmic and trigonometric tables.

•  He managed the process by splitting the
work into very detailed workﬂows.

!#$% '#()$)*'%+ ,'
• !#$%/ 0
–  Grier, When computers were human, 2005

56'#()12
#$)3 6'#(

• !#$%'()
– 9$*2$)+ $
'#()1-
6'#1) '2?
(2'?91#A
-$./ '4 %
6'#()$)
$/)2'%'#
C2*12+ D

3

Using Mechanical Turk for user studies

Traditional user Mechanical Turk

studies

Task complexity

Complex

Simple

Long

Short

Task subjectivity

Subjective

Objective

Opinions

Veriﬁable

User information

Targeted demographics

Unknown demographics

High interactivity

Limited interactivity

Can Mechanical Turk be usefully used for user studies?

4

Task

•  Assess quality of Wikipedia articles

•  Started with ratings from expert Wikipedians

–  14 articles (e.g., Germany , Noam Chomsky )

–  7-point scale

•  Can we get matching ratings with mechanical turk?

5

Experiment 1

•  Rate articles on 7-point scales:

–  Well written

–  Factually accurate

–  Overall quality

•  Free-text input:

–  What improvements does the article need?

•  Paid $0.05 each

6

Experiment 1: Good news

•  58 users made 210 ratings (15 per article)

–  $10.50 total

•  Fast results

–  44% within a day, 100% within two days

–  Many completed within minutes

7

Experiment 1: Bad news

•  Correlation between turkers and Wikipedians
only marginally signiﬁcant (r=.50, p=.07)

•  Worse, 59% potentially invalid responses

Experiment 1
Invalid 49%
comments
1 min 31%
responses

•  Nearly 75% of these done by only 8 users

8

Not a good start

•  Summary of Experiment 1:

–  Only marginal correlation with experts.

–  Heavy gaming of the system by a minority

•  Possible Response:

–  Can make sure these gamers are not rewarded

–  Ban them from doing your hits in the future

–  Create a reputation system [Delores Lab]

•  Can we change how we collect user input ?

9

Design changes

•  Use veriﬁable questions to signal monitoring

–  How many sections does the article have?

–  How many images does the article have?

–  How many references does the article have?

10

Design changes


•  Make malicious answers as high cost as good-faith
answers

–  Provide 4-6 keywords that would give someone a
good summary of the contents of the article

11

Design changes


answers

•  Make veriﬁable answers useful for completing
task

–  Used tasks similar to how Wikipedians evaluate quality
(organization, presentation, references)

12

Design changes


answers

•  Make veriﬁable answers useful for completing
task

•  Put veriﬁable tasks before subjective responses

–  First do objective tasks and summarization

–  Only then evaluate subjective quality

–  Ecological validity?

13

Experiment 2: Results

•  124 users provided 277 ratings (~20 per article)

•  Signiﬁcant positive correlation with Wikipedians

–  r=.66, p=.01

•  Smaller proportion malicious responses

•  Increased time on task

Experiment 1

Experiment 2

Invalid 49%

3%

comments

1 min 31%

7%

responses

Median time

1:30

4:06

14

Generalizing to other MTurk studies

•  Combine objective and subjective questions

–  Rapid prototyping: ask veriﬁable questions about content/
design of prototype before subjective evaluation

–  User surveys: ask common-knowledge questions before
asking for opinions

•  Filtering for Quality

–  Put in a ﬁeld for Free-Form Responses and Filter out
data without answers

–  Results that came in too quickly

–  Sort by WorkerID and look for cut and paste answers

–  Look for outliers in the data that are suspicious

15

Quick Summary of Tips

•  Mechanical Turk offers the practitioner a way to access a
large user pool and quickly collect data at low cost

•  Good results require careful task design

1.  Use verifiable questions to signal monitoring

2.  Make malicious answers as high cost as good-faith answers

3.  Make verifiable answers useful for completing task

4.  Put verifiable tasks before subjective responses

16

Managing Quality

•  Quality through redundancy: Combining votes

–  Majority vote [work best when similar worker quality]

–  Worker-Quality‐adjusted vote

–  Managing dependencies

•  Quality through gold data

–  Advantaged when imbalanced dataset bad workers

•  Estimating worker quality (Redundancy + Gold)

–  Calculate the confusion matrix and see if you actually
get some information from the worker

•  Toolkit: http://code.google.com/p/get‐another‐label/

Source: Ipeirotis, WWW2011 17

Coding and Machine Learning
!#$% '(%)*(+

•  Integration with Machine Learning

• ,)#-+' %-.% */-++0 1-*- using
–  Build automatic classiﬁcation models
crowdsourced data

• 2' */-++0 1-*- *( .)%1 #(1%

Data from existing
crowdsourced answers

N
New C
Case Automatic Model Automatic
(through machine learning) Answer

Source: Ipeirotis, WWW2011
18

Limitations of Mechanical Turk

•  No control of users environment

–  Potential for different browsers, physical distractions

–  General problem with online experimentation

•  Not designed for user studies

–  Difﬁcult to do between-subjects design

–  May need some programming

•  Users

–  Somewhat hard to control demographics, expertise

19

Crowdsourcing for HCI Research

•  Does my interface/visualization work?

–  WikiDashboard: transparency vis for Wikipedia [Suh et al.]

–  Replicating Perceptual Experiments [Heer et al., CHI2010]

•  Coding of large amount of user data

–  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]

•  Incentive mechanisms

–  Intrinsic vs. Extrinsic rewards: Games vs. Pay

–  [Horton Chilton, 2010 for Mturk] and [Ariely, 2009] in general

20









–  [Horton Chilton, 2010 for MTurk] and Satisﬁcing

–  [Ariely, 2009] in general: Higher pay != Better work

21









–  [Horton Chilton, 2010 for MTurk] and Satisﬁcing

–  [Ariely, 2009] in general: Higher pay != Better work

22

Crowd Programming for Complex Tasks

•  Decompose tasks into smaller tasks

–  Digital Taylorism

–  Frederick Winslow Taylor (1856-1915)

–  1911 'Principles Of Scientiﬁc Management’

•  Crowd Programming Explorations

–  MapReduce Models

•  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge.

•  Kulkarni, Can, Hartmann, CHI2011 workshop WIP

–  Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In
KDD 2010 Workshop on Human Computation.

23

2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada

Crowd Programming for Complex Tasks
!

!

#!$%!'%()(*!%!(+,-.-+/!01,+((-#2!('+!-(!%3-+/!'1! %0'-'-1#!('+!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!
+%,4!-'+$!-#!'4+!%0'-'-1#5!64+(+!'%()(!%0+!-/+%337! 0+0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!
•  Crowd Programming Explorations

(-$3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!
(410'!%$18#'!1.!'-$+5!;10!+%$3+*!%!$%!'%()!.10!
EF-('107G!%#/!EH+120%47G5!#!%#!+#=-01#$+#'!:4+0+!
:10)+0(!:183/!,1$3+'+!4-24!+..10'!'%()(*!'4+!#+'!('+!

–  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on
%0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!
%!2-=+#!'1-,!-#!'4+!%0'-,3+(!18'3-#+5!?83'-3+!-#('%#,+(!
$-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!%0%20%4!.10!+%,4!
(+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!

CrowdForge.

1.!%!$%!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!%0'-'-1#@!
+525*!$83'-3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!
.-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$3+'+!
%0%20%4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!
+%,4!1#!%!'1-,!-#!%0%33+35! ,%%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!
–  Kulkarni, Can, Hartmann, CHI2011 workshop WIP

8!.80'4+0*!(+%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!
;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#! :0-'-#2!(89'%()(5!B+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!
$%!'%()!%#/!,1#(13-/%'+!'4+$*!'7-,%337!-#'1!%!(-#23+! .01$!'4+!%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%!'%()(!-#!
0+(83'5!#!'4+!%0'-,3+!:0-'-#2!+%$3+*!%!0+/8,+!('+!
$-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1-,!97!$%#7!
:10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!%0%20%45! “Please solve the 16-question SAT located at
A#7!1.!'4+(+!('+(!,%#!9+!-'+0%'-=+5!;10!+%$3+*!'4+! http://bit.ly/SATexam”. In both cases, we paid wo
'1-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!%0'-'-1#!
between $0.10 and $0.40 per HIT. Each “subdivid
,%#!-'(+3.!9+!%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!
%0%20%4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+!,%#!-#! “merge” HIT received answers within 4 hours; sol
'80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+5!
to the initial task were complete within 72 hours.
!#$%#'()$#%
C+!+310+/!%(!%!,%(+!('8/7!'4+!,1$3+!'%()!1.!
:0-'-#2!%#!+#,7,31+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%! Results
,4%33+#2-#2!%#/!-#'+0/++#/+#'!'%()!'4%'!-#=13=+(!$%#7! The decompositions produced by Turkers while ru
/-..+0+#'!(89'%()(D!3%##-#2!'4+!(,1+!1.!'4+!%0'-,3+*!
41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2! Turkomatic are displayed in Figure 1 (essay-writin
-#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8!'4%'!-#.10$%'-1#*!
.-#/-#2!%#/!.--#2!20%$$%0!%#/!(+33-#2*!%#/!$%)-#2!
and Figure 4 (SAT).
'4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!
Figure 4. For the SAT task, we uploaded
:0-'-#2!%!,4%33+#2-#2!98'!0+0+(+#'%'-=+!'+('!,%(+!.10!
sixteen questions from a high school
180!%01%,45! In the essay task, each “subdivide” HIT was poste
Scholastic Aptitude Test to the web and three times by Turkomatic and the best of the thr
61!(13=+!'4-(!0193+$!:+!,0+%'+/!%!(-$3+!.31:! *)+',$%-.%/,)0%,$#'0#%12%%310041,)5$%
was selected by experimenters (simulating Turker 24
posed ,1#(-('-#2!1.!%!%0'-'-1#*!$%*!%#/!0+/8,+!('+5!!64+!
the following task to Turkomatic: 6,))7+%#89%

Future Directions in Crowdsourcing

•  Real-time Crowdsourcing

–  Bigham, et al. VizWiz, UIST 2010

What color is this pillow? What denomination is Do you see picnic tables What temperature is my Can you please tell me W
this bill? across the parking lot? oven set to? what this can is?

(89s) . (24s) 20 (13s) no (69s) it looks like 425 (183s) chickpeas. (9
(105s) multiple shades (29s) 20 (46s) no degrees but the image (514s) beans (9
of soft green, blue and is difficult to see. (552s) Goya Beans p
gold (84s) 400 (2
(122s) 450

Figure 2: Six questions asked by participants, the photographs they took, and answers received with latenc
25




•  Embedding of Crowdwork inside Tools

–  Bernstein, et al. Solyent, UIST 2010

26

the goals of learning, engagement, a
improvement, we first analyze the im

dimensions of the design space for cr
(Figure 2).

Timeliness: When should feedback be

In micro-task work, workers stay with

while, then move on. This implies two
synchronously deliver feedback when
•  Embedding of Crowdwork inside Tools

engaged in a set of tasks, or asynchr
–  Bernstein, et al. Solyent, UIST 2010

feedback after workers have complet

•  Shepherding Crowdwork

Synchronous feedback may have mor
–  Dow et al. CHI2011 WIP

task performance s
while workers are s
the task domain. It
probability that wor
onto similar tasks. H
synchronous feedba
burden on the feedb
they have little time
This implies a need
scheduling algorithm
near real-time feed
Asynchronous feedb
27

Tutorials

•  Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/

•  AAAI 2011 (w HCOMP 2011): Human Computation: Core Research
Questions and State of the Art (E. Law Luis von Ahn)

•  WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to
Work for You (Omar Alonso and Matthew Lease)

–  http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf

•  LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob
Carpenter and Massimo Poesio)

–  http://lingpipe-blog.com/2010/05/17/

•  ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso)

–  http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html

•  CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐
Fei Li)

–  http://sites.google.com/site/turkforvision/

•  CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose)

–  http://videolectures.net/cikm08_rose_cfre/

•  WWW2011: Managing Crowdsourced Human Computation (Panos
Ipeirotis)

–  http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation

28

Thanks!

•  chi@acm.org

•  http://edchi.net

•  @edchi

•  Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies
With Mechanical Turk. In Proceedings of the ACM Conference on Human-
factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008.
Florence, Italy.

•  Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki?
Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-
Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008.
San Diego, CA. [Best Note Award]

29

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

More Related Content

What's hot (20)

Similar to Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research (20)

More from Ed Chi (20)

Recently uploaded (20)

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research