SlideShare a Scribd company logo
Quality 
Crowdsourcing for Human
Computer Interaction Research	


Ed H. Chi	

	

Research Scientist	

Google	

(work done while at [Xerox] PARC)	


 Aniket Kittur, Ed H. Chi, Bongwon Suh. 	

 Crowdsourcing User Studies With Mechanical Turk. In CHI2008.	



                                                                   1
Example Task from Amazon MTurk	





                                    2
Historical Footnote
                              	


•  De Prony, 1794, hired hairdressers 	

•  (unemployed after French revolution; knew only
    addition and subtraction) 	

•  to create logarithmic and trigonometric tables. 	

•  He managed the process by splitting the
    work into very detailed workflows.	

	

                                             !#$% '#()$)*'%+ ,'
                                                          • !#$%/ 0
   –  Grier, When computers were human, 2005	

               56'#()12
                                                              #$)3 6'#(

                                                             • !#$%'()
                                                               – 9$*2$)+ $
                                                                 '#()1-
                                                                 6'#1) '2?
                                                                 (2'?91#A
                                                                 -$./ '4 %
                                                                 6'#()$)
                                                                 $/)2'%'#
                                                                       C2*12+ D




                                                              3
Using Mechanical Turk for user studies	


                        Traditional user           Mechanical Turk	

                            studies	

Task complexity	

            Complex	

                  Simple	

                               Long	

                    Short	

Task subjectivity	

         Subjective	

               Objective	

                             Opinions	

                 Verifiable	

User information	

    Targeted demographics	

   Unknown demographics 	

                          High interactivity	

    Limited interactivity	



         Can Mechanical Turk be usefully used for user studies?	



                                                                              4
Task	


•  Assess quality of Wikipedia articles	

•  Started with ratings from expert Wikipedians	

    –  14 articles (e.g., Germany , Noam Chomsky )	

    –  7-point scale	

•  Can we get matching ratings with mechanical turk?	





                                                          5
Experiment 1	


•  Rate articles on 7-point scales:	

    –  Well written	

    –  Factually accurate	

    –  Overall quality	

•  Free-text input:	

    –  What improvements does the article need?	

•  Paid $0.05 each	





                                                     6
Experiment 1: Good news	


•  58 users made 210 ratings (15 per article)	

   –  $10.50 total	

•  Fast results	

   –  44% within a day, 100% within two days	

   –  Many completed within minutes	





                                                   7
Experiment 1: Bad news	


•  Correlation between turkers and Wikipedians
   only marginally significant (r=.50, p=.07)	

•  Worse, 59% potentially invalid responses	

                             Experiment 1
            Invalid              49%
          comments
            1 min               31%
          responses

•  Nearly 75% of these done by only 8 users	





                                                  8
Not a good start	

•  Summary of Experiment 1:	

   –  Only marginal correlation with experts.	

   –  Heavy gaming of the system by a minority	

•  Possible Response:	

   –  Can make sure these gamers are not rewarded	

   –  Ban them from doing your hits in the future	

   –  Create a reputation system [Delores Lab]	

•  Can we change how we collect user input ?	





                                                       9
Design changes	


•  Use verifiable questions to signal monitoring	

   –  How many sections does the article have? 	

   –  How many images does the article have? 	

   –  How many references does the article have? 	





                                                       10
Design changes	


•  Use verifiable questions to signal monitoring	

•  Make malicious answers as high cost as good-faith
   answers	

   –  Provide 4-6 keywords that would give someone a
     good summary of the contents of the article 	





                                                       11
Design changes	


•  Use verifiable questions to signal monitoring	

•  Make malicious answers as high cost as good-faith
   answers	

•  Make verifiable answers useful for completing
   task	

   –  Used tasks similar to how Wikipedians evaluate quality
      (organization, presentation, references)	





                                                               12
Design changes	


•  Use verifiable questions to signal monitoring	

•  Make malicious answers as high cost as good-faith
   answers	

•  Make verifiable answers useful for completing
   task	

•  Put verifiable tasks before subjective responses	

   –  First do objective tasks and summarization	

   –  Only then evaluate subjective quality	

   –  Ecological validity?	





                                                        13
Experiment 2: Results	


    •  124 users provided 277 ratings (~20 per article)	

    •  Significant positive correlation with Wikipedians 	

        –  r=.66, p=.01	

    •  Smaller proportion malicious responses	

    •  Increased time on task	


                        Experiment 1	

      Experiment 2	

   Invalid                   49%	

                3%	

 comments	

   1 min                    31%	

                7%	

 responses	

Median time	

               1:30	

               4:06	

                                                               14
Generalizing to other MTurk studies	


•  Combine objective and subjective questions	

    –  Rapid prototyping: ask verifiable questions about content/
      design of prototype before subjective evaluation	

    –  User surveys: ask common-knowledge questions before
      asking for opinions	

•  Filtering for Quality	

    –  Put in a field for Free-Form Responses and Filter out
       data without answers	

    –  Results that came in too quickly	

    –  Sort by WorkerID and look for cut and paste answers	

    –  Look for outliers in the data that are suspicious	





                                                                   15
Quick Summary of Tips	


•  Mechanical Turk offers the practitioner a way to access a
   large user pool and quickly collect data at low cost	

•  Good results require careful task design	


  1.    Use verifiable questions to signal monitoring	

  2.    Make malicious answers as high cost as good-faith answers	

  3.    Make verifiable answers useful for completing task	

  4.    Put verifiable tasks before subjective responses	





                                                                       16
Managing Quality
                                 	


•  Quality through redundancy: Combining votes 	

      –  Majority vote [work best when similar worker quality]	

      –  Worker-Quality‐adjusted vote	

      –  Managing dependencies	

•  Quality through gold data	

      –  Advantaged when imbalanced dataset  bad workers	

•  Estimating worker quality (Redundancy + Gold)	

      –  Calculate the confusion matrix and see if you actually
         get some information from the worker	

	

•  Toolkit: http://code.google.com/p/get‐another‐label/	




                                  Source: Ipeirotis, WWW2011        17
Coding and Machine Learning
                !#$% '(%)*(+ 	

    •  Integration with Machine Learning	

• ,)#-+' %-.% */-++0 1-*- using
     –  Build automatic classification models
        crowdsourced data	

• 2' */-++0 1-*- *( .)%1 #(1%

                         Data from existing
                       crowdsourced answers



N
New C
    Case                Automatic Model                  Automatic
                     (through machine learning)           Answer


                                    Source: Ipeirotis, WWW2011
                                                                     18
Limitations of Mechanical Turk	


•  No control of users environment	

    –  Potential for different browsers, physical distractions	

    –  General problem with online experimentation 	

•  Not designed for user studies	

    –  Difficult to do between-subjects design	

    –  May need some programming	

•  Users	

    –  Somewhat hard to control demographics, expertise	





                                                                    19
Crowdsourcing for HCI Research
                                   	


•  Does my interface/visualization work?	

   –  WikiDashboard: transparency vis for Wikipedia [Suh et al.]	

   –  Replicating Perceptual Experiments [Heer et al., CHI2010]	

•  Coding of large amount of user data	

   –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]	

•  Incentive mechanisms	

   –  Intrinsic vs. Extrinsic rewards: Games vs. Pay	

   –  [Horton  Chilton, 2010 for Mturk] and [Ariely, 2009] in general	





                                                                              20
Crowdsourcing for HCI Research
                                    	


•  Does my interface/visualization work?	

   –  WikiDashboard: transparency vis for Wikipedia [Suh et al.]	

   –  Replicating Perceptual Experiments [Heer et al., CHI2010]	

•  Coding of large amount of user data	

   –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]	

•  Incentive mechanisms	

   –  Intrinsic vs. Extrinsic rewards: Games vs. Pay	

   –  [Horton  Chilton, 2010 for MTurk] and Satisficing	

   –  [Ariely, 2009] in general: Higher pay != Better work	





                                                                              21
Crowdsourcing for HCI Research
                                    	


•  Does my interface/visualization work?	

   –  WikiDashboard: transparency vis for Wikipedia [Suh et al.]	

   –  Replicating Perceptual Experiments [Heer et al., CHI2010]	

•  Coding of large amount of user data	

   –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]	

•  Incentive mechanisms	

   –  Intrinsic vs. Extrinsic rewards: Games vs. Pay	

   –  [Horton  Chilton, 2010 for MTurk] and Satisficing	

   –  [Ariely, 2009] in general: Higher pay != Better work	





                                                                              22
Crowd Programming for Complex Tasks
                                  	


  •  Decompose tasks into smaller tasks	

     –  Digital Taylorism	

     –  Frederick Winslow Taylor (1856-1915) 	

     –  1911 'Principles Of Scientific Management’	

  •  Crowd Programming Explorations	

     –  MapReduce Models	

         •  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge.	

         •  Kulkarni, Can, Hartmann, CHI2011 workshop  WIP	

     –  Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In
        KDD 2010 Workshop on Human Computation.	





                                                                              23
2011 • Work-in-Progress                                                                                           May 7–12, 2011 • Vancouver, BC, Canada



             Crowd Programming for Complex Tasks
                            !

                                               	

                                                                                                   !




                            #!$%!'%()(*!%!(+,-.-+/!01,+((-#2!('+!-(!%3-+/!'1!         %0'-'-1#!('+!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!
                            +%,4!-'+$!-#!'4+!%0'-'-1#5!64+(+!'%()(!%0+!-/+%337!             0+0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!
                          •  Crowd Programming Explorations	

                            (-$3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!
                            (410'!%$18#'!1.!'-$+5!;10!+%$3+*!%!$%!'%()!.10!
                                                                                             EF-('107G!%#/!EH+120%47G5!#!%#!+#=-01#$+#'!:4+0+!
                                                                                             :10)+0(!:183/!,1$3+'+!4-24!+..10'!'%()(*!'4+!#+'!('+!

                                –  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on
                            %0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!
                            %!2-=+#!'1-,!-#!'4+!%0'-,3+(!18'3-#+5!?83'-3+!-#('%#,+(!
                                                                                             $-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!%0%20%4!.10!+%,4!
                                                                                             (+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!

                                   CrowdForge.	

                            1.!%!$%!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!%0'-'-1#@!
                            +525*!$83'-3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!
                                                                                             .-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$3+'+!
                                                                                             %0%20%4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!
                            +%,4!1#!%!'1-,!-#!%0%33+35!                                    ,%%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!
                                –  Kulkarni, Can, Hartmann, CHI2011 workshop  WIP	

        8!.80'4+0*!(+%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!
                            ;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#!         :0-'-#2!(89'%()(5!B+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!
                            $%!'%()!%#/!,1#(13-/%'+!'4+$*!'7-,%337!-#'1!%!(-#23+!          .01$!'4+!%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%!'%()(!-#!
                            0+(83'5!#!'4+!%0'-,3+!:0-'-#2!+%$3+*!%!0+/8,+!('+!
                            $-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1-,!97!$%#7!
                            :10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!%0%20%45!        “Please solve the 16-question SAT located at
                            A#7!1.!'4+(+!('+(!,%#!9+!-'+0%'-=+5!;10!+%$3+*!'4+!        http://bit.ly/SATexam”. In both cases, we paid wo
                            '1-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!%0'-'-1#!
                                                                                          between $0.10 and $0.40 per HIT. Each “subdivid
                            ,%#!-'(+3.!9+!%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!
                            %0%20%4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+!,%#!-#!           “merge” HIT received answers within 4 hours; sol
                            '80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+5!
                                                                                          to the initial task were complete within 72 hours.
                            !#$%#'()$#%
                         C+!+310+/!%(!%!,%(+!('8/7!'4+!,1$3+!'%()!1.!
                         :0-'-#2!%#!+#,7,31+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%!        Results
                         ,4%33+#2-#2!%#/!-#'+0/++#/+#'!'%()!'4%'!-#=13=+(!$%#7!          The decompositions produced by Turkers while ru
                         /-..+0+#'!(89'%()(D!3%##-#2!'4+!(,1+!1.!'4+!%0'-,3+*!
                         41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2!              Turkomatic are displayed in Figure 1 (essay-writin
                         -#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8!'4%'!-#.10$%'-1#*!
                         .-#/-#2!%#/!.--#2!20%$$%0!%#/!(+33-#2*!%#/!$%)-#2!
                                                                                          and Figure 4 (SAT).
                         '4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!
                   Figure 4. For the SAT task, we uploaded
                         :0-'-#2!%!,4%33+#2-#2!98'!0+0+(+#'%'-=+!'+('!,%(+!.10!
                   sixteen questions from a high school
                         180!%01%,45!                                                   In the essay task, each “subdivide” HIT was poste
                   Scholastic Aptitude Test to the web and                                three times by Turkomatic and the best of the thr
                         61!(13=+!'4-(!0193+$!:+!,0+%'+/!%!(-$3+!.31:!                      *)+',$%-.%/,)0%,$#'0#%12%%310041,)5$%
                                                                                          was selected by experimenters (simulating Turker  24
                   posed ,1#(-('-#2!1.!%!%0'-'-1#*!$%*!%#/!0+/8,+!('+5!!64+!
                         the following task to Turkomatic:                                    6,))7+%#89%
Future Directions in Crowdsourcing
                                                	


                 •  Real-time Crowdsourcing	

                        –  Bigham, et al. VizWiz, UIST 2010	


 What color is this pillow?   What denomination is   Do you see picnic tables What temperature is my    Can you please tell me   W
                                    this bill?        across the parking lot?      oven set to?           what this can is?




 (89s)             .              (24s) 20                  (13s) no          (69s) it looks like 425    (183s) chickpeas.       (9
 (105s) multiple shades           (29s) 20                  (46s) no          degrees but the image      (514s) beans            (9
 of soft green, blue and                                                      is difficult to see.       (552s) Goya Beans       p
 gold                                                                         (84s) 400                                          (2
                                                                              (122s) 450

Figure 2: Six questions asked by participants, the photographs they took, and answers received with latenc
                                                                                                                      25
Future Directions in Crowdsourcing
                                 	


•  Real-time Crowdsourcing	

   –  Bigham, et al. VizWiz, UIST 2010	

•  Embedding of Crowdwork inside Tools	

   –  Bernstein, et al. Solyent, UIST 2010	





                                                26
the goals of learning, engagement, a
                                           improvement, we first analyze the im
Future Directions in Crowdsourcing
                                 	

       dimensions of the design space for cr
                                           (Figure 2).

                                            Timeliness: When should feedback be
•  Real-time Crowdsourcing	

               In micro-task work, workers stay with
   –    Bigham, et al. VizWiz, UIST 2010	

 while, then move on. This implies two
                                            synchronously deliver feedback when
•  Embedding of Crowdwork inside Tools	

   engaged in a set of tasks, or asynchr
   –    Bernstein, et al. Solyent, UIST 2010	

                                            feedback after workers have complet

•  Shepherding Crowdwork	

                Synchronous feedback may have mor
   –  Dow et al. CHI2011 WIP	

                            task performance s
                                                           while workers are s
                                                           the task domain. It
                                                           probability that wor
                                                           onto similar tasks. H
                                                           synchronous feedba
                                                           burden on the feedb
                                                           they have little time
                                                           This implies a need
                                                           scheduling algorithm
                                                           near real-time feed
                                                           Asynchronous feedb
                                                                        27
Tutorials	

•    Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/	

•    AAAI 2011 (w HCOMP 2011): Human Computation: Core Research
     Questions and State of the Art (E. Law  Luis von Ahn)	

•    WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to
     Work for You (Omar Alonso and Matthew Lease)	

      –    http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf	

•    LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob
     Carpenter and Massimo Poesio) 	

      –    http://lingpipe-blog.com/2010/05/17/ 	

•    ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso)                        	

	

      –    http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html	

•    CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐
     Fei Li) 	

      –    http://sites.google.com/site/turkforvision/	

•    CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) 	

      –    http://videolectures.net/cikm08_rose_cfre/ 	

•    WWW2011: Managing Crowdsourced Human Computation (Panos
     Ipeirotis)	

      –    http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation 	



                                                                                                      28
Thanks!

	

•  chi@acm.org	

•  http://edchi.net	

•  @edchi	



•    Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies
     With Mechanical Turk. In Proceedings of the ACM Conference on Human-
     factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008.
     Florence, Italy.	

•    Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki?
     Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-
     Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008.
     San Diego, CA. [Best Note Award]	




                                                                               29

More Related Content

PDF
What I learned from 5 years of sciencing the crap out of DevOps
DOCX
Successful writing at work copyright 2017 cengage learn
PPTX
2016 velocity santa clara state of dev ops report deck final
PDF
2016 State of DevOps
PDF
Sciencing the Crap Out of DevOps
PPTX
Tools Won't Fix Your Broken DevOps
PDF
The Data on DevOps: Making the Case for Awesome
PDF
How DevOps is Transforming IT, and What it Can Do for Academia
What I learned from 5 years of sciencing the crap out of DevOps
Successful writing at work copyright 2017 cengage learn
2016 velocity santa clara state of dev ops report deck final
2016 State of DevOps
Sciencing the Crap Out of DevOps
Tools Won't Fix Your Broken DevOps
The Data on DevOps: Making the Case for Awesome
How DevOps is Transforming IT, and What it Can Do for Academia

What's hot (20)

PDF
Vmware2021 why even devop nicolefv
PDF
DevOps: What's Buried in the Fine Print
PPTX
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
PDF
Developer Nirvana
PPTX
Soaring in the Clouds - Don't be dragged down by ITIL bloat!
PPTX
The Unicorn Project and The Five Ideals (older: see notes for newer version)
PPTX
Continuous Delivery + DevOps = Awesome
PPTX
Secrets and surprises of high performance: What the data says
PDF
Which Development Metrics Should I Watch?
PPTX
DevOps: The Key to IT Performance
PPTX
Software as Craft
PDF
How Metrics Make Your DevOps Awesome
PDF
Are We There Yet? Signposts On Your Journey to Awesome
PDF
On Impact in Software Engineering Research
PDF
The Rationale for Continuous Delivery
PDF
If you don't know where you're going it doesn't matter how fast you get there
PDF
The Data Behind DevOps: Becoming a High Performer
PDF
DOES 2016 Sciencing the Crap Out of DevOps
PPTX
DevOps Roadtrip - Denver
PDF
Limited WIP Meeting presentation - The Phoenix Project book review
Vmware2021 why even devop nicolefv
DevOps: What's Buried in the Fine Print
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
Developer Nirvana
Soaring in the Clouds - Don't be dragged down by ITIL bloat!
The Unicorn Project and The Five Ideals (older: see notes for newer version)
Continuous Delivery + DevOps = Awesome
Secrets and surprises of high performance: What the data says
Which Development Metrics Should I Watch?
DevOps: The Key to IT Performance
Software as Craft
How Metrics Make Your DevOps Awesome
Are We There Yet? Signposts On Your Journey to Awesome
On Impact in Software Engineering Research
The Rationale for Continuous Delivery
If you don't know where you're going it doesn't matter how fast you get there
The Data Behind DevOps: Becoming a High Performer
DOES 2016 Sciencing the Crap Out of DevOps
DevOps Roadtrip - Denver
Limited WIP Meeting presentation - The Phoenix Project book review
Ad

Similar to Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research (20)

PDF
Crowdsourcing using MTurk for HCI research
PDF
Crowdsourcing for HCI Research with Amazon Mechanical Turk
PPTX
The art of project estimation
PPTX
The Art of Project Estimation
PDF
Insemtives iswc2011 session1
PDF
INSEMTIVES Tutorial ISWC2011 - Session1
PDF
asdfas
PDF
PDF
Comparison GWAP Mechanical Turk
PPTX
MODEL-DRIVEN ENGINEERING (MDE) in Practice
PDF
Software Engineering an Introduction
PDF
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
PPTX
Usability evaluation methods (part 2) and performance metrics
PDF
Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Sim...
PPTX
Cleaning Code - Tools and Techniques for Large Legacy Projects
PDF
FutureOfTesting2008
PPTX
Human computation, crowdsourcing and social: An industrial perspective
PDF
Usability Testing for Qualitative Researchers - QRCA NYC Chapter event
PDF
Insemtives swat4ls 2012
KEY
Towards an Agile approach to building application profiles
Crowdsourcing using MTurk for HCI research
Crowdsourcing for HCI Research with Amazon Mechanical Turk
The art of project estimation
The Art of Project Estimation
Insemtives iswc2011 session1
INSEMTIVES Tutorial ISWC2011 - Session1
asdfas
Comparison GWAP Mechanical Turk
MODEL-DRIVEN ENGINEERING (MDE) in Practice
Software Engineering an Introduction
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
Usability evaluation methods (part 2) and performance metrics
Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Sim...
Cleaning Code - Tools and Techniques for Large Legacy Projects
FutureOfTesting2008
Human computation, crowdsourcing and social: An industrial perspective
Usability Testing for Qualitative Researchers - QRCA NYC Chapter event
Insemtives swat4ls 2012
Towards an Agile approach to building application profiles
Ad

More from Ed Chi (20)

PDF
2017 10-10 (netflix ml platform meetup) learning item and user representation...
PDF
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing
PDF
Location and Language in Social Media (Stanford Mobi Social Invited Talk)
PDF
CIKM 2011 Social Computing Industry Invited Talk
PDF
WikiSym 2011 Closing Keynote
PDF
CSCL 2011 Keynote on Social Computing and eLearning
PDF
Replication is more than Duplication: Position slides for CHI2011 panel on re...
PDF
Eddi: Topic Browsing of Twitter Streams
PDF
Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented ...
PDF
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
PDF
Zerozero88 Twitter URL Item Recommender
PDF
Smart eBooks: ScentIndex and ScentHighlight research published at VAST2006
PDF
Model-Driven Research in Social Computing
PPTX
ASC Disaster Response Proposal from Aug 2007
PPT
Using Information Scent to Model Users in Web1.0 and Web2.0
PPT
China HCI Symposium 2010 March: Augmented Social Cognition Research from PARC...
PDF
2010-03-10 PARC Augmented Social Cognition Research Overview
PDF
2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica
PPT
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
PPTX
Slowing Growth of Wikipedia and Models of its Dynamic (Presented at Wikimedia...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing
Location and Language in Social Media (Stanford Mobi Social Invited Talk)
CIKM 2011 Social Computing Industry Invited Talk
WikiSym 2011 Closing Keynote
CSCL 2011 Keynote on Social Computing and eLearning
Replication is more than Duplication: Position slides for CHI2011 panel on re...
Eddi: Topic Browsing of Twitter Streams
Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented ...
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Zerozero88 Twitter URL Item Recommender
Smart eBooks: ScentIndex and ScentHighlight research published at VAST2006
Model-Driven Research in Social Computing
ASC Disaster Response Proposal from Aug 2007
Using Information Scent to Model Users in Web1.0 and Web2.0
China HCI Symposium 2010 March: Augmented Social Cognition Research from PARC...
2010-03-10 PARC Augmented Social Cognition Research Overview
2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Slowing Growth of Wikipedia and Models of its Dynamic (Presented at Wikimedia...

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

  • 1. Quality Crowdsourcing for Human Computer Interaction Research Ed H. Chi Research Scientist Google (work done while at [Xerox] PARC) Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In CHI2008. 1
  • 2. Example Task from Amazon MTurk 2
  • 3. Historical Footnote •  De Prony, 1794, hired hairdressers •  (unemployed after French revolution; knew only addition and subtraction) •  to create logarithmic and trigonometric tables. •  He managed the process by splitting the work into very detailed workflows. !#$% '#()$)*'%+ ,' • !#$%/ 0 –  Grier, When computers were human, 2005 56'#()12 #$)3 6'#( • !#$%'() – 9$*2$)+ $ '#()1- 6'#1) '2? (2'?91#A -$./ '4 % 6'#()$) $/)2'%'# C2*12+ D 3
  • 4. Using Mechanical Turk for user studies Traditional user Mechanical Turk studies Task complexity Complex Simple Long Short Task subjectivity Subjective Objective Opinions Verifiable User information Targeted demographics Unknown demographics High interactivity Limited interactivity Can Mechanical Turk be usefully used for user studies? 4
  • 5. Task •  Assess quality of Wikipedia articles •  Started with ratings from expert Wikipedians –  14 articles (e.g., Germany , Noam Chomsky ) –  7-point scale •  Can we get matching ratings with mechanical turk? 5
  • 6. Experiment 1 •  Rate articles on 7-point scales: –  Well written –  Factually accurate –  Overall quality •  Free-text input: –  What improvements does the article need? •  Paid $0.05 each 6
  • 7. Experiment 1: Good news •  58 users made 210 ratings (15 per article) –  $10.50 total •  Fast results –  44% within a day, 100% within two days –  Many completed within minutes 7
  • 8. Experiment 1: Bad news •  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07) •  Worse, 59% potentially invalid responses Experiment 1 Invalid 49% comments 1 min 31% responses •  Nearly 75% of these done by only 8 users 8
  • 9. Not a good start •  Summary of Experiment 1: –  Only marginal correlation with experts. –  Heavy gaming of the system by a minority •  Possible Response: –  Can make sure these gamers are not rewarded –  Ban them from doing your hits in the future –  Create a reputation system [Delores Lab] •  Can we change how we collect user input ? 9
  • 10. Design changes •  Use verifiable questions to signal monitoring –  How many sections does the article have? –  How many images does the article have? –  How many references does the article have? 10
  • 11. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers –  Provide 4-6 keywords that would give someone a good summary of the contents of the article 11
  • 12. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers •  Make verifiable answers useful for completing task –  Used tasks similar to how Wikipedians evaluate quality (organization, presentation, references) 12
  • 13. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers •  Make verifiable answers useful for completing task •  Put verifiable tasks before subjective responses –  First do objective tasks and summarization –  Only then evaluate subjective quality –  Ecological validity? 13
  • 14. Experiment 2: Results •  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians –  r=.66, p=.01 •  Smaller proportion malicious responses •  Increased time on task Experiment 1 Experiment 2 Invalid 49% 3% comments 1 min 31% 7% responses Median time 1:30 4:06 14
  • 15. Generalizing to other MTurk studies •  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about content/ design of prototype before subjective evaluation –  User surveys: ask common-knowledge questions before asking for opinions •  Filtering for Quality –  Put in a field for Free-Form Responses and Filter out data without answers –  Results that came in too quickly –  Sort by WorkerID and look for cut and paste answers –  Look for outliers in the data that are suspicious 15
  • 16. Quick Summary of Tips •  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost •  Good results require careful task design 1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith answers 3.  Make verifiable answers useful for completing task 4.  Put verifiable tasks before subjective responses 16
  • 17. Managing Quality •  Quality through redundancy: Combining votes –  Majority vote [work best when similar worker quality] –  Worker-Quality‐adjusted vote –  Managing dependencies •  Quality through gold data –  Advantaged when imbalanced dataset bad workers •  Estimating worker quality (Redundancy + Gold) –  Calculate the confusion matrix and see if you actually get some information from the worker •  Toolkit: http://code.google.com/p/get‐another‐label/ Source: Ipeirotis, WWW2011 17
  • 18. Coding and Machine Learning !#$% '(%)*(+ •  Integration with Machine Learning • ,)#-+' %-.% */-++0 1-*- using –  Build automatic classification models crowdsourced data • 2' */-++0 1-*- *( .)%1 #(1% Data from existing crowdsourced answers N New C Case Automatic Model Automatic (through machine learning) Answer Source: Ipeirotis, WWW2011 18
  • 19. Limitations of Mechanical Turk •  No control of users environment –  Potential for different browsers, physical distractions –  General problem with online experimentation •  Not designed for user studies –  Difficult to do between-subjects design –  May need some programming •  Users –  Somewhat hard to control demographics, expertise 19
  • 20. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.] –  Replicating Perceptual Experiments [Heer et al., CHI2010] •  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi] •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton Chilton, 2010 for Mturk] and [Ariely, 2009] in general 20
  • 21. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.] –  Replicating Perceptual Experiments [Heer et al., CHI2010] •  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi] •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton Chilton, 2010 for MTurk] and Satisficing –  [Ariely, 2009] in general: Higher pay != Better work 21
  • 22. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.] –  Replicating Perceptual Experiments [Heer et al., CHI2010] •  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi] •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton Chilton, 2010 for MTurk] and Satisficing –  [Ariely, 2009] in general: Higher pay != Better work 22
  • 23. Crowd Programming for Complex Tasks •  Decompose tasks into smaller tasks –  Digital Taylorism –  Frederick Winslow Taylor (1856-1915) –  1911 'Principles Of Scientific Management’ •  Crowd Programming Explorations –  MapReduce Models •  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge. •  Kulkarni, Can, Hartmann, CHI2011 workshop WIP –  Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In KDD 2010 Workshop on Human Computation. 23
  • 24. 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada Crowd Programming for Complex Tasks ! ! #!$%!'%()(*!%!(+,-.-+/!01,+((-#2!('+!-(!%3-+/!'1! %0'-'-1#!('+!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*! +%,4!-'+$!-#!'4+!%0'-'-1#5!64+(+!'%()(!%0+!-/+%337! 0+0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(! •  Crowd Programming Explorations (-$3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%! (410'!%$18#'!1.!'-$+5!;10!+%$3+*!%!$%!'%()!.10! EF-('107G!%#/!EH+120%47G5!#!%#!+#=-01#$+#'!:4+0+! :10)+0(!:183/!,1$3+'+!4-24!+..10'!'%()(*!'4+!#+'!('+! –  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on %0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#! %!2-=+#!'1-,!-#!'4+!%0'-,3+(!18'3-#+5!?83'-3+!-#('%#,+(! $-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!%0%20%4!.10!+%,4! (+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#! CrowdForge. 1.!%!$%!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!%0'-'-1#@! +525*!$83'-3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'! .-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$3+'+! %0%20%4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)! +%,4!1#!%!'1-,!-#!%0%33+35! ,%%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()! –  Kulkarni, Can, Hartmann, CHI2011 workshop WIP 8!.80'4+0*!(+%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/! ;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#! :0-'-#2!(89'%()(5!B+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2! $%!'%()!%#/!,1#(13-/%'+!'4+$*!'7-,%337!-#'1!%!(-#23+! .01$!'4+!%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%!'%()(!-#! 0+(83'5!#!'4+!%0'-,3+!:0-'-#2!+%$3+*!%!0+/8,+!('+! $-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1-,!97!$%#7! :10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!%0%20%45! “Please solve the 16-question SAT located at A#7!1.!'4+(+!('+(!,%#!9+!-'+0%'-=+5!;10!+%$3+*!'4+! http://bit.ly/SATexam”. In both cases, we paid wo '1-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!%0'-'-1#! between $0.10 and $0.40 per HIT. Each “subdivid ,%#!-'(+3.!9+!%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+! %0%20%4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+!,%#!-#! “merge” HIT received answers within 4 hours; sol '80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+5! to the initial task were complete within 72 hours. !#$%#'()$#% C+!+310+/!%(!%!,%(+!('8/7!'4+!,1$3+!'%()!1.! :0-'-#2!%#!+#,7,31+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%! Results ,4%33+#2-#2!%#/!-#'+0/++#/+#'!'%()!'4%'!-#=13=+(!$%#7! The decompositions produced by Turkers while ru /-..+0+#'!(89'%()(D!3%##-#2!'4+!(,1+!1.!'4+!%0'-,3+*! 41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2! Turkomatic are displayed in Figure 1 (essay-writin -#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8!'4%'!-#.10$%'-1#*! .-#/-#2!%#/!.--#2!20%$$%0!%#/!(+33-#2*!%#/!$%)-#2! and Figure 4 (SAT). '4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+! Figure 4. For the SAT task, we uploaded :0-'-#2!%!,4%33+#2-#2!98'!0+0+(+#'%'-=+!'+('!,%(+!.10! sixteen questions from a high school 180!%01%,45! In the essay task, each “subdivide” HIT was poste Scholastic Aptitude Test to the web and three times by Turkomatic and the best of the thr 61!(13=+!'4-(!0193+$!:+!,0+%'+/!%!(-$3+!.31:! *)+',$%-.%/,)0%,$#'0#%12%%310041,)5$% was selected by experimenters (simulating Turker 24 posed ,1#(-('-#2!1.!%!%0'-'-1#*!$%*!%#/!0+/8,+!('+5!!64+! the following task to Turkomatic: 6,))7+%#89%
  • 25. Future Directions in Crowdsourcing •  Real-time Crowdsourcing –  Bigham, et al. VizWiz, UIST 2010 What color is this pillow? What denomination is Do you see picnic tables What temperature is my Can you please tell me W this bill? across the parking lot? oven set to? what this can is? (89s) . (24s) 20 (13s) no (69s) it looks like 425 (183s) chickpeas. (9 (105s) multiple shades (29s) 20 (46s) no degrees but the image (514s) beans (9 of soft green, blue and is difficult to see. (552s) Goya Beans p gold (84s) 400 (2 (122s) 450 Figure 2: Six questions asked by participants, the photographs they took, and answers received with latenc 25
  • 26. Future Directions in Crowdsourcing •  Real-time Crowdsourcing –  Bigham, et al. VizWiz, UIST 2010 •  Embedding of Crowdwork inside Tools –  Bernstein, et al. Solyent, UIST 2010 26
  • 27. the goals of learning, engagement, a improvement, we first analyze the im Future Directions in Crowdsourcing dimensions of the design space for cr (Figure 2). Timeliness: When should feedback be •  Real-time Crowdsourcing In micro-task work, workers stay with –  Bigham, et al. VizWiz, UIST 2010 while, then move on. This implies two synchronously deliver feedback when •  Embedding of Crowdwork inside Tools engaged in a set of tasks, or asynchr –  Bernstein, et al. Solyent, UIST 2010 feedback after workers have complet •  Shepherding Crowdwork Synchronous feedback may have mor –  Dow et al. CHI2011 WIP task performance s while workers are s the task domain. It probability that wor onto similar tasks. H synchronous feedba burden on the feedb they have little time This implies a need scheduling algorithm near real-time feed Asynchronous feedb 27
  • 28. Tutorials •  Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/ •  AAAI 2011 (w HCOMP 2011): Human Computation: Core Research Questions and State of the Art (E. Law Luis von Ahn) •  WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Omar Alonso and Matthew Lease) –  http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf •  LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and Massimo Poesio) –  http://lingpipe-blog.com/2010/05/17/ •  ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso) –  http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html •  CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐ Fei Li) –  http://sites.google.com/site/turkforvision/ •  CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) –  http://videolectures.net/cikm08_rose_cfre/ •  WWW2011: Managing Crowdsourced Human Computation (Panos Ipeirotis) –  http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation 28
  • 29. Thanks! •  chi@acm.org •  http://edchi.net •  @edchi •  Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human- factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008. Florence, Italy. •  Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer- Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008. San Diego, CA. [Best Note Award] 29