Crowdsourcing using Mechanical Turk: Quality Management and Scalability

Crowdsourcing using Mechanical Turk:
Quality Management and Scalability

Panos Ipeirotis

New York University & oDesk

Twitter: @ipeirotis
Joint work with: Jing Wang, Foster Provost,
Josh Attenberg, and Victor Sheng; Special “A Computer Scientist in a Business School”
thanks to AdSafe Media http://behind-the-enemy-lines.com

Brand advertising not fully embraced
Internet advertising yet…

Afraid of improper brand placement

3 Gabrielle Giffords Shooting, Tucson, AZ, Jan 2011

New Classification Models Needed
within days

 Pharmaceutical firm does not want ads to appear:
– In pages that discuss swine flu (FDA prohibited pharmaceutical
company to display drug ad in pages about swine flu)

 Big fast-food chain does not want ads to appear:
– In pages that discuss the brand (99% negative sentiment)
– In pages discussing obesity, diabetes, cholesterol, etc

 Airline company does not want ads to appear:
– In pages with crashes, accidents, …
– In pages with discussions of terrorist plots against airlines

6

Need to build models fast

 Traditionally, modeling teams have invested
substantial internal resources in data collection,
extraction, cleaning, and other preprocessing
No time for such things…
 However, now, we can outsource preprocessing tasks,
such as labeling, feature extraction, verifying
information extraction, etc.
– using Mechanical Turk, oDesk, etc.
– quality may be lower than expert labeling (much?)
– but low costs can allow massive scale

7

Example: Build an “Adult Web Site” Classifier

 Need a large number of hand-labeled sites
 Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)

Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 Mechanical Turk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers!

Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)

Redundant votes, infer quality

Look at our lazy friend ATAMRO447HWJQ
together with other 9 workers

 Using redundancy, we can compute error rates
for each worker

Algorithm of (Dawid & Skene, 1979)
[and many recent variations on the same theme]

Iterative process to estimate worker error rates
1. Initialize“correct” label for each object (e.g., use majority vote)
2. Estimate error rates for workers (using “correct” labels)
3. Estimate “correct” labels (using error rates, weight worker
votes according to quality)
4. Go to Step 2 and iterate until convergence

Error rates for ATAMRO447HWJQ
Our friend ATAMRO447HWJQ
P[G → G]=99.947% P[G → X]=0.053%
marked almost all sites as G.
P[X → G]=99.153% P[X → X]=0.847%
Clickety clickey click…

Challenge: From Confusion
Matrixes to Quality Scores

Confusion Matrix for ATAMRO447HWJQ
 P[X → X]=0.847% P[X → G]=99.153%
 P[G → X]=0.053% P[G → G]=99.947%

How to check if a worker is a spammer
using the confusion matrix?
(hint: error rate not enough)

Challenge 1:
Spammers are lazy and smart!
Confusion matrix for spammer Confusion matrix for good worker
 P[X → X]=0% P[X → G]=100%  P[X → X]=80% P[X → G]=20%
 P[G → X]=0% P[G → G]=100%  P[G → X]=20% P[G → G]=80%

 Spammers figure out how to fly under the radar…

 In reality, we have 85% G sites and 15% X sites

 Error rate of spammer = 0% * 85% + 100% * 15% = 15%
 Error rate of good worker = 85% * 20% + 85% * 20% = 20%

False negatives: Spam workers pass as legitimate

Challenge 2:
Humans are biased!
Error rates for CEO of AdSafe
P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0%
P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%
P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%
P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

 We have 85% G sites, 5% P sites, 5% R sites, 5% X sites

 Error rate of spammer (all G) = 0% * 85% + 100% * 15% = 15%
 Error rate of biased worker = 80% * 85% + 100% * 5% = 73%

False positives: Legitimate workers appear to be spammers
(important note: bias is not just a matter of “ordered” classes)

Solution: Reverse errors first,
compute error rate afterwards
Error Rates for CEO of AdSafe
P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0%
P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%
P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%
P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

 When biased worker says G, it is 100% G
 When biased worker says P, it is 100% G
 When biased worker says R, it is 50% P, 50% R
 When biased worker says X, it is 100% X

Small ambiguity for “R-rated” votes but other than that, fine!

Solution: Reverse errors first,
compute error rate afterwards
Error Rates for spammer: ATAMRO447HWJQ
P[G → G]=100.0% P[G → P]=0.0% P[G → R]=0.0% P[G → X]=0.0%
P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0%
P[R → G]=100.0% P[R → P]=0.0% P[R → R]=0.0% P[R → X]=0.0%
P[X → G]=100.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=0.0%

 When spammer says G, it is 25% G, 25% P, 25% R, 25% X
 When spammer says P, it is 25% G, 25% P, 25% R, 25% X
 When spammer says R, it is 25% G, 25% P, 25% R, 25% X
 When spammer says X, it is 25% G, 25% P, 25% R, 25% X
[note: assume equal priors]

The results are highly ambiguous. No information provided!

Expected Misclassification Cost

• High cost: probability spread across classes
• Low cost: “probability mass concentrated in one class

Assigned Label Corresponding “Soft” Label Expected
Label Cost
Spammer: G <G: 25%, P: 25%, R: 25%, X: 25%> 0.75
Good worker: P <G: 100%, P: 0%, R: 0%, X: 0%> 0.0

[***Assume misclassification cost equal to 1, solution generalizes]

Quality Score
Quality Score: A scalar measure of quality

• A spammer is a worker who always assigns labels
randomly, regardless of what the true class is.
ExpCost ( Worker)
QualityScore( Worker)  1 
ExpCost (Spammer)

• Scalar score, useful for the purpose of ranking workers

HCOMP 2010

Instead of blocking: Quality-sensitive Payment

• Threshold-ing rewards gives wrong incentives:
• Decent (but still useful) workers get fired
• Uncertainty near the decision threshold
• Instead: Estimate payment level based on quality
• Set acceptable quality (e.g., 99% accuracy)
• For workers above quality specs: Pay full price
• For others: Estimate level of redundancy to reach
acceptable quality (e.g., Need 5 workers with 90% accuracy or
13 workers with 80% accuracy to reach 99% accuracy;)
• Pay full price divided by level of redundancy

Simple example:
Redundancy and Quality

 Ask multiple labelers, keep majority label as “true” label
 Quality is probability of being correct
1
P=1.0
0.9
P=0.9
0.8 P=0.8
Integrated quality

P is probability 0.7
P=0.7
of individual labeler 0.6 P=0.6
being correct 0.5
P=0.5
0.4
P=1.0: perfect P=0.4
0.3
P=0.5: random
P=0.4: adversarial 0.2
1 3 5 7 9 11 13
21 Number of labelers

Implementation

Open source implementation available at:
http://code.google.com/p/get-another-label/
and demo at http://qmturk.appspot.com/
 Input:
– Labels from Mechanical Turk
– [Optional] Some “gold” labels from trusted labelers
– Cost of incorrect classifications (e.g., XG costlier than GX)
 Output:
– Corrected labels
– Worker error rates
– Ranking of workers according to their quality
– [Coming soon] Quality-sensitive payment
– [Coming soon] Risk-adjusted quality-sensitive payment

Example: Build an “Adult Web Site” Classifier

 Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)

But we are not going to label the whole Internet…
Expensive
Slow

Quality and Classification Performance

Noisy labels lead to degraded task performance
Labeling quality increases  classification quality increases
Quality = 100%
100
Quality = 80%
90
80
AUC

Quality = 60%
70
60
Quality = 50%
50
40
100

120

140

160

180

200

220

240

260

280

300
1

20

40

60

80

Number of examples ("Mushroom" data set) Single-labeler quality
24 (probability of assigning
correctly a binary label)

Tradeoffs: More data or better data?

 Get more examples  Improve classification
 Get more labels  Improve label quality  Improve classification

Quality = 100%
100
Quality = 80 %
90

80
Accuracy

70 Quality = 60%

60

50 Quality = 50%
40
0

0

0
0

0
0

0

0
0

0

0
1
20

40
60

80

KDD 2008,
10
12

14

16
18

20

22

24

26
28

30
25 Number of examples (Mushroom) Best paper
runner-up

Summary of Basic Results

We want to follow the direction that has the highest
“learning gradient”
– Estimate improvement with more data (cross-validation)
– Estimate sensitivity to data quality (introduce noise and
measure degradation in quality)
Rule-of-thumb results:
With high quality labelers (85% and above):
Get more data (One worker per example)
With low quality labelers (~60-70%):
Improve quality (Multiple workers per example)
26

Selective Repeated-Labeling

 We do not need to label everything the same way

 Key observation: we have additional information to
guide selection of data for repeated labeling
 the current multiset of labels
 the current model built from the data

 Example: {+,-,+,-,-,+} vs. {+,+,+,+,+,+}
– Will skip details in the talk, see “Repeated Labeling” paper,
for targeting using item difficulty, and other techniques
27

Selective labeling strategy:
Model Uncertainty (MU)

 Learning models of the data additional
source of information about label certainty
 Model uncertainty: get more labels for
instances that cause model uncertainty in Examples

training data (i.e., irregularities!)

Models
+ - -- -
+ + -- Self-healing process
+
+ +
+
+
+ - -- - -+- -
-
+ + + - - - -- - - - - examines
+
+ + + + ---- -----
- irregularities in
+ + training data
+ + + --
+ -- ----
+ This is NOT active
--
28 learning

+ - -- -
+ +
+ + +
+ ++ + - -- - -+- -
-
Why does + + +
+ + +
- - - -- - -
---- ----
Model Uncertainty (MU) work? ++
+
+
--
--

Self-healing Examples

Self-healing MU process

Models
“active learning” MU

29

Adult content classification

Round robin

Selective

30

Improving worker participation

 With just labeling, workers are passively
labeling the data that we give them

 But this can be wasteful when positive cases
are sparse

 Why not asking the workers to search
themselves and find training data
31

Guided Learning

Ask workers to find
example web pages
(great for “sparse” content)

After collecting enough
examples, easy to build
and test web page
http://url-collector.appspot.com/allTopics.jsp
classifier
32 KDD 2009

Limits of Guided Learning

 No incentives for workers to find “new” content

 After a while, submitted web pages similar to
already submitted ones

 No improvement for classifier

33

The result? Blissful ignorance…

 Classifier seems great: Cross-validation tests
show excellent performance

 Alas, classifier fails: The “unknown unknowns” ™
No similar training data in training set

“Unknown unknowns” = classifier
fails with high confidence

34

Beat the Machine!

Ask humans to find URLs that
 the classifier will classify incorrectly
 another human will classify correctly

http://adsafe-beatthemachine.appspot.com/
Example:
35 Find hate speech pages that the machine will classify as benign

Probes Successes

Error rate for probes significantly higher
than error rate on (stratified) random data
(10x to 100x higher than base error rate)
36

Structure of Successful Probes

 Now, we identify errors much
faster (and proactively)

 Errors not random outliers:
We can “learn” the errors

 Could not, however, incorporate
errors into existing classifier without
degrading performance

37

Unknown unknowns  Known unknowns

 Once humans find the holes, they keep probing
(e.g., multilingual porn  )

 However, we can learn what we do not know
(“unknown unknowns”  “known unknowns”)

 We now know the areas where we are likely to be
wrong

38

Reward Structure for Humans

 High reward higher when:
– Classifier confident (but wrong) and
– We do not know it will be an error
 Medium reward when:
– Classifier confident (but wrong) and
– We do know it will be an error
 Low reward when:
39
– Classifier already uncertain about outcome

Current Directions

 Learn how to best incorporate knowledge to
improve classifier

 Measure prevalence of newly identified errors
on the web (“query by document”)
– Increase rewards for errors prevalent in the
“generalized” case

40

Workers reacting to bad rewards/scores

Score-based feedback leads to strange interactions:

The “angry, has-been-burnt-too-many-times” worker:
 “F*** YOU! I am doing everything correctly and you know
it! Stop trying to reject me with your stupid ‘scores’!”

The overachiever worker:
 “What am I doing wrong?? My score is 92% and I want to
have 100%”

41

An unexpected connection at the
NAS “Frontiers of Science” conf.

Your bad
workers behave
like my mice!

42


Your bad
workers behave
like my mice!

Eh?

43


Your bad workers want
to engage their brain
only for motor skills,
not for cognitive skills

Yeah, makes
sense…
44


And here is how
I train my mice
to behave…

45


Confuse motor skills!
Reward cognition!

I should try this the
moment that I get
back to my room

46

Implicit Feedback using Frustration

 Punish bad answers with frustration of motor
skills (e.g., add delays between tasks)
– “Loading image, please wait…”
– “Image did not load, press here to reload”
– “404 error. Return the HIT and accept again”
 Reward good answers by rewarding the
cognitive part of the brain (e.g, introduce
variety/novelty, return results fast)

47 →Make this probabilistic to keep feedback implicit

First result

 Spammer workers quickly abandon
 Good workers keep labeling

 Bad: Spammer bots unaffected
 How to frustrate a bot?
– Give it a CAPTHCA 

49

Second result (more impressive)

 Remember, scheme was for training the mice…

 15% of the spammers start submitting good work!
 Putting cognitive effort is more beneficial (?)

 Key trick: Learn to test workers on-the-fly and
estimate their quality over streaming data (code
and paper coming soon…)
50

Crowdsourcing using Mechanical Turk: Quality Management and Scalability

More Related Content

Similar to Crowdsourcing using Mechanical Turk: Quality Management and Scalability (20)

More from Panos Ipeirotis (7)

Recently uploaded (20)

Crowdsourcing using Mechanical Turk: Quality Management and Scalability