SlideShare a Scribd company logo
SMU Classification: Restricted
1. Yan, Y., Rosales, R., Fung, G., & Dy, J. G. (2011). Active learning from crowds. In ICML (Vol. 11, pp. 1161–1168).
2. Bi, W., Wang, L., Kwok, J. T., & Tu, Z. (2014). Learning to Predict from Crowdsourced Data. In UAI (pp. 82–91).
3. Rodrigues, F., Lourenco, M., Ribeiro, B., & Pereira, F. C. (2017). Learning Supervised Topic Models for
Classification and Regression from Crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(12), 2409–2422.
SMU Classification: Restricted
1
SMU Classification: Restricted
2
• Most research on supervised learning techniques rely on an
often overlooked assumption that a single domain expert
can provide the required supervision
• Crowdsourcing
- Quality: Mixture of experts and non-experts, annotators having
different expertise
- Inference: Truth inference from noisy labels
- Budget: How to collect enough useful labels before running
out of budget?
SMU Classification: Restricted
3
• Motivation behind Crowdsourcing
- It is difficult to collect a single golden ground-truth in some
problem domains
- It is often the case that an annotator does not have the
appropriate knowledge for annotating all the data, even for a
particular domain
- In many instances, collecting annotations from multiple non-
expert annotators can be less costly than collecting
annotations just from one expert
- Collaboration and knowledge sharing is becoming more
common, and thus technology for combining multiple opinions
will become necessary
SMU Classification: Restricted
4
• In many learning tasks the labeled data is limited in quantity
or expensive to obtain, but the amount of unlabeled data is
large or easy to obtain
• Try to learn the most at a given cost
- Identify the most useful data point to label given the
information obtained
- Identify the most useful annotator
SMU Classification: Restricted
5
• Active Learning from Crowds and Extensions
- Simple Ground Truth Inference
- Learn the prediction model at the same time
- Extend some existed model to the active learning from crowds
scenario
SMU Classification: Restricted
6
• Sometimes the annotator may not have the knowledge to
label the data accurately
- The annotation may comes from the observation of the input
data, not the underlying ground truth
• Goal
- Actively collect ground truth from the worker, and learn a
prediction model
SMU Classification: Restricted
7
• Probabilistic Multi-Labeler Model
- ! data points {#$, #&, … , #!} from input space )
- The label for the *-th data point by annotator + is ,*
(+)
from
label space /
- The unknown ground truth for the *-th data point is 0* from
output space 1
- All 0 and some of , are unobservable
SMU Classification: Restricted
8
• Model Definition
- The classifier is trained by assuming a probabilistic model
over random variables !, ", and #
where $%
&
is the set of annotators for %-th data point
SMU Classification: Restricted
9
• Model Definition
-
- We could use a Gaussian model:
where the variance depends on the input ! and is specific to
each annotator "
For binary classification, the variance is a logistic function of
input and annotator
SMU Classification: Restricted
10
• Model Definition
-
- We could use a Bernoulli model:
where !"($) is also a logistic function of the input and the
labeler identity "
SMU Classification: Restricted
11
• Model Definition
-
- Gaussian model allows for assigning a lower variance to input
regions where the labeler is more consistently correct relative
to areas where there are inconsistencies
- Bernoulli model assigns a higher probability of the labeler
being correct to certain input areas relative to other areas
SMU Classification: Restricted
12
• Model Definition
-
- The following logistic regression function is used
because the task is binary classification
SMU Classification: Restricted
13
• Optimally Selecting New Training Points and Annotators
- Pick a new training point to be labeled
- Pick a appropriate labeler among all available labelers
• To find the least confident data point
- The potential samples for which the probability of !(# = %|')
is close to
)
*
• To find the most confident annotator given data point
- Recall the aforementioned variance formula
Find the annotator with minimal variance
SMU Classification: Restricted
14
• Document Classification Task
- Binary Classification
SMU Classification: Restricted
16
• Workers’ qualities can vary drastically and lead to different
noise levels in their annotations
- The worker might not be a expert
- The worker’s default label judgement is incorrect
- Different labeling tasks can have different difficulties
- Worker may not be dedicated to the task
• Worker’s decision process:
- If the worker is dedicated to the labeling task or if he considers
the sample as easy, the corresponding label is generated
according to his underlying decision function
- Otherwise, the label is generated based on his default labeling
judgement
SMU Classification: Restricted
17
• The task is a binary classification problem with:
- ! workers
- " query samples
- The #-th sample $(#) is annotated by the set of workers '# ⊆
{*, ,, … , !}
- The annotation by the /-th worker is 0/
(#)
∈ {2, *}
- The ground truth 0∗(#) ∈ {2, *} is generated by a logistic
regression model with parameters 4∗
where
SMU Classification: Restricted
18
• Reasons that an annotator gives incorrect label:
1. The annotate is dedicated to the task, but the expertise is not
strong enough
The worker !‘s annotation follows a Bernoulli distribution
where "! is !’s estimation of "∗
A small $! suggests "! being very similar to "∗
-> worker ! has high accuracy
SMU Classification: Restricted
19
• Reasons that an annotator gives incorrect label:
2. The annotator is not dedicated to the task, he randomly
annotates according to some default judgement
The worker !‘s annotation follows a Bernoulli distribution
where "! ∈ [%, ']
• Combining the two reasons:
SMU Classification: Restricted
20
• Difficulty to an annotator affects the quality significantly:
- The difficulty of !-th sample "(!) to annotator % is &%
(!)
If "(!) is difficult to %, &%
(!)
will be closed to 0
- The sample is difficult if it’s closed to the worker’s decision
boundary
A small '% will makes an easy sample (w/ large distance to the
boundary) seems difficult to the worker
Distance to the boundary
Sensitivity to
sample difficulty
SMU Classification: Restricted
21
Accuracy of worker
Whether the
worker is
dedicated to
the task
Sensitivity of
worker to
the difficulty
of the task
Difficulty of
the sample to
the worker
Ground truth
is generated
by a logistic
regression
with these
params
W* is drawn
from this
prior
Worker’s
estimatio
n of w*
SMU Classification: Restricted
22
• Baselines
- MTL: prediction model is average of all workers’ model
- RY: coin flipping to decide whether annotation comes from
bias/ground truth
- YAN: active learning from crowd
- GLAD: considering sample difficulty and workers’ expertise
- CUBAM: considering workers’ expertise and bias
- MV: majority vote
Algorithm learns a prediction model
SMU Classification: Restricted
23
wordtopic
Dist. of
topic-doc
Dist. of
topic-word
# doc
# word
Prior of !
SMU Classification: Restricted
24
Latent class
(truth)
Label
(annotation)
Reliability of
worker
# worker
# class
SMU Classification: Restricted
25
• Other than binary classification
- Multi-Label Learning from Crowds
• Level of confidence
- Active Learning from Crowds with Unsure Option
- Active Learning with Confidence-based Answers for
Crowdsourcing Labelling Tasks
• More complicated models:
- Gaussian Process Classification and Active Learning with
Multiple Annotators
- Deep Learning from Crowds
SMU Classification: Restricted
26
• Crowdsourcing can be very helpful when performing out-of-
sample prediction
• Existed models can be extended to be put in the
crowdsourcing scenario
SMU Classification: Restricted

More Related Content

PPTX
Supervised learning
DOC
1st sem
PDF
Sample prac exam2013
PPTX
Moviereview prjct
PPTX
Textual & Sentiment Analysis of Movie Reviews
PDF
Syllabus ms
DOC
Knowledge Levels In Certifications
PDF
Stated preference methods and analysis
Supervised learning
1st sem
Sample prac exam2013
Moviereview prjct
Textual & Sentiment Analysis of Movie Reviews
Syllabus ms
Knowledge Levels In Certifications
Stated preference methods and analysis

Similar to Active learning from crowds (20)

PPTX
Tips and tricks to win kaggle data science competitions
PDF
Artificial Neural Networks for data mining
PDF
Lecture 2 Data mining process.pdf
PPTX
Kaggle Gold Medal Case Study
PDF
2015EDM: A Framework for Multifaceted Evaluation of Student Models (Polygon)
PPTX
Iwsm2014 cosmic approximate sizing using a fuzzy logic approach (alain abran)
PPTX
Iwsm2014 cosmic approximate sizing using a fuzzy logic approach (alain abran)
PPT
Machine learning introduction to unit 1.ppt
PPT
Lecture: introduction to Machine Learning.ppt
PDF
A Comparative Study of Heterogeneous Ensemble-Learning Techniques for Landsli...
PPT
3 DM Classification HFCS kilometres .ppt
PPT
Unit 2 MARKETING RESEARCH
PPTX
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
PPT
Chapter 02 collaborative recommendation
PPT
Chapter 02 collaborative recommendation
PPTX
Machine Learning in the Financial Industry
PPTX
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
PDF
Modelling and evaluation
DOCX
Mb0050 research methodology
Tips and tricks to win kaggle data science competitions
Artificial Neural Networks for data mining
Lecture 2 Data mining process.pdf
Kaggle Gold Medal Case Study
2015EDM: A Framework for Multifaceted Evaluation of Student Models (Polygon)
Iwsm2014 cosmic approximate sizing using a fuzzy logic approach (alain abran)
Iwsm2014 cosmic approximate sizing using a fuzzy logic approach (alain abran)
Machine learning introduction to unit 1.ppt
Lecture: introduction to Machine Learning.ppt
A Comparative Study of Heterogeneous Ensemble-Learning Techniques for Landsli...
3 DM Classification HFCS kilometres .ppt
Unit 2 MARKETING RESEARCH
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
Chapter 02 collaborative recommendation
Chapter 02 collaborative recommendation
Machine Learning in the Financial Industry
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
Modelling and evaluation
Mb0050 research methodology
Ad

More from PC LO (16)

PDF
Task oriented word embedding for text classification
PDF
Chinese liwc lexicon expansion via hierarchical classification of word embedd...
PPTX
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
PPTX
Sentiment analysis and opinion mining Ch.7
PPTX
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
PPTX
Campclass
PPTX
MIS報告 供應鏈管理
PPTX
資料庫期末報告
PPTX
新事業期末報告
PPTX
ERP 期末報告
PPTX
User Acceptance of Information Technology
PPTX
SCM_B2B
PPTX
Ubuntu
PPTX
Travelution
PPTX
社團整合系統
PPTX
禿窄痘胖醜_期中
Task oriented word embedding for text classification
Chinese liwc lexicon expansion via hierarchical classification of word embedd...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
Sentiment analysis and opinion mining Ch.7
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
Campclass
MIS報告 供應鏈管理
資料庫期末報告
新事業期末報告
ERP 期末報告
User Acceptance of Information Technology
SCM_B2B
Ubuntu
Travelution
社團整合系統
禿窄痘胖醜_期中
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
annual-report-2024-2025 original latest.
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Introduction to the R Programming Language
PPTX
Managing Community Partner Relationships
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Lecture1 pattern recognition............
Introduction-to-Cloud-ComputingFinal.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
annual-report-2024-2025 original latest.
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to the R Programming Language
Managing Community Partner Relationships
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............

Active learning from crowds

  • 1. SMU Classification: Restricted 1. Yan, Y., Rosales, R., Fung, G., & Dy, J. G. (2011). Active learning from crowds. In ICML (Vol. 11, pp. 1161–1168). 2. Bi, W., Wang, L., Kwok, J. T., & Tu, Z. (2014). Learning to Predict from Crowdsourced Data. In UAI (pp. 82–91). 3. Rodrigues, F., Lourenco, M., Ribeiro, B., & Pereira, F. C. (2017). Learning Supervised Topic Models for Classification and Regression from Crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2409–2422.
  • 3. SMU Classification: Restricted 2 • Most research on supervised learning techniques rely on an often overlooked assumption that a single domain expert can provide the required supervision • Crowdsourcing - Quality: Mixture of experts and non-experts, annotators having different expertise - Inference: Truth inference from noisy labels - Budget: How to collect enough useful labels before running out of budget?
  • 4. SMU Classification: Restricted 3 • Motivation behind Crowdsourcing - It is difficult to collect a single golden ground-truth in some problem domains - It is often the case that an annotator does not have the appropriate knowledge for annotating all the data, even for a particular domain - In many instances, collecting annotations from multiple non- expert annotators can be less costly than collecting annotations just from one expert - Collaboration and knowledge sharing is becoming more common, and thus technology for combining multiple opinions will become necessary
  • 5. SMU Classification: Restricted 4 • In many learning tasks the labeled data is limited in quantity or expensive to obtain, but the amount of unlabeled data is large or easy to obtain • Try to learn the most at a given cost - Identify the most useful data point to label given the information obtained - Identify the most useful annotator
  • 6. SMU Classification: Restricted 5 • Active Learning from Crowds and Extensions - Simple Ground Truth Inference - Learn the prediction model at the same time - Extend some existed model to the active learning from crowds scenario
  • 7. SMU Classification: Restricted 6 • Sometimes the annotator may not have the knowledge to label the data accurately - The annotation may comes from the observation of the input data, not the underlying ground truth • Goal - Actively collect ground truth from the worker, and learn a prediction model
  • 8. SMU Classification: Restricted 7 • Probabilistic Multi-Labeler Model - ! data points {#$, #&, … , #!} from input space ) - The label for the *-th data point by annotator + is ,* (+) from label space / - The unknown ground truth for the *-th data point is 0* from output space 1 - All 0 and some of , are unobservable
  • 9. SMU Classification: Restricted 8 • Model Definition - The classifier is trained by assuming a probabilistic model over random variables !, ", and # where $% & is the set of annotators for %-th data point
  • 10. SMU Classification: Restricted 9 • Model Definition - - We could use a Gaussian model: where the variance depends on the input ! and is specific to each annotator " For binary classification, the variance is a logistic function of input and annotator
  • 11. SMU Classification: Restricted 10 • Model Definition - - We could use a Bernoulli model: where !"($) is also a logistic function of the input and the labeler identity "
  • 12. SMU Classification: Restricted 11 • Model Definition - - Gaussian model allows for assigning a lower variance to input regions where the labeler is more consistently correct relative to areas where there are inconsistencies - Bernoulli model assigns a higher probability of the labeler being correct to certain input areas relative to other areas
  • 13. SMU Classification: Restricted 12 • Model Definition - - The following logistic regression function is used because the task is binary classification
  • 14. SMU Classification: Restricted 13 • Optimally Selecting New Training Points and Annotators - Pick a new training point to be labeled - Pick a appropriate labeler among all available labelers • To find the least confident data point - The potential samples for which the probability of !(# = %|') is close to ) * • To find the most confident annotator given data point - Recall the aforementioned variance formula Find the annotator with minimal variance
  • 15. SMU Classification: Restricted 14 • Document Classification Task - Binary Classification
  • 16. SMU Classification: Restricted 16 • Workers’ qualities can vary drastically and lead to different noise levels in their annotations - The worker might not be a expert - The worker’s default label judgement is incorrect - Different labeling tasks can have different difficulties - Worker may not be dedicated to the task • Worker’s decision process: - If the worker is dedicated to the labeling task or if he considers the sample as easy, the corresponding label is generated according to his underlying decision function - Otherwise, the label is generated based on his default labeling judgement
  • 17. SMU Classification: Restricted 17 • The task is a binary classification problem with: - ! workers - " query samples - The #-th sample $(#) is annotated by the set of workers '# ⊆ {*, ,, … , !} - The annotation by the /-th worker is 0/ (#) ∈ {2, *} - The ground truth 0∗(#) ∈ {2, *} is generated by a logistic regression model with parameters 4∗ where
  • 18. SMU Classification: Restricted 18 • Reasons that an annotator gives incorrect label: 1. The annotate is dedicated to the task, but the expertise is not strong enough The worker !‘s annotation follows a Bernoulli distribution where "! is !’s estimation of "∗ A small $! suggests "! being very similar to "∗ -> worker ! has high accuracy
  • 19. SMU Classification: Restricted 19 • Reasons that an annotator gives incorrect label: 2. The annotator is not dedicated to the task, he randomly annotates according to some default judgement The worker !‘s annotation follows a Bernoulli distribution where "! ∈ [%, '] • Combining the two reasons:
  • 20. SMU Classification: Restricted 20 • Difficulty to an annotator affects the quality significantly: - The difficulty of !-th sample "(!) to annotator % is &% (!) If "(!) is difficult to %, &% (!) will be closed to 0 - The sample is difficult if it’s closed to the worker’s decision boundary A small '% will makes an easy sample (w/ large distance to the boundary) seems difficult to the worker Distance to the boundary Sensitivity to sample difficulty
  • 21. SMU Classification: Restricted 21 Accuracy of worker Whether the worker is dedicated to the task Sensitivity of worker to the difficulty of the task Difficulty of the sample to the worker Ground truth is generated by a logistic regression with these params W* is drawn from this prior Worker’s estimatio n of w*
  • 22. SMU Classification: Restricted 22 • Baselines - MTL: prediction model is average of all workers’ model - RY: coin flipping to decide whether annotation comes from bias/ground truth - YAN: active learning from crowd - GLAD: considering sample difficulty and workers’ expertise - CUBAM: considering workers’ expertise and bias - MV: majority vote Algorithm learns a prediction model
  • 23. SMU Classification: Restricted 23 wordtopic Dist. of topic-doc Dist. of topic-word # doc # word Prior of !
  • 24. SMU Classification: Restricted 24 Latent class (truth) Label (annotation) Reliability of worker # worker # class
  • 25. SMU Classification: Restricted 25 • Other than binary classification - Multi-Label Learning from Crowds • Level of confidence - Active Learning from Crowds with Unsure Option - Active Learning with Confidence-based Answers for Crowdsourcing Labelling Tasks • More complicated models: - Gaussian Process Classification and Active Learning with Multiple Annotators - Deep Learning from Crowds
  • 26. SMU Classification: Restricted 26 • Crowdsourcing can be very helpful when performing out-of- sample prediction • Existed models can be extended to be put in the crowdsourcing scenario