SlideShare a Scribd company logo
SMU Classification: Restricted
Strategic Partner:
On the Unreliability of Bug Severity Data
Yuan TIAN
Data Scientist at Living Analytics Research Centre,
Singapore Management University
ytian@smu.edu.sg
April 18th, 2018 @ Queen’s University, Canada
SMU Classification: Restricted
2
Supervised Machine Learning models rely heavily on labels
Labels
Predictive
Model
Feature
Extraction
Learning
Expected Label
Training
Data
New
Data
Cat
Not Cat
SMU Classification: Restricted
3
Traditional machine learning models suffer from noisy labels
in training set
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
SMU Classification: Restricted
4
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
Decreased
Performance
Traditional machine learning models suffer from noisy labels
in training set
Noise Level=3%
F1=0.83
F1=0.4
SMU Classification: Restricted
5
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
Traditional machine learning models suffer from noisy labels
in training set
SMU Classification: Restricted
6
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
Traditional machine learning models suffer from noisy labels
in training set
Assume that absolute ground truth
for labels exists although the labels
may be noisy for some reasons.
SMU Classification: Restricted
7
Medical Diagnosis Malware Detection
Classification is in some cases subjective, which results in
inter-labeller variability
Image Tagging
“inconsistent
labels” noise
SMU Classification: Restricted
8
“Inconsistent Labels” Noise
Classification is in some cases subjective, which results in
inter-labeller variability (cont.)
How interesting is this book ?
How informative is this tweet ?
Is it a high quality product ?
How severe is the problem ?
User created tags for images, content, etc.
Job titles created by different companies
Numeric Label
Categorical Label
SMU Classification: Restricted
9
Before collecting labels
Readily available data
• Shared labelling criteria
• Multiple labelers
• Repeated labelling
• Labels are agreed by
all/majority
• Pairwise comparison
• Averaging
• Majority voting
• Consensus voting
• Remove outliers
• Learning with
uncertain labels
?
How to measure
inconsistency?
Multiple labelers
Single labeler
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to cope with
inconsistency?
SMU Classification: Restricted
10
Community Intelligence
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to measure
inconsistency?
How to cope with
inconsistency?
SMU Classification: Restricted
11
Part 1:
│Noisy labels negatively impact learning
│Overview of approaches for coping with inconsistent labels
│Sample : Bug severity levels
Part 2:
│Future research direction: Big Data to Thick (high quality) Data
Talk Outline
SMU Classification: Restricted
12
Tester/
User
Summary
Description
Product, component, version
Severity
1:Blocker
3: Major
4:Minor
5:Trivial
2: Critical
Severity is assigned to reflect the level of
impact that a bug has on the system
SMU Classification: Restricted
13
Prior work in automated bug severity labelling
SMU Classification: Restricted
14
Prior work in automated bug severity labelling
Existing approaches assume that all assigned
severity levels are consistent.
Classifiers are evaluated using Accuracy &
F-measure, AUC.
SMU Classification: Restricted
15
How to measure label
inconsistency?
How to evaluate machine
learning models with
inconsistent labels?
human-machine
_____________
human-human
Challenges and our solutions
SMU Classification: Restricted
16
Bug Detection &
Reporting
Bug Triaging
#Validity Check
# Duplicate Bug Detection
# Bug Prioritization
# Bug Assignment
Debugging &
Bug Fixing
Duplicate bugs should
have the same severity
levels, if “severity level”
labels are consistent.
How to measure the inconsistency?
SMU Classification: Restricted
17
Blocker Critical
Major
Minor
Inconsistent Duplicate Buckets
1
2
3
Clean Duplicate Buckets
Manual verify 1,394 bug reports
(statistically representative
sample)
95% of the inconsistent
buckets are reporting the
same bug.
Up to 51% of human-assigned bug severity labels are
inconsistent !
SMU Classification: Restricted
18
human-machine
_____________
human-human
Krippendorff's alpha
A new evaluation measure for machine learning tasks with
inconsistent labels:
𝛼 = 1 −
𝐷𝑜
𝐷𝑒
Observed disagreements
Expected disagreements
when the bug severity
levels are randomly
assigned.
𝛼 = 1 regarded as perfect agreement.
Benefit of Alpha:
• Allow multiple labellers
• Good for ordinal labels
• Factoring class distributions
• Less biased to number of
labels and the number of
coders
SMU Classification: Restricted
19
Bug Report 1 2 3 4 5 6 7 8 9 10
Human 1 2 3 3 4 3 3 4 3 5
Machine A 2 3 3 3 3 3 3 5 3 2
Machine B 3 4 3 3 3 3 3 5 3 2
Machine C 3 3 3 3 3 3 3 3 3 3
Krippendorff’s Alpha Vs Accuracy
Accuracy: Machine A = Machine B = Machine C
Alpha: Machine A > Machine B > Machine C
SMU Classification: Restricted
20
human-machine
_____________
human-human
Krippendorff's alpha
Low agreement between machine learning systems and human might
be due to data inconsistency!
DataSet Human-Human Human-Machine
(REP+kNN)
Agreement
Ratio
OpenOffice 0.538 0.415 0.77
Mozilla 0.675 0.556 0.82
Eclipse 0.595 0.510 0.86
Vs.
SMU Classification: Restricted
21
Community intelligence can be used to
identify and quantify the inconsistency
of subjective labels.
Performance of machine learning
models should be measured within
context, e.g., relative to human inter-
agreement.
human-machine
_____________
human-human
Take away messages
SMU Classification: Restricted
22
During 2017, every minute of the day:
4M tweets
36 M google
searches
600 page edits
0.4 M trips
120 new
professionals $258,751.90 in sales
Big Data is everywhere, however…
Bad data is costing organizations some $3.1 trillion a year in
the US alone.
83% of companies said their revenue is affected by inaccurate
and incomplete customer or prospect data.
SMU Classification: Restricted
23
Big
Data
Thick
Data
Size of
Data
Quality
Big Data Vs Thick Data (high quality data, but expensive to
collect, hard to scale)
SMU Classification: Restricted
24
Big
Data
Thick
Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Make big data “ thick ”
SMU Classification: Restricted
25
Big
Data
Thick Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Challenges:
1. Impossible to specify all the data semantics beforehand.
2. Manual labelling of noise is expensive and time
consuming, impossible to scale.
3. Lack of quality metrics for big data.
4. Lack of performance measures for noisy data.
Make big data “ thick ”
SMU Classification: Restricted
26
• Integrate existing internal/external resources collectively
created by relevant communities.
• Utilize knowledge in unstructured data.
How to make big data thick in a lightweight cost-effective way?
SMU Classification: Restricted
27
Assessing Quality of
Big Data
Lightweight Scalable Noise
Reduction/Correction
Techniques
New Noise Tolerant Learning
Algorithm
for Big Data
New Performance Measures
for Noisy Data
Future Roadmap: Big Data to Thick Data
SMU Classification: Restricted
28
human-machine
_____________
human-human
• New measure for machine learning
performance with inconsistent labels
Challenges: Big Data to Thick Data
• Need metrics for quality assessment and
model evaluation
• Need lightweight noise reduction methods
Conclusion ytian@smu.edu.sg
• Classification is in some cases subjective,
which results in inter-labeler variability.
SMU Classification: Restricted
ytian@smu.edu.sg
Backup Slides
29
SMU Classification: Restricted
30
Bucket/
Bug Report
1 2 3 4 5 6 7 8 9 10
1 2 3 3 4 3 3 4 3 5
2 3 3 3 3 3 3 5 3 2
#Raters 2 2 2 2 2 2 2 2 2 2
Count Matrix
Derived from
Ratings
Predefined
Distance Matrix
This is why alpha is
good for ordinal labels!
(c,k) represents a
pair of ratings
Computation of Krippendorff's alpha
𝛼 = 0.27
SMU Classification: Restricted
31
How do inconsistent labels affect varies machine learning
models?
Duplicate Bug
Reports
Clean Bug Reports Inconsistent Bug Reports
Test Bug
Reports
Train (Inconsistent Data Ratio:0%)
Inconsistent Data Ratio:20%
SMU Classification: Restricted
32
Noise injection leads to drop in alpha for all datasets
(Mozilla)
(OpenOffice)
Decision Tree
Naïve Bayes Multinomial
Support Vector Machine
REP + k-Nearest Neighbor
Ratio of Inconsistent Bug Reports in Training Set
(Eclipse)

More Related Content

PPT
user support system in HCI
PDF
Fundamentals of Deep Recommender Systems
PPT
Applying Deep Learning with Weak and Noisy labels
PPTX
ICDCC_40shugajkshdgjkadkjhgjkjhgjkhgjkhg
PPT
Botnet detection using Wgans for security
DOCX
Imtiaz khan data_science_analytics
PPTX
Towards Responsible AI - NY.pptx
PDF
Review of Algorithms for Crime Analysis & Prediction
user support system in HCI
Fundamentals of Deep Recommender Systems
Applying Deep Learning with Weak and Noisy labels
ICDCC_40shugajkshdgjkadkjhgjkjhgjkhgjkhg
Botnet detection using Wgans for security
Imtiaz khan data_science_analytics
Towards Responsible AI - NY.pptx
Review of Algorithms for Crime Analysis & Prediction

Similar to On the Unreliability of Bug Severity Data (20)

PDF
Unit1_Types of MACHINE LEARNING 2020pattern.pdf
PDF
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
PPTX
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
PPTX
Lecture-2 Applied ML .pptx
PDF
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
PDF
AI Testing: Ensuring a Good Data Split Between Data Sets (Training and Test) ...
PPTX
An Overview of Advesarial-attack-in-Recommender-system.pptx
PDF
Forms of learning in Artificial intelligence and learning
PPTX
Data Quality Analytics: Understanding what is in your data, before using it
PDF
detailed Presentation on supervised learning
PPTX
Towards Responsible AI - KC.pptx
PPTX
Benchmarking LLM for zero-day vulnerabilities.pptx
PDF
Real-world Strategies for Debugging Machine Learning Systems
PDF
DATI, AI E ROBOTICA @POLITO
PPTX
Anaomaly detection with IOT techniques *
PDF
Model evaluation in the land of deep learning
PDF
Identifying and classifying unknown Network Disruption
PDF
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
PPTX
A survey of random forest based methods for
PDF
Fuzzy Rule Base System for Software Classification
Unit1_Types of MACHINE LEARNING 2020pattern.pdf
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
Lecture-2 Applied ML .pptx
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI Testing: Ensuring a Good Data Split Between Data Sets (Training and Test) ...
An Overview of Advesarial-attack-in-Recommender-system.pptx
Forms of learning in Artificial intelligence and learning
Data Quality Analytics: Understanding what is in your data, before using it
detailed Presentation on supervised learning
Towards Responsible AI - KC.pptx
Benchmarking LLM for zero-day vulnerabilities.pptx
Real-world Strategies for Debugging Machine Learning Systems
DATI, AI E ROBOTICA @POLITO
Anaomaly detection with IOT techniques *
Model evaluation in the land of deep learning
Identifying and classifying unknown Network Disruption
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A survey of random forest based methods for
Fuzzy Rule Base System for Software Classification
Ad

More from SAIL_QU (20)

PDF
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
PDF
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
PPTX
Improving the testing efficiency of selenium-based load tests
PDF
Studying User-Developer Interactions Through the Distribution and Reviewing M...
PDF
Studying online distribution platforms for games through the mining of data f...
PPTX
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
PDF
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
PDF
Mining Development Knowledge to Understand and Support Software Logging Pract...
PPTX
Which Log Level Should Developers Choose For a New Logging Statement?
PPTX
Towards Just-in-Time Suggestions for Log Changes
PDF
The Impact of Task Granularity on Co-evolution Analyses
PPTX
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
PPTX
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
PPTX
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
PDF
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
PPTX
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
PDF
What Do Programmers Know about Software Energy Consumption?
PPTX
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
PDF
Revisiting the Experimental Design Choices for Approaches for the Automated R...
PPTX
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Improving the testing efficiency of selenium-based load tests
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying online distribution platforms for games through the mining of data f...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Mining Development Knowledge to Understand and Support Software Logging Pract...
Which Log Level Should Developers Choose For a New Logging Statement?
Towards Just-in-Time Suggestions for Log Changes
The Impact of Task Granularity on Co-evolution Analyses
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
What Do Programmers Know about Software Energy Consumption?
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Ad

Recently uploaded (20)

PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Transform Your Business with a Software ERP System
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Introduction to Artificial Intelligence
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Nekopoi APK 2025 free lastest update
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Softaken Excel to vCard Converter Software.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Understanding Forklifts - TECH EHS Solution
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Transform Your Business with a Software ERP System
Odoo Companies in India – Driving Business Transformation.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Introduction to Artificial Intelligence
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Digital Systems & Binary Numbers (comprehensive )
Computer Software and OS of computer science of grade 11.pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Upgrade and Innovation Strategies for SAP ERP Customers
Design an Analysis of Algorithms I-SECS-1021-03
Nekopoi APK 2025 free lastest update
Adobe Illustrator 28.6 Crack My Vision of Vector Design

On the Unreliability of Bug Severity Data

  • 1. SMU Classification: Restricted Strategic Partner: On the Unreliability of Bug Severity Data Yuan TIAN Data Scientist at Living Analytics Research Centre, Singapore Management University ytian@smu.edu.sg April 18th, 2018 @ Queen’s University, Canada
  • 2. SMU Classification: Restricted 2 Supervised Machine Learning models rely heavily on labels Labels Predictive Model Feature Extraction Learning Expected Label Training Data New Data Cat Not Cat
  • 3. SMU Classification: Restricted 3 Traditional machine learning models suffer from noisy labels in training set Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems.
  • 4. SMU Classification: Restricted 4 Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems. Decreased Performance Traditional machine learning models suffer from noisy labels in training set Noise Level=3% F1=0.83 F1=0.4
  • 5. SMU Classification: Restricted 5 Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems. Decreased Performance More Complex Model Unreliable Performance Measures Incorrect Influential Features Traditional machine learning models suffer from noisy labels in training set
  • 6. SMU Classification: Restricted 6 Decreased Performance More Complex Model Unreliable Performance Measures Incorrect Influential Features Traditional machine learning models suffer from noisy labels in training set Assume that absolute ground truth for labels exists although the labels may be noisy for some reasons.
  • 7. SMU Classification: Restricted 7 Medical Diagnosis Malware Detection Classification is in some cases subjective, which results in inter-labeller variability Image Tagging “inconsistent labels” noise
  • 8. SMU Classification: Restricted 8 “Inconsistent Labels” Noise Classification is in some cases subjective, which results in inter-labeller variability (cont.) How interesting is this book ? How informative is this tweet ? Is it a high quality product ? How severe is the problem ? User created tags for images, content, etc. Job titles created by different companies Numeric Label Categorical Label
  • 9. SMU Classification: Restricted 9 Before collecting labels Readily available data • Shared labelling criteria • Multiple labelers • Repeated labelling • Labels are agreed by all/majority • Pairwise comparison • Averaging • Majority voting • Consensus voting • Remove outliers • Learning with uncertain labels ? How to measure inconsistency? Multiple labelers Single labeler How people cope with label noise (“inconsistent labels”) caused by the subjective labelling, etc. How to cope with inconsistency?
  • 10. SMU Classification: Restricted 10 Community Intelligence How people cope with label noise (“inconsistent labels”) caused by the subjective labelling, etc. How to measure inconsistency? How to cope with inconsistency?
  • 11. SMU Classification: Restricted 11 Part 1: │Noisy labels negatively impact learning │Overview of approaches for coping with inconsistent labels │Sample : Bug severity levels Part 2: │Future research direction: Big Data to Thick (high quality) Data Talk Outline
  • 12. SMU Classification: Restricted 12 Tester/ User Summary Description Product, component, version Severity 1:Blocker 3: Major 4:Minor 5:Trivial 2: Critical Severity is assigned to reflect the level of impact that a bug has on the system
  • 13. SMU Classification: Restricted 13 Prior work in automated bug severity labelling
  • 14. SMU Classification: Restricted 14 Prior work in automated bug severity labelling Existing approaches assume that all assigned severity levels are consistent. Classifiers are evaluated using Accuracy & F-measure, AUC.
  • 15. SMU Classification: Restricted 15 How to measure label inconsistency? How to evaluate machine learning models with inconsistent labels? human-machine _____________ human-human Challenges and our solutions
  • 16. SMU Classification: Restricted 16 Bug Detection & Reporting Bug Triaging #Validity Check # Duplicate Bug Detection # Bug Prioritization # Bug Assignment Debugging & Bug Fixing Duplicate bugs should have the same severity levels, if “severity level” labels are consistent. How to measure the inconsistency?
  • 17. SMU Classification: Restricted 17 Blocker Critical Major Minor Inconsistent Duplicate Buckets 1 2 3 Clean Duplicate Buckets Manual verify 1,394 bug reports (statistically representative sample) 95% of the inconsistent buckets are reporting the same bug. Up to 51% of human-assigned bug severity labels are inconsistent !
  • 18. SMU Classification: Restricted 18 human-machine _____________ human-human Krippendorff's alpha A new evaluation measure for machine learning tasks with inconsistent labels: 𝛼 = 1 − 𝐷𝑜 𝐷𝑒 Observed disagreements Expected disagreements when the bug severity levels are randomly assigned. 𝛼 = 1 regarded as perfect agreement. Benefit of Alpha: • Allow multiple labellers • Good for ordinal labels • Factoring class distributions • Less biased to number of labels and the number of coders
  • 19. SMU Classification: Restricted 19 Bug Report 1 2 3 4 5 6 7 8 9 10 Human 1 2 3 3 4 3 3 4 3 5 Machine A 2 3 3 3 3 3 3 5 3 2 Machine B 3 4 3 3 3 3 3 5 3 2 Machine C 3 3 3 3 3 3 3 3 3 3 Krippendorff’s Alpha Vs Accuracy Accuracy: Machine A = Machine B = Machine C Alpha: Machine A > Machine B > Machine C
  • 20. SMU Classification: Restricted 20 human-machine _____________ human-human Krippendorff's alpha Low agreement between machine learning systems and human might be due to data inconsistency! DataSet Human-Human Human-Machine (REP+kNN) Agreement Ratio OpenOffice 0.538 0.415 0.77 Mozilla 0.675 0.556 0.82 Eclipse 0.595 0.510 0.86 Vs.
  • 21. SMU Classification: Restricted 21 Community intelligence can be used to identify and quantify the inconsistency of subjective labels. Performance of machine learning models should be measured within context, e.g., relative to human inter- agreement. human-machine _____________ human-human Take away messages
  • 22. SMU Classification: Restricted 22 During 2017, every minute of the day: 4M tweets 36 M google searches 600 page edits 0.4 M trips 120 new professionals $258,751.90 in sales Big Data is everywhere, however… Bad data is costing organizations some $3.1 trillion a year in the US alone. 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.
  • 23. SMU Classification: Restricted 23 Big Data Thick Data Size of Data Quality Big Data Vs Thick Data (high quality data, but expensive to collect, hard to scale)
  • 24. SMU Classification: Restricted 24 Big Data Thick Data Size of Data Quality Big Data Size of Data Quality Make big data “ thick ”
  • 25. SMU Classification: Restricted 25 Big Data Thick Data Size of Data Quality Big Data Size of Data Quality Challenges: 1. Impossible to specify all the data semantics beforehand. 2. Manual labelling of noise is expensive and time consuming, impossible to scale. 3. Lack of quality metrics for big data. 4. Lack of performance measures for noisy data. Make big data “ thick ”
  • 26. SMU Classification: Restricted 26 • Integrate existing internal/external resources collectively created by relevant communities. • Utilize knowledge in unstructured data. How to make big data thick in a lightweight cost-effective way?
  • 27. SMU Classification: Restricted 27 Assessing Quality of Big Data Lightweight Scalable Noise Reduction/Correction Techniques New Noise Tolerant Learning Algorithm for Big Data New Performance Measures for Noisy Data Future Roadmap: Big Data to Thick Data
  • 28. SMU Classification: Restricted 28 human-machine _____________ human-human • New measure for machine learning performance with inconsistent labels Challenges: Big Data to Thick Data • Need metrics for quality assessment and model evaluation • Need lightweight noise reduction methods Conclusion ytian@smu.edu.sg • Classification is in some cases subjective, which results in inter-labeler variability.
  • 30. SMU Classification: Restricted 30 Bucket/ Bug Report 1 2 3 4 5 6 7 8 9 10 1 2 3 3 4 3 3 4 3 5 2 3 3 3 3 3 3 5 3 2 #Raters 2 2 2 2 2 2 2 2 2 2 Count Matrix Derived from Ratings Predefined Distance Matrix This is why alpha is good for ordinal labels! (c,k) represents a pair of ratings Computation of Krippendorff's alpha 𝛼 = 0.27
  • 31. SMU Classification: Restricted 31 How do inconsistent labels affect varies machine learning models? Duplicate Bug Reports Clean Bug Reports Inconsistent Bug Reports Test Bug Reports Train (Inconsistent Data Ratio:0%) Inconsistent Data Ratio:20%
  • 32. SMU Classification: Restricted 32 Noise injection leads to drop in alpha for all datasets (Mozilla) (OpenOffice) Decision Tree Naïve Bayes Multinomial Support Vector Machine REP + k-Nearest Neighbor Ratio of Inconsistent Bug Reports in Training Set (Eclipse)

Editor's Notes

  • #2: Thanks prof. z for the introduction, thanks all for taking time to attend this talk, I am honored to be invited here. As you can see from the title, I’d like to share some of my experience in dealing with inconsistent labels, which is important for making your later analysis on the data reliable and effective. Please feel free to interrupt if you find any difficulties regarding the slide.
  • #3: To begin with, I would like to introduce supervised machine learning models, which are probability the most common machine learning models used nowadays. And they are also the ones that are affected the most by the quality of labels. In the slide, we see a general flow of supervised machine learning, it takes labelled training data as input, gleaning information from it, and eventually learn a model that can label new data. Let’s me give you a simple example here, at the top of the slide, you can see 7 images, each of them is associated with a label 1/0 indicating whether there is a cat in the image. Supervised machine learning takes these data-label pairs as training input, and then extracts features from the images. In traditional machine learning flow, features are defined manually, while in the latest deep learning techniques, features are learned from the data. After the feature extraction process, a model is learned for a mapping between feature values and labels, so that given the new image shown at the right bottom of the slide, we hope that the learned model is able to identify that there is a cat in the image.
  • #4: Since supervised machine learning models are popular, and rely heavily on labels, intuitively, we machine learning practitioners want clean labels in our training data. However, things don’t always go as we wish. Have you noticed that just among the 7 images we saw in the previous slide, the image cycled in red contains a dog, instead of a cat, in this case, we say we encounter an incorrect label. In fact, real-world data often contain noisy labels due to various reasons. For example, we human make mistakes, including experts. secondly, as collecting reliable labels is a expensive and time costly task, many studies leverage crowdsourcing platforms to collect non-expert labels in a cheap and fast way, and some regard machine generated labels as the ground truth. Last, noisy labels might simply due to communication or encoding problems.
  • #5: In fact, people have studied the impact of noisy labels for a long time in the machine learning areas, theoretically or empirically demonstrate that noisy labels can bring negative consequences for learning. The most important two includes descried performance and unreliable performance measures. The image on the right side shows an study regarding performance of traditional classifiers on a particular task, we could see that the performance of the classifiers all drop dramatically after noise level is greater than 3%. Noisy labels often lead to more complex model, and incorrect influential features. http://www.stat.purdue.edu/~jianzhan/papers/sigir03zhang.pdf
  • #6: In fact, people have studied the impact of noisy labels for a long time in the machine learning areas, theoretically or empirically demonstrate that noisy labels can bring negative consequences for learning. The most important two includes descried performance and unreliable performance measures. The image on the right side shows an study regarding performance of traditional classifiers on a particular task, we could see that the performance of the classifiers all drop dramatically after noise level is greater than 3%. Noisy labels often lead to more complex model, and incorrect influential features. So the message I want to deliver here is, we should take care of noisy label when we design machine learning models.
  • #7: In most of the research on analysing noisy labels, we have an important assumption that absolute ground truth for label exist, like we could easily tell that the cycled image is wrongly labelled. However, classification can be subjective, where the ground truth is hard to tell.
  • #8: For instance, two doctors may give different diagnosis regarding the same patient based on their experiences, especially when the information are not fully collected. In the field of computer security, different companies have their own standard in determining whether a software is malware or not. Thus when people combine different malware benchmarks together, there might be conflict labels on the same software. In the image tagging process, users are allowed to create tags by themselves, thus when we may find different tags regarding the same object. Other data such as movie ratings and application ratings also encounter inconsistent caused by the subjective classification process.
  • #9: To summarize the inconsistent labels noise we have seen so far, we can divide them into two groups depend on whether the label can be represented using a numeric variable or categorical variable. For questions regarding opinions of people,like interestingness of a book, a score between 0-10, or 1-5, is usually assigned to measure levels of agreement on a statement. So we can use a numeric value to represent each category. For scenario like image tagging, each tag is a categorical label. Similarity, job titles created by different companies may be different for the same person, or the same job. There are no standard terminology regarding the same data. Since subjective classification tasks often happens, especially when we want to model user behaviours and preference, and if we just ignore the inconsistent label noise introduced by the process, the learning will suffer from performance drop, and many other negative consequences as we have talked in the beginning of this talk.
  • #10: So here comes the question, how people cope with inconsistent label noise? Well, to answer this question, we need to first figure out when the inconsistent labels are encountered. Sometimes, we are the ones who can control the label collection process, sometime, we start with labelled data. In the label collection process, some people do not aware of potential inconsistency introduced by the subjective nature of the classification task. While some people do care about the labeling process, especially when they are creating benchmark dataset. Several strategies are adopted, which I believe all researchers should consider before collecting human annotated labels. For example, involving multiple labelers and making sure that one instance has been labelled more than one labelers. If disagreement appear regarding one instance, we should consider either throw the data, or resolve the disagreement among labelers. Recently, there is also a trend of adopting pairwise comparison rather than assigning an absolute score, but this method would require many pairs of comparisons. If we do not have control on the process, but we have labels provided by multiple labelers, many studies go for different voting methods to merge multiple labelers’ into one label, but this method … Other studies will study the reliability of each labeler and filter outliers, or treat label as a distribution over all possible labels, which is called uncertain label. But how about we only have one label per instance, especially when we know that nothing has been controlled in the label collection process. In the literature, rare work consider this case, but we keep seeing the danger of ignoring such inconsistent noise, which motivate my work on this area. The key challenges for single labeler case, are: How to measure inconsistency of labelers? How to cope with such inconsistency?
  • #11: And the key point of our solution is to leveraging collectively provided labels, which I call community intelligence on other tasks to transfer single labeler setting into multi labeler settings.
  • #12: This work is part of my long-term research program which focuses on coping with data quality issues in big data settings.
  • #13: bug severity level reflects the impact of bug on the system, it is assigned during bug reporting process, which is shown in the slide.