On the Unreliability of Bug Severity Data

SMU Classification: Restricted
Strategic Partner:
On the Unreliability of Bug Severity Data
Yuan TIAN
Data Scientist at Living Analytics Research Centre,
Singapore Management University
ytian@smu.edu.sg
April 18th, 2018 @ Queen’s University, Canada

2
Supervised Machine Learning models rely heavily on labels
Labels
Predictive
Model
Feature
Extraction
Learning
Expected Label
Training
Data
New
Data
Cat
Not Cat

3
Traditional machine learning models suffer from noisy labels
in training set
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.

4
• Human mistakes.
problems.
Decreased
Performance
in training set
Noise Level=3%
F1=0.83
F1=0.4

5
• Human mistakes.
problems.
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
in training set

6
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
in training set
Assume that absolute ground truth
for labels exists although the labels
may be noisy for some reasons.

7
Medical Diagnosis Malware Detection
Classification is in some cases subjective, which results in
inter-labeller variability
Image Tagging
“inconsistent
labels” noise

8
“Inconsistent Labels” Noise
Classification is in some cases subjective, which results in
inter-labeller variability (cont.)
How interesting is this book ?
How informative is this tweet ?
Is it a high quality product ?
How severe is the problem ?
User created tags for images, content, etc.
Job titles created by different companies
Numeric Label
Categorical Label

9
Before collecting labels
Readily available data
• Shared labelling criteria
• Multiple labelers
• Repeated labelling
• Labels are agreed by
all/majority
• Pairwise comparison
• Averaging
• Majority voting
• Consensus voting
• Remove outliers
• Learning with
uncertain labels
?
How to measure
inconsistency?
Multiple labelers
Single labeler
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to cope with
inconsistency?

10
Community Intelligence
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to measure
inconsistency?
How to cope with
inconsistency?

11
Part 1:
│Noisy labels negatively impact learning
│Overview of approaches for coping with inconsistent labels
│Sample : Bug severity levels
Part 2:
│Future research direction: Big Data to Thick (high quality) Data
Talk Outline

12
Tester/
User
Summary
Description
Product, component, version
Severity
1:Blocker
3: Major
4:Minor
5:Trivial
2: Critical
Severity is assigned to reflect the level of
impact that a bug has on the system

13
Prior work in automated bug severity labelling

14
Prior work in automated bug severity labelling
Existing approaches assume that all assigned
severity levels are consistent.
Classifiers are evaluated using Accuracy &
F-measure, AUC.

15
How to measure label
inconsistency?
How to evaluate machine
learning models with
inconsistent labels?
human-machine
_____________
human-human
Challenges and our solutions

16
Bug Detection &
Reporting
Bug Triaging
#Validity Check
# Duplicate Bug Detection
# Bug Prioritization
# Bug Assignment
Debugging &
Bug Fixing
Duplicate bugs should
have the same severity
levels, if “severity level”
labels are consistent.
How to measure the inconsistency?

17
Blocker Critical
Major
Minor
Inconsistent Duplicate Buckets
1
2
3
Clean Duplicate Buckets
Manual verify 1,394 bug reports
(statistically representative
sample)
95% of the inconsistent
buckets are reporting the
same bug.
Up to 51% of human-assigned bug severity labels are
inconsistent !

18
human-machine
_____________
human-human
Krippendorff's alpha
A new evaluation measure for machine learning tasks with
inconsistent labels:
𝛼 = 1 −
𝐷𝑜
𝐷𝑒
Observed disagreements
Expected disagreements
when the bug severity
levels are randomly
assigned.
𝛼 = 1 regarded as perfect agreement.
Benefit of Alpha:
• Allow multiple labellers
• Good for ordinal labels
• Factoring class distributions
• Less biased to number of
labels and the number of
coders

19
Bug Report 1 2 3 4 5 6 7 8 9 10
Human 1 2 3 3 4 3 3 4 3 5
Machine A 2 3 3 3 3 3 3 5 3 2
Machine B 3 4 3 3 3 3 3 5 3 2
Machine C 3 3 3 3 3 3 3 3 3 3
Krippendorff’s Alpha Vs Accuracy
Accuracy: Machine A = Machine B = Machine C
Alpha: Machine A > Machine B > Machine C

20
human-machine
_____________
human-human
Krippendorff's alpha
Low agreement between machine learning systems and human might
be due to data inconsistency!
DataSet Human-Human Human-Machine
(REP+kNN)
Agreement
Ratio
OpenOffice 0.538 0.415 0.77
Mozilla 0.675 0.556 0.82
Eclipse 0.595 0.510 0.86
Vs.

21
Community intelligence can be used to
identify and quantify the inconsistency
of subjective labels.
Performance of machine learning
models should be measured within
context, e.g., relative to human inter-
agreement.
human-machine
_____________
human-human
Take away messages

22
During 2017, every minute of the day:
4M tweets
36 M google
searches
600 page edits
0.4 M trips
120 new
professionals $258,751.90 in sales
Big Data is everywhere, however…
Bad data is costing organizations some $3.1 trillion a year in
the US alone.
83% of companies said their revenue is affected by inaccurate
and incomplete customer or prospect data.

23
Big
Data
Thick
Data
Size of
Data
Quality
Big Data Vs Thick Data (high quality data, but expensive to
collect, hard to scale)

24
Big
Data
Thick
Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Make big data “ thick ”

25
Big
Data
Thick Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Challenges:
1. Impossible to specify all the data semantics beforehand.
2. Manual labelling of noise is expensive and time
consuming, impossible to scale.
3. Lack of quality metrics for big data.
4. Lack of performance measures for noisy data.
Make big data “ thick ”

26
• Integrate existing internal/external resources collectively
created by relevant communities.
• Utilize knowledge in unstructured data.
How to make big data thick in a lightweight cost-effective way?

27
Assessing Quality of
Big Data
Lightweight Scalable Noise
Reduction/Correction
Techniques
New Noise Tolerant Learning
Algorithm
for Big Data
New Performance Measures
for Noisy Data
Future Roadmap: Big Data to Thick Data

28
human-machine
_____________
human-human
• New measure for machine learning
performance with inconsistent labels
Challenges: Big Data to Thick Data
• Need metrics for quality assessment and
model evaluation
• Need lightweight noise reduction methods
Conclusion ytian@smu.edu.sg
• Classification is in some cases subjective,
which results in inter-labeler variability.

ytian@smu.edu.sg
Backup Slides
29

30
Bucket/
Bug Report
1 2 3 4 5 6 7 8 9 10
1 2 3 3 4 3 3 4 3 5
2 3 3 3 3 3 3 5 3 2
#Raters 2 2 2 2 2 2 2 2 2 2
Count Matrix
Derived from
Ratings
Predefined
Distance Matrix
This is why alpha is
good for ordinal labels!
(c,k) represents a
pair of ratings
Computation of Krippendorff's alpha
𝛼 = 0.27

31
How do inconsistent labels affect varies machine learning
models?
Duplicate Bug
Reports
Clean Bug Reports Inconsistent Bug Reports
Test Bug
Reports
Train (Inconsistent Data Ratio:0%)
Inconsistent Data Ratio:20%

32
Noise injection leads to drop in alpha for all datasets
(Mozilla)
(OpenOffice)
Decision Tree
Naïve Bayes Multinomial
Support Vector Machine
REP + k-Nearest Neighbor
Ratio of Inconsistent Bug Reports in Training Set
(Eclipse)

On the Unreliability of Bug Severity Data

More Related Content

Similar to On the Unreliability of Bug Severity Data (20)

More from SAIL_QU (20)

Recently uploaded (20)

On the Unreliability of Bug Severity Data

Editor's Notes