Data Quality for Software Vulnerability Dataset

The University of Adelaide
Data Quality for Software Vulnerability
Datasets
Centre of Research on Engineering Software Technologies (CREST - @crest_uofa)
School of Computer Science, The University of Adelaide, Australia
Cyber Security Cooperative Research Centre, Australia
The 45th International Conference on Software Engineering (ICSE ‘23)
May 17, 2023
Roland Croft
roland.croft@adelaide.edu.au
M. Ali Babar
ali.babar@adelaide.edu.au
Mehdi Kholoosi
mehdi.kholoosi@adelaide.edu.au

Growth of AI
The University of Adelaide Slide 2
AI is beginning to shape
software development and
software quality assurance.

Software Vulnerability Prediction
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction

Software Vulnerability Prediction
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Data is the core
component of any
data-driven pipeline:
“Garbage In, Garbage Out”

Software Vulnerability Datasets
Weak
Supervision
1. Vulnerability Reports
2. Development Commit
Logs
3. Static Analysis Tools
4. Synthetic Data

Research Objective
Aim
Outcomes
Inform the state of software
vulnerability data quality and the
reliability of downstream tasks.
1
Enable automated data cleaning
frameworks to improve data quality
and downstream tasks.
2
To gain deep understanding into
the nature of data quality for
software vulnerability datasets.

Research Design

Research Design
Data Quality Attributes
Accuracy
1
Completeness
4
Uniqueness
2
Consistency
3
Currentness
5

Research Design
Labelling Heuristic: Selected Dataset:
Security Big-Vul
Developer Devign
Tool D2A
Synthetic Juliet Test Suite

Research Design
Inspect change in model
performance caused by
attempting to reduce data
quality issues.

Findings - Accuracy
“The degree to which the data has attributes that correctly represent the
true value of the intended attribute of a concept or event in a specific
context of use.”
Big-Vul 54.3%
Devign 80.0%
28.6%
D2A
100%
Juliet
Manually inspect
label correctness
-50%
Lower performance
on true labels
-29%
-80%

Findings - Uniqueness
“The degree to which there is no duplication in records.”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Security Developer Tool Synthetic
Model Performance with and without
duplicates
Original No duplicates
-13.9%
-81.7%
-10.4%
Big-Vul 83.0%
Devign 89.9%
2.1%
D2A
16.3%
Juliet

Key Takeaways
State of the art software vulnerability datasets are imperfect.
Data quality significantly affects the performance of downstream software security
models.
We need better cleaning methods or more robust models to ensure reliability and
effective data driven software security.
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul
0.543 0.830 0.999 0.824 0.761
Devign
0.800 0.899 0.991 0.944 0.811
D2A
0.286 0.021 0.531 0.981 0.844
Juliet
1 0.163 0.750 1 NA
Dataset data
quality values

Data Quality for Software Vulnerability Dataset

More Related Content

Similar to Data Quality for Software Vulnerability Dataset (20)

More from CREST (20)

Recently uploaded (20)

Data Quality for Software Vulnerability Dataset

Editor's Notes