The University of Adelaide
Data Quality for Software Vulnerability
Datasets
Centre of Research on Engineering Software Technologies (CREST - @crest_uofa)
School of Computer Science, The University of Adelaide, Australia
Cyber Security Cooperative Research Centre, Australia
The 45th International Conference on Software Engineering (ICSE ‘23)
May 17, 2023
Roland Croft
roland.croft@adelaide.edu.au
M. Ali Babar
ali.babar@adelaide.edu.au
Mehdi Kholoosi
mehdi.kholoosi@adelaide.edu.au
Growth of AI
The University of Adelaide Slide 2
AI is beginning to shape
software development and
software quality assurance.
Software Vulnerability Prediction
The University of Adelaide Slide 3
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Software Vulnerability Prediction
The University of Adelaide Slide 4
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Data is the core
component of any
data-driven pipeline:
“Garbage In, Garbage Out”
Software Vulnerability Datasets
The University of Adelaide Slide 5
Weak
Supervision
1. Vulnerability Reports
2. Development Commit
Logs
3. Static Analysis Tools
4. Synthetic Data
Research Objective
The University of Adelaide Slide 6
Aim
Outcomes
Inform the state of software
vulnerability data quality and the
reliability of downstream tasks.
1
Enable automated data cleaning
frameworks to improve data quality
and downstream tasks.
2
To gain deep understanding into
the nature of data quality for
software vulnerability datasets.
Research Design
The University of Adelaide Slide 7
Research Design
The University of Adelaide Slide 8
Data Quality Attributes
Accuracy
1
Completeness
4
Uniqueness
2
Consistency
3
Currentness
5
Research Design
The University of Adelaide Slide 9
Labelling Heuristic: Selected Dataset:
Security Big-Vul
Developer Devign
Tool D2A
Synthetic Juliet Test Suite
Research Design
The University of Adelaide Slide 10
Inspect change in model
performance caused by
attempting to reduce data
quality issues.
Findings - Accuracy
The University of Adelaide Slide 11
“The degree to which the data has attributes that correctly represent the
true value of the intended attribute of a concept or event in a specific
context of use.”
Big-Vul 54.3%
Devign 80.0%
28.6%
D2A
100%
Juliet
Manually inspect
label correctness
-50%
Lower performance
on true labels
-29%
-80%
Findings - Uniqueness
The University of Adelaide Slide 12
“The degree to which there is no duplication in records.”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Security Developer Tool Synthetic
Model Performance with and without
duplicates
Original No duplicates
-13.9%
-81.7%
-10.4%
Big-Vul 83.0%
Devign 89.9%
2.1%
D2A
16.3%
Juliet
Key Takeaways
The University of Adelaide Slide 13
State of the art software vulnerability datasets are imperfect.
Data quality significantly affects the performance of downstream software security
models.
We need better cleaning methods or more robust models to ensure reliability and
effective data driven software security.
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul
0.543 0.830 0.999 0.824 0.761
Devign
0.800 0.899 0.991 0.944 0.811
D2A
0.286 0.021 0.531 0.981 0.844
Juliet
1 0.163 0.750 1 NA
Dataset data
quality values

More Related Content

PPTX
Architecture centric support for security orchestration and automation
PDF
Multi-vocal Review of security orchestration
PPTX
The Gap Between Academic Research and Industrial Practice in Software Testing
PDF
Challenges Software Testers Face in 2025.pdf
PPTX
How to Extend Security and Compliance Within Box
PDF
US AI Safety Institute and Trustworthy AI Details.
PPT
Lecture 02 Software Management Renaissance.ppt
PPTX
First Review PPT gfinal gyft ftu liu yrfut go
Architecture centric support for security orchestration and automation
Multi-vocal Review of security orchestration
The Gap Between Academic Research and Industrial Practice in Software Testing
Challenges Software Testers Face in 2025.pdf
How to Extend Security and Compliance Within Box
US AI Safety Institute and Trustworthy AI Details.
Lecture 02 Software Management Renaissance.ppt
First Review PPT gfinal gyft ftu liu yrfut go

Similar to Data Quality for Software Vulnerability Dataset (20)

PDF
Experience Sharing on School Pentest Project
PDF
Tools for Building Confidence in Using Simulation To Inform or Replace Real-W...
PPTX
Doing Science Properly In The Digital Age - Rutgers Seminar
PPTX
Security Data Quality Challenges
DOC
PPTX
Solnet dev secops meetup
PDF
Executing on the promise of the Internet of Things (IoT)
PPTX
Secure and govern your data with Microsoft Purview
PDF
Clone of an organization
PDF
Agile methods cost of quality
PDF
Agile Methods Cost of Quality: Benefits of Testing Early & Often
DOC
Murali Krishnan Narayanan_Resume
PPTX
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
PPTX
Third Review PPT that consists of the project d etails like abstract.
PDF
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
PDF
Sinha_WhitePaper
DOCX
Md Ismail_QA
PDF
Data Driven Testing Is More Than an Excel File
PDF
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
PDF
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Experience Sharing on School Pentest Project
Tools for Building Confidence in Using Simulation To Inform or Replace Real-W...
Doing Science Properly In The Digital Age - Rutgers Seminar
Security Data Quality Challenges
Solnet dev secops meetup
Executing on the promise of the Internet of Things (IoT)
Secure and govern your data with Microsoft Purview
Clone of an organization
Agile methods cost of quality
Agile Methods Cost of Quality: Benefits of Testing Early & Often
Murali Krishnan Narayanan_Resume
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Third Review PPT that consists of the project d etails like abstract.
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Sinha_WhitePaper
Md Ismail_QA
Data Driven Testing Is More Than an Excel File
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Ad

More from CREST (20)

PDF
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
PPTX
Making Software and Software Engineering visible
PPTX
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
PPTX
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
PPTX
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
PPTX
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
PPTX
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
PPTX
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
PPTX
Falling for Phishing: An Empirical Investigation into People's Email Response...
PPTX
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
PPTX
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
PPTX
Detecting Misuses of Security APIs: A Systematic Review
PPTX
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
PPTX
Mod2Dash Presentation
PDF
Run-time Patching and updating Impact Estimation
PDF
ECSA 2023 Ubuntu Case Study
PDF
Energy Efficiency Evaluation of Local and Offloaded Data Processing
PPTX
Designing Quality-Driven Blockchain Networks
PPTX
Privacy Engineering in the Wild
PPTX
CREST Overview
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Making Software and Software Engineering visible
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Falling for Phishing: An Empirical Investigation into People's Email Response...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Detecting Misuses of Security APIs: A Systematic Review
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Mod2Dash Presentation
Run-time Patching and updating Impact Estimation
ECSA 2023 Ubuntu Case Study
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Designing Quality-Driven Blockchain Networks
Privacy Engineering in the Wild
CREST Overview
Ad

Recently uploaded (20)

PPTX
GSA Content Generator Crack (2025 Latest)
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
Cost to Outsource Software Development in 2025
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PPTX
Cybersecurity: Protecting the Digital World
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Microsoft Office 365 Crack Download Free
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PPTX
assetexplorer- product-overview - presentation
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
Trending Python Topics for Data Visualization in 2025
GSA Content Generator Crack (2025 Latest)
Computer Software and OS of computer science of grade 11.pptx
Topaz Photo AI Crack New Download (Latest 2025)
How Tridens DevSecOps Ensures Compliance, Security, and Agility
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Cost to Outsource Software Development in 2025
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Cybersecurity: Protecting the Digital World
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Microsoft Office 365 Crack Download Free
Advanced SystemCare Ultimate Crack + Portable (2025)
assetexplorer- product-overview - presentation
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
MCP Security Tutorial - Beginner to Advanced
Designing Intelligence for the Shop Floor.pdf
CNN LeNet5 Architecture: Neural Networks
DNT Brochure 2025 – ISV Solutions @ D365
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Trending Python Topics for Data Visualization in 2025

Data Quality for Software Vulnerability Dataset

  • 1. The University of Adelaide Data Quality for Software Vulnerability Datasets Centre of Research on Engineering Software Technologies (CREST - @crest_uofa) School of Computer Science, The University of Adelaide, Australia Cyber Security Cooperative Research Centre, Australia The 45th International Conference on Software Engineering (ICSE ‘23) May 17, 2023 Roland Croft roland.croft@adelaide.edu.au M. Ali Babar ali.babar@adelaide.edu.au Mehdi Kholoosi mehdi.kholoosi@adelaide.edu.au
  • 2. Growth of AI The University of Adelaide Slide 2 AI is beginning to shape software development and software quality assurance.
  • 3. Software Vulnerability Prediction The University of Adelaide Slide 3 • Utilise AI to improve automation and effectiveness of vulnerability detection. • Use knowledge from previous examples to automatically learn vulnerable patterns. Previous known Vulnerabilities Machine Learning Prediction
  • 4. Software Vulnerability Prediction The University of Adelaide Slide 4 • Utilise AI to improve automation and effectiveness of vulnerability detection. • Use knowledge from previous examples to automatically learn vulnerable patterns. Previous known Vulnerabilities Machine Learning Prediction Data is the core component of any data-driven pipeline: “Garbage In, Garbage Out”
  • 5. Software Vulnerability Datasets The University of Adelaide Slide 5 Weak Supervision 1. Vulnerability Reports 2. Development Commit Logs 3. Static Analysis Tools 4. Synthetic Data
  • 6. Research Objective The University of Adelaide Slide 6 Aim Outcomes Inform the state of software vulnerability data quality and the reliability of downstream tasks. 1 Enable automated data cleaning frameworks to improve data quality and downstream tasks. 2 To gain deep understanding into the nature of data quality for software vulnerability datasets.
  • 7. Research Design The University of Adelaide Slide 7
  • 8. Research Design The University of Adelaide Slide 8 Data Quality Attributes Accuracy 1 Completeness 4 Uniqueness 2 Consistency 3 Currentness 5
  • 9. Research Design The University of Adelaide Slide 9 Labelling Heuristic: Selected Dataset: Security Big-Vul Developer Devign Tool D2A Synthetic Juliet Test Suite
  • 10. Research Design The University of Adelaide Slide 10 Inspect change in model performance caused by attempting to reduce data quality issues.
  • 11. Findings - Accuracy The University of Adelaide Slide 11 “The degree to which the data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use.” Big-Vul 54.3% Devign 80.0% 28.6% D2A 100% Juliet Manually inspect label correctness -50% Lower performance on true labels -29% -80%
  • 12. Findings - Uniqueness The University of Adelaide Slide 12 “The degree to which there is no duplication in records.” 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Security Developer Tool Synthetic Model Performance with and without duplicates Original No duplicates -13.9% -81.7% -10.4% Big-Vul 83.0% Devign 89.9% 2.1% D2A 16.3% Juliet
  • 13. Key Takeaways The University of Adelaide Slide 13 State of the art software vulnerability datasets are imperfect. Data quality significantly affects the performance of downstream software security models. We need better cleaning methods or more robust models to ensure reliability and effective data driven software security. Dataset Accuracy Uniqueness Consistency Completeness Currentness Big-Vul 0.543 0.830 0.999 0.824 0.761 Devign 0.800 0.899 0.991 0.944 0.811 D2A 0.286 0.021 0.531 0.981 0.844 Juliet 1 0.163 0.750 1 NA Dataset data quality values

Editor's Notes

  • #2: Self-Introduction. I will be presenting our paper “Data Quality for Software Vulnerability Datasets.”
  • #3: Many of us have been witnessing the huge growth in AI over the last few years, and the software engineering community is no exception. Many organizations are beginning to harness the power of AI to provide intelligent tools that assist with software development and quality assurance. For instance, ChatGPT has blown away the world with its remarkable capabilities for programming and code comprehension. A properly trained model is powerful, and it allows us to effectively automate tasks that we’d otherwise find challenging or time-consuming.  
  • #4: Now in the software security domain, there’s actually a lot of really hard difficult time consuming tasks we’d love to automate. We’ll focus on software vulnerability detection. Vulnerabilities are security weakness in the code that can cause catastrophic consequences when exploited by attackers. The issue is however that they are hard to spot, and it can take developers years and years to review and test every single piece of code. This is where AI comes in. AI has shown much promise towards improving the automation and effectiveness of software vulnerability detection. The basic idea of these solutions is that we use historical records of vulnerability examples to train learning-based models that can automatically detect vulnerable patterns. This example here depicts a simple but dangerous buffer overflow, which we can show to our model, and after it works its magic it can theoretically spot the vulnerability in future. 
  • #5: Now as you may have guessed from the title, this talk isn’t actually going to be about this little amazing machine learning model here. No, it’s going to be about the data. Why? Because the data is actually rather important.  A fundamental concept in computer science states that the quality of outputs of a system is dictated by the quality of its inputs. This concept is beautifully summarized by the saying “garbage in, garbage out.” The data is important.  
  • #6: So how do we get a nice cleanly labeled vulnerability dataset? Well this is actually extremely difficult. For traditional supervised learning problems, we might get some subject matter expert to hand label the data. But we can’t really do this for vulnerability data as it’s extremely scarce and complex. We instead use weak supervion to obtain some higher-level indicators to produce our labels. I’ll go through each of the four main ways we can do this.    Firstly, over the lifetime of a project, we naturally detect and report vulnerabilities through testing and use. For open source software, these reports are often documented in security advisories. We can attempt to trace the information contained in these reports back to the original code, and this gives us an idea of which code snippets were vulnerable.     The second approach is very similar to the last one, but rather than going through a third party vulnerability database, we can just look at the development history directly for commits describing vulnerability fixes.     However, these two sources only provide label indicators for known vulnerabilities. This means we get very small datasets in practice. This is where our third approach comes in. What if we didn’t have to wait for a developer to spot a vulnerability in order to know where it is. Well we can use some automatic tools to scan the code and tell us where the vulnerabilities. Of course this heavily relies on how reliable are tool is.     Finally, to overcome these uncertainties, we can kind of just cheat and just simply make the data up. This is called synthetic data, where we automatically create examples of code that we know to be vulnerable or not vulnerable, using known patterns.     Now none of these data collection approaches are perfect unfortunately. As each of these data sources is using relatively weak label indicators, they exhibit weakness and produce lower quality datasets than traditional supervision. But despite the importance of the data, and the difficulties we have in repairing it, we’ve found the data quality to actually be a rather ill-considered concept in software security, until now. 
  • #7: Hence, our goal is to gather a deep understanding of the data quality of existing software vulnerability datasets. We aim to do this for two major reasons. Firstly, our findings will help inform and raise awareness of the importance of data quality for data-driven software security research, and the impacts that data quality issues can have. Secondly, by gathering deep knowledge of the nature of data quality issues, we can learn how to prevent and overcome then. Ensuring data quality is key to enabling reliable and effective solutions for AI-based software security. 
  • #8: To achieve our aims, we conduct an empirical study using a simple 3 step process.  
  • #9: Firstly, we identify the data characteristics that we will examine. We use the ISO/IEC 25012 data quality standard to obtain 5 inherent data quality attributes: accuracy, uniqueness, consistency, completeness, and currentness. I’ll go over the definitions of these during the findings.  
  • #10: Secondly, we measure each of these attributes on the existing state of the art datasets. We applied a quality selection criteria to collect one dataset for each of the 4 labeling heuristics that we previously outlined. The four datasets are called Big-Vul, Devign, D2A, and the Juliet Test Suite.   
  • #11: Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Let’s get into it.   Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Now due to the time constraints of this presentation, I’m only going to go over our findings for the first two data attributes, but our full findings are in the paper.
  • #12: It’s an expectation that when we’re working with a dataset, that the data labels are actually correct, and this is what the accuracy attribute measures. For vulnerability data we are essentially checking whether our collected vulnerabilities are actually vulnerabilities. Now to measure this, through some quite painstaking efforts, we manually examined the labeling mechanisms that assigned the data points and verified each data point as correct or not. We found that some vulnerability datasets, don’t actually do a very good job of containing vulnerabilities. The worst case is for the tool based dataset, in which only 28.6% of the data was accurate, as static analysis tools have very high false positive rates. More importantly though, these label inaccuracies have catastrophic consequences when we train the models with this data. When we evaluated our models using our manually verified data points, the performance dropped significantly, up to 80%. This is as the models are learning the wrong patterns in the training data. On the other hand, synthetic data is largely correct as the vulnerabilities are specifically crafted for these purposes, rather than collected post-hoc. 
  • #13: Uniqueness is defined as the degree to which there is no duplication in records. Duplication for code datasets can actually be quite common. The same piece of code can get flagged multiple times or at different stages of development. The tool-based and synthetic datasets take this to the extreme however. Only 2.1% of the dataset contained unique values in the worst case.  Duplication can be a significant problem in machine learning due to data leakage. If the validation or test set that is used to guide the learning process contains samples that the model has already seen, well its like we’re letting our model cheat on the test, and this wildly inflates the performance. We can see this in our experiments, where the model performance decreases after we remove duplicates. This is important, as we’re now getting a truer indication of our model performance.  
  • #14: Looking at our findings as a whole, all the examined datasets exhibited issues in various data quality aspects. Other than the synthetic dataset, none of the labeling heuristic are able to produce actually very accurate labels, which means our models are just learning the wrong things. Furthermore, the larger datasets, the ones that don’t rely on reported vulnerabilities, have huge problems with duplication and consistency. Current state of the art datasets are imperfect. What’s more, is that these issues can’t be ignored, as they have significant impacts on the tasks that rely on this data. To move towards the future, to enable data-driven intelligent methods for software security, we need to make these datasets better and overcome these challenges.