SlideShare a Scribd company logo
KENTA YAMAMOTO | TECHNICAL SUPPORT ENGINEER | 2018-03-27
Vulnerability Detection
Based on Git History
Agenda
Introduction
Introduction
Research
Background
The Trend of Security Incidents
Key facts. Why this research is important:
In Quantity
# of CVE reports: 1,020 (2000) → 14,643 (2017) [NVD]
In Quality
• Equifax exposed 143M consumers’ data due to website application
vulnerability (2017)
• Yahoo breached 3B users’ account information (2013)
The Century of Vulnerability
# OF VULNERABILITIES
As information technology is broadly
adopted, the impact of security
incidents is getting extensive and
critical.
Introduction To Help Code Reviewer
We know how to deliver software in proper quality. Code review!
Best Practice is Well-Known
Review patches before release and fix bugs before deployment. Still, however,
even the famous OSS projects struggle with the lack of code reviewers.
A Trade-Off of Automation Techniques
Software projects widely adopt a variety of automation approaches. Vulnerability
detection techniques faces a contradictory:
• (a) High precision. Useless if the tool outputs a billions of false positives.
• (b) Adaptability. No one wants to make efforts only for ensuring security such
as annotating unsafe user inputs.
Research
Background
# Example of taint annotation
int printf(/*@untainted@*/ char *fmt,
...);
Git is somewhat difficult.
No worries, it’s not only you!
WHAT’S GIT?
“(Git is) expressly designed
to make you feel
less intelligent than you
thought you were”
– Andrew Morton
The Greatness of Git -
www.linuxfoundation.org
Introduction
What’s Git?
But Git is Always Stay With You
Trust me, or try this command on your terminal:
# List up how much you rely on Git
history | awk '{ print $2 }' | sort | 
uniq -c | sort -r | head
Introduction
What’s Git?
Git for Machine Learning
Git provides what machine learning requires; good data:
• Adopted by 69.2% of 30K developers [StackOverfow]
• Trusted by most prominent OSS projects such as Linux
Kernel, OpenSSL, FFmpeg, PostgreSQL, Chrome V8, and
Apache HTTPD.
Introduction
What’s Git?
CVE-ID and Security Fix on Git
A sufficient number of reliable security fixes:
• Refers CVE-IDs in their commit message
• Or, fixed commits are referred by CVE database
Introduction
What’s Git?
A Brief Introduction of Git Features
Agenda
Methodology
A static analysis to detect
suspicious vulnerabilities based
on Git history.
METHODOLOGY - HVD
Methodology Proposal Approach
Concept
• This research proposes the approach which aims to
reduce the false positive rate compared to VCCFinder
[Perl et al] without sacrificing adaptability.
• The data source is the same to VCCFinder but this
approach takes account of added-lines and removed-
lines in patch feature while VCCFinder doesn’t.
Methodology VCCFinder: a Novel Approach
Concept
Generally, it’s hard to apply machine learning to source code
because most high-level programming languages such as
C/C++ are less redundant compared to natural languages
and assembly languages. To address this difficulty, Perl et
al.:
• Narrowed down the problem to the quantifiable lemma.
The quality of source code can be hardly quantified but
vulnerability can be expressed as 0 or 1.
• Leveraged the legacies. CVE database and the prominent
OSS projects.
“I really never wanted to do
source control
management at all and felt
that it was just about the
least interesting thing in
the computing world”
– Linus Torvalds
10 Years of Git -
www.linuxfoundation.org
Methodology Overall Architecture
Concept
Methodology Abbreviations
Terms
• HVD: History-based Vulnerability Detector
• VCC: Vulnerability-Contributing Commit(s). Changes
containing vulnerability
• UC: Unclassified Changes
• LT-S: Line type sensitive. The HVD approach
• LT-I: Line type insensitive. The replication of VCCFinder
Methodology Exploit vs Vulnerability
Terms
Potential
vulnerability
Vulnerability
Exploit
(malicious input)
Agenda
Evaluation
351,452
commits in total
Evaluation Dataset Provided by Perl et al.
Experiment
• This dataset contains commits labelled by VCC and UC and associated with
their CVE-IDs.
• It comprises 714 VCCs out of 350k commits in total from 66 OSS repositories
implemented in C/C++.
• The number of unique tokens counts 170k.
• Compressed size is 525mb (npz).
Evaluation Implementation in Python
Experiment
To make the experiment reliable, I adopted a variety of libraries including:
• Numpy
• SciPy
• Scikit-learn
• Unidiff
LT-I: note that the reproducibility is limited since the source of VCCFinder is not
publicly available.
Evaluation Environment Specs
Experiment
The computation was performed at the one of CX250 Cluster (MPC):
• CPU: Intel Xeon E5-2680v2 2.80GHz (10-core) x2
• Memory: 64GB (4GB DDR3-1866 ECC x16)
Evaluation Precision Improvement
• LT-S improved the AUC (area under curve) of its precision-recall curve by
18.8% from LT-I.
Precision
Evaluation Trade-off
• Execution time x3: (LT-I, LT-S) = (17m06s, 45m36s)
• Note: the vast majority of the processing time is occupied by learning phase.
In the practical use case, the learnt model is dumped and shared with future
predictions for a while once calculated. Then, it takes a few seconds to parse a
given unknown commit and perform prediction by using the shared model.
Hence, the execution time of learning phase should not influence the
development process.
Precision
Evaluation The most contributing features
Effective Features
To gain more profound insights from the
experiment, this study also reveals that
valuables consisting of words related to
computer resource most significantly
contributed to the classification model.
For instance:
• (RAM) structors: memory allocation with
complex structures
• (RAM) vmalloc: virtual memory allocation
• (CPU) skbuf_head: a spin-lock of threads
• (network) tso: TCP Segmentation Offload
• (network) if_ether: a flag of Ethernet
availability
Evaluation Findings & insights
Effective Features
Findings:
• The valuable tokens which are relevant to computer resources such as CPU,
memory, and network
• The figure also shows most contributing valuables are added-tokens.
Insights:
• These findings do not surprise us because it’s obvious that vulnerability occurs
correlating closely with side effects with computer resource management and
adding code.
• However, it’s worth verifying that automatic detection approach makes no
difference with the experiential intuition of human.
Agenda
Conclusion
Despite the difficulty that the features acquirable via Git are limited, this study shows LT-
S improved AUC of the precision-recall curve by 18.8% compared to LT-I without losing
the original advantages:
• (a) Scalability
• (b) Generality
• (c) Explainability
CONCLUSION
KENTA YAMAMOTO | TECHNICAL SUPPORT ENGINEER | @I05
Thank you!
Questions & discussion

More Related Content

PDF
Ofer rivlin BGU - department seminar
PDF
How to hack cryptographic protocols with Formal Methods
PPTX
Object Broker Infrastructure for Wide Area Networks
PPTX
Verigraph
PDF
IoT Malware: Comprehensive Survey, Analysis Framework and Case Studies
PPTX
Solving the Hidden Costs of Kubernetes with Observability
PDF
Over-the-Air: How we Remotely Compromised the Gateway, BCM, and Autopilot ECU...
PDF
Why Distributed Tracing is Essential for Performance and Reliability
Ofer rivlin BGU - department seminar
How to hack cryptographic protocols with Formal Methods
Object Broker Infrastructure for Wide Area Networks
Verigraph
IoT Malware: Comprehensive Survey, Analysis Framework and Case Studies
Solving the Hidden Costs of Kubernetes with Observability
Over-the-Air: How we Remotely Compromised the Gateway, BCM, and Autopilot ECU...
Why Distributed Tracing is Essential for Performance and Reliability

What's hot (20)

PPTX
OpenTelemetry For Developers
PDF
From Thousands of Hours to a Couple of Minutes: Automating Exploit Generation...
PDF
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
PDF
MITRE ATT&CKcon 2018: Detection Philosophy, Evolution & ATT&CK, Fred Stankows...
PDF
Model-driven trace diagnostics for pattern-based temporal specifications
PPTX
SDN Analytics & Security
PDF
WJAX 2019 - Taking Distributed Tracing to the next level
PDF
Enabling Model Testing of Cyber Physical Systems
PDF
Everything You wanted to Know About Distributed Tracing
PPTX
Singapore International Cyberweek 2020
PDF
Improving Automated Tests with Fluent Assertions
PDF
Test Case Prioritization for Acceptance Testing of Cyber Physical Systems
PDF
Bridging the Security Testing Gap in Your CI/CD Pipeline
PPTX
Under-reported Security Defects in Kubernetes Manifests
PDF
Analysing Defect Inflow Distribution of Automotive & Large Software Projects
PPTX
Container intrusions Do You Even IDS
PPTX
What Questions Do Programmers Ask About Configuration as Code?
PDF
44CON & Ruxcon: SDN security
PPTX
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
PDF
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
OpenTelemetry For Developers
From Thousands of Hours to a Couple of Minutes: Automating Exploit Generation...
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
MITRE ATT&CKcon 2018: Detection Philosophy, Evolution & ATT&CK, Fred Stankows...
Model-driven trace diagnostics for pattern-based temporal specifications
SDN Analytics & Security
WJAX 2019 - Taking Distributed Tracing to the next level
Enabling Model Testing of Cyber Physical Systems
Everything You wanted to Know About Distributed Tracing
Singapore International Cyberweek 2020
Improving Automated Tests with Fluent Assertions
Test Case Prioritization for Acceptance Testing of Cyber Physical Systems
Bridging the Security Testing Gap in Your CI/CD Pipeline
Under-reported Security Defects in Kubernetes Manifests
Analysing Defect Inflow Distribution of Automotive & Large Software Projects
Container intrusions Do You Even IDS
What Questions Do Programmers Ask About Configuration as Code?
44CON & Ruxcon: SDN security
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
Ad

Similar to Vulnerability Detection Based on Git History (20)

PDF
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
PDF
The adoption of FOSS workfows in commercial software development: the case of...
PPTX
Git.From thorns to the stars
PDF
Presentation
PDF
Will Git Be Around Forever? A List of Possible Successors
PDF
Git risky using git metadata to predict code bug risk
PDF
Donu’t Let Vulnerabilities Create a Hole in Your Organization
PDF
#ATAGTR2019 Presentation "DevSecOps with GitLab" By Avishkar Nikale
PPTX
Devoops: DoJ Annual Cybersecurity Training Symposium Edition 2015
PDF
Kernel Recipes 2018 - A year of fixing Coverity issues all over the Linux ker...
PDF
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
PDF
Getting Git Right
PPTX
Managing and Versioning Machine Learning Models in Python
PPTX
2019-09-10: Testing Contributions at Scale
PDF
Personalized Defect Prediction
PDF
Logs are-magic-devfestweekend2018
PPTX
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
PPTX
Git version control
PDF
Threat Modeling the CI/CD Pipeline to Improve Software Supply Chain Security ...
PDF
Git - An Introduction
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
The adoption of FOSS workfows in commercial software development: the case of...
Git.From thorns to the stars
Presentation
Will Git Be Around Forever? A List of Possible Successors
Git risky using git metadata to predict code bug risk
Donu’t Let Vulnerabilities Create a Hole in Your Organization
#ATAGTR2019 Presentation "DevSecOps with GitLab" By Avishkar Nikale
Devoops: DoJ Annual Cybersecurity Training Symposium Edition 2015
Kernel Recipes 2018 - A year of fixing Coverity issues all over the Linux ker...
"Will Git Be Around Forever? A List of Possible Successors" at UtrechtJUG
Getting Git Right
Managing and Versioning Machine Learning Models in Python
2019-09-10: Testing Contributions at Scale
Personalized Defect Prediction
Logs are-magic-devfestweekend2018
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
Git version control
Threat Modeling the CI/CD Pipeline to Improve Software Supply Chain Security ...
Git - An Introduction
Ad

More from Kenta Yamamoto (10)

PDF
The Art of Command Line (2021)
PDF
[論文紹介] VCC-Finder: Finding Potential Vulnerabilities in Open-Source Projects ...
PDF
文字コードとセキュリティ
PDF
良いUrlを設計する
PDF
私たちは何を Web っぽいと感じているのか
PDF
Tips for bash script
PDF
優れたビデオゲームに共通する不変の法則
PDF
20110805 ui14課題2
KEY
20110804 ui14課題
KEY
東日本大震災後の訪日外国人数の変移_2011.3
The Art of Command Line (2021)
[論文紹介] VCC-Finder: Finding Potential Vulnerabilities in Open-Source Projects ...
文字コードとセキュリティ
良いUrlを設計する
私たちは何を Web っぽいと感じているのか
Tips for bash script
優れたビデオゲームに共通する不変の法則
20110805 ui14課題2
20110804 ui14課題
東日本大震災後の訪日外国人数の変移_2011.3

Recently uploaded (20)

PDF
Softaken Excel to vCard Converter Software.pdf
PPT
Introduction Database Management System for Course Database
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Introduction to Artificial Intelligence
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
top salesforce developer skills in 2025.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Nekopoi APK 2025 free lastest update
PDF
Digital Strategies for Manufacturing Companies
PDF
System and Network Administration Chapter 2
Softaken Excel to vCard Converter Software.pdf
Introduction Database Management System for Course Database
Online Work Permit System for Fast Permit Processing
Introduction to Artificial Intelligence
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Odoo POS Development Services by CandidRoot Solutions
CHAPTER 2 - PM Management and IT Context
ISO 45001 Occupational Health and Safety Management System
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
top salesforce developer skills in 2025.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Nekopoi APK 2025 free lastest update
Digital Strategies for Manufacturing Companies
System and Network Administration Chapter 2

Vulnerability Detection Based on Git History

  • 1. KENTA YAMAMOTO | TECHNICAL SUPPORT ENGINEER | 2018-03-27 Vulnerability Detection Based on Git History
  • 3. Introduction Research Background The Trend of Security Incidents Key facts. Why this research is important: In Quantity # of CVE reports: 1,020 (2000) → 14,643 (2017) [NVD] In Quality • Equifax exposed 143M consumers’ data due to website application vulnerability (2017) • Yahoo breached 3B users’ account information (2013)
  • 4. The Century of Vulnerability # OF VULNERABILITIES As information technology is broadly adopted, the impact of security incidents is getting extensive and critical.
  • 5. Introduction To Help Code Reviewer We know how to deliver software in proper quality. Code review! Best Practice is Well-Known Review patches before release and fix bugs before deployment. Still, however, even the famous OSS projects struggle with the lack of code reviewers. A Trade-Off of Automation Techniques Software projects widely adopt a variety of automation approaches. Vulnerability detection techniques faces a contradictory: • (a) High precision. Useless if the tool outputs a billions of false positives. • (b) Adaptability. No one wants to make efforts only for ensuring security such as annotating unsafe user inputs. Research Background # Example of taint annotation int printf(/*@untainted@*/ char *fmt, ...);
  • 6. Git is somewhat difficult. No worries, it’s not only you! WHAT’S GIT?
  • 7. “(Git is) expressly designed to make you feel less intelligent than you thought you were” – Andrew Morton The Greatness of Git - www.linuxfoundation.org
  • 8. Introduction What’s Git? But Git is Always Stay With You Trust me, or try this command on your terminal: # List up how much you rely on Git history | awk '{ print $2 }' | sort | uniq -c | sort -r | head
  • 9. Introduction What’s Git? Git for Machine Learning Git provides what machine learning requires; good data: • Adopted by 69.2% of 30K developers [StackOverfow] • Trusted by most prominent OSS projects such as Linux Kernel, OpenSSL, FFmpeg, PostgreSQL, Chrome V8, and Apache HTTPD.
  • 10. Introduction What’s Git? CVE-ID and Security Fix on Git A sufficient number of reliable security fixes: • Refers CVE-IDs in their commit message • Or, fixed commits are referred by CVE database
  • 11. Introduction What’s Git? A Brief Introduction of Git Features
  • 13. A static analysis to detect suspicious vulnerabilities based on Git history. METHODOLOGY - HVD
  • 14. Methodology Proposal Approach Concept • This research proposes the approach which aims to reduce the false positive rate compared to VCCFinder [Perl et al] without sacrificing adaptability. • The data source is the same to VCCFinder but this approach takes account of added-lines and removed- lines in patch feature while VCCFinder doesn’t.
  • 15. Methodology VCCFinder: a Novel Approach Concept Generally, it’s hard to apply machine learning to source code because most high-level programming languages such as C/C++ are less redundant compared to natural languages and assembly languages. To address this difficulty, Perl et al.: • Narrowed down the problem to the quantifiable lemma. The quality of source code can be hardly quantified but vulnerability can be expressed as 0 or 1. • Leveraged the legacies. CVE database and the prominent OSS projects.
  • 16. “I really never wanted to do source control management at all and felt that it was just about the least interesting thing in the computing world” – Linus Torvalds 10 Years of Git - www.linuxfoundation.org
  • 18. Methodology Abbreviations Terms • HVD: History-based Vulnerability Detector • VCC: Vulnerability-Contributing Commit(s). Changes containing vulnerability • UC: Unclassified Changes • LT-S: Line type sensitive. The HVD approach • LT-I: Line type insensitive. The replication of VCCFinder
  • 19. Methodology Exploit vs Vulnerability Terms Potential vulnerability Vulnerability Exploit (malicious input)
  • 22. Evaluation Dataset Provided by Perl et al. Experiment • This dataset contains commits labelled by VCC and UC and associated with their CVE-IDs. • It comprises 714 VCCs out of 350k commits in total from 66 OSS repositories implemented in C/C++. • The number of unique tokens counts 170k. • Compressed size is 525mb (npz).
  • 23. Evaluation Implementation in Python Experiment To make the experiment reliable, I adopted a variety of libraries including: • Numpy • SciPy • Scikit-learn • Unidiff LT-I: note that the reproducibility is limited since the source of VCCFinder is not publicly available.
  • 24. Evaluation Environment Specs Experiment The computation was performed at the one of CX250 Cluster (MPC): • CPU: Intel Xeon E5-2680v2 2.80GHz (10-core) x2 • Memory: 64GB (4GB DDR3-1866 ECC x16)
  • 25. Evaluation Precision Improvement • LT-S improved the AUC (area under curve) of its precision-recall curve by 18.8% from LT-I. Precision
  • 26. Evaluation Trade-off • Execution time x3: (LT-I, LT-S) = (17m06s, 45m36s) • Note: the vast majority of the processing time is occupied by learning phase. In the practical use case, the learnt model is dumped and shared with future predictions for a while once calculated. Then, it takes a few seconds to parse a given unknown commit and perform prediction by using the shared model. Hence, the execution time of learning phase should not influence the development process. Precision
  • 27. Evaluation The most contributing features Effective Features To gain more profound insights from the experiment, this study also reveals that valuables consisting of words related to computer resource most significantly contributed to the classification model. For instance: • (RAM) structors: memory allocation with complex structures • (RAM) vmalloc: virtual memory allocation • (CPU) skbuf_head: a spin-lock of threads • (network) tso: TCP Segmentation Offload • (network) if_ether: a flag of Ethernet availability
  • 28. Evaluation Findings & insights Effective Features Findings: • The valuable tokens which are relevant to computer resources such as CPU, memory, and network • The figure also shows most contributing valuables are added-tokens. Insights: • These findings do not surprise us because it’s obvious that vulnerability occurs correlating closely with side effects with computer resource management and adding code. • However, it’s worth verifying that automatic detection approach makes no difference with the experiential intuition of human.
  • 30. Despite the difficulty that the features acquirable via Git are limited, this study shows LT- S improved AUC of the precision-recall curve by 18.8% compared to LT-I without losing the original advantages: • (a) Scalability • (b) Generality • (c) Explainability CONCLUSION
  • 31. KENTA YAMAMOTO | TECHNICAL SUPPORT ENGINEER | @I05 Thank you! Questions & discussion