SlideShare a Scribd company logo
UNSUPERVISED
MACHINE
LEARNING FOR
CLONE DETECTION
Valerio Maggio, Ph.D.
June 25, 2013
valerio.maggio@unina.it
General Disclaimer:
All the Maths appearing in the next slides is only intended to better introduce the considered case studies. Speakers are not
responsible for any possible disease or “brain consumption” caused by too much formulas.
So BEWARE; use this information at your own risk!
It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or
health professional.
AwfulMaths
Number one in the stink parade is duplicated code.
If you see the same code structure in more than one
place, you can be sure that your program will be better
if you find a way to unify them.
ImageMapOutputFormat.java SVGOutputFormat.java
JHOTDRAW
CPYTHON2.5.1
PYTHON (NLTK)
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Software clones are fragments of code that are similar according
to some predefined measure of similarity
I.D. Baxter, 1998
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Textual Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Functional Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones affect the reliability of the system!
Sneaky Bug!
DIFFERENT TYPES OF
CLONES
THE ORIGINAL ONE
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
def do_something_cool_in_Python (filepath, marker='---end---'):
! lines = list() # This list is initially empty
! with open(filepath) as report:
! ! for l in report: # It goes through the lines of the file
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l)
! return lines
TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, whitespace, and comments
# Type 2 Clone
def do_something_cool_in_Python(path, end='---end---'):
! targets = list()
! with open(path) as data_file:
! ! for t in datae:
! ! ! if l.endswith(end):
! ! ! ! targets.append(t) # Stores only lines that ends with "marker"
! #Return the list of different lines
! return targets
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, whitespace, and comments
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
def do_always_the_same_stuff(filepath, marker='---end---'):
! report = open(filepath)
! file_lines = report.readlines()
! report.close()
! #Filters only the lines ending with marker
! return filter(lambda l: len(l) and l.endswith(marker), file_lines)
TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
Unsupervised Machine Learning for clone detection
HTTPD2.2.14:TYPE1
HTTPD2.2.14:TYPE2
HTTPD2.2.14:TYPE3
SOURCECODEINFORMATION
SOURCECODEINFORMATION
SOURCECODEINFORMATION
FUNCTION
parser_compare PARAMS
PARAMPARAM
node *left node *right
IF-STMT IF-STMT RETURN-STMT
BODY
CALL-STMT
parser_compare_node
PARAMS
STRUCT-OP
right st_nodeleft st_node
BODY BODYCOND COND
OR
====
left right0 0
==
rightleft
RETURN-
STMTRETURN-STMT
00
SOURCECODEINFORMATION ENTRY EXIT
FORMAL-IN
ACTUAL-IN
ACTUAL-IN
FORMAL-IN
BODY
CONTROL-POINT
EXPR
CONTROL-POINT CONTROL-POINT CALL-SITE
RETURN
ACTUAL-OUT
RETURN
EXPR
EXPR
FORMAL-OUT
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Text Based Tools:
Text is compared line by line
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Token Based Tools:
Token sequences are
compared to sequences
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Syntax Based Tools:
Syntax subtrees are compared
to each other
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Graph Based Tools:
(sub) graphs are compared to
each other
STATEOFTHEARTTOOLS
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
STATEOFTHEART
TECHNIQUES
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pros: Well suited to detect structural similarities
• Cons: Not Properly suited to detect Type 3 Clones
STATEOFTHEART
TECHNIQUES
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pros: Well suited to detect structural similarities
• Cons: Not Properly suited to detect Type 3 Clones
• Graph based Techniques:
• Pros: The only one able to deal with Type 4 Clones
• Cons: Performance Issues
STATEOFTHEART
TECHNIQUES
USE
MACHINE
LEARNING
L U K E
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
• the application of the learning algorithms to the considered data
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples
(+) No cost of labeling samples
(-) Trade-off imposed on the quality of the data
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Computation of the dot product between (Graph) Structures
K( ),
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the different instructions of a program (function)
Program Dependencies Graph (PDG)
(Directed) Graph structure representing the relationship
among the different statement of a program
Computation of the dot product between (Graph) Structures
K( ),
CODE
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNEL
KERNELFORCLONES
<
block
while
= =
block
=
y -
=
x +
+
x 1
-
y 1
<
x y
>
b 0 =
c 3
if
block
>
b a
-
b 1
<
block
while
+
a 1
=
b -
=
a +
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
IC = Conditional-Expr
I = Less-operator
C = Loop
Ls= [x,y]
IC = Loop
I = while-loop
C = Function-Body
Ls= [x, y]
Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
IC = Block
I = while-body
C = Loop
Ls= [ x ]
CLONE DETECTION
• Comparison with another (pure) AST-based clone detector
• Comparison on a system with randomly seeded clones
0
0.25
0.5
0.75
1
Precision Recall F-measure
CloneDigger Tree Kernel Tool
RE
SULTS
Results refer to clones where code
fragments have been modified by adding/
removing or changing code statements
0
0.25
0.50
0.75
1.00
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision, Recall and F-Measure
Precision Recall F1
Precision: How accurate are the obtained results?
(Altern.) How many errors do they contain?
Recall: How complete are the obtained results?
(Altern.) How many clones have been retrieved w.r.t. Total Clones?
CODE
STRUCTURES
PDG NODES AND EDGES
while call-site
argexpr
CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data Nodes
• e.g., expressions - parameters...
NODES AND EDGES
while call-site
argexpr
CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data Nodes
• e.g., expressions - parameters...
• Two Types of Edges (i.e., dependencies)
• Control edges (Dashed ones)
• Data edges
NODES AND EDGES
while call-site
argexpr
• Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Features of edges:
• Edge Type
• i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
• Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Features of edges:
• Edge Type
• i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
Node Label = WHILE
Node Type = Control Node
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
Control Edge
Data Edge
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
SCENARIO-BASED
EVALUATION
Unsupervised Machine Learning for clone detection
FUTURE
RESEARCH
DIRECTIONS
PROBL
EM
S T A T E
M E N T
(MODEL) CLONE
DETECTION
Models: models are typically represented visually, as box-and-arrow diagrams,
and the clones we are searching for are similar subgraphs of these diagrams.
Model Granularity: models could be represented at different levels of granularity
(such as the source code) corresponding to different syntactic (and semantic)
units.
Models Clones are categorized in (three) different Types
REFERENCEEXAMPLE
TYPE 1C L O N E S
(MODEL) CLONE
DETECTION
• Type 1 (exact) model clones: Identical model fragments except for
variations in visual presentation, layout and formatting.
TYPE 2C L O N E S
(MODEL) CLONE
DETECTION
Type 2 (renamed) model clones: Structurally identical model fragments except
for variations in labels, values, types, visual presentation, layout and formatting.
model@Friction Mode Logic/Break
Apart Detection
model@Friction Mode Logic/Lockup
Detection/Required Friction for
Lockup
TYPE 3C L O N E S
(MODEL) CLONE
DETECTION
Type 3 (near-miss) model clones: Model fragments with further modifications,
such as changes in position or connection with respect to other model fragments
and small additions or removals of blocks or lines in addition to variations in labels,
values, types, visual presentation, layout and formatting.
model@Speed.speed_estimation
model@Throttle.throttle_estimation
MODELSASSOURCECODE
THANK YOU
Valerio Maggio
Ph.D., University of Naples “Federico II”
valerio.maggio@unina.it

More Related Content

PDF
Clone detection in Python
PPTX
Clonedigger-Python
PDF
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
PDF
File Handling in C Programming
PPTX
C language updated
PPTX
Programming in C
PDF
C programming language
PPTX
C Language (All Concept)
Clone detection in Python
Clonedigger-Python
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
File Handling in C Programming
C language updated
Programming in C
C programming language
C Language (All Concept)

What's hot (19)

PDF
Something About Dynamic Linking
PPT
7.0 files and c input
PPT
Phyton Learning extracts
PPTX
Yacc (yet another compiler compiler)
DOCX
PDF
Advanced C Language for Engineering
ODP
OpenGurukul : Language : C Programming
PDF
C Programming Project
PDF
Let’s Learn Python An introduction to Python
DOCX
Report on c and c++
PDF
Notes part 8
PPTX
Managing input and output operation in c
PPTX
C LANGUAGE - BESTECH SOLUTIONS
PDF
Hands-on Introduction to the C Programming Language
PPTX
Python Programming Basics for begginners
PDF
Embedded C - Lecture 2
DOC
'C' language notes (a.p)
PDF
Programming languages
PDF
Introduction to C Language - Version 1.0 by Mark John Lado
Something About Dynamic Linking
7.0 files and c input
Phyton Learning extracts
Yacc (yet another compiler compiler)
Advanced C Language for Engineering
OpenGurukul : Language : C Programming
C Programming Project
Let’s Learn Python An introduction to Python
Report on c and c++
Notes part 8
Managing input and output operation in c
C LANGUAGE - BESTECH SOLUTIONS
Hands-on Introduction to the C Programming Language
Python Programming Basics for begginners
Embedded C - Lecture 2
'C' language notes (a.p)
Programming languages
Introduction to C Language - Version 1.0 by Mark John Lado
Ad

Viewers also liked (11)

PDF
네이티브 웹앱 기술 동향 및 전망
PDF
Improving Software Maintenance using Unsupervised Machine Learning techniques
PPT
PPT
Principles in Refactoring
PDF
Refactoring: Improve the design of existing code
PDF
Unit testing with Junit
PPSX
PPT
Refactoring Tips by Martin Fowler
PDF
Refactoring 101
PPTX
영화 예매 프로그램 (DB 설계, 프로그램 연동)
 
PPTX
딥러닝을 이용한 자연어처리의 연구동향
네이티브 웹앱 기술 동향 및 전망
Improving Software Maintenance using Unsupervised Machine Learning techniques
Principles in Refactoring
Refactoring: Improve the design of existing code
Unit testing with Junit
Refactoring Tips by Martin Fowler
Refactoring 101
영화 예매 프로그램 (DB 설계, 프로그램 연동)
 
딥러닝을 이용한 자연어처리의 연구동향
Ad

Similar to Unsupervised Machine Learning for clone detection (20)

PDF
RubyConf Portugal 2014 - Why ruby must go!
PDF
Theperlreview
PDF
Presentation pythonpppppppppppppppppppppppppppppppppyyyyyyyyyyyyyyyyyyytttttt...
PPTX
compiler design syntax analysis top down parsing
PPTX
Python language data types
PPTX
Python language data types
PPTX
Python language data types
PPTX
Python language data types
PPTX
Python language data types
PPTX
Python language data types
PPTX
Python language data types
PDF
CS6660-COMPILER DESIGN-1368874055-CD NOTES-55-150.pdf
PDF
Syntax analysis
PPTX
Licão 07 operating the shell
PDF
Clojure: Simple By Design
PPT
Shell Scripts
PPTX
Ruby -the wheel Technology
PDF
"Objects validation and comparison using runtime types (io-ts)", Oleksandr Suhak
PDF
Documenting with xcode
PPTX
Introduction of bison
RubyConf Portugal 2014 - Why ruby must go!
Theperlreview
Presentation pythonpppppppppppppppppppppppppppppppppyyyyyyyyyyyyyyyyyyytttttt...
compiler design syntax analysis top down parsing
Python language data types
Python language data types
Python language data types
Python language data types
Python language data types
Python language data types
Python language data types
CS6660-COMPILER DESIGN-1368874055-CD NOTES-55-150.pdf
Syntax analysis
Licão 07 operating the shell
Clojure: Simple By Design
Shell Scripts
Ruby -the wheel Technology
"Objects validation and comparison using runtime types (io-ts)", Oleksandr Suhak
Documenting with xcode
Introduction of bison

More from Valerio Maggio (10)

PDF
Number Crunching in Python
PDF
Machine Learning for Software Maintainability
PDF
LINSEN an efficient approach to split identifiers and expand abbreviations
PDF
A Tree Kernel based approach for clone detection
PDF
Scaffolding with JMock
PDF
Junit in action
PDF
Design patterns and Refactoring
PDF
Test Driven Development
PDF
Unit testing and scaffolding
PDF
Web frameworks
Number Crunching in Python
Machine Learning for Software Maintainability
LINSEN an efficient approach to split identifiers and expand abbreviations
A Tree Kernel based approach for clone detection
Scaffolding with JMock
Junit in action
Design patterns and Refactoring
Test Driven Development
Unit testing and scaffolding
Web frameworks

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25 Week I
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
sap open course for s4hana steps from ECC to s4
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Unsupervised Machine Learning for clone detection

  • 1. UNSUPERVISED MACHINE LEARNING FOR CLONE DETECTION Valerio Maggio, Ph.D. June 25, 2013 valerio.maggio@unina.it
  • 2. General Disclaimer: All the Maths appearing in the next slides is only intended to better introduce the considered case studies. Speakers are not responsible for any possible disease or “brain consumption” caused by too much formulas. So BEWARE; use this information at your own risk! It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or health professional. AwfulMaths
  • 3. Number one in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.
  • 7. PROBL EM S T A T E M E N T CLONE DETECTION Software clones are fragments of code that are similar according to some predefined measure of similarity I.D. Baxter, 1998
  • 8. PROBL EM S T A T E M E N T CLONE DETECTION
  • 9. PROBL EM S T A T E M E N T CLONE DETECTION Clones Textual Similarity
  • 10. PROBL EM S T A T E M E N T CLONE DETECTION Clones Functional Similarity
  • 11. PROBL EM S T A T E M E N T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  • 13. THE ORIGINAL ONE # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  • 14. TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments
  • 15. def do_something_cool_in_Python (filepath, marker='---end---'): ! lines = list() # This list is initially empty ! with open(filepath) as report: ! ! for l in report: # It goes through the lines of the file ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) ! return lines TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  • 16. TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  • 17. # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): ! targets = list() ! with open(path) as data_file: ! ! for t in datae: ! ! ! if l.endswith(end): ! ! ! ! targets.append(t) # Stores only lines that ends with "marker" ! #Return the list of different lines ! return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  • 18. TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 19. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 20. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 21. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 22. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 23. TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  • 24. # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): ! report = open(filepath) ! file_lines = report.readlines() ! report.close() ! #Filters only the lines ending with marker ! return filter(lambda l: len(l) and l.endswith(marker), file_lines) TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  • 31. SOURCECODEINFORMATION FUNCTION parser_compare PARAMS PARAMPARAM node *left node *right IF-STMT IF-STMT RETURN-STMT BODY CALL-STMT parser_compare_node PARAMS STRUCT-OP right st_nodeleft st_node BODY BODYCOND COND OR ==== left right0 0 == rightleft RETURN- STMTRETURN-STMT 00
  • 32. SOURCECODEINFORMATION ENTRY EXIT FORMAL-IN ACTUAL-IN ACTUAL-IN FORMAL-IN BODY CONTROL-POINT EXPR CONTROL-POINT CONTROL-POINT CALL-SITE RETURN ACTUAL-OUT RETURN EXPR EXPR FORMAL-OUT
  • 38. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones STATEOFTHEART TECHNIQUES
  • 39. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones STATEOFTHEART TECHNIQUES
  • 40. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones • Graph based Techniques: • Pros: The only one able to deal with Type 4 Clones • Cons: Performance Issues STATEOFTHEART TECHNIQUES
  • 42. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets
  • 43. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  • 44. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in:
  • 45. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  • 46. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  • 47. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  • 48. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  • 49. CODE STRUCTURES KERNELSFORSTRUCTURES Computation of the dot product between (Graph) Structures K( ),
  • 50. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  • 52. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNELFORCLONES
  • 53. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNELFORCLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  • 55. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  • 56. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  • 57. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  • 58. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  • 59. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  • 60. CLONE DETECTION • Comparison with another (pure) AST-based clone detector • Comparison on a system with randomly seeded clones 0 0.25 0.5 0.75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  • 61. 0 0.25 0.50 0.75 1.00 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?
  • 62. CODE STRUCTURES PDG NODES AND EDGES while call-site argexpr
  • 63. CODE STRUCTURES PDG • Two Types of Nodes • Control Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... NODES AND EDGES while call-site argexpr
  • 64. CODE STRUCTURES PDG • Two Types of Nodes • Control Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... • Two Types of Edges (i.e., dependencies) • Control edges (Dashed ones) • Data edges NODES AND EDGES while call-site argexpr
  • 65. • Features of nodes: • Node Label • i.e., , WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG GRAPH KERNELS FOR PDG while call-site arg expr expr
  • 66. • Features of nodes: • Node Label • i.e., , WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG Node Label = WHILE Node Type = Control Node GRAPH KERNELS FOR PDG while call-site arg expr expr Control Edge Data Edge
  • 67. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 68. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 69. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 70. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 74. PROBL EM S T A T E M E N T (MODEL) CLONE DETECTION Models: models are typically represented visually, as box-and-arrow diagrams, and the clones we are searching for are similar subgraphs of these diagrams. Model Granularity: models could be represented at different levels of granularity (such as the source code) corresponding to different syntactic (and semantic) units. Models Clones are categorized in (three) different Types
  • 76. TYPE 1C L O N E S (MODEL) CLONE DETECTION • Type 1 (exact) model clones: Identical model fragments except for variations in visual presentation, layout and formatting.
  • 77. TYPE 2C L O N E S (MODEL) CLONE DETECTION Type 2 (renamed) model clones: Structurally identical model fragments except for variations in labels, values, types, visual presentation, layout and formatting. model@Friction Mode Logic/Break Apart Detection model@Friction Mode Logic/Lockup Detection/Required Friction for Lockup
  • 78. TYPE 3C L O N E S (MODEL) CLONE DETECTION Type 3 (near-miss) model clones: Model fragments with further modifications, such as changes in position or connection with respect to other model fragments and small additions or removals of blocks or lines in addition to variations in labels, values, types, visual presentation, layout and formatting. model@Speed.speed_estimation model@Throttle.throttle_estimation
  • 80. THANK YOU Valerio Maggio Ph.D., University of Naples “Federico II” valerio.maggio@unina.it