SlideShare a Scribd company logo
A Tree Kernel Based
Approach for
Clone Detection
1) University of Naples Federico II
2) University of Basilicata
Anna Corazza1
, Sergio Di Martino1
,
Valerio Maggio1
, Giuseppe Scanniello2
Outline
►Background
○ Clone detection definition
○ State of the Art Techniques Taxonomy
►Our Abstract Syntax Tree based Proposal
○ A Tree Kernel based approach for clone detection
►A preliminary evaluation
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
► Program Text can be further distinguished by their degree of similarity1
○ Type 1 Clone: Exact Copy
○ Type 2 Clone: Parameter Substituted Clone
○ Type 3 Clone: Modified/Structure Substituted Clone
1. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
State of the Art Techniques
► Classified in terms of Program Text representation2
○ String, token, syntax tree, control structures, metric vectors
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
2
2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
State of the Art Techniques
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
► Combined Techniques (a.k.a. Hybrid)
○Combine different representations
○Combine different techniques
○Combine different sources of information
●Tree Kernel based approach (Our approach :)
2
The Proposed Approach
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
► As a measure we propose the use of a
(Tree) Kernel Function
3
Kernels for Structured Data
► Kernels are a class of functions with many appealing features:
○ Are based on the idea that a complex object can be described in terms of
its constituent parts
○ Can be easily tailored to a specific domain
► There exist different classes of Kernels:
○ String Kernels
○ Graph Kernels
○ …
○ Tree Kernels
● Applied to NLP Parse Trees (Collins and Duffy 2004)
4
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of
compared trees
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
(3) A proper Kernel Function to compare subparts of
trees
5
(1) The defined features
► We annotate each node of AST by 4 features:
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is
enclosed
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is enclosed
○ Lexemes
● Lexical information within the code
6
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Lexemes Feature
► For leaf nodes:
○ It is the lexeme associated to the node
► For internal nodes:
○ It is the set of lexemes that recursively comes from
subtrees with minimum height
8
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x 0
x y
y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
y, return
9
(2) Applying features in a Kernel
We exploits these features to compute similarity among pairs of
nodes, as follows:
► Instruction Class filters comparable nodes
○ We compare only nodes with the same Instruction Class
► Instruction, Context and Lexemes are used to define a value of
similarity between compared nodes
10
(Primitive) Kernel Function between nodes
1.0 If two nodes have the same values of
features
0.8 If two nodes differ in lexemes
(same instruction and context)
0.7 If two nodes share lexemes and are
the same instruction
0.5 If two nodes share lexemes and are
enclosed in the same context
0.25 If two nodes have at least one feature
in common
0.0 no match
s(n1,n2)=
11
(3) Tree Kernel: Kernel on entire Tree Structures
►We apply nodes comparison recursively to compute
similarity between subtrees
►We aim to identify the maximum isomorphic
tree/subtree
12
Overall Process
1. Preprocessing 2. Extraction
3. Match Detection 4. Aggregation
13
A Preliminary evaluation
Evaluation Description
► We considered a small Java software system
○ We choose to identify clones at method level
► We checked system against the presence of up to Type 3 clones
○ Removed all detected clones through refactoring operations
► We manually and randomly injected a set of artificially created clones
○ One set for each type of clones
► We applied our prototype and CloneDigger* to mutated systems
► We evaluated performances in terms of Precision, Recall and F1
*http://clonedigger.sourceforge.net/
14
Results (1)
► Type 1 and Type 2 Clones:
○ We were able to detect all clones without any false
positive
○ This was obtained also by CloneDigger
○ Both tools expressed the potential of AST-based
approaches
15
Results (2)
► Type 3 clones:
○ We classified results as “true Type 3 clones” according to
different thresholds on similarity values
○ We measured performance on different thresholds
We get best results with
threshold equals to 0.70
16
Conclusions and Future Works
► Measure performance on real systems and projects
○ Bellon's Benchmark
○ Investigate best results with 0.7 as threshold
○ Measure Time Performances
► Improve the scalability of the approach
○ Avoid to compare all pairs
► Improve similarity computation
○ Avoid manual weighting features
► Extend Supported Languages
○ Now we support Java, C, Python
17
Thank you for listening.
Questions?
18

More Related Content

PPT
Chapter 1 Presentation
PPT
Object and class in java
PPTX
Any Which Array But Loose
PPT
Core java by a introduction sandesh sharma
PPTX
Principles of functional progrmming in scala
PDF
Classification using Apache SystemML by Prithviraj Sen
PPTX
JSpiders - Wrapper classes
Chapter 1 Presentation
Object and class in java
Any Which Array But Loose
Core java by a introduction sandesh sharma
Principles of functional progrmming in scala
Classification using Apache SystemML by Prithviraj Sen
JSpiders - Wrapper classes

What's hot (17)

PPTX
Chapter ii(oop)
ODP
Data structures in scala
PDF
Euclideus_Language
PPTX
Java Unit 2(Part 1)
PPTX
Java Unit 2(part 3)
PPTX
Java Unit 2 (Part 2)
PPTX
Vectors in Java
DOCX
Jist of Java
DOCX
JAVA CONCEPTS AND PRACTICES
DOCX
What Do You Mean By NUnit
PDF
Java Serialization Deep Dive
PDF
PPT
Iterator Design Pattern
PDF
LectureNotes-05-DSA
PPSX
Collections - Array List
PDF
Bin Sorting And Bubble Sort By Luisito G. Trinidad
PPT
Serialization/deserialization
Chapter ii(oop)
Data structures in scala
Euclideus_Language
Java Unit 2(Part 1)
Java Unit 2(part 3)
Java Unit 2 (Part 2)
Vectors in Java
Jist of Java
JAVA CONCEPTS AND PRACTICES
What Do You Mean By NUnit
Java Serialization Deep Dive
Iterator Design Pattern
LectureNotes-05-DSA
Collections - Array List
Bin Sorting And Bubble Sort By Luisito G. Trinidad
Serialization/deserialization
Ad

Viewers also liked (20)

PDF
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
PPT
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
PPTX
Method of least square
PDF
Stata statistics
PDF
Dispersion stati
PPTX
Measures of dispersion
PPT
Simple (and Simplistic) Introduction to Econometrics and Linear Regression
PPTX
STATA - Panel Regressions
PPTX
Measures of dispersion
PPT
Regression
PDF
T test and ANOVA
ODP
ANOVA II
ODP
Correlation
PPT
Lesson 8 Linear Correlation And Regression
PPT
Simple linear regression (final)
PDF
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
PPT
Regression analysis ppt
PDF
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
PPTX
Analysis of variance (ANOVA)
PPT
Correlation analysis ppt
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Method of least square
Stata statistics
Dispersion stati
Measures of dispersion
Simple (and Simplistic) Introduction to Econometrics and Linear Regression
STATA - Panel Regressions
Measures of dispersion
Regression
T test and ANOVA
ANOVA II
Correlation
Lesson 8 Linear Correlation And Regression
Simple linear regression (final)
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
Regression analysis ppt
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Analysis of variance (ANOVA)
Correlation analysis ppt
Ad

Similar to A tree kernel based approach for clone detection (20)

PDF
Machine Learning for Software Maintainability
PDF
Approximate Tree Kernels
PDF
IRJET- Code Cloning using Abstract Syntax Tree
PPT
Text classification using Text kernels
PDF
[CIbSE2023] Cross-language clone detection for Mobile Apps
PPTX
Deduplication on large amounts of code
PPTX
Space-efficient Feature Maps for String Alignment Kernels
PDF
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
PDF
Ontology matching
PDF
Jing Ma - 2017 - Detect Rumors in Microblog Posts Using Propagation Structur...
PDF
Alastair Butler - 2015 - Round trips with meaning stopovers
PDF
[ACM-ICPC] Tree Isomorphism
PDF
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
PPT
17 Trees and graphs
PDF
Binary Similarity : Theory, Algorithms and Tool Evaluation
ODP
Code Analysis and Refactoring with CDT
PPTX
6 attributed grammars
PPT
Contextual ontology alignment may 2011
Machine Learning for Software Maintainability
Approximate Tree Kernels
IRJET- Code Cloning using Abstract Syntax Tree
Text classification using Text kernels
[CIbSE2023] Cross-language clone detection for Mobile Apps
Deduplication on large amounts of code
Space-efficient Feature Maps for String Alignment Kernels
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
Ontology matching
Jing Ma - 2017 - Detect Rumors in Microblog Posts Using Propagation Structur...
Alastair Butler - 2015 - Round trips with meaning stopovers
[ACM-ICPC] Tree Isomorphism
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
17 Trees and graphs
Binary Similarity : Theory, Algorithms and Tool Evaluation
Code Analysis and Refactoring with CDT
6 attributed grammars
Contextual ontology alignment may 2011

More from ICSM 2010 (14)

PPTX
Scalable Semantic Web-based Source Code Search Infrastructure
PDF
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
PDF
Wiki dev nlp
PDF
iFL: An Interactive Environment for Understanding Feature Implementations
PDF
Using Clone Detection to Identify Bugs in Concurrent Software
PDF
Automatically Repairing Test Cases for Evolving Method Declarations
PDF
Automated Identification of Cross-browser Issues in Web Applications
PDF
Reverse Engineering Object-Oriented Distributed Systems
PPTX
Software asset management
PPTX
Successfulresearch 100915022614-phpapp01
PPTX
Enabling multi tenancy(An Industrial Experience Report)
PDF
Ponsini automatic slides
PDF
Studying the impact of dependency network measures on software quality
PDF
Icsm2010 Announcement
Scalable Semantic Web-based Source Code Search Infrastructure
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
Wiki dev nlp
iFL: An Interactive Environment for Understanding Feature Implementations
Using Clone Detection to Identify Bugs in Concurrent Software
Automatically Repairing Test Cases for Evolving Method Declarations
Automated Identification of Cross-browser Issues in Web Applications
Reverse Engineering Object-Oriented Distributed Systems
Software asset management
Successfulresearch 100915022614-phpapp01
Enabling multi tenancy(An Industrial Experience Report)
Ponsini automatic slides
Studying the impact of dependency network measures on software quality
Icsm2010 Announcement

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Understanding_Digital_Forensics_Presentation.pptx

A tree kernel based approach for clone detection

  • 1. A Tree Kernel Based Approach for Clone Detection 1) University of Naples Federico II 2) University of Basilicata Anna Corazza1 , Sergio Di Martino1 , Valerio Maggio1 , Giuseppe Scanniello2
  • 2. Outline ►Background ○ Clone detection definition ○ State of the Art Techniques Taxonomy ►Our Abstract Syntax Tree based Proposal ○ A Tree Kernel based approach for clone detection ►A preliminary evaluation
  • 3. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 4. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 5. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” ► Program Text can be further distinguished by their degree of similarity1 ○ Type 1 Clone: Exact Copy ○ Type 2 Clone: Parameter Substituted Clone ○ Type 3 Clone: Modified/Structure Substituted Clone 1. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 6. State of the Art Techniques ► Classified in terms of Program Text representation2 ○ String, token, syntax tree, control structures, metric vectors ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... 2 2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
  • 7. State of the Art Techniques ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... ► Combined Techniques (a.k.a. Hybrid) ○Combine different representations ○Combine different techniques ○Combine different sources of information ●Tree Kernel based approach (Our approach :) 2
  • 9. The Goal ► Define an AST based technique able to detect up to Type 3 Clones 3
  • 10. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information 3
  • 11. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information ► As a measure we propose the use of a (Tree) Kernel Function 3
  • 12. Kernels for Structured Data ► Kernels are a class of functions with many appealing features: ○ Are based on the idea that a complex object can be described in terms of its constituent parts ○ Can be easily tailored to a specific domain ► There exist different classes of Kernels: ○ String Kernels ○ Graph Kernels ○ … ○ Tree Kernels ● Applied to NLP Parse Trees (Collins and Duffy 2004) 4
  • 13. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees 5
  • 14. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes 5
  • 15. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes (3) A proper Kernel Function to compare subparts of trees 5
  • 16. (1) The defined features ► We annotate each node of AST by 4 features: 6
  • 17. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... 6
  • 18. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... 6
  • 19. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed 6
  • 20. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed ○ Lexemes ● Lexical information within the code 6
  • 21. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 22. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 23. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 24. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 25. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 26. Lexemes Feature ► For leaf nodes: ○ It is the lexeme associated to the node ► For internal nodes: ○ It is the set of lexemes that recursively comes from subtrees with minimum height 8
  • 31. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return 9
  • 32. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return y, return 9
  • 33. (2) Applying features in a Kernel We exploits these features to compute similarity among pairs of nodes, as follows: ► Instruction Class filters comparable nodes ○ We compare only nodes with the same Instruction Class ► Instruction, Context and Lexemes are used to define a value of similarity between compared nodes 10
  • 34. (Primitive) Kernel Function between nodes 1.0 If two nodes have the same values of features 0.8 If two nodes differ in lexemes (same instruction and context) 0.7 If two nodes share lexemes and are the same instruction 0.5 If two nodes share lexemes and are enclosed in the same context 0.25 If two nodes have at least one feature in common 0.0 no match s(n1,n2)= 11
  • 35. (3) Tree Kernel: Kernel on entire Tree Structures ►We apply nodes comparison recursively to compute similarity between subtrees ►We aim to identify the maximum isomorphic tree/subtree 12
  • 36. Overall Process 1. Preprocessing 2. Extraction 3. Match Detection 4. Aggregation 13
  • 38. Evaluation Description ► We considered a small Java software system ○ We choose to identify clones at method level ► We checked system against the presence of up to Type 3 clones ○ Removed all detected clones through refactoring operations ► We manually and randomly injected a set of artificially created clones ○ One set for each type of clones ► We applied our prototype and CloneDigger* to mutated systems ► We evaluated performances in terms of Precision, Recall and F1 *http://clonedigger.sourceforge.net/ 14
  • 39. Results (1) ► Type 1 and Type 2 Clones: ○ We were able to detect all clones without any false positive ○ This was obtained also by CloneDigger ○ Both tools expressed the potential of AST-based approaches 15
  • 40. Results (2) ► Type 3 clones: ○ We classified results as “true Type 3 clones” according to different thresholds on similarity values ○ We measured performance on different thresholds We get best results with threshold equals to 0.70 16
  • 41. Conclusions and Future Works ► Measure performance on real systems and projects ○ Bellon's Benchmark ○ Investigate best results with 0.7 as threshold ○ Measure Time Performances ► Improve the scalability of the approach ○ Avoid to compare all pairs ► Improve similarity computation ○ Avoid manual weighting features ► Extend Supported Languages ○ Now we support Java, C, Python 17
  • 42. Thank you for listening. Questions? 18