Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

Author Identification of Source Code Segments
Written by Multiple Authors
Using Stacking Ensemble Method
Parvez Mahbub Naz Zarreen Oishie S M Rafizul Haque
this@parvezmrobin.com nazzarreen05@gmail.com rafizul@cse.ku.ac.bd

Overview
Introduction
• Objective
• Application
Background
• Related Works
Methodology
• Stacking Ensemble
Classifier
• Datasets
Experimental
Result
• Comparison With
Other Methods
Conclusion
2

Objective
• Identifying author of
a source code
segment written by
multiple authors
based on previous
training
What
• Vast importance in
digital forensics &
plagiarism detection
Why • Designed a stacking
ensemble method
based on DNN,
random forest and
SVM
How
4

Application
Plagiarism
detection
Detecting
original
author
Law
Enforcement
Cybercrime
Investigation
Criminal
prosecution
Hacking
Malicious
software
Corporate
litigation
Cracking
Copyright
Infringement
Code Cloning
5

Artificial Neural Network (ANN)
7

Random Forests
 An ensemble consists of a number
of decision trees.
 Each tree is in the forest is trained
with a subset of the original data
 During classification each of the
trees gives their vote and the
result is based on the majority
vote
 CART and C4.5 are the most used
algorithm for generating the
decision trees
8

Support Vector Machine
 Classifies data using a linear
optimal hyperplane
 This hyperplane is generated using
several support vectors and
maximizing the margin
 In case of data, that are linearly
inseparable, a kernel function
maps the data into linear space in
higher dimension
9

Ensemble Methods
 Machine learning method that
combines several machine learning
classifiers
 Often more accurate than a single
machine learning classifier
 When a output of the combined
classifiers are fed to another
classifier, it is stacking ensemble.
10

Approaches
N-gram based approaches
 Treats source codes like
ordinary text document
 Can not make use of
programming features and
grammars
Metric based approaches
 More recent approaches
 Extracts several metrics from
the source code
 Line length, identifier length,
ratio of commenting
 Perform classification based on
extracted metrics
12

Stacking Ensemble Method for
Author Identification
13

Methodology
 A stacking ensemble method consists of several base
classifiers and one meta classifier.
 The outputs of the base classifiers are used as the input for the
meta classifier.
 Random forest, deep neural network(DNN) and support
vector machine(SVM) classifiers perform best in general
classification tasks[7].
 1 DNN, 2 random forests and 2 SVMs are chosen as the
base classifier.
 The meta classifier is another deep neural network.
14

Code Metrics
Metric Name Metric Description
Line Length Measures the number of characters in one source code line.
Line Words Measures the number of words in one source code line.
Comments
Frequency
Calculates the relative frequency of line comment, block comment
and optionally doc-comment used by the programmers.
Identifiers Length Calculates the length of each identifier of programs.
Inline Space-Tab
Calculates the whitespaces that occur on the interior areas of non-
whitespace lines.
Trail Space-Tab
Measures the whitespace and tab occurrence at the end of each non-
whitespace line.
Indent Space-Tab
Calculates the indentation whitespaces used at the beginning of each
non-whitespace line.
Underscores Measures the number of underscore characters used in identifiers.
15

Block Diagram of The System
Training
Source Codes
Feature
Extractor
Extracted
Features
Deep Neural
Network
DNN Meta
Features
Random
Forest(CART)
RF-CART Meta
Features
Random
Forest (C4.5)
RF-C4.5 Meta
Features
Meta Classifier
Trained Model
C-SVM
C-SVM Meta
Features
ν-SVM
ν-SVM Meta
Features
16
Data
Flow

Algorithm
1. Extract code metrics from the training set
2. Convert the code metrics to feature vectors
3. For each model in {DNN, RF-CART, RF-C4.5, C-SVM, ν-SVM}:
4. Train model based on the training features
5. Stack the outputs of each model to form meta features
6. Train the meta classifier based on the meta features
7. Predict the authors of unknown samples using the classifiers
17

Challenges
 No unique author style
 Choosing a proper code metric set
 Should be small
 Relatively easy to compute
 Programming language independent
 Visualize such a high-dimensional data and take decision.
 196 features
 Properly training each of the base models
 Choosing the proper configuration of the classifiers
 DNN: Number of hidden layers, activation functions, loss, optimizer, regularizer
 Random Forest: Number of trees, algorithm to train trees
 SVM: Bound on support vectors, kernel functions, bound on train-error
18

Datasets
 Open-source codes from github.com
 8 groups of authors
 2270 contributors
 6063 files
 Roughly 226 lines per file
 Author: The true owner of the source
codes
 Contributor: A group of people who
are not the owner of the source
codes but willingly contributes to the
project by writing or editing a
segment of it
Class Label
Number of
Authors
Number of
Contributors
Azure 3 136
GoogleCloudPlatform 33 820
StackStorm 2 147
dimagi 2 101
enthought 9 224
fp7-ofelia 1 4
freenas 2 126
sympy 2 712
19

Experimental Result
Base Classifiers
Classifier Accuracy
Deep neural network 82%
CART based random
forest
83%
C4.5 based random
forest
83%
C-support vector
machine
79%
ν-Support Vector
Machine
79%
Stacking ensemble classifier
 87% accuracy
 F1-score 0.86
20

Comparison With The Related Works
Method Features Language
independent
features
Multiple
Author
Number of
classes
Number of
source codes
Accuracy
SCAP
Byte level n-
gram
Yes No 30 333 69%
Information retrieval
Character level
n-gram
Yes No 100 1640 67%
Code metric histogram 7 code metrics Yes No 20 4068 55%
Genetic algorithm 4 code metrics Yes No 20 NA 75%
Deep neural network 9 code metrics No No
10, 10, 8,
5, 9
1644, 780,
475, 131, 520
93%, 93%,
93%, 78%, 89%
Support vector machine 46 code metrics No No 8, 53 8000, 502 98%, 80%
Stacking ensemble
method
7 code metrics Yes Yes
8 (group of
authors)
6063 87%

Conclusion
 In this paper, we designed a stacking ensemble classifier that can sufficiently
identify the authorship of a source code written by multiple authors.
 We have found a relatively small set of code metrics that is able to capture
the writing style yet being language independent.
 We have shown that a single machine learning classifier is not sufficient in
case of multiple writers.
 Our achieved accuracy is pretty close to that of approaches deal with single
author.
22

References
1. G. Frantzeskou, E. Stamatatos, E. Gritzalis, and S. Katsikas, “Source code
author identification based on n-gram author profiles,” in Artificial
Intelligence Applications and Innovations (K. K. B. M. Maglogiannis, I., ed.),
vol. 204, pp. 508-515, International Federation for Information Processing,
Springer, Boston, MA, 2006.
2. S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n-
grams,” in Proceedings of the 12th Australasian Document Computing
Symposium (A. T. Amanda Spink and M. Wu, eds.), pp. 32-40, School of
Computer Science and Information Technology, RMIT University, 2007.
3. R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic
algorithms to perform author identification for software forensics,” in
Proceedings of the 9th Annual Conference on Genetic and Evolutionary
Computation, London, (New York, NY, USA), pp. 2082-2089, ACM, 2007.
23

Reference Cont.
4. M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of
discretized source code metrics for author identification,” in Proceedings of
51st International Symposium on Search Based Software Engineering (M. D.
Penta and S. Poulding, eds.), (Cumberland Lodge, Windsor, UK), pp. 69-78,
IEEE Computer Society, 2009.
5. U. Bandara and G. Wijayarathna, “Deep neural networks for source code
author identification,” in Proceedings of 20th International Conference on
Neural Information Processing (M. L. et al., ed.), vol. 2 of LNCS 8227,
pp. 368-375, Springer-Verlag Berlin Heidelberg, 2013.
6. C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source
codes,” in Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information
Management (WAIM) Joint Conference on Web and Big Data (C. L.,J. C., S. C.,
Y. X., and L. X., eds.), vol. 10366 of Lecture Notes in Computer Science,
(Cham, Switzerland), pp. 282-296, Springer, 2017.
24

Reference Cont.
7. R. Caruana, N. Karampatziakis, and A. Yessenalina, “An empirical evaluation
of supervised learning in high dimensions,” in Proceedings of the 25th
International Conference on Machine Learning, (Helsinki, Finland), pp. 96-
103, 2008.
8. S. Burrows, A. Uitdenbogerd, and A. Turpin, “Comparing techniques for
authorship attribution of source code,” Software: Practice and Experience,
vol. 44, 01 2014.
9. X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, “Authorship attribution of
source code by using back propagation neural network based on particle
swarm optimization,” PLOS ONE, vol. 12, pp. 1-18, 11 2017.
10. https://www.bbc.com/bengali/news/2016/03/160324_bangladesh_bank_cyb
er_heist_how_it_happened
25

Equations
Forward Pass
𝑎 = 𝑊𝑥 + 𝑏
𝑦 = 𝑔(𝑎)
Cost
Function
L2 𝐽 = (𝑦 − 𝑦)2
Logarithmic 𝐽 =
−1
𝑚
𝑦 log 𝑦 + (1 − 𝑦)(1 − log 1 − 𝑦
Categorical Cross Entropy 𝐽 = −𝑦 log 𝑦
Backward Pass
ⅆ𝑍 = 𝐴 − 𝑦
ⅆ𝑊 =
1
𝑚
. ⅆ𝑍. 𝐴
ⅆ𝑏 =
1
𝑚
𝑖=1
𝑚
ⅆ𝑍[𝑖]
Gradient Descent 𝑤 = 𝑤 − 𝛼ⅆ𝑤
Adam
𝑣𝑡 = 𝛽1 ∗ 𝑣𝑡−1 − 1 − 𝛽1 ∗ 𝑔𝑡
𝑠𝑡 = 𝛽2 ∗ 𝑠𝑡−1 − 1 − 𝛽2 ∗ 𝑔𝑡
2
∆𝑤𝑡 = −𝛼
𝑣𝑡
𝑠𝑡 + 𝜀
∗ 𝑔𝑡
𝑤𝑡+1 = 𝑤𝑡 + ∆𝑤𝑡
27

Machine Learning Training & Testing
Trained
Model
ML
Algorithm
Features
Feature
Extractor
Training
Input
Training
Label
28
Accuracy
Predicted
Labels
Trained
Model
Features
Feature
Extractor
Test Input
Test
labels

DNN
DNN RF-CART
CART 1
CART 2
…
CART 100
RF-C4.5
C4.5 1
C4.5 2
…
C4.5 100
C-SVM ν-SVM
29

N-gram Based Approaches
Method Author Year Limitation
SCAP[1] Frantzeskou
et al.
2006 Biased toward larger author
profiles.[8]
Information
retrieval approach[2]
Burrows and
Tahaghoghi
2007 Low accuracy — 67%. Dataset
consists of codes from
students which are not
standard[2].
30
1. G. Frantzeskou, E. Stamatatos, E. Gritzalis, and S. Katsikas, “Source code author identification based on n-gram
author profiles,” in Artificial Intelligence Applications and Innovations (K. K. B. M. Maglogiannis, I., ed.), vol. 204, pp.
508-515, International Federation for Information Processing, Springer, Boston, MA, 2006.
2. S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n-grams,” in Proceedings of the 12th
Australasian Document Computing Symposium (A. T. Amanda Spink and M. Wu, eds.), pp. 32-40, School of Computer
Science and Information Technology, RMIT University, 2007.

Metric Based Approaches
Histogram
and genetic
algorithm[3]
Lange and
Mancoridis
2007 Low accuracy — 55%. Requires
new metric combination for each
author[3]. Some of the features of
this paper are unbounded[9].
Discretized
Source Code
Metrics[4]
Shevertalov
et al.
2009 Low accuracy. 54% for files and
75% for projects. No metrics were
programming style based[4].
31
4. R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification
for software forensics,” in Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation,
London, (New York, NY, USA), pp. 2082-2089, ACM, 2007.
5. M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of discretized source code metrics for author
identification,” in Proceedings of 51st International Symposium on Search Based Software Engineering (M. D. Penta
and S. Poulding, eds.), (Cumberland Lodge, Windsor, UK), pp. 69-78, IEEE Computer Society, 2009.

Metric Based Approaches Cont.
Deep neural
network[5]
Bandara and
Wijayarathna
2013 Feature set is not language-
independent.
Support Vector
Machine[6]
Zhang et al. 2017 18% accuracy gap from dataset to
dataset. Feature set is not
language-independent.
32
4.U. Bandara and G. Wijayarathna, “Deep neural networks for source code author identification,” in Proceedings of 20th
International Conference on Neural Information Processing (M. L. et al., ed.), vol. 2 of LNCS 8227,
pp. 368-375, Springer-Verlag Berlin Heidelberg, 2013.
5.C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source codes,” in Proceedings of Asia-Pacific Web
(APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (C. L.,J. C., S. C., Y. X., and
L. X., eds.), vol. 10366 of Lecture Notes in Computer Science, (Cham, Switzerland), pp. 282-296, Springer, 2017.

Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

More Related Content

What's hot (20)

Similar to Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method (20)

Recently uploaded (20)

Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method