SlideShare a Scribd company logo
Author Identification of Source Code Segments
Written by Multiple Authors
Using Stacking Ensemble Method
Parvez Mahbub Naz Zarreen Oishie S M Rafizul Haque
this@parvezmrobin.com nazzarreen05@gmail.com rafizul@cse.ku.ac.bd
Overview
Introduction
• Objective
• Application
Background
• Related Works
Methodology
• Stacking Ensemble
Classifier
• Datasets
Experimental
Result
• Comparison With
Other Methods
Conclusion
2
Introduction
3
Objective
• Identifying author of
a source code
segment written by
multiple authors
based on previous
training
What
• Vast importance in
digital forensics &
plagiarism detection
Why • Designed a stacking
ensemble method
based on DNN,
random forest and
SVM
How
4
Application
Plagiarism
detection
Detecting
original
author
Law
Enforcement
Cybercrime
Investigation
Criminal
prosecution
Hacking
Malicious
software
Corporate
litigation
Cracking
Copyright
Infringement
Code Cloning
5
Background
6
Artificial Neural Network (ANN)
7
Random Forests
 An ensemble consists of a number
of decision trees.
 Each tree is in the forest is trained
with a subset of the original data
 During classification each of the
trees gives their vote and the
result is based on the majority
vote
 CART and C4.5 are the most used
algorithm for generating the
decision trees
8
Support Vector Machine
 Classifies data using a linear
optimal hyperplane
 This hyperplane is generated using
several support vectors and
maximizing the margin
 In case of data, that are linearly
inseparable, a kernel function
maps the data into linear space in
higher dimension
9
Ensemble Methods
 Machine learning method that
combines several machine learning
classifiers
 Often more accurate than a single
machine learning classifier
 When a output of the combined
classifiers are fed to another
classifier, it is stacking ensemble.
10
Related Works
11
Approaches
N-gram based approaches
 Treats source codes like
ordinary text document
 Can not make use of
programming features and
grammars
Metric based approaches
 More recent approaches
 Extracts several metrics from
the source code
 Line length, identifier length,
ratio of commenting
 Perform classification based on
extracted metrics
12
Stacking Ensemble Method for
Author Identification
13
Methodology
 A stacking ensemble method consists of several base
classifiers and one meta classifier.
 The outputs of the base classifiers are used as the input for the
meta classifier.
 Random forest, deep neural network(DNN) and support
vector machine(SVM) classifiers perform best in general
classification tasks[7].
 1 DNN, 2 random forests and 2 SVMs are chosen as the
base classifier.
 The meta classifier is another deep neural network.
14
Code Metrics
Metric Name Metric Description
Line Length Measures the number of characters in one source code line.
Line Words Measures the number of words in one source code line.
Comments
Frequency
Calculates the relative frequency of line comment, block comment
and optionally doc-comment used by the programmers.
Identifiers Length Calculates the length of each identifier of programs.
Inline Space-Tab
Calculates the whitespaces that occur on the interior areas of non-
whitespace lines.
Trail Space-Tab
Measures the whitespace and tab occurrence at the end of each non-
whitespace line.
Indent Space-Tab
Calculates the indentation whitespaces used at the beginning of each
non-whitespace line.
Underscores Measures the number of underscore characters used in identifiers.
15
Block Diagram of The System
Training
Source Codes
Feature
Extractor
Extracted
Features
Deep Neural
Network
DNN Meta
Features
Random
Forest(CART)
RF-CART Meta
Features
Random
Forest (C4.5)
RF-C4.5 Meta
Features
Meta Classifier
Trained Model
C-SVM
C-SVM Meta
Features
ν-SVM
ν-SVM Meta
Features
16
Data
Flow
Algorithm
1. Extract code metrics from the training set
2. Convert the code metrics to feature vectors
3. For each model in {DNN, RF-CART, RF-C4.5, C-SVM, ν-SVM}:
4. Train model based on the training features
5. Stack the outputs of each model to form meta features
6. Train the meta classifier based on the meta features
7. Predict the authors of unknown samples using the classifiers
17
Challenges
 No unique author style
 Choosing a proper code metric set
 Should be small
 Relatively easy to compute
 Programming language independent
 Visualize such a high-dimensional data and take decision.
 196 features
 Properly training each of the base models
 Choosing the proper configuration of the classifiers
 DNN: Number of hidden layers, activation functions, loss, optimizer, regularizer
 Random Forest: Number of trees, algorithm to train trees
 SVM: Bound on support vectors, kernel functions, bound on train-error
18
Datasets
 Open-source codes from github.com
 8 groups of authors
 2270 contributors
 6063 files
 Roughly 226 lines per file
 Author: The true owner of the source
codes
 Contributor: A group of people who
are not the owner of the source
codes but willingly contributes to the
project by writing or editing a
segment of it
Class Label
Number of
Authors
Number of
Contributors
Azure 3 136
GoogleCloudPlatform 33 820
StackStorm 2 147
dimagi 2 101
enthought 9 224
fp7-ofelia 1 4
freenas 2 126
sympy 2 712
19
Experimental Result
Base Classifiers
Classifier Accuracy
Deep neural network 82%
CART based random
forest
83%
C4.5 based random
forest
83%
C-support vector
machine
79%
ν-Support Vector
Machine
79%
Stacking ensemble classifier
 87% accuracy
 F1-score 0.86
20
Comparison With The Related Works
Method Features Language
independent
features
Multiple
Author
Number of
classes
Number of
source codes
Accuracy
SCAP
Byte level n-
gram
Yes No 30 333 69%
Information retrieval
Character level
n-gram
Yes No 100 1640 67%
Code metric histogram 7 code metrics Yes No 20 4068 55%
Genetic algorithm 4 code metrics Yes No 20 NA 75%
Deep neural network 9 code metrics No No
10, 10, 8,
5, 9
1644, 780,
475, 131, 520
93%, 93%,
93%, 78%, 89%
Support vector machine 46 code metrics No No 8, 53 8000, 502 98%, 80%
Stacking ensemble
method
7 code metrics Yes Yes
8 (group of
authors)
6063 87%
Conclusion
 In this paper, we designed a stacking ensemble classifier that can sufficiently
identify the authorship of a source code written by multiple authors.
 We have found a relatively small set of code metrics that is able to capture
the writing style yet being language independent.
 We have shown that a single machine learning classifier is not sufficient in
case of multiple writers.
 Our achieved accuracy is pretty close to that of approaches deal with single
author.
22
References
1. G. Frantzeskou, E. Stamatatos, E. Gritzalis, and S. Katsikas, “Source code
author identification based on n-gram author profiles,” in Artificial
Intelligence Applications and Innovations (K. K. B. M. Maglogiannis, I., ed.),
vol. 204, pp. 508-515, International Federation for Information Processing,
Springer, Boston, MA, 2006.
2. S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n-
grams,” in Proceedings of the 12th Australasian Document Computing
Symposium (A. T. Amanda Spink and M. Wu, eds.), pp. 32-40, School of
Computer Science and Information Technology, RMIT University, 2007.
3. R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic
algorithms to perform author identification for software forensics,” in
Proceedings of the 9th Annual Conference on Genetic and Evolutionary
Computation, London, (New York, NY, USA), pp. 2082-2089, ACM, 2007.
23
Reference Cont.
4. M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of
discretized source code metrics for author identification,” in Proceedings of
51st International Symposium on Search Based Software Engineering (M. D.
Penta and S. Poulding, eds.), (Cumberland Lodge, Windsor, UK), pp. 69-78,
IEEE Computer Society, 2009.
5. U. Bandara and G. Wijayarathna, “Deep neural networks for source code
author identification,” in Proceedings of 20th International Conference on
Neural Information Processing (M. L. et al., ed.), vol. 2 of LNCS 8227,
pp. 368-375, Springer-Verlag Berlin Heidelberg, 2013.
6. C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source
codes,” in Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information
Management (WAIM) Joint Conference on Web and Big Data (C. L.,J. C., S. C.,
Y. X., and L. X., eds.), vol. 10366 of Lecture Notes in Computer Science,
(Cham, Switzerland), pp. 282-296, Springer, 2017.
24
Reference Cont.
7. R. Caruana, N. Karampatziakis, and A. Yessenalina, “An empirical evaluation
of supervised learning in high dimensions,” in Proceedings of the 25th
International Conference on Machine Learning, (Helsinki, Finland), pp. 96-
103, 2008.
8. S. Burrows, A. Uitdenbogerd, and A. Turpin, “Comparing techniques for
authorship attribution of source code,” Software: Practice and Experience,
vol. 44, 01 2014.
9. X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, “Authorship attribution of
source code by using back propagation neural network based on particle
swarm optimization,” PLOS ONE, vol. 12, pp. 1-18, 11 2017.
10. https://www.bbc.com/bengali/news/2016/03/160324_bangladesh_bank_cyb
er_heist_how_it_happened
25
“
”
Thank You
26
Equations
Forward Pass
𝑎 = 𝑊𝑥 + 𝑏
𝑦 = 𝑔(𝑎)
Cost
Function
L2 𝐽 = (𝑦 − 𝑦)2
Logarithmic 𝐽 =
−1
𝑚
𝑦 log 𝑦 + (1 − 𝑦)(1 − log 1 − 𝑦
Categorical Cross Entropy 𝐽 = −𝑦 log 𝑦
Backward Pass
ⅆ𝑍 = 𝐴 − 𝑦
ⅆ𝑊 =
1
𝑚
. ⅆ𝑍. 𝐴
ⅆ𝑏 =
1
𝑚
𝑖=1
𝑚
ⅆ𝑍[𝑖]
Gradient Descent 𝑤 = 𝑤 − 𝛼ⅆ𝑤
Adam
𝑣𝑡 = 𝛽1 ∗ 𝑣𝑡−1 − 1 − 𝛽1 ∗ 𝑔𝑡
𝑠𝑡 = 𝛽2 ∗ 𝑠𝑡−1 − 1 − 𝛽2 ∗ 𝑔𝑡
2
∆𝑤𝑡 = −𝛼
𝑣𝑡
𝑠𝑡 + 𝜀
∗ 𝑔𝑡
𝑤𝑡+1 = 𝑤𝑡 + ∆𝑤𝑡
27
Machine Learning Training & Testing
Trained
Model
ML
Algorithm
Features
Feature
Extractor
Training
Input
Training
Label
28
Accuracy
Predicted
Labels
Trained
Model
Features
Feature
Extractor
Test Input
Test
labels
DNN
DNN RF-CART
CART 1
CART 2
…
CART 100
RF-C4.5
C4.5 1
C4.5 2
…
C4.5 100
C-SVM ν-SVM
29
N-gram Based Approaches
Method Author Year Limitation
SCAP[1] Frantzeskou
et al.
2006 Biased toward larger author
profiles.[8]
Information
retrieval approach[2]
Burrows and
Tahaghoghi
2007 Low accuracy — 67%. Dataset
consists of codes from
students which are not
standard[2].
30
1. G. Frantzeskou, E. Stamatatos, E. Gritzalis, and S. Katsikas, “Source code author identification based on n-gram
author profiles,” in Artificial Intelligence Applications and Innovations (K. K. B. M. Maglogiannis, I., ed.), vol. 204, pp.
508-515, International Federation for Information Processing, Springer, Boston, MA, 2006.
2. S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n-grams,” in Proceedings of the 12th
Australasian Document Computing Symposium (A. T. Amanda Spink and M. Wu, eds.), pp. 32-40, School of Computer
Science and Information Technology, RMIT University, 2007.
Metric Based Approaches
Method Author Year Limitation
Histogram
and genetic
algorithm[3]
Lange and
Mancoridis
2007 Low accuracy — 55%. Requires
new metric combination for each
author[3]. Some of the features of
this paper are unbounded[9].
Discretized
Source Code
Metrics[4]
Shevertalov
et al.
2009 Low accuracy. 54% for files and
75% for projects. No metrics were
programming style based[4].
31
4. R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification
for software forensics,” in Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation,
London, (New York, NY, USA), pp. 2082-2089, ACM, 2007.
5. M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of discretized source code metrics for author
identification,” in Proceedings of 51st International Symposium on Search Based Software Engineering (M. D. Penta
and S. Poulding, eds.), (Cumberland Lodge, Windsor, UK), pp. 69-78, IEEE Computer Society, 2009.
Metric Based Approaches Cont.
Method Author Year Limitation
Deep neural
network[5]
Bandara and
Wijayarathna
2013 Feature set is not language-
independent.
Support Vector
Machine[6]
Zhang et al. 2017 18% accuracy gap from dataset to
dataset. Feature set is not
language-independent.
32
4.U. Bandara and G. Wijayarathna, “Deep neural networks for source code author identification,” in Proceedings of 20th
International Conference on Neural Information Processing (M. L. et al., ed.), vol. 2 of LNCS 8227,
pp. 368-375, Springer-Verlag Berlin Heidelberg, 2013.
5.C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source codes,” in Proceedings of Asia-Pacific Web
(APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (C. L.,J. C., S. C., Y. X., and
L. X., eds.), vol. 10366 of Lecture Notes in Computer Science, (Cham, Switzerland), pp. 282-296, Springer, 2017.

More Related Content

PDF
Eat it, Review it: A New Approach for Review Prediction
PDF
H42054550
PDF
Color Cryptography using Substitution Method
PDF
Survey on Text Prediction Techniques
DOC
taghelper-final.doc
DOCX
5.local community detection algorithm based on minimal cluster
PDF
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
PPT
A NEW APPROACH TOWARDS INFORMATION SECURITY BASED ON DNA CRYPTOGRAPHY
Eat it, Review it: A New Approach for Review Prediction
H42054550
Color Cryptography using Substitution Method
Survey on Text Prediction Techniques
taghelper-final.doc
5.local community detection algorithm based on minimal cluster
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
A NEW APPROACH TOWARDS INFORMATION SECURITY BASED ON DNA CRYPTOGRAPHY

What's hot (20)

PDF
Paper id 71201913
PDF
DNA Encryption Algorithms: Scope and Challenges in Symmetric Key Cryptography
PDF
Study on security and quality of service implementations in p2 p overlay netw...
PDF
Dna cryptography
PDF
Secure and distributed data discovery and dissemination in wireless sensor ne...
PDF
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
PPTX
An approach to source code plagiarism
PDF
Test PDF
PDF
IRJET- Survey on Text Error Detection using Deep Learning
PPTX
The Duet model
PDF
IEEE NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
PDF
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
PDF
Prediction of Answer Keywords using Char-RNN
PPT
Online handwritten script recognition
PDF
Ensure Security and Scalable Performance in Multiple Relay Networks
PDF
Secure data transmission using dna encryption
PDF
Survey of Different DNA Cryptography based Algorithms
PPT
character recognition: Scope and challenges
PDF
Handwriting identification using deep convolutional neural network method
PPTX
Exploring Session Context using Distributed Representations of Queries and Re...
Paper id 71201913
DNA Encryption Algorithms: Scope and Challenges in Symmetric Key Cryptography
Study on security and quality of service implementations in p2 p overlay netw...
Dna cryptography
Secure and distributed data discovery and dissemination in wireless sensor ne...
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
An approach to source code plagiarism
Test PDF
IRJET- Survey on Text Error Detection using Deep Learning
The Duet model
IEEE NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
Prediction of Answer Keywords using Char-RNN
Online handwritten script recognition
Ensure Security and Scalable Performance in Multiple Relay Networks
Secure data transmission using dna encryption
Survey of Different DNA Cryptography based Algorithms
character recognition: Scope and challenges
Handwriting identification using deep convolutional neural network method
Exploring Session Context using Distributed Representations of Queries and Re...
Ad

Similar to Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method (20)

PPTX
A tool for Detecting Source Code Plagarism-SourcePlag
PDF
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
PPTX
CNN-Based Bangla Handwritten Character Recognition: Exploring Ekush Dataset f...
PDF
Exploring and comparing various machine and deep learning technique algorithm...
PDF
Cryptanalysis of Cipher texts using Artificial Neural Networks: A comparative...
PDF
voice and speech recognition using machine learning
PDF
A systematic analysis on machine learning classifiers with data pre-processin...
PPT
Topic Models Based Personalized Spam Filter
PPTX
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
DOC
Proposed-curricula-MCSEwithSyllabus_24_...
PDF
Randomness evaluation framework of cryptographic algorithms
PDF
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
PDF
Analyzing Big Data's Weakest Link (hint: it might be you)
PDF
Pattern Recognition using Artificial Neural Network
DOCX
Hand Written Character Recognition Using Neural Networks
PDF
Mythri_Thippareddy_Resume
PDF
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
PPTX
Big Data: the weakest link
PPTX
intrusion-detection-using-Machine Learning
PDF
Off-line English Character Recognition: A Comparative Survey
A tool for Detecting Source Code Plagarism-SourcePlag
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
CNN-Based Bangla Handwritten Character Recognition: Exploring Ekush Dataset f...
Exploring and comparing various machine and deep learning technique algorithm...
Cryptanalysis of Cipher texts using Artificial Neural Networks: A comparative...
voice and speech recognition using machine learning
A systematic analysis on machine learning classifiers with data pre-processin...
Topic Models Based Personalized Spam Filter
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
Proposed-curricula-MCSEwithSyllabus_24_...
Randomness evaluation framework of cryptographic algorithms
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
Analyzing Big Data's Weakest Link (hint: it might be you)
Pattern Recognition using Artificial Neural Network
Hand Written Character Recognition Using Neural Networks
Mythri_Thippareddy_Resume
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Big Data: the weakest link
intrusion-detection-using-Machine Learning
Off-line English Character Recognition: A Comparative Survey
Ad

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
ai tools demonstartion for schools and inter college
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms I-SECS-1021-03
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Designing Intelligence for the Shop Floor.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
ai tools demonstartion for schools and inter college
2025 Textile ERP Trends: SAP, Odoo & Oracle
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Computer Software and OS of computer science of grade 11.pptx
Digital Systems & Binary Numbers (comprehensive )
Upgrade and Innovation Strategies for SAP ERP Customers
How to Choose the Right IT Partner for Your Business in Malaysia
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PTS Company Brochure 2025 (1).pdf.......
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx

Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

  • 1. Author Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method Parvez Mahbub Naz Zarreen Oishie S M Rafizul Haque this@parvezmrobin.com nazzarreen05@gmail.com rafizul@cse.ku.ac.bd
  • 2. Overview Introduction • Objective • Application Background • Related Works Methodology • Stacking Ensemble Classifier • Datasets Experimental Result • Comparison With Other Methods Conclusion 2
  • 4. Objective • Identifying author of a source code segment written by multiple authors based on previous training What • Vast importance in digital forensics & plagiarism detection Why • Designed a stacking ensemble method based on DNN, random forest and SVM How 4
  • 8. Random Forests  An ensemble consists of a number of decision trees.  Each tree is in the forest is trained with a subset of the original data  During classification each of the trees gives their vote and the result is based on the majority vote  CART and C4.5 are the most used algorithm for generating the decision trees 8
  • 9. Support Vector Machine  Classifies data using a linear optimal hyperplane  This hyperplane is generated using several support vectors and maximizing the margin  In case of data, that are linearly inseparable, a kernel function maps the data into linear space in higher dimension 9
  • 10. Ensemble Methods  Machine learning method that combines several machine learning classifiers  Often more accurate than a single machine learning classifier  When a output of the combined classifiers are fed to another classifier, it is stacking ensemble. 10
  • 12. Approaches N-gram based approaches  Treats source codes like ordinary text document  Can not make use of programming features and grammars Metric based approaches  More recent approaches  Extracts several metrics from the source code  Line length, identifier length, ratio of commenting  Perform classification based on extracted metrics 12
  • 13. Stacking Ensemble Method for Author Identification 13
  • 14. Methodology  A stacking ensemble method consists of several base classifiers and one meta classifier.  The outputs of the base classifiers are used as the input for the meta classifier.  Random forest, deep neural network(DNN) and support vector machine(SVM) classifiers perform best in general classification tasks[7].  1 DNN, 2 random forests and 2 SVMs are chosen as the base classifier.  The meta classifier is another deep neural network. 14
  • 15. Code Metrics Metric Name Metric Description Line Length Measures the number of characters in one source code line. Line Words Measures the number of words in one source code line. Comments Frequency Calculates the relative frequency of line comment, block comment and optionally doc-comment used by the programmers. Identifiers Length Calculates the length of each identifier of programs. Inline Space-Tab Calculates the whitespaces that occur on the interior areas of non- whitespace lines. Trail Space-Tab Measures the whitespace and tab occurrence at the end of each non- whitespace line. Indent Space-Tab Calculates the indentation whitespaces used at the beginning of each non-whitespace line. Underscores Measures the number of underscore characters used in identifiers. 15
  • 16. Block Diagram of The System Training Source Codes Feature Extractor Extracted Features Deep Neural Network DNN Meta Features Random Forest(CART) RF-CART Meta Features Random Forest (C4.5) RF-C4.5 Meta Features Meta Classifier Trained Model C-SVM C-SVM Meta Features ν-SVM ν-SVM Meta Features 16 Data Flow
  • 17. Algorithm 1. Extract code metrics from the training set 2. Convert the code metrics to feature vectors 3. For each model in {DNN, RF-CART, RF-C4.5, C-SVM, ν-SVM}: 4. Train model based on the training features 5. Stack the outputs of each model to form meta features 6. Train the meta classifier based on the meta features 7. Predict the authors of unknown samples using the classifiers 17
  • 18. Challenges  No unique author style  Choosing a proper code metric set  Should be small  Relatively easy to compute  Programming language independent  Visualize such a high-dimensional data and take decision.  196 features  Properly training each of the base models  Choosing the proper configuration of the classifiers  DNN: Number of hidden layers, activation functions, loss, optimizer, regularizer  Random Forest: Number of trees, algorithm to train trees  SVM: Bound on support vectors, kernel functions, bound on train-error 18
  • 19. Datasets  Open-source codes from github.com  8 groups of authors  2270 contributors  6063 files  Roughly 226 lines per file  Author: The true owner of the source codes  Contributor: A group of people who are not the owner of the source codes but willingly contributes to the project by writing or editing a segment of it Class Label Number of Authors Number of Contributors Azure 3 136 GoogleCloudPlatform 33 820 StackStorm 2 147 dimagi 2 101 enthought 9 224 fp7-ofelia 1 4 freenas 2 126 sympy 2 712 19
  • 20. Experimental Result Base Classifiers Classifier Accuracy Deep neural network 82% CART based random forest 83% C4.5 based random forest 83% C-support vector machine 79% ν-Support Vector Machine 79% Stacking ensemble classifier  87% accuracy  F1-score 0.86 20
  • 21. Comparison With The Related Works Method Features Language independent features Multiple Author Number of classes Number of source codes Accuracy SCAP Byte level n- gram Yes No 30 333 69% Information retrieval Character level n-gram Yes No 100 1640 67% Code metric histogram 7 code metrics Yes No 20 4068 55% Genetic algorithm 4 code metrics Yes No 20 NA 75% Deep neural network 9 code metrics No No 10, 10, 8, 5, 9 1644, 780, 475, 131, 520 93%, 93%, 93%, 78%, 89% Support vector machine 46 code metrics No No 8, 53 8000, 502 98%, 80% Stacking ensemble method 7 code metrics Yes Yes 8 (group of authors) 6063 87%
  • 22. Conclusion  In this paper, we designed a stacking ensemble classifier that can sufficiently identify the authorship of a source code written by multiple authors.  We have found a relatively small set of code metrics that is able to capture the writing style yet being language independent.  We have shown that a single machine learning classifier is not sufficient in case of multiple writers.  Our achieved accuracy is pretty close to that of approaches deal with single author. 22
  • 23. References 1. G. Frantzeskou, E. Stamatatos, E. Gritzalis, and S. Katsikas, “Source code author identification based on n-gram author profiles,” in Artificial Intelligence Applications and Innovations (K. K. B. M. Maglogiannis, I., ed.), vol. 204, pp. 508-515, International Federation for Information Processing, Springer, Boston, MA, 2006. 2. S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n- grams,” in Proceedings of the 12th Australasian Document Computing Symposium (A. T. Amanda Spink and M. Wu, eds.), pp. 32-40, School of Computer Science and Information Technology, RMIT University, 2007. 3. R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification for software forensics,” in Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, London, (New York, NY, USA), pp. 2082-2089, ACM, 2007. 23
  • 24. Reference Cont. 4. M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of discretized source code metrics for author identification,” in Proceedings of 51st International Symposium on Search Based Software Engineering (M. D. Penta and S. Poulding, eds.), (Cumberland Lodge, Windsor, UK), pp. 69-78, IEEE Computer Society, 2009. 5. U. Bandara and G. Wijayarathna, “Deep neural networks for source code author identification,” in Proceedings of 20th International Conference on Neural Information Processing (M. L. et al., ed.), vol. 2 of LNCS 8227, pp. 368-375, Springer-Verlag Berlin Heidelberg, 2013. 6. C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source codes,” in Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (C. L.,J. C., S. C., Y. X., and L. X., eds.), vol. 10366 of Lecture Notes in Computer Science, (Cham, Switzerland), pp. 282-296, Springer, 2017. 24
  • 25. Reference Cont. 7. R. Caruana, N. Karampatziakis, and A. Yessenalina, “An empirical evaluation of supervised learning in high dimensions,” in Proceedings of the 25th International Conference on Machine Learning, (Helsinki, Finland), pp. 96- 103, 2008. 8. S. Burrows, A. Uitdenbogerd, and A. Turpin, “Comparing techniques for authorship attribution of source code,” Software: Practice and Experience, vol. 44, 01 2014. 9. X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, “Authorship attribution of source code by using back propagation neural network based on particle swarm optimization,” PLOS ONE, vol. 12, pp. 1-18, 11 2017. 10. https://www.bbc.com/bengali/news/2016/03/160324_bangladesh_bank_cyb er_heist_how_it_happened 25
  • 27. Equations Forward Pass 𝑎 = 𝑊𝑥 + 𝑏 𝑦 = 𝑔(𝑎) Cost Function L2 𝐽 = (𝑦 − 𝑦)2 Logarithmic 𝐽 = −1 𝑚 𝑦 log 𝑦 + (1 − 𝑦)(1 − log 1 − 𝑦 Categorical Cross Entropy 𝐽 = −𝑦 log 𝑦 Backward Pass ⅆ𝑍 = 𝐴 − 𝑦 ⅆ𝑊 = 1 𝑚 . ⅆ𝑍. 𝐴 ⅆ𝑏 = 1 𝑚 𝑖=1 𝑚 ⅆ𝑍[𝑖] Gradient Descent 𝑤 = 𝑤 − 𝛼ⅆ𝑤 Adam 𝑣𝑡 = 𝛽1 ∗ 𝑣𝑡−1 − 1 − 𝛽1 ∗ 𝑔𝑡 𝑠𝑡 = 𝛽2 ∗ 𝑠𝑡−1 − 1 − 𝛽2 ∗ 𝑔𝑡 2 ∆𝑤𝑡 = −𝛼 𝑣𝑡 𝑠𝑡 + 𝜀 ∗ 𝑔𝑡 𝑤𝑡+1 = 𝑤𝑡 + ∆𝑤𝑡 27
  • 28. Machine Learning Training & Testing Trained Model ML Algorithm Features Feature Extractor Training Input Training Label 28 Accuracy Predicted Labels Trained Model Features Feature Extractor Test Input Test labels
  • 29. DNN DNN RF-CART CART 1 CART 2 … CART 100 RF-C4.5 C4.5 1 C4.5 2 … C4.5 100 C-SVM ν-SVM 29
  • 30. N-gram Based Approaches Method Author Year Limitation SCAP[1] Frantzeskou et al. 2006 Biased toward larger author profiles.[8] Information retrieval approach[2] Burrows and Tahaghoghi 2007 Low accuracy — 67%. Dataset consists of codes from students which are not standard[2]. 30 1. G. Frantzeskou, E. Stamatatos, E. Gritzalis, and S. Katsikas, “Source code author identification based on n-gram author profiles,” in Artificial Intelligence Applications and Innovations (K. K. B. M. Maglogiannis, I., ed.), vol. 204, pp. 508-515, International Federation for Information Processing, Springer, Boston, MA, 2006. 2. S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n-grams,” in Proceedings of the 12th Australasian Document Computing Symposium (A. T. Amanda Spink and M. Wu, eds.), pp. 32-40, School of Computer Science and Information Technology, RMIT University, 2007.
  • 31. Metric Based Approaches Method Author Year Limitation Histogram and genetic algorithm[3] Lange and Mancoridis 2007 Low accuracy — 55%. Requires new metric combination for each author[3]. Some of the features of this paper are unbounded[9]. Discretized Source Code Metrics[4] Shevertalov et al. 2009 Low accuracy. 54% for files and 75% for projects. No metrics were programming style based[4]. 31 4. R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification for software forensics,” in Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, London, (New York, NY, USA), pp. 2082-2089, ACM, 2007. 5. M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of discretized source code metrics for author identification,” in Proceedings of 51st International Symposium on Search Based Software Engineering (M. D. Penta and S. Poulding, eds.), (Cumberland Lodge, Windsor, UK), pp. 69-78, IEEE Computer Society, 2009.
  • 32. Metric Based Approaches Cont. Method Author Year Limitation Deep neural network[5] Bandara and Wijayarathna 2013 Feature set is not language- independent. Support Vector Machine[6] Zhang et al. 2017 18% accuracy gap from dataset to dataset. Feature set is not language-independent. 32 4.U. Bandara and G. Wijayarathna, “Deep neural networks for source code author identification,” in Proceedings of 20th International Conference on Neural Information Processing (M. L. et al., ed.), vol. 2 of LNCS 8227, pp. 368-375, Springer-Verlag Berlin Heidelberg, 2013. 5.C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source codes,” in Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (C. L.,J. C., S. C., Y. X., and L. X., eds.), vol. 10366 of Lecture Notes in Computer Science, (Cham, Switzerland), pp. 282-296, Springer, 2017.