SlideShare a Scribd company logo
Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012
DOI : 10.5121/acij.2012.3403 21
‘CodeAliker’ - Plagiarism Detection on the Cloud
Nitish Upreti1
and Rishi Kumar2
Department of Computer Science and Engineering, AMITY University, Noida, India.
nitishupreti@gmail.com
Department of Computer Science and Engineering, AMITY University, Noida, India.
rishikumar182000@gmail.com
Abstract
Plagiarism is a burning problem that academics have been facing in all of the varied levels of the
educational system. With the advent of digital content, the challenge to ensure the integrity of academic
work has been amplified. This paper discusses on defining a precise definition of plagiarized computer
code, various solutions available for detecting plagiarism and building a cloud platform for plagiarism
disclosure.
‘CodeAliker’, our application thus developed automates the submission of assignments and the review
process associated for essay text as well as computer code. It has been made available under the GNU’s
General Public License as a Free and Open Source Software.
Keywords
Plagiarism, String matching, Cloud Computing
1.Introduction
An insightful look into the scenario of academic integrity and its implications give us the major
motivation for pursuing the subject. The issue holds utmost significance as the intellectual
standards of an individual pursuing an academia a reestablished around his ability to produce
authoritative work. Plagiarism is thus lethal. Every year a large number of students and scholars
submit a huge volume of material to their respective mentors and professors. Due to the sheer
amount of text involved, a manual scrutiny is infeasible. Analyzing the situation, we found no
existing work in the public domain that solved the problem faced by educational institutes
worldwide. Most of the alternatives were either closed source or catered to only a fraction of the
entire problem. Working on this issue, at the outset we explore the sensitive aspect of
classification of documents as ‘authentic’ or ‘plagiarized’. We then analyze numerous
approaches to Plagiarism detection. Advancing then to our chief goal of implementing an
engine and leveraging the cloud platform for scalable and robust plagiarism detection. Alex
Aliken’s MOSS[1] is chosen as the key approach for building the application. Result and
conclusion follow where we present our observations and learning.
2. Classification of Text
Broadly categorizing, the nature of text submitted to such a system can either be an Essay that is
plain text in language or computer code in any of the popular language such as C, C++, Java or
Ruby for instance. It is easy to figure out whether an essay text has been plagiarized however
source code copying is a delicate issue with mostly a fine line drawn between ‘code reuse’,
Advanced Computing: An International Journal
‘collaboration’, ‘non-citation’ and ‘plagiarism’
an OOP system and ‘Don’t re-invent the wheel
more blurred. With hardly any definition
identifying copying instances is infeasible. Hence
little work has been done on the topic; the only concrete input comes from the work of
Cosma and Mike Joy [2]. Their work follows a
finding the right answers and opinions
to have our own precise judgment on
assignment submitted consists of
in almost all the sophisticated code bases, could be a major potential resource for
plagiarism instances. Nonetheless
looking false positives as Copyright statements occur
of CodeAliker we choose to strip off comments
we address a multitude of other questions. Is using an external library or API an instance of
plagiarism? For most of the cases
are a central part of any sophisticated piece
import statements and library includes.
submission; code for the design is
manually is much more effective than
3. Approaches to Plagiarism Detection
Various different approaches to Plagiarism detection exist and their performance and speed
to a great extent. Also certain plagiarism detection sche
specific structure and nature. A rich t
Plagiarism
Web Scrapping Based
String Matching
Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012
citation’ and ‘plagiarism’. With learning themes such as ‘Code Reuse’ in
invent the wheel’ code philosophies, the distinction are
With hardly any definition in place designing a system capable of accurately
nstances is infeasible. Hence concrete definitions need to be in place.
topic; the only concrete input comes from the work of
[2]. Their work follows a survey-based approach in the U.K academics
finding the right answers and opinions. However for implementing a practical solution we need
judgment on the problem rather than a crude hypothesis
submitted consists of comments and the actual source code. Comments, which occur
in almost all the sophisticated code bases, could be a major potential resource for
onetheless they present a major pitfall and could lead on to suspicious
Copyright statements occur frequently as comments. For the purpose
we choose to strip off comments so as to avoid any such issues. Moving forward
we address a multitude of other questions. Is using an external library or API an instance of
t of the cases we found that library use without citation is legitimate
sophisticated piece of computer program. CodeAliker thus filters out
import statements and library includes. An intuitive User Interface design can also be a part of
ode for the design is put under scrutiny by CodeAliker but looking at the design
manually is much more effective than plain UI code checking.
Approaches to Plagiarism Detection
Various different approaches to Plagiarism detection exist and their performance and speed
Also certain plagiarism detection schemes are more suitable for data
A rich taxonomy can be summarized in the diagram below
Figure 3.1
Plagiarism Detection Approaches
Local Database Based
Non Sturctured
Fingerprinting
Techniques
Matching
Parameterized
Techniques
Structured
ACIJ ), Vol.3, No.4, July 2012
22
With learning themes such as ‘Code Reuse’ in
distinction are even
m capable of accurately
need to be in place. A
topic; the only concrete input comes from the work of Georgina
in the U.K academics for
However for implementing a practical solution we need
rather than a crude hypothesis. Code
Comments, which occur
in almost all the sophisticated code bases, could be a major potential resource for identifying
and could lead on to suspicious
For the purpose
Moving forward,
we address a multitude of other questions. Is using an external library or API an instance of
legitimate as they
CodeAliker thus filters out
an also be a part of
but looking at the design
Various different approaches to Plagiarism detection exist and their performance and speed vary
data set with a
axonomy can be summarized in the diagram below.
Based
Sturctured
Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012
23
Web Scrapping based approaches use the World Wide Web to check for Plagiarism instances
from a large corpus of data. The scope of Web Scrapping is huge and lots of published work
exists on such systems. Our focus for this research is on systems based on a local database
compiled from assignments submitted by students taking the classes and past year submissions.
Local Database Based Approaches can be either Structured or Non Structured. The Structured
approach creates a graph model of information in the document. This approach is used mostly
with code-based assignments.
Non-Structured techniques are the most popular ones and are useful on a wide variety of text
material. They are classified based on the algorithm used. Document Fingerprinting, String
Matching and Parameterized Matching are the popular ones [3].
Tools based on the fingerprint approach work by creating “fingerprints” for each file which
consist statistical information about the file, such as average number of terms per line, number
of unique terms, and number of keywords [4].The DUP tool [5] is based on a parameterized
matching algorithm, which detects identical and near-duplicate sections of source-code, by
matching source-code sections whose identifiers have been substituted (renamed) systematically
[3].
String Matching algorithms are quite popular and effective. MOSS [1], (YAP3) [6], JPlag [7],
and Sherlock [8] are some of the popular ones available. CodeAliker is based on MOSS[1] that
employs string-matching algorithms using k-grams, where a k-gram is an adjacent substring of
length k. Winnowing, a local fingerprinting algorithm is also used to ensure matches of certain
length are detected.
4. Designing the Engine with Ruby
There were various motives for choosing MOSS as the core for CodeAliker’s engine. Also
Ruby was used to implement the engine after considering several important factors. The
language provides excellent text processing libraries, encourages an agile development
methodology and Test Driven Development (TDD). Moreover it is ready for the web with
excellent frameworks available.
MOSS is highly effective for plagiarism detection with text of different nature. It can also be
scaled to handle a large volume of data. MOSS also guarantees matches of certain length to be
detected [1].
The engine consists of three major modules: Text Filter, Hasher and Winnower. All of the
components can be customized with easy to write configuration files.
The text filter has a key role to play when processing code assignments. Based on the approach
MOSS suggests, the comments are stripped off, text is lowercased, identifiers are replaced with
a dummy symbol, language specific keywords are removed and punctuations with no semantic
meanings are stripped off. Filtered text with noise eliminated is thus obtained.
The filtered text is then fed to a Hasher that calculates hashes for the given text. A rolling hash
function based on the famous Rabin Karp Algorithm is employed to calculate hashes quickly.
With each hash value calculated, the corresponding line number where the text occurred is
stored. This aids later in presenting user with the information regarding the instances where
plagiarized text is present.
The Winnower is an implementation of the ‘Robust Winnowing’ algorithm defined by MOSS.A
set of hash is chosen to be as the finger print of a document. Line number information is still
preserved.
Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012
24
Winnower needs to be configured with parameters value ‘k’ for k-gram, a threshold value ‘t’
and a modulus value ‘q’. If there is a substring match at least as long as the guarantee
threshold, ‘t’, then this match is detected, and we do not detect any matches shorter than
the noise threshold, ‘k’ [1].The hash values computed are two large and hinder a
scalable implementation; hence a value ‘q’ is used as the modulus.
For CodeAliker we found the sweet spot with the values 5(k), 8(t) and 10001(q) respectively.
The documents are compared based on the final fingerprints, with plagiarism instance being
reported line by line. Check for essay based assignment is surprisingly similar with the Filter
step being omitted.
5. Building a Cloud Application
The most interesting part of our research is to build a cloud application for the engine. For
building the web application we employ the Ruby on Rails platform.
Ruby on Rails, a full stack framework for Ruby is excellent for agile development and
sustainable productivity. It boasts a high modular design, excellent package management
capabilities, database abstraction with ORM(Object Relational Mapping) library and ease of
deployment.
The application is built with the MVC (Model – View – Controller) design pattern inherent on
the Rails Framework. CodeAliker aims to ease the workflow involved by automating the entire
process. To achieve this, an authentication-based system is introduced for the professors where
assignments for each class they take are available to them as a separate bunch.
The professor can mark any on the assignment as primary and check with respect to that
assignment all the possible plagiarized instances. Thus getting a complete view of the scenario
effortlessly in a non ad-hoc fashion. Academics can also manually supervise the submission and
reviews.
While the traditional delivery of software services have been mainstream, bringing the cloud
into perspective changes the entire scenario. Cloud tends to centralize our resources, code base
and data onto an always-available depot. Hardware resources can be accurately utilized, load on
a high demand system can be catered to and system can be easily scaled. Configuration
management is also made effortless. Any organization with requirements for sucha system is in
need of the cloud. The platform allows changes to be pushed onto codebase with a push of a
button, rather than relying on extensive upgrade packs.
For plagiarism detection in a university or an academic institute the needs are critical and point
towards the cloud. The computing requirements are thus addressed. Also our system is
centralized and easily scaled.
For hosting CodeAliker on the cloud, Heroku, a cloud platform has been employed.
The source code is available to public here on: https://github.com/Myth17/CodeAliker
The application is available for free use at: http://codealiker.heroku.com/
Advanced Computing: An International Journal
6. Results
The results for CodeAliker display
suspected plagiarism instances are previewed
clearer picture, the plagiarized instances are marked with the line numbers in order of aid
manual scrutiny and presenting a
7. Conclusion
We have analyzed the entire scenario of Plagiaris
and solutions for developing an
research being our ability to define
different approaches towards plagiarism detection, practical implementation of the MOSS
engine with fine tuned parameters
platform.
Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012
Figure 6.1
Figure 6.2
results for CodeAliker display the assignment marked as primary to the left while the
suspected plagiarism instances are previewed stacked onto each other in the right. To present a
clearer picture, the plagiarized instances are marked with the line numbers in order of aid
iny and presenting a more cohesive report.
lyzed the entire scenario of Plagiarism detection, while figuring out the
olutions for developing an application for the purpose. Major accomplishment of the
our ability to define a precise definition of code plagiarism, understanding
towards plagiarism detection, practical implementation of the MOSS
parameters and building a scalable web application hosted on a cloud
ACIJ ), Vol.3, No.4, July 2012
25
the assignment marked as primary to the left while the
the right. To present a
clearer picture, the plagiarized instances are marked with the line numbers in order of aid
while figuring out the problems
Major accomplishment of the
understanding
towards plagiarism detection, practical implementation of the MOSS
hosted on a cloud
Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012
26
References
[1] Saul Schleimer, Daniel S. Wilkerson and Alex Aiken, “Winnowing: Local Algorithms for Document
Fingerprinting”, SIGMOD '03 Proceedings of the 2003 ACM SIGMOD international conference on
Management of data, pp 76-85, 2003.
[2] Georgina Cosma and Mike Joy, “ Towards a definition of Source Code Plagiarism”, IEEE
TRANSACTIONS ON EDUCATION, VOL. 51, NO. 2, MAY 2008.
[3] Georgina Cosma and Mike Joy, “An Approach to Source-Code Plagiarism Detection and
Investigation Using Latent Semantic Analysis”, IEEE TRANSACTIONS ON COMPUTERS, VOL.
61, NO. 3, MARCH 2012 379.
[4] M. Mozgovoy, “Desktop Tools for Offline Plagiarism Detection in Computer Programs,”
Informatics in Education, vol. 5, no. 1, pp. 97- 112, 2006.
[5] B. Baker, “On Finding Duplication and Near-Duplication in LargeSoftware Systems,” Proc. IEEE
Second Working Conf. Reverse Eng.,pp. 85-95, 1995.
[6] M.J. Wise, “YAP3: Improved Detection of Similarities in Computer Program and Other Texts,”
Proc. 27th SIGCSE Technical Symp.,pp. 130-134, 1996.
[7] L. Prechelt, G. Malpohl, and M. Philippsen, “Finding PlagiarismsAmong a Set of Programs with
JPlag,” J. Universal ComputerScience, vol. 8, no. 11, pp. 1016-1038, 2002.
[8] M. Joy and M. Luck, “Plagiarism in Programming Assignments,” IEEE Trans. Education, vol. 42,
no. 2, pp. 129-133, May 1999.
Authors
Nitish Upreti is computer science student at AMITY School of Engineering and
Technology, Noida, India. His fields of interest include Algorithm Design, Artificial
Intelligence and building scalable web applications.
Rishi Kumar is computer science faculty at Amity School of Engineering and
Technology, Noida, India. His fields of interest include Artificial Intelligence,
Expert System & Image Processing.

More Related Content

PDF
The Plagiarism Detection Systems for Higher Education - A Case Study in Saudi...
PDF
Convolutional Neural Networks
PDF
A framework for plagiarism
PDF
Developing an arabic plagiarism detection corpus
PDF
csmalware_malware
PDF
Automated server-side model for recognition of security vulnerabilities in sc...
PDF
Hybrid Feature Classification Approach for Malicious JavaScript Attack Detect...
PDF
A New Metric for Code Readability
The Plagiarism Detection Systems for Higher Education - A Case Study in Saudi...
Convolutional Neural Networks
A framework for plagiarism
Developing an arabic plagiarism detection corpus
csmalware_malware
Automated server-side model for recognition of security vulnerabilities in sc...
Hybrid Feature Classification Approach for Malicious JavaScript Attack Detect...
A New Metric for Code Readability

What's hot (20)

PDF
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
PDF
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
PDF
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
PDF
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
PDF
Deepcoder to Self-Code with Machine Learning
PDF
Vulnerability Assessment and Penetration Testing using Webkill
PDF
IRJET- Analysis and Detection of E-Mail Phishing using Pyspark
PDF
Performance analysis on secured data method in natural language steganography
PDF
A PPLICATION OF C LASSICAL E NCRYPTION T ECHNIQUES FOR S ECURING D ATA -...
PPT
Mining Unstructured Software Repositories Using IR Models
PDF
M phil-computer-science-cryptography-projects
PDF
IRJET- A Survey for an Efficient Secure Guarantee in Network Flow
PDF
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
PDF
Securing Cloud Using Fog: A Review
PPTX
New enterprise application and data security challenges and solutions apr 2...
PDF
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
PDF
Detection of multiword from a wordnet is complex
PDF
A survey of cloud based secured web application
PDF
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
PDF
Privacy-Preserving Updates to Anonymous and Confidential Database
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Deepcoder to Self-Code with Machine Learning
Vulnerability Assessment and Penetration Testing using Webkill
IRJET- Analysis and Detection of E-Mail Phishing using Pyspark
Performance analysis on secured data method in natural language steganography
A PPLICATION OF C LASSICAL E NCRYPTION T ECHNIQUES FOR S ECURING D ATA -...
Mining Unstructured Software Repositories Using IR Models
M phil-computer-science-cryptography-projects
IRJET- A Survey for an Efficient Secure Guarantee in Network Flow
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
Securing Cloud Using Fog: A Review
New enterprise application and data security challenges and solutions apr 2...
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
Detection of multiword from a wordnet is complex
A survey of cloud based secured web application
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
Privacy-Preserving Updates to Anonymous and Confidential Database
Ad

Similar to ‘CodeAliker’ - Plagiarism Detection on the Cloud (20)

PDF
A Survey On Plagiarism Detection
PDF
A Tool to Detect Plagiarism in Java Source Code.pdf
PPTX
A tool for Detecting Source Code Plagarism-SourcePlag
PDF
A Literature Review on Plagiarism Detection in Computer Programming Assignments
PPTX
Plagiarism introduction
PPTX
Codequiry: A Reliable Solution for Code Plagiarism Detection.pptx
PDF
Codequiry: A Code Similarity Checker Every Developer Should Know
PPTX
Code Plagiarism Checker - Code Quiry
PDF
Check Code For Plagiarism With Codequiry
PDF
Advanced Code Plagiarism Detection: Codequiry
PPT
Plag detection
PDF
Review of plagiarism detection and control & copyrights in India
PDF
Advanced Coding Plagiarism Checker By Codequiry
PDF
A Review Of Plagiarism Detection Based On Lexical And Semantic Approach
PDF
Codequiry Advanced Source Code Plagiarism Checker
PDF
IRJET - Online Assignment Plagiarism Checking using Data Mining and NLP
PDF
Codequiry: Advance Source Code Plagiarism Checker
PDF
Why Codequiry is the Best Coding Plagiarism Checker for Developers
PDF
How a Code Plagiarism Checker Protects Originality in Programming
PDF
Advanced Code Similarity Checker Python By Codequiry
A Survey On Plagiarism Detection
A Tool to Detect Plagiarism in Java Source Code.pdf
A tool for Detecting Source Code Plagarism-SourcePlag
A Literature Review on Plagiarism Detection in Computer Programming Assignments
Plagiarism introduction
Codequiry: A Reliable Solution for Code Plagiarism Detection.pptx
Codequiry: A Code Similarity Checker Every Developer Should Know
Code Plagiarism Checker - Code Quiry
Check Code For Plagiarism With Codequiry
Advanced Code Plagiarism Detection: Codequiry
Plag detection
Review of plagiarism detection and control & copyrights in India
Advanced Coding Plagiarism Checker By Codequiry
A Review Of Plagiarism Detection Based On Lexical And Semantic Approach
Codequiry Advanced Source Code Plagiarism Checker
IRJET - Online Assignment Plagiarism Checking using Data Mining and NLP
Codequiry: Advance Source Code Plagiarism Checker
Why Codequiry is the Best Coding Plagiarism Checker for Developers
How a Code Plagiarism Checker Protects Originality in Programming
Advanced Code Similarity Checker Python By Codequiry
Ad

More from acijjournal (20)

PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PDF
July 2025-Top 10 Read articles ACIJ Advanced Computing: An International Jour...
PDF
MODEL AND ALGORITHM FOR INCREASING THE EFFICIENCY OF REMOTE SERVICE SYSTEMS S...
PDF
15th International Conference on Computer Science, Engineering and Applicatio...
PDF
4th International Conference on Computer Science and Information Technology (...
PDF
APPLICATION AND ANALYSIS OF ENSEMBLE ALGORITHMS IN SOLVING REGRESSION PROBLEMS
PDF
4th International Conference on Computer Science and Information Technology (...
PDF
Application and Analysis of Ensemble Algorithms in Solving Regression Problems
PDF
17th International Conference on Networks & Communications (NeTCoM 2025)
PDF
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
PDF
Advanced Computing: An International Journal (ACIJ)
PDF
6 th International Conference on Data Mining and Software Engineering (DMSE 2...
PDF
ARTICLE :OVERVIEW OF STRUCTURE FROM MOTION
PDF
14th International Conference on Advanced Information Technologies and Applic...
PDF
2nd International Conference on Information Technology Convergence Services &...
PDF
Advanced Computing: An International Journal ( ACIJ )
PDF
3rd International Conference on Computer Science, Engineering and Artificia...
PDF
6th International Conference on Big Data and Machine Learning (BDML 2025)
PDF
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
PDF
4th International Conference on Computing and Information Technology Trends (...
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
July 2025-Top 10 Read articles ACIJ Advanced Computing: An International Jour...
MODEL AND ALGORITHM FOR INCREASING THE EFFICIENCY OF REMOTE SERVICE SYSTEMS S...
15th International Conference on Computer Science, Engineering and Applicatio...
4th International Conference on Computer Science and Information Technology (...
APPLICATION AND ANALYSIS OF ENSEMBLE ALGORITHMS IN SOLVING REGRESSION PROBLEMS
4th International Conference on Computer Science and Information Technology (...
Application and Analysis of Ensemble Algorithms in Solving Regression Problems
17th International Conference on Networks & Communications (NeTCoM 2025)
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
Advanced Computing: An International Journal (ACIJ)
6 th International Conference on Data Mining and Software Engineering (DMSE 2...
ARTICLE :OVERVIEW OF STRUCTURE FROM MOTION
14th International Conference on Advanced Information Technologies and Applic...
2nd International Conference on Information Technology Convergence Services &...
Advanced Computing: An International Journal ( ACIJ )
3rd International Conference on Computer Science, Engineering and Artificia...
6th International Conference on Big Data and Machine Learning (BDML 2025)
METHODS AND ALGORITHMS FOR ASSESSING COMPUTER NETWORK PERFORMANCE
4th International Conference on Computing and Information Technology Trends (...

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
composite construction of structures.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Well-logging-methods_new................
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
573137875-Attendance-Management-System-original
CYBER-CRIMES AND SECURITY A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
bas. eng. economics group 4 presentation 1.pptx
composite construction of structures.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Well-logging-methods_new................
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
R24 SURVEYING LAB MANUAL for civil enggi
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

‘CodeAliker’ - Plagiarism Detection on the Cloud

  • 1. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012 DOI : 10.5121/acij.2012.3403 21 ‘CodeAliker’ - Plagiarism Detection on the Cloud Nitish Upreti1 and Rishi Kumar2 Department of Computer Science and Engineering, AMITY University, Noida, India. nitishupreti@gmail.com Department of Computer Science and Engineering, AMITY University, Noida, India. rishikumar182000@gmail.com Abstract Plagiarism is a burning problem that academics have been facing in all of the varied levels of the educational system. With the advent of digital content, the challenge to ensure the integrity of academic work has been amplified. This paper discusses on defining a precise definition of plagiarized computer code, various solutions available for detecting plagiarism and building a cloud platform for plagiarism disclosure. ‘CodeAliker’, our application thus developed automates the submission of assignments and the review process associated for essay text as well as computer code. It has been made available under the GNU’s General Public License as a Free and Open Source Software. Keywords Plagiarism, String matching, Cloud Computing 1.Introduction An insightful look into the scenario of academic integrity and its implications give us the major motivation for pursuing the subject. The issue holds utmost significance as the intellectual standards of an individual pursuing an academia a reestablished around his ability to produce authoritative work. Plagiarism is thus lethal. Every year a large number of students and scholars submit a huge volume of material to their respective mentors and professors. Due to the sheer amount of text involved, a manual scrutiny is infeasible. Analyzing the situation, we found no existing work in the public domain that solved the problem faced by educational institutes worldwide. Most of the alternatives were either closed source or catered to only a fraction of the entire problem. Working on this issue, at the outset we explore the sensitive aspect of classification of documents as ‘authentic’ or ‘plagiarized’. We then analyze numerous approaches to Plagiarism detection. Advancing then to our chief goal of implementing an engine and leveraging the cloud platform for scalable and robust plagiarism detection. Alex Aliken’s MOSS[1] is chosen as the key approach for building the application. Result and conclusion follow where we present our observations and learning. 2. Classification of Text Broadly categorizing, the nature of text submitted to such a system can either be an Essay that is plain text in language or computer code in any of the popular language such as C, C++, Java or Ruby for instance. It is easy to figure out whether an essay text has been plagiarized however source code copying is a delicate issue with mostly a fine line drawn between ‘code reuse’,
  • 2. Advanced Computing: An International Journal ‘collaboration’, ‘non-citation’ and ‘plagiarism’ an OOP system and ‘Don’t re-invent the wheel more blurred. With hardly any definition identifying copying instances is infeasible. Hence little work has been done on the topic; the only concrete input comes from the work of Cosma and Mike Joy [2]. Their work follows a finding the right answers and opinions to have our own precise judgment on assignment submitted consists of in almost all the sophisticated code bases, could be a major potential resource for plagiarism instances. Nonetheless looking false positives as Copyright statements occur of CodeAliker we choose to strip off comments we address a multitude of other questions. Is using an external library or API an instance of plagiarism? For most of the cases are a central part of any sophisticated piece import statements and library includes. submission; code for the design is manually is much more effective than 3. Approaches to Plagiarism Detection Various different approaches to Plagiarism detection exist and their performance and speed to a great extent. Also certain plagiarism detection sche specific structure and nature. A rich t Plagiarism Web Scrapping Based String Matching Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012 citation’ and ‘plagiarism’. With learning themes such as ‘Code Reuse’ in invent the wheel’ code philosophies, the distinction are With hardly any definition in place designing a system capable of accurately nstances is infeasible. Hence concrete definitions need to be in place. topic; the only concrete input comes from the work of [2]. Their work follows a survey-based approach in the U.K academics finding the right answers and opinions. However for implementing a practical solution we need judgment on the problem rather than a crude hypothesis submitted consists of comments and the actual source code. Comments, which occur in almost all the sophisticated code bases, could be a major potential resource for onetheless they present a major pitfall and could lead on to suspicious Copyright statements occur frequently as comments. For the purpose we choose to strip off comments so as to avoid any such issues. Moving forward we address a multitude of other questions. Is using an external library or API an instance of t of the cases we found that library use without citation is legitimate sophisticated piece of computer program. CodeAliker thus filters out import statements and library includes. An intuitive User Interface design can also be a part of ode for the design is put under scrutiny by CodeAliker but looking at the design manually is much more effective than plain UI code checking. Approaches to Plagiarism Detection Various different approaches to Plagiarism detection exist and their performance and speed Also certain plagiarism detection schemes are more suitable for data A rich taxonomy can be summarized in the diagram below Figure 3.1 Plagiarism Detection Approaches Local Database Based Non Sturctured Fingerprinting Techniques Matching Parameterized Techniques Structured ACIJ ), Vol.3, No.4, July 2012 22 With learning themes such as ‘Code Reuse’ in distinction are even m capable of accurately need to be in place. A topic; the only concrete input comes from the work of Georgina in the U.K academics for However for implementing a practical solution we need rather than a crude hypothesis. Code Comments, which occur in almost all the sophisticated code bases, could be a major potential resource for identifying and could lead on to suspicious For the purpose Moving forward, we address a multitude of other questions. Is using an external library or API an instance of legitimate as they CodeAliker thus filters out an also be a part of but looking at the design Various different approaches to Plagiarism detection exist and their performance and speed vary data set with a axonomy can be summarized in the diagram below. Based Sturctured
  • 3. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012 23 Web Scrapping based approaches use the World Wide Web to check for Plagiarism instances from a large corpus of data. The scope of Web Scrapping is huge and lots of published work exists on such systems. Our focus for this research is on systems based on a local database compiled from assignments submitted by students taking the classes and past year submissions. Local Database Based Approaches can be either Structured or Non Structured. The Structured approach creates a graph model of information in the document. This approach is used mostly with code-based assignments. Non-Structured techniques are the most popular ones and are useful on a wide variety of text material. They are classified based on the algorithm used. Document Fingerprinting, String Matching and Parameterized Matching are the popular ones [3]. Tools based on the fingerprint approach work by creating “fingerprints” for each file which consist statistical information about the file, such as average number of terms per line, number of unique terms, and number of keywords [4].The DUP tool [5] is based on a parameterized matching algorithm, which detects identical and near-duplicate sections of source-code, by matching source-code sections whose identifiers have been substituted (renamed) systematically [3]. String Matching algorithms are quite popular and effective. MOSS [1], (YAP3) [6], JPlag [7], and Sherlock [8] are some of the popular ones available. CodeAliker is based on MOSS[1] that employs string-matching algorithms using k-grams, where a k-gram is an adjacent substring of length k. Winnowing, a local fingerprinting algorithm is also used to ensure matches of certain length are detected. 4. Designing the Engine with Ruby There were various motives for choosing MOSS as the core for CodeAliker’s engine. Also Ruby was used to implement the engine after considering several important factors. The language provides excellent text processing libraries, encourages an agile development methodology and Test Driven Development (TDD). Moreover it is ready for the web with excellent frameworks available. MOSS is highly effective for plagiarism detection with text of different nature. It can also be scaled to handle a large volume of data. MOSS also guarantees matches of certain length to be detected [1]. The engine consists of three major modules: Text Filter, Hasher and Winnower. All of the components can be customized with easy to write configuration files. The text filter has a key role to play when processing code assignments. Based on the approach MOSS suggests, the comments are stripped off, text is lowercased, identifiers are replaced with a dummy symbol, language specific keywords are removed and punctuations with no semantic meanings are stripped off. Filtered text with noise eliminated is thus obtained. The filtered text is then fed to a Hasher that calculates hashes for the given text. A rolling hash function based on the famous Rabin Karp Algorithm is employed to calculate hashes quickly. With each hash value calculated, the corresponding line number where the text occurred is stored. This aids later in presenting user with the information regarding the instances where plagiarized text is present. The Winnower is an implementation of the ‘Robust Winnowing’ algorithm defined by MOSS.A set of hash is chosen to be as the finger print of a document. Line number information is still preserved.
  • 4. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012 24 Winnower needs to be configured with parameters value ‘k’ for k-gram, a threshold value ‘t’ and a modulus value ‘q’. If there is a substring match at least as long as the guarantee threshold, ‘t’, then this match is detected, and we do not detect any matches shorter than the noise threshold, ‘k’ [1].The hash values computed are two large and hinder a scalable implementation; hence a value ‘q’ is used as the modulus. For CodeAliker we found the sweet spot with the values 5(k), 8(t) and 10001(q) respectively. The documents are compared based on the final fingerprints, with plagiarism instance being reported line by line. Check for essay based assignment is surprisingly similar with the Filter step being omitted. 5. Building a Cloud Application The most interesting part of our research is to build a cloud application for the engine. For building the web application we employ the Ruby on Rails platform. Ruby on Rails, a full stack framework for Ruby is excellent for agile development and sustainable productivity. It boasts a high modular design, excellent package management capabilities, database abstraction with ORM(Object Relational Mapping) library and ease of deployment. The application is built with the MVC (Model – View – Controller) design pattern inherent on the Rails Framework. CodeAliker aims to ease the workflow involved by automating the entire process. To achieve this, an authentication-based system is introduced for the professors where assignments for each class they take are available to them as a separate bunch. The professor can mark any on the assignment as primary and check with respect to that assignment all the possible plagiarized instances. Thus getting a complete view of the scenario effortlessly in a non ad-hoc fashion. Academics can also manually supervise the submission and reviews. While the traditional delivery of software services have been mainstream, bringing the cloud into perspective changes the entire scenario. Cloud tends to centralize our resources, code base and data onto an always-available depot. Hardware resources can be accurately utilized, load on a high demand system can be catered to and system can be easily scaled. Configuration management is also made effortless. Any organization with requirements for sucha system is in need of the cloud. The platform allows changes to be pushed onto codebase with a push of a button, rather than relying on extensive upgrade packs. For plagiarism detection in a university or an academic institute the needs are critical and point towards the cloud. The computing requirements are thus addressed. Also our system is centralized and easily scaled. For hosting CodeAliker on the cloud, Heroku, a cloud platform has been employed. The source code is available to public here on: https://github.com/Myth17/CodeAliker The application is available for free use at: http://codealiker.heroku.com/
  • 5. Advanced Computing: An International Journal 6. Results The results for CodeAliker display suspected plagiarism instances are previewed clearer picture, the plagiarized instances are marked with the line numbers in order of aid manual scrutiny and presenting a 7. Conclusion We have analyzed the entire scenario of Plagiaris and solutions for developing an research being our ability to define different approaches towards plagiarism detection, practical implementation of the MOSS engine with fine tuned parameters platform. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012 Figure 6.1 Figure 6.2 results for CodeAliker display the assignment marked as primary to the left while the suspected plagiarism instances are previewed stacked onto each other in the right. To present a clearer picture, the plagiarized instances are marked with the line numbers in order of aid iny and presenting a more cohesive report. lyzed the entire scenario of Plagiarism detection, while figuring out the olutions for developing an application for the purpose. Major accomplishment of the our ability to define a precise definition of code plagiarism, understanding towards plagiarism detection, practical implementation of the MOSS parameters and building a scalable web application hosted on a cloud ACIJ ), Vol.3, No.4, July 2012 25 the assignment marked as primary to the left while the the right. To present a clearer picture, the plagiarized instances are marked with the line numbers in order of aid while figuring out the problems Major accomplishment of the understanding towards plagiarism detection, practical implementation of the MOSS hosted on a cloud
  • 6. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.4, July 2012 26 References [1] Saul Schleimer, Daniel S. Wilkerson and Alex Aiken, “Winnowing: Local Algorithms for Document Fingerprinting”, SIGMOD '03 Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp 76-85, 2003. [2] Georgina Cosma and Mike Joy, “ Towards a definition of Source Code Plagiarism”, IEEE TRANSACTIONS ON EDUCATION, VOL. 51, NO. 2, MAY 2008. [3] Georgina Cosma and Mike Joy, “An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis”, IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 3, MARCH 2012 379. [4] M. Mozgovoy, “Desktop Tools for Offline Plagiarism Detection in Computer Programs,” Informatics in Education, vol. 5, no. 1, pp. 97- 112, 2006. [5] B. Baker, “On Finding Duplication and Near-Duplication in LargeSoftware Systems,” Proc. IEEE Second Working Conf. Reverse Eng.,pp. 85-95, 1995. [6] M.J. Wise, “YAP3: Improved Detection of Similarities in Computer Program and Other Texts,” Proc. 27th SIGCSE Technical Symp.,pp. 130-134, 1996. [7] L. Prechelt, G. Malpohl, and M. Philippsen, “Finding PlagiarismsAmong a Set of Programs with JPlag,” J. Universal ComputerScience, vol. 8, no. 11, pp. 1016-1038, 2002. [8] M. Joy and M. Luck, “Plagiarism in Programming Assignments,” IEEE Trans. Education, vol. 42, no. 2, pp. 129-133, May 1999. Authors Nitish Upreti is computer science student at AMITY School of Engineering and Technology, Noida, India. His fields of interest include Algorithm Design, Artificial Intelligence and building scalable web applications. Rishi Kumar is computer science faculty at Amity School of Engineering and Technology, Noida, India. His fields of interest include Artificial Intelligence, Expert System & Image Processing.