Authors
University
Politehnica
of Bucharest
Automatic Plagiarism Detection
System for Specialized Corpora
Filip Cristian Buruiană
Adrian Scoică
Traian Rebedea – traian.rebedea@cs.pub.ro
Razvan Rughiniș
Overview
• Introduction
• System architecture
• Detection of plagiarism
• Algorithms for candidate selection
• Algorithms for detailed analysis
• Algorithms for post-procesing
• Results
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2
Introduction
• Plagiarism: unauthorized appropriation of the language or
thoughts of another author and the representation of that
author's work as pertaining to one's own without according
proper credit to the original author
• Lots of documents => automatic detection
needed
• Information Retrieval
– Stemming (ex. beauty, beautiful, beautifulness => beauti)
– Vector Space Model
– tf-idf weighting, cosine similarity
• Measuring results
– precision, recall, granularity => F-measure
22.09.13 CSCS 2013 – Bucharest, Romania 3
Existing solutions
• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)
• They are general solutions, topic independent
• No open-source solutions that offer good results
• No solutions specialized for Computer Science
• Difficult to evaluate: need a good corpus (annotated by persons,
how to find plagiarized documents, etc.)
• AuthentiCop – developed for specialized corpora, also evaluated on
general texts
• Used corpora:
– PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and
social software misuse” at CLEF)
– Bachelor thesis @ A&C
22.09.13 CSCS 2013 – Bucharest, Romania 4
System Architecture
• Web interface for accessing AuthentiCop
– Simple to add documents (text, pdf) and to highlight suspicios
elements
22.09.13 5CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 6
• Logical separation
– Front-end (PHP, JavaScript + AJAX, jquery)
– Back-end (C++)
– Cross-Language Communication
• Scalable solution, easy to update
– Web server (front-end) and the plagiarism detection
modules (back-end) may run on different machines
– Plagiarism detection can be distributed on different
machines (distributed workers)
• Several external open-source libraries are used
(e.g. Apache Tika, Clucene, etc.)
CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 7CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 8
•Example: sequence of steps for processing PDF files:
•Apache Tika is used for transforming PDFs into text
•Automatic build module for the back-end components
•Automatic deployment system for the solution
CSCS 2013 – Bucharest, Romania
Detection of plagiarism
• Different problems
– Intrinsic plagiarism (analyze only the suspicious
document)
– External plagiarism (also has a reference collection
to check against)
• How large is the collection? Online sources?
• Source identification
• Text allignment
22.09.13 CSCS 2013 – Bucharest, Romania 9
Detection of plagiarism
Steps for external plagiarism detection
1.Candidate selection
– Find pairs of suspicious texts
– Combines source identification with text
allignment
1.Detailed analysis
2.Post-processing
22.09.13 CSCS 2013 – Bucharest, Romania 10
Algorithms for candidate selection
22.09.13 11
•Selection of the plausible pairs of
plagiarism
•Using stop-words elimination, tf-idf & cosine
•Initial hypothesis
•“Similarity Search Problem”: All-Pairs,
ppjoin (Prefix Filtering with Positional
Information Join) CSCS 2013 – Bucharest, Romania
Algorithms for candidate selection
22.09.13 12
•FastDocode (presented at PAN 2010)
+ caching + sub-linear merging
•New approach
- Text segments => fingerprints & indexing with Apache
CLucene
- Compute the number of inversions
N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet
3 150 10% 5413 44522 11469 ~ 1 0.162
4 150 10% 4913 10297 11969 ~ 2 0.306
4 150 30% 7633 35169 9249 ~ 4.5 0.256
5 150 20% 5194 6256 11688 ~ 3 0.367
Used method (used on
1000 documents)
TP FP FN Prec. Recall Plagdet
Fingerprinting & indexing 685 494 761 0.581 0.474 0.522
FastDocode#3 634 4097 812 0.134 0.438 0.205
FastDocode#4 424 815 1022 0.342 0.293 0.316
CSCS 2013 – Bucharest, Romania
Algorithms for detailed analysis
22.09.13 13
•DotPlot: “Sequence Alignment Problem”.
•Modified FastDocode
• Extending the analysis to the right and to the left,
starting from common words/passages
• Using passages instead of words as seeds for the
comparison
• tf-idf weighting & cosine similarity
Image source: Wikipedia
CSCS 2013 – Bucharest, Romania
Algorithms for post-processing
• Semantic analysis using LSA
– Built a semantic space with papers from Computer
Science (and pages from Wikipedia)
– Gensim framework in Pyhton
• Smith-Waterman Algorithm
– Dynamic programming
– Similar to the longest common subsequence
– Insert and delete operations may have any cost
(they may be greater than 1)
22.09.13 14CSCS 2013 – Bucharest, Romania
Results
22.09.13 15
• Corpus: PAN 2011 (~ 22k documents)
• Run time on laptop: ~ 20 hours
• Results:
• Official results from PAN 2011:
Plagdet Recall Precision Granularity
0.221929185084 0.202996955425 0.366482242839 1.26150173611
CSCS 2013 – Bucharest, Romania
Results
22.09.13 16
• Specific corpus for CS:
– 940 BSc thesis + 8700 article on CS from Wikipedia
• Detecting thesis written in English: TextCat
– 307 BSc thesis in English
Plagiarized text Original text from Wikipedia
The Canny edge detector uses a filter based
on the first derivative of a Gaussian, because
it is susceptible to noise present on raw
unprocessed image data, so to begin with,
the raw image is convolved with a Gaussian
filter. The result is a slightly blurred version of
the original which is not affected by a single
noisy pixel to any significant degree.
Because the Canny edge detector is
susceptible to noise present in raw
unprocessed image data, it uses a filter based
on a Gaussian (bell curve), where the raw
image is convolved with a Gaussian filter. The
result is a slightly blurred version of the
original which is not affected by a single noisy
pixel to any significant degree.
• Some elements are incorrectly identified as
plagiarism: quotes, bibliographic references
CSCS 2013 – Bucharest, Romania
Conclusions
• Improving the corpus
• The system uses several parameters that were
determined empirically => use machine
learning for finding the best values
• Increase the speed of the processing
• Improve the method: “bag of words” +
information about the position of the words
• Need a better post-processing for real
documents (like scientific papers or thesis)
22.09.13 17CSCS 2013 – Bucharest, Romania
Thank you!
• Questions?
• Discussion
22.09.13 CSCS 2013 – Bucharest, Romania 18

More Related Content

PDF
Modelling Multi-Component Predictive Systems as Petri Nets
DOCX
IEEE 2014 JAVA NETWORKING PROJECTS Snapshot and continuous data collection in...
PPT
Software Services in Romania – Academia and Industry
PPT
Importanța algoritmilor pentru problemele de la interviuri
PPT
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
PDF
My Graduation Project Documentation: Plagiarism Detection System for English ...
PPTX
Authorship attribution
PPT
Plag detection
Modelling Multi-Component Predictive Systems as Petri Nets
IEEE 2014 JAVA NETWORKING PROJECTS Snapshot and continuous data collection in...
Software Services in Romania – Academia and Industry
Importanța algoritmilor pentru problemele de la interviuri
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
My Graduation Project Documentation: Plagiarism Detection System for English ...
Authorship attribution
Plag detection

Viewers also liked (14)

PPT
Using Technology To Detect Plagiarism
PDF
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PDF
The routledge handbook of forensic linguistics routledge handbooks in applied...
PPTX
Authorship analysis using function words forensic linguistics
PDF
NLP & Machine Learning - An Introductory Talk
PPTX
Plagiarism and its detection
PPTX
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
PPTX
Algorithm Design and Complexity - Course 1&2
PPT
Machine Learning for NLP
PPTX
plagiarism detection tools and techniques
PDF
Intro to Deep Learning for Question Answering
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PPTX
Forensic linguistics
PPTX
Forensic Linguistics:The Practical Applications
Using Technology To Detect Plagiarism
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
The routledge handbook of forensic linguistics routledge handbooks in applied...
Authorship analysis using function words forensic linguistics
NLP & Machine Learning - An Introductory Talk
Plagiarism and its detection
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Algorithm Design and Complexity - Course 1&2
Machine Learning for NLP
plagiarism detection tools and techniques
Intro to Deep Learning for Question Answering
Artificial Intelligence, Machine Learning and Deep Learning
Forensic linguistics
Forensic Linguistics:The Practical Applications
Ad

Similar to Automatic plagiarism detection system for specialized corpora (20)

PDF
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
PDF
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
PPT
Neven Vrček: Project activities and opportunities for collaboration with Facu...
PDF
SERENE 2014 School:Andras pataricza serene2014_school
PDF
SERENE 2014 School: Andras pataricza serene2014_school
PDF
SERENE 2014 School: Challenges in Cyber-Physical Systems
PPTX
AIAA Conference - Big Data Session_ Final - Jan 2016
PDF
Software Analytics: Data Analytics for Software Engineering
PDF
A Review on Traffic Classification Methods in WSN
PPT
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
PDF
Performance evaluation methods for P2P overlays
PPT
bonino
PPT
Disambiguating Advanced Computing for Humanities Researchers
PPT
Semantics in Sensor Networks
PPTX
Role of python in hpc
PPT
Combining a co-occurrence-based and a semantic measure for entity linking
PDF
PEARC17:A real-time machine learning and visualization framework for scientif...
PPTX
Linked Data Quality Assessment – daQ and Luzzu
DOC
A signature based indexing method for efficient content-based retrieval of re...
PPT
Instrumentation and measurement
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Neven Vrček: Project activities and opportunities for collaboration with Facu...
SERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Challenges in Cyber-Physical Systems
AIAA Conference - Big Data Session_ Final - Jan 2016
Software Analytics: Data Analytics for Software Engineering
A Review on Traffic Classification Methods in WSN
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
Performance evaluation methods for P2P overlays
bonino
Disambiguating Advanced Computing for Humanities Researchers
Semantics in Sensor Networks
Role of python in hpc
Combining a co-occurrence-based and a semantic measure for entity linking
PEARC17:A real-time machine learning and visualization framework for scientif...
Linked Data Quality Assessment – daQ and Luzzu
A signature based indexing method for efficient content-based retrieval of re...
Instrumentation and measurement
Ad

More from Traian Rebedea (20)

PPTX
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
PDF
AI @ Wholi - Bucharest.AI Meetup #5
PDF
Deep neural networks for matching online social networking profiles
PPTX
What is word2vec?
PPT
How useful are semantic links for the detection of implicit references in csc...
PPT
A focused crawler for romanian words discovery
PPTX
Detecting and Describing Historical Periods in a Large Corpora
PDF
Practical machine learning - Part 1
PPT
Propunere de dezvoltare a carierei universitare
PPT
Relevance based ranking of video comments on YouTube
PPT
Opinion mining for social media and news items in Romanian
PPT
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PPTX
Web services for supporting the interactions of learners in the social web - ...
PPT
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
PPT
Conclusions and Recommendations of the Romanian ICT RTD Survey
PPT
Istoria Web-ului - part 2 - tentativ How to Web 2009
PPT
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
PPT
Istoria Web-ului - part 1 - tentativ How to Web 2009
PDF
Algorithm Design and Complexity - Course 12
PDF
Algorithm Design and Complexity - Course 11
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
AI @ Wholi - Bucharest.AI Meetup #5
Deep neural networks for matching online social networking profiles
What is word2vec?
How useful are semantic links for the detection of implicit references in csc...
A focused crawler for romanian words discovery
Detecting and Describing Historical Periods in a Large Corpora
Practical machine learning - Part 1
Propunere de dezvoltare a carierei universitare
Relevance based ranking of video comments on YouTube
Opinion mining for social media and news items in Romanian
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
Web services for supporting the interactions of learners in the social web - ...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Conclusions and Recommendations of the Romanian ICT RTD Survey
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 11

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
DOCX
search engine optimization ppt fir known well about this
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPT
What is a Computer? Input Devices /output devices
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Hybrid model detection and classification of lung cancer
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hindi spoken digit analysis for native and non-native speakers
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Unlock new opportunities with location data.pdf
sustainability-14-14877-v2.pddhzftheheeeee
observCloud-Native Containerability and monitoring.pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
search engine optimization ppt fir known well about this
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Benefits of Physical activity for teenagers.pptx
1 - Historical Antecedents, Social Consideration.pdf
Getting started with AI Agents and Multi-Agent Systems
A review of recent deep learning applications in wood surface defect identifi...
What is a Computer? Input Devices /output devices
Enhancing emotion recognition model for a student engagement use case through...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Taming the Chaos: How to Turn Unstructured Data into Decisions
Hybrid model detection and classification of lung cancer
Group 1 Presentation -Planning and Decision Making .pptx

Automatic plagiarism detection system for specialized corpora

  • 1. Authors University Politehnica of Bucharest Automatic Plagiarism Detection System for Specialized Corpora Filip Cristian Buruiană Adrian Scoică Traian Rebedea – traian.rebedea@cs.pub.ro Razvan Rughiniș
  • 2. Overview • Introduction • System architecture • Detection of plagiarism • Algorithms for candidate selection • Algorithms for detailed analysis • Algorithms for post-procesing • Results • Conclusions 22.09.13 Sesiunea de Licenţe - Iulie 2012 2
  • 3. Introduction • Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author • Lots of documents => automatic detection needed • Information Retrieval – Stemming (ex. beauty, beautiful, beautifulness => beauti) – Vector Space Model – tf-idf weighting, cosine similarity • Measuring results – precision, recall, granularity => F-measure 22.09.13 CSCS 2013 – Bucharest, Romania 3
  • 4. Existing solutions • Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.) • They are general solutions, topic independent • No open-source solutions that offer good results • No solutions specialized for Computer Science • Difficult to evaluate: need a good corpus (annotated by persons, how to find plagiarized documents, etc.) • AuthentiCop – developed for specialized corpora, also evaluated on general texts • Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and social software misuse” at CLEF) – Bachelor thesis @ A&C 22.09.13 CSCS 2013 – Bucharest, Romania 4
  • 5. System Architecture • Web interface for accessing AuthentiCop – Simple to add documents (text, pdf) and to highlight suspicios elements 22.09.13 5CSCS 2013 – Bucharest, Romania
  • 6. System architecture 22.09.13 6 • Logical separation – Front-end (PHP, JavaScript + AJAX, jquery) – Back-end (C++) – Cross-Language Communication • Scalable solution, easy to update – Web server (front-end) and the plagiarism detection modules (back-end) may run on different machines – Plagiarism detection can be distributed on different machines (distributed workers) • Several external open-source libraries are used (e.g. Apache Tika, Clucene, etc.) CSCS 2013 – Bucharest, Romania
  • 7. System architecture 22.09.13 7CSCS 2013 – Bucharest, Romania
  • 8. System architecture 22.09.13 8 •Example: sequence of steps for processing PDF files: •Apache Tika is used for transforming PDFs into text •Automatic build module for the back-end components •Automatic deployment system for the solution CSCS 2013 – Bucharest, Romania
  • 9. Detection of plagiarism • Different problems – Intrinsic plagiarism (analyze only the suspicious document) – External plagiarism (also has a reference collection to check against) • How large is the collection? Online sources? • Source identification • Text allignment 22.09.13 CSCS 2013 – Bucharest, Romania 9
  • 10. Detection of plagiarism Steps for external plagiarism detection 1.Candidate selection – Find pairs of suspicious texts – Combines source identification with text allignment 1.Detailed analysis 2.Post-processing 22.09.13 CSCS 2013 – Bucharest, Romania 10
  • 11. Algorithms for candidate selection 22.09.13 11 •Selection of the plausible pairs of plagiarism •Using stop-words elimination, tf-idf & cosine •Initial hypothesis •“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join) CSCS 2013 – Bucharest, Romania
  • 12. Algorithms for candidate selection 22.09.13 12 •FastDocode (presented at PAN 2010) + caching + sub-linear merging •New approach - Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet 3 150 10% 5413 44522 11469 ~ 1 0.162 4 150 10% 4913 10297 11969 ~ 2 0.306 4 150 30% 7633 35169 9249 ~ 4.5 0.256 5 150 20% 5194 6256 11688 ~ 3 0.367 Used method (used on 1000 documents) TP FP FN Prec. Recall Plagdet Fingerprinting & indexing 685 494 761 0.581 0.474 0.522 FastDocode#3 634 4097 812 0.134 0.438 0.205 FastDocode#4 424 815 1022 0.342 0.293 0.316 CSCS 2013 – Bucharest, Romania
  • 13. Algorithms for detailed analysis 22.09.13 13 •DotPlot: “Sequence Alignment Problem”. •Modified FastDocode • Extending the analysis to the right and to the left, starting from common words/passages • Using passages instead of words as seeds for the comparison • tf-idf weighting & cosine similarity Image source: Wikipedia CSCS 2013 – Bucharest, Romania
  • 14. Algorithms for post-processing • Semantic analysis using LSA – Built a semantic space with papers from Computer Science (and pages from Wikipedia) – Gensim framework in Pyhton • Smith-Waterman Algorithm – Dynamic programming – Similar to the longest common subsequence – Insert and delete operations may have any cost (they may be greater than 1) 22.09.13 14CSCS 2013 – Bucharest, Romania
  • 15. Results 22.09.13 15 • Corpus: PAN 2011 (~ 22k documents) • Run time on laptop: ~ 20 hours • Results: • Official results from PAN 2011: Plagdet Recall Precision Granularity 0.221929185084 0.202996955425 0.366482242839 1.26150173611 CSCS 2013 – Bucharest, Romania
  • 16. Results 22.09.13 16 • Specific corpus for CS: – 940 BSc thesis + 8700 article on CS from Wikipedia • Detecting thesis written in English: TextCat – 307 BSc thesis in English Plagiarized text Original text from Wikipedia The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. • Some elements are incorrectly identified as plagiarism: quotes, bibliographic references CSCS 2013 – Bucharest, Romania
  • 17. Conclusions • Improving the corpus • The system uses several parameters that were determined empirically => use machine learning for finding the best values • Increase the speed of the processing • Improve the method: “bag of words” + information about the position of the words • Need a better post-processing for real documents (like scientific papers or thesis) 22.09.13 17CSCS 2013 – Bucharest, Romania
  • 18. Thank you! • Questions? • Discussion 22.09.13 CSCS 2013 – Bucharest, Romania 18