Automatic plagiarism detection system for specialized corpora

Authors
University
Politehnica
of Bucharest
Automatic Plagiarism Detection
System for Specialized Corpora
Filip Cristian Buruiană
Adrian Scoică
Traian Rebedea – traian.rebedea@cs.pub.ro
Razvan Rughiniș

Overview
• Introduction
• System architecture
• Detection of plagiarism
• Algorithms for candidate selection
• Algorithms for detailed analysis
• Algorithms for post-procesing
• Results
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2

Introduction
• Plagiarism: unauthorized appropriation of the language or
thoughts of another author and the representation of that
author's work as pertaining to one's own without according
proper credit to the original author
• Lots of documents => automatic detection
needed
• Information Retrieval
– Stemming (ex. beauty, beautiful, beautifulness => beauti)
– Vector Space Model
– tf-idf weighting, cosine similarity
• Measuring results
– precision, recall, granularity => F-measure
22.09.13 CSCS 2013 – Bucharest, Romania 3

Existing solutions
• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)
• They are general solutions, topic independent
• No open-source solutions that offer good results
• No solutions specialized for Computer Science
• Difficult to evaluate: need a good corpus (annotated by persons,
how to find plagiarized documents, etc.)
• AuthentiCop – developed for specialized corpora, also evaluated on
general texts
• Used corpora:
– PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and
social software misuse” at CLEF)
– Bachelor thesis @ A&C

System Architecture
• Web interface for accessing AuthentiCop
– Simple to add documents (text, pdf) and to highlight suspicios
elements
22.09.13 5CSCS 2013 – Bucharest, Romania

System architecture
22.09.13 6
• Logical separation
– Front-end (PHP, JavaScript + AJAX, jquery)
– Back-end (C++)
– Cross-Language Communication
• Scalable solution, easy to update
– Web server (front-end) and the plagiarism detection
modules (back-end) may run on different machines
– Plagiarism detection can be distributed on different
machines (distributed workers)
• Several external open-source libraries are used
(e.g. Apache Tika, Clucene, etc.)
CSCS 2013 – Bucharest, Romania

System architecture

System architecture
22.09.13 8
•Example: sequence of steps for processing PDF files:
•Apache Tika is used for transforming PDFs into text
•Automatic build module for the back-end components
•Automatic deployment system for the solution

Detection of plagiarism
• Different problems
– Intrinsic plagiarism (analyze only the suspicious
document)
– External plagiarism (also has a reference collection
to check against)
• How large is the collection? Online sources?
• Source identification
• Text allignment

Detection of plagiarism
Steps for external plagiarism detection
1.Candidate selection
– Find pairs of suspicious texts
– Combines source identification with text
allignment
1.Detailed analysis
2.Post-processing

Algorithms for candidate selection
22.09.13 11
•Selection of the plausible pairs of
plagiarism
•Using stop-words elimination, tf-idf & cosine
•Initial hypothesis
•“Similarity Search Problem”: All-Pairs,
ppjoin (Prefix Filtering with Positional
Information Join) CSCS 2013 – Bucharest, Romania

Algorithms for candidate selection
22.09.13 12
•FastDocode (presented at PAN 2010)
+ caching + sub-linear merging
•New approach
- Text segments => fingerprints & indexing with Apache
CLucene
- Compute the number of inversions
N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet
3 150 10% 5413 44522 11469 ~ 1 0.162
4 150 10% 4913 10297 11969 ~ 2 0.306
4 150 30% 7633 35169 9249 ~ 4.5 0.256
5 150 20% 5194 6256 11688 ~ 3 0.367
Used method (used on
1000 documents)
TP FP FN Prec. Recall Plagdet
Fingerprinting & indexing 685 494 761 0.581 0.474 0.522
FastDocode#3 634 4097 812 0.134 0.438 0.205
FastDocode#4 424 815 1022 0.342 0.293 0.316

Algorithms for detailed analysis
22.09.13 13
•DotPlot: “Sequence Alignment Problem”.
•Modified FastDocode
• Extending the analysis to the right and to the left,
starting from common words/passages
• Using passages instead of words as seeds for the
comparison
• tf-idf weighting & cosine similarity
Image source: Wikipedia

Algorithms for post-processing
• Semantic analysis using LSA
– Built a semantic space with papers from Computer
Science (and pages from Wikipedia)
– Gensim framework in Pyhton
• Smith-Waterman Algorithm
– Dynamic programming
– Similar to the longest common subsequence
– Insert and delete operations may have any cost
(they may be greater than 1)

Results
22.09.13 15
• Corpus: PAN 2011 (~ 22k documents)
• Run time on laptop: ~ 20 hours
• Results:
• Official results from PAN 2011:
Plagdet Recall Precision Granularity
0.221929185084 0.202996955425 0.366482242839 1.26150173611

Results
22.09.13 16
• Specific corpus for CS:
– 940 BSc thesis + 8700 article on CS from Wikipedia
• Detecting thesis written in English: TextCat
– 307 BSc thesis in English
Plagiarized text Original text from Wikipedia
The Canny edge detector uses a filter based
on the first derivative of a Gaussian, because
it is susceptible to noise present on raw
unprocessed image data, so to begin with,
the raw image is convolved with a Gaussian
filter. The result is a slightly blurred version of
the original which is not affected by a single
noisy pixel to any significant degree.
Because the Canny edge detector is
susceptible to noise present in raw
unprocessed image data, it uses a filter based
on a Gaussian (bell curve), where the raw
image is convolved with a Gaussian filter. The
result is a slightly blurred version of the
original which is not affected by a single noisy
pixel to any significant degree.
• Some elements are incorrectly identified as
plagiarism: quotes, bibliographic references

Conclusions
• Improving the corpus
• The system uses several parameters that were
determined empirically => use machine
learning for finding the best values
• Increase the speed of the processing
• Improve the method: “bag of words” +
information about the position of the words
• Need a better post-processing for real
documents (like scientific papers or thesis)

Thank you!
• Questions?
• Discussion

Automatic plagiarism detection system for specialized corpora

More Related Content

Viewers also liked (14)

Similar to Automatic plagiarism detection system for specialized corpora (20)

More from Traian Rebedea (20)

Recently uploaded (20)

Automatic plagiarism detection system for specialized corpora