Natural language processing

Saugata Bose
12204
M.Sc-II
Natural Language Processing:
Plagiarism Detection

Scope, Objectives, Significance
 Propose a Framework
 Investigate the role of machine learning in the proposed
framework. Significance of the Project
Degradation of
Education Quality
Scope External Plagiarism

 Plagiarism
 Natural Language Processing
“The action or practice of taking someone else's work, idea, etc.,
and passing it off as one's own; literary theft."
computer science
+ artificial intelligence
+linguistics
Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse”
on 2000
Shallow
Deep
Direct copy or paraphrase of n grams
Can be of various length
Information Retrieval
Word Segmentation
Sentence Breaking
Word Sense Disambiguation

Works Influence us… …
 SCAM(Shivakumar and Garcia-Molina (1995, 1996))
 The more complex the metrics are, the more processing
power is required.(Lancaster and Culwin (2003)
 PRAISE(Culwin and Lancaster (2001)
 N gram overlap Method(RomanTesar, Massimo Poesio,Vaclav Strnad, and Karel
Jezek)
 Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris
Katz)
 Plagiarism Pattern Checker(Nam Oh Kang,Alexander Gelbukh, and SangYong
Han)
 Use ofVSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.)
Limitations!!!!

Our Initiatives… …
Frequency Comparison
Approach
N gram Similarity
Measure along with
Jaccard Index
Shallow NLP

Experimental Setup
 Corpus of Plagiarised ShortAnswers
-------Clough & Stevenson (2009)
 Original source documents : 5
 Plagiarised documents : 57
----------Near copy : 19
----------Light revision : 19
----------Heavy revision :19
----------Non-plagiarised documents : 38

Experimental Setup(cont…)
Text Pre-
processing &
NLP
Techniques
Comparison
Methodologies
Machine Learning
Accuracy Score
Feature
SelectionMachine Learning
Construction of a
Train Model
Plagiarism Detection
Suspicious Documents
Original Documents
Machine Learning
Accuracy
Corpus
Test Model

 Text pre-processing & NLP techniques:
Lower Case
Without Stop
Word
StopWord
Punctuation No Punctuation No PunctuationPunctuation
Stemming No Stemming
Lemmatizing No Lemmatizing
Stemming No Stemming Stemming No Stemming Stemming No Stemming
Lemmatizing No Lemmatizing Lemmatizing No Lemmatizing Lemmatizing No Lemmatiz
Sentence Segmentation
Tokenization
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them.”]
[ To die, to sleep no more – and by a sleep to say we
end the heartache and the thousand natural shocks
that flesh is heir to – ‘tis a consummation devoutly to
be wished.]
“To be or not to be– that is the question:”
[To] [be] [or] [not] [to] [be] [–] [that] [is] [the]
[question] [:]
to be or not to be– that is the question
be or not be - question:
“Hello Dear, how areYou?
Hello Dear how are you
Produced  Produce
Produced/ Product/ Produce  Produc
Computational  Comput
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them. To die, to
sleep no more – and by a sleep to say we end the
heartache and the thousand natural shocks that flesh
is heir to – ‘tis a consummation devoutly to be
wished.]

 Comparison Methodologies
 Machine learning algorithm:
N gram Frequency based similarity measure
N gram Similarity measure using Jaccard Index
J48 Classifier, Naïve Bais Classifier

N gram Similarity Measure
 1 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
1 gram representation
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and]
[talking] [with] [her] [friend]]
[[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%
[[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk]
[with] [her] [friend]]
[[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%No SP and P
With SP and P

N gram Similarity measure using
Jaccard Index
 2 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
2 gram representation
Similarity Index
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The girl],[girl is],[is standing],[standing outside],[outside
of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with
her],[her friend],[friend ]]
[[The boy],[boy is],[is talking],[talking with],[with his],[his
friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis
]]
2/20= 10%

Experiment and Findings-1
Generating DecisionTree
95 instances121 attributes
Selecting Features
Build train model
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 86.3874 % on NaiveBaise

Experiment and Findings-2
Generating DecisionTree
Use Filter Metrics
Build train model
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 86.3874 % on NaiveBaise

Future Improvements
 IntegrateWordnet with current framework
 Address Paraphrasing
 Address multi-lingual plagiarism detection

Natural language processing

More Related Content

Similar to Natural language processing (20)

Recently uploaded (20)

Natural language processing