SlideShare a Scribd company logo
Saugata Bose
12204
M.Sc-II
Natural Language Processing:
Plagiarism Detection
Scope, Objectives, Significance
 Propose a Framework
 Investigate the role of machine learning in the proposed
framework. Significance of the Project
Degradation of
Education Quality
Scope External Plagiarism
 Plagiarism
 Natural Language Processing
“The action or practice of taking someone else's work, idea, etc.,
and passing it off as one's own; literary theft."
computer science
+ artificial intelligence
+linguistics
Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse”
on 2000
Shallow
Deep
Direct copy or paraphrase of n grams
Can be of various length
Information Retrieval
Word Segmentation
Sentence Breaking
Word Sense Disambiguation
Works Influence us… …
 SCAM(Shivakumar and Garcia-Molina (1995, 1996))
 The more complex the metrics are, the more processing
power is required.(Lancaster and Culwin (2003)
 PRAISE(Culwin and Lancaster (2001)
 N gram overlap Method(RomanTesar, Massimo Poesio,Vaclav Strnad, and Karel
Jezek)
 Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris
Katz)
 Plagiarism Pattern Checker(Nam Oh Kang,Alexander Gelbukh, and SangYong
Han)
 Use ofVSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.)
Limitations!!!!
Our Initiatives… …
Frequency Comparison
Approach
N gram Similarity
Measure along with
Jaccard Index
Shallow NLP
Experimental Setup
 Corpus of Plagiarised ShortAnswers
-------Clough & Stevenson (2009)
 Original source documents : 5
 Plagiarised documents : 57
----------Near copy : 19
----------Light revision : 19
----------Heavy revision :19
----------Non-plagiarised documents : 38
Experimental Setup(cont…)
Text Pre-
processing &
NLP
Techniques
Comparison
Methodologies
Machine Learning
Accuracy Score
Feature
SelectionMachine Learning
Construction of a
Train Model
Plagiarism Detection
Suspicious Documents
Original Documents
Machine Learning
Accuracy
Corpus
Test Model
Experimental Setup(cont…)
 Text pre-processing & NLP techniques:
Lower Case
Without Stop
Word
StopWord
Punctuation No Punctuation No PunctuationPunctuation
Stemming No Stemming
Lemmatizing No Lemmatizing
Stemming No Stemming Stemming No Stemming Stemming No Stemming
Lemmatizing No Lemmatizing Lemmatizing No Lemmatizing Lemmatizing No Lemmatiz
Sentence Segmentation
Tokenization
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them.”]
[ To die, to sleep no more – and by a sleep to say we
end the heartache and the thousand natural shocks
that flesh is heir to – ‘tis a consummation devoutly to
be wished.]
“To be or not to be– that is the question:”
[To] [be] [or] [not] [to] [be] [–] [that] [is] [the]
[question] [:]
“To be or not to be– that is the question:”
to be or not to be– that is the question
“To be or not to be– that is the question:”
be or not be - question:
“Hello Dear, how areYou?
Hello Dear how are you
Produced  Produce
Produced/ Product/ Produce  Produc
Computational  Comput
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them. To die, to
sleep no more – and by a sleep to say we end the
heartache and the thousand natural shocks that flesh
is heir to – ‘tis a consummation devoutly to be
wished.]
Experimental Setup(cont…)
 Comparison Methodologies
 Machine learning algorithm:
N gram Frequency based similarity measure
N gram Similarity measure using Jaccard Index
J48 Classifier, Naïve Bais Classifier
N gram Similarity Measure
 1 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
1 gram representation
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and]
[talking] [with] [her] [friend]]
[[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%
[[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk]
[with] [her] [friend]]
[[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%No SP and P
With SP and P
N gram Similarity measure using
Jaccard Index
 2 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
2 gram representation
Similarity Index
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The girl],[girl is],[is standing],[standing outside],[outside
of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with
her],[her friend],[friend ]]
[[The boy],[boy is],[is talking],[talking with],[with his],[his
friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis
]]
2/20= 10%
Experiment and Findings-1
Generating DecisionTree
95 instances121 attributes
Selecting Features
Build train model
95 instances27 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise
Experiment and Findings-2
Generating DecisionTree
95 instances121 attributes
Use Filter Metrics
Build train model
95 instances26 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise
Future Improvements
 IntegrateWordnet with current framework
 Address Paraphrasing
 Address multi-lingual plagiarism detection

More Related Content

PDF
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
PDF
Representation Learning of Vectors of Words and Phrases
PPTX
Chat bot using text similarity approach
PPTX
Analyzing Arguments during a Debate using Natural Language Processing in Python
PDF
AINL 2016: Kravchenko
PDF
IRJET - Analysis of Paraphrase Detection using NLP Techniques
PDF
P13 corley
PPTX
Text similarity measures
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Representation Learning of Vectors of Words and Phrases
Chat bot using text similarity approach
Analyzing Arguments during a Debate using Natural Language Processing in Python
AINL 2016: Kravchenko
IRJET - Analysis of Paraphrase Detection using NLP Techniques
P13 corley
Text similarity measures

Similar to Natural language processing (20)

PPTX
Similarity Metrics for Textual Data.pptx
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
Text smilarity02 corpus_based
PPTX
Subjective evaluation answer ppt
PDF
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
PDF
Semantic Similarity Between Sentences
PDF
IRJET-Semantic Similarity Between Sentences
PDF
Leveraging Sentiment to Compute Word Similarity
PDF
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
PDF
F017243241
PDF
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
PPTX
Argumentation Framework
PDF
20433-39028-3-PB.pdf
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
PDF
A Survey on Unsupervised Graph-based Word Sense Disambiguation
PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PDF
IJNLC 2013 - Ambiguity-Aware Document Similarity
Similarity Metrics for Textual Data.pptx
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
Text smilarity02 corpus_based
Subjective evaluation answer ppt
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
Semantic Similarity Between Sentences
IRJET-Semantic Similarity Between Sentences
Leveraging Sentiment to Compute Word Similarity
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
F017243241
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
Argumentation Framework
20433-39028-3-PB.pdf
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A Survey on Unsupervised Graph-based Word Sense Disambiguation
Big Data Palooza Talk: Aspects of Semantic Processing
IJNLC 2013 - Ambiguity-Aware Document Similarity
Ad

Recently uploaded (20)

PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
communication and presentation skills 01
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPT
Total quality management ppt for engineering students
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Current and future trends in Computer Vision.pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
Amdahl’s law is explained in the above power point presentations
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Visual Aids for Exploratory Data Analysis.pdf
communication and presentation skills 01
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Total quality management ppt for engineering students
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Fundamentals of Mechanical Engineering.pptx
Current and future trends in Computer Vision.pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
III.4.1.2_The_Space_Environment.p pdffdf
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Management Information system : MIS-e-Business Systems.pptx
Amdahl’s law is explained in the above power point presentations
August -2025_Top10 Read_Articles_ijait.pdf
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Categorization of Factors Affecting Classification Algorithms Selection
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Ad

Natural language processing

  • 1. Saugata Bose 12204 M.Sc-II Natural Language Processing: Plagiarism Detection
  • 2. Scope, Objectives, Significance  Propose a Framework  Investigate the role of machine learning in the proposed framework. Significance of the Project Degradation of Education Quality Scope External Plagiarism
  • 3.  Plagiarism  Natural Language Processing “The action or practice of taking someone else's work, idea, etc., and passing it off as one's own; literary theft." computer science + artificial intelligence +linguistics Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse” on 2000 Shallow Deep Direct copy or paraphrase of n grams Can be of various length Information Retrieval Word Segmentation Sentence Breaking Word Sense Disambiguation
  • 4. Works Influence us… …  SCAM(Shivakumar and Garcia-Molina (1995, 1996))  The more complex the metrics are, the more processing power is required.(Lancaster and Culwin (2003)  PRAISE(Culwin and Lancaster (2001)  N gram overlap Method(RomanTesar, Massimo Poesio,Vaclav Strnad, and Karel Jezek)  Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris Katz)  Plagiarism Pattern Checker(Nam Oh Kang,Alexander Gelbukh, and SangYong Han)  Use ofVSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.) Limitations!!!!
  • 5. Our Initiatives… … Frequency Comparison Approach N gram Similarity Measure along with Jaccard Index Shallow NLP
  • 6. Experimental Setup  Corpus of Plagiarised ShortAnswers -------Clough & Stevenson (2009)  Original source documents : 5  Plagiarised documents : 57 ----------Near copy : 19 ----------Light revision : 19 ----------Heavy revision :19 ----------Non-plagiarised documents : 38
  • 7. Experimental Setup(cont…) Text Pre- processing & NLP Techniques Comparison Methodologies Machine Learning Accuracy Score Feature SelectionMachine Learning Construction of a Train Model Plagiarism Detection Suspicious Documents Original Documents Machine Learning Accuracy Corpus Test Model
  • 8. Experimental Setup(cont…)  Text pre-processing & NLP techniques: Lower Case Without Stop Word StopWord Punctuation No Punctuation No PunctuationPunctuation Stemming No Stemming Lemmatizing No Lemmatizing Stemming No Stemming Stemming No Stemming Stemming No Stemming Lemmatizing No Lemmatizing Lemmatizing No Lemmatizing Lemmatizing No Lemmatiz Sentence Segmentation Tokenization [ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them.”] [ To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.] “To be or not to be– that is the question:” [To] [be] [or] [not] [to] [be] [–] [that] [is] [the] [question] [:] “To be or not to be– that is the question:” to be or not to be– that is the question “To be or not to be– that is the question:” be or not be - question: “Hello Dear, how areYou? Hello Dear how are you Produced  Produce Produced/ Product/ Produce  Produc Computational  Comput [ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them. To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.]
  • 9. Experimental Setup(cont…)  Comparison Methodologies  Machine learning algorithm: N gram Frequency based similarity measure N gram Similarity measure using Jaccard Index J48 Classifier, Naïve Bais Classifier
  • 10. N gram Similarity Measure  1 gram similarity measure (Pre- processing +NLP+ Comparison) Original Document Suspicious Document 1 gram representation The girl is standing outside of PUCSD and talking with her friend The boy is talking with his friend outside of Symbiosis [[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and] [talking] [with] [her] [friend]] [[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of] [Symbiosis ]] 7/10= 70% 3/10= 30% [[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk] [with] [her] [friend]] [[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of] [Symbiosis ]] 7/10= 70% 3/10= 30%No SP and P With SP and P
  • 11. N gram Similarity measure using Jaccard Index  2 gram similarity measure (Pre- processing +NLP+ Comparison) Original Document Suspicious Document 2 gram representation Similarity Index The girl is standing outside of PUCSD and talking with her friend The boy is talking with his friend outside of Symbiosis [[The girl],[girl is],[is standing],[standing outside],[outside of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with her],[her friend],[friend ]] [[The boy],[boy is],[is talking],[talking with],[with his],[his friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis ]] 2/20= 10%
  • 12. Experiment and Findings-1 Generating DecisionTree 95 instances121 attributes Selecting Features Build train model 95 instances27 attributes Accuracy: 94.6809 % on J48 Accuracy: 65.9574 % % on Naïve Baise Accuracy: 71.2766 % on Naïve Baise Accuracy: 93.617 % on J48 Accuracy: 89.0052 % on J48 Accuracy: 86.3874 % on NaiveBaise
  • 13. Experiment and Findings-2 Generating DecisionTree 95 instances121 attributes Use Filter Metrics Build train model 95 instances26 attributes Accuracy: 94.6809 % on J48 Accuracy: 65.9574 % % on Naïve Baise Accuracy: 71.2766 % on Naïve Baise Accuracy: 93.617 % on J48 Accuracy: 89.0052 % on J48 Accuracy: 86.3874 % on NaiveBaise
  • 14. Future Improvements  IntegrateWordnet with current framework  Address Paraphrasing  Address multi-lingual plagiarism detection