toxic commnets classification using python

eliminary Screening
Project Title ---- Toxic Comment Detection
Presented by :- Under the Supervision of:
1

Contents:
 Introduction
 Motivation
 Problem statement
 Literature survey
 Meeting Details
 Workload Distribution
 Project Planning
 Screenshot of Approval of the Certificate of the Project Report
 Methodology Used
 Solution Approached
 Algorithms and Framework
 Outcome Produced
 Proof of the Outcome

Introduction
• Texts that are considered toxic are those that are impolite, show disrespect, or have a tendency to drive away from the
conversation.
• On various social networks, news websites, and online forums, we might be able to have healthier discussions if these toxic texts
can be automatically identified.
• These texts contain dangers such as high-toxicity texts that lead to personal insults, online abuse, and bullying habits that are
harmful to a person's psychological health and emotional well-being.
• Many people refrain from expressing themselves and give up on expressing themselves because they are afraid of online
harassment and bullying.
• An automated system must be formulated to keep away, remove, or identify such harmful content from online sites. But
developing such a toxicity identification system is a difficult task for online platform providers.
• Natural language processing provides a helping hand in the identification of toxicity in texts expressed as images or texts.
• The detection of insulting comments is a critical area of research in natural language processing.
• The primary goal is to assess the toxicity and habits expressed in words and their contexts.
• The objective of this paper is to propose a model to detect toxic or non-toxic texts with higher accuracy.

Motivation
• People refrain from expressing
themselves due to toxicity on social
media affecting their emotional and
mental well-being.
• A system must be developed to
identify such toxicity in texts.

Problem
statement
• To propose a model that will help
users to stay away from the
toxic environment that exist on the
social media in the form of text.
• To propose a model for identification
of toxicity i.e., toxic or non- toxic in
texts.
• To propose a model with the high-rate
accuracy.

Literature Survey
S.NO. NAME AUTHOR OBJECTIVE ALGORITHM DATASET CONCLUSION DRAWBACKS
1. Keeping
Children
safe online
with limited
resources:
Analyzing
what is seen
and what is
heard.
ALEKSANDAR
JEVREMOVIC,
MLADEN
VEINOVIC ,
MILAN
CABARKAPA
Designed a
framework(Casper)
which will directly
analyzes at the content
what the user sees and
hears.
BERT, for
images
CNN, LSTM,
BLSTM.
1. Twitter
sexism
parsed
2. You
tube
parsed
3. Toxicit
y
parsed
4. Attack
parsed
1. Accuracy-
95%
2. Audio
Accuracy-
91%
Online
grooming
and self-
harm
detection
are their
future
focus.

2. Text Mining
and text
analytics of
research
articles
Akshaya
Udgave and Prasa
nna Kulkarni
To analyze the
use of text
mining
techniques,
and to explore
recent
developments
in the field of
design
science.
Text mining,
NLP
In the future,
different
design
algorithms
would be
helpful in
resolving
various issues
in the text
mining field.
Integration of
domain
information,
varying
granularity
principles,
refinement of text
in multilingual
type and
ambiguity in the
handling of the
natural language
are major
problems and
challenges that
emerge
throughout the
text extraction or
mining phase.

S.NO. NAME AUTHOR OBJECTIVE ALGORITHMS DATASET CONCLUSIONS DRAWBACKS
3. Multilin
gual
Sentime
nt
Analysis
and
Toxicity
Detectio
n for
Text
Message
s in
Russian
Darya
Bogoradni
kova,
Olesia
Makhnytki
na, Anton
Matveev,
Anastasia
Zakharova,
Artem
Akulov
In this
paper, they
discuss an
approach to
sentiment
analysis and
emotion
identificatio
n for user
comments.
1.Text pre-
processing
2.Data
Augmentation
3.Sentiment
analysis
4. Detection of
toxic comments
5. Detection of
toxic spans.
The
dataset
contains
1703 user
reviews in
Russian
from two
online
education
platforms:
Coursera
and Stepik
Finally, they
achieved a
complex
solution for
evaluating
users’ opinions
about online-
courses.

S.N
O.
NAME AUTHOR OBJECTIVE ALGORITHM DATASET CONCLUSION DRAWBACKS
4. Commen
t toxicity
detection
via a
multicha
nnel
convolut
ional
bidirecti
onal
gated
recurrent
unit
Ashok
Kumar J,
Abirami,
Tina
Esther
Trueman ,
Erik
Cambria b,
To check
toxicity of the
neural
network using
ML
algorithims
Natural
language
processing,
MCBiGRU
model ,
CNN
223; 549
instances
with six
labels,
namely,
toxic,
obscene,
severe toxic,
insult,
threat, and
identity
hate. These
labels
define an
instance as
toxicity or
non-
toxicity.
achieve better
training and
testing accuracy
than the existing
models using
only n-gram
word
embeddings.
the proposed
MCBiGRU
model
outperforms the
existing results.
----

S.NO. NAME AUTHOR OBJECTIVE ALGORITH
MS
DATASET CONCLUSIONS DRAWBACKS
5. Detectin
g Islamic
Radicalis
m Arabic
Tweets
Using
Natural
Languag
e
Processi
ng
KHALID T.
MURSI
,MOHAM
MAD D.
ALAHMA
DI, FAISAL
S.
ALSUBAEI
,AND
AHMED S.
ALGHAM
DI
To automate
the process of
detecting
hateful
tweets,
utilized
advanced
Machine
Learning
(ML)
techniques
and perform
sentiment
analysis to
capture the
meaning of
the Arabic
words in a
proper word
embedding
(Word2Vec)
Word2vec 100,000
tweets of
the last
decade.
Determined the
most frequent
terminologies
in the radical
tweets of each
year which
include some
Jihadist groups,
Countries, and
Individuals.
This work can
help law
enforcement to
analyze and
detect
extremism in
social media.
Small dataset.
The proposed
paper has low
range of
radical
keywords

S.NO. NAME AUTHOR OBJECTIVE ALGORITHM
S
DATASET CONCLUSIONS DRAWBACKS
6 Offensive
Language
Detection
in Arabic
Social
Networks
Using
Evolution
ary-Based
Classifiers
Learned
From
Fine-
Tuned
Embeddin
gs
FATIMA
SHANNAQ
,
BASSAM
HAMMO ,
HOSSAM
FARIS ,
AND
PEDRO A.
Detect
offensive
tweets
using SVM
XGBoost
SVM(suppor
t Vector
Machine)
ArCybC
dataset
an intelligent
prediction
system to detect
the offensive
language in
Arabic tweets
has been
presented
Dataset of
ARCybC is
small,
effectiveness
towards bid
dataset is to
measured,.

S.NO. NAME AUTHOR OBJECTIVE ALGORITHMS DATASET CONCLUSION
S
DRAWBACKS
7 A
Frame
work
for
Hate
Speech
Detecti
on
Using
Deep
Convol
utional
Neural
Networ
k
Pradeep
kumar roy
To monitor
user’s posts
and filter the
hate speech
related post
before it is
spread.
Deep
Convolutional
Neural Network
(DCNN)
used 10 fold
cross-
validation
used with the
proposed
DCNN and
achieved the
best
prediction
recall value of
0.88 for hate
speech and
0.99 for non
hate speech
It can predict
only 53% of
tweets of his
correctly in the
dataset because
of the inbalance
in the dataset
(baise towards
non hate
tweets).
Images can be
also used for
the same.

8 An
Assess
ment of
Deep
Learnin
g
Models
and
Word
Embed
dings
for
Toxicit
y
Detecti
on
within
Online
Textual
Comm
ents
Danilo
Dessì Dieg
o
Reforgiato
Recupero
and Harald
Sack 1
Uses multiple
deep learning
models in
multiple tests
for checking
the toxicity of
the text.
Natural
language
processing ,
Sentiment
Analyis,
Emotion
Detection.
CNN
BERT
LSTM
Kaggle
based
dataset.
LSTM-based
model is the first
choice among
the experimented
models to detect
toxicity.
how various
word
embeddings may
represent the
domain
knowledge in a
variety of ways,
and an unique
model for all
cases might be
insufficient.
failure of BERT
embeddings

S.NO. NAME AUTHOR OBJECTIVE DATASET ALGORITHM CONCLUSION DRAWBACKS
9 Toxic
comme
nts
detectio
n using
LSTM
Krishna
Dubey,
Rahul Nair
This paper
aims to
achieve text
mining and
making use of
deep learning
models that
can nearly
accurate
classify given
text is toxic or
not.
ML algorithm,
LSTM, NLP,
artificial neural
network
Accuracy-94% Could have been
more precise and
ELMOL model
has not being
very used to
detect the
problem.

S.NO. NAME AUTHOR OBJECTIVE DATASET ALORITHM CONCLUSION DRAWBACKS
10 Detectin
g Toxic
Remarks
in
Online
Convers
ation
Pushpit
Gautam
This project
aims to
establish
toxicity
classification
scheme in
online
comments
based on
vocabulary and
other
characteristics
in a sentence
Kaggle
competition
multi label
Wikipedia
talk page edit
dataset
•Naïve bayes,
•Gaussian
naïve bayes,
• Support
vector
machine,
• Back
propagation
neural
network
It has been
observed that the
label power set
method with
multinomial naïve
could be used for
finding the toxic
comments with
more than one
type.
Dataset used in
this had more
than 1.5 Lakh
comments and
due to this
kernel was
frequently
getting down a
lot errors.
Implementation
of Adaboost in
scikit learn
library so that it
could be used
directly for
multilabel
classification
problems.

11. Detect
Toxic
Content to
Improve
Online
Conversatio
ns
Deepshi
Mediratta,
Nikhil oswal
Train online
text to detect
offensive
content
SVM, Naïve
Bayes, GRU
and LSTM
GRU using
GloVe
embedding
provided the best
result ( Accuracy
= 89.49, F1 score
= 0.72)
dataset provided is
highly imbalanced,
The data also
contains noise,
questions not
classified correctly by
humans,

12 Convolutiona
l Neural
Networks for
Toxic
Comment
Classification
Spiros V.
Georgakopou
los
Perform text
mining using
CNN
Convolutional neural
network,
word2vec
CNN can
outperform well
established
methodologies
providing enough
evidence that
their use is
appropriate for
toxic comment
classification
Promising
results are
motivating for
further
development of
CNN based
methodologies
for text mining
in the near
future, in our
interest,
employing
methods for
adaptive
learning and
providing
further
comparisons
with n-gram
based
approaches

13 Machine
learning
methods
for toxic
comment
classifica
tion: a
systemati
c review
Darco
Arcocez
Toxic
comment or
reply
detection
using machine
learning
RPART, SVM
and GLM
evaluated 62
classifiers
representing 19
major algorithmic
families against
features extracted
from the Jigsaw
dataset of
Wikipedia
comments
.compared the
classifiers based
on statistically
significant
differences in
accuracy and
relative execution
time.

14 A Study of
Multilingu
al Toxic
Text
Detection
Approache
s under
Imbalance
d Sample
Distributio
n
Guizhe
Song ,
Degen
Huang and
Zhifeng
Xiao
Use machine
learning for
toxic text
detection in in
uneven dataset
XLM-
RoBERTa;
MBERT
Part of English
training corpus is
divided into
multiple
languages
Sample size
reconstruction is
required.

Work load distribution
Serial No. Team Member Role to be assigned
1. Vishwajeet Kumar Research work, coding and
Documentation
2. Ashwani Tyagi Coding and concerned
Research, Product Design
3. Arpit Rao Research , Testing coding
and Product review

Project Planning
Topic found Research about the
topic
Define problem
statement
Workload
distribution
Prioritize tasks
Read previous years
research paper
Implementation

Methodology
Performance Analysis
Detection of toxic word based on proposed work
Data Analysis
Data Preprocessing
Collection of Data

RAW DATA
TEXT PRE-
PROCESSING
FEATURE
EXTRACTION
TRAINING
DATA
TEST DATA CLASSIFICATI
ON
BINARY
CLASSIFICATI
ON
TOXIC TEXT NON-TOXIC
TEXT
Solution Approach:-

• RAW DATA: We have first collected
the dataset from kaggle. We have
selected the dataset of Twitter.
• PRE-PROCESSING: We have edited,
cleansed and modified the data
in this step and the steps are shown.
• FEATURE EXTRACTION: We have
seen what features has been there in the
data in this step before training and
testing the data.
• TRAIN and TEST: We have divided
the dataset into two subsets train and test.
• CLASSIFICATION: For classification
we have used Linear Regression
, CNN, LSTM.
• And the used classifier has detected the
text is toxic or non-toxic text.

Algorithms and
Framework
Machine
Learning
Linear
Regression
Deep
Learning
CNN
LSTM
NLP
Semantic
Analysis

Outcome
Produced
The expected outcome of this project is a research
paper that we have submitted on IEEE explore.

toxic commnets classification using python

More Related Content

Similar to toxic commnets classification using python (20)

More from Hamed Raza (20)

Recently uploaded (20)

toxic commnets classification using python