Sms spam classification

SMS spam classification using
NLP: Methods, approaches,
and applications
By Anisha Agarwal

Introduction
The easy accessibility and simplicity of SMS have made it
attractive to malicious users thereby incurring unnecessary costing
on the mobile users and also the Secure Mobile Message
Communication is jeopardized.
Thus, this article is to identify and review existing state-of-the-art
methodology for SMS spam classification based on certain
metrics: ML and AI methods and techniques, approaches, and
deployed environment.

1. Import the required Libraries.
2. Data Preprocessing.
3. Bag of Words.
4. Adding new Feature. Like- Length of the text,
Profanity of the text, Parts of Speech(POS).
5. EDA of the dataset.
6. Word Tokenization.
7. Implementing different ML classifying models. Like-
LogisticRegression, MultinomialNB,
RandomForestClassifier, LinearSVC, SGDClassifier,
GradientBoostingClassifier. And compare these to
find which Model is best for this classification.
Implementation

Data Preprocessing
1. Removing unnecessary
columns and renaming
features name.

Data Preprocessing:
2. Numericalizing categorical feature which is our label (ham or sam).

Data Preprocessing:
3. Generating corpus from raw sms messages (stopwords,lowering,stemming).

Data Preprocessing:
4. Creating bag of words model using CountVectorizer.

Bag of Words: Code to Generate Bag of Words

Code to plot Word of Cloud Spam Words

Code to plot Word of Cloud Ham Words

New Features added: Length of Text

New Features added: Profanity Check

New Features added: Readability Score

New Features added: Parts of Speech (POS)

Maximum Length of the Text Plotted

Spam and Ham Text against the Length

Ham Tokenization for first 50 Words:
OutPut

Spam Tokenization for first 50 Words:
OutPut

Classification Model Data Preparation:

1. We provided the text and refined the text (removal of stopwords,
punctuations, and performed lemmatization). This helped in
improving the Accuracy.
2. We have used different Model Pipeline containing TfidfVectorizer,
where SVM model gives the best accuracy score of 98%.
3. The top Spam Tokenized words are- Call, Txt, Claim, Prize, Stop
etc. These words gives an indication that it is either an commercial
SMS or Spam SMS which is not used in regular life.
4. Most likely spam SMS’s have longer length in text as compared to
Non Spam SMS.
5. Readability score is less or negative in Spam SMS as compared to
Non Spam SMS.
6. Parts of speech that is adjective and adverbs, we can see that
adjectives are used most frequently in Spam SMS as compared to
Non Spam SMS.
Inference

Sms spam classification

More Related Content

What's hot (20)

Similar to Sms spam classification (20)

Recently uploaded (20)

Sms spam classification