How to add custom stopwords and then remove them from text in nltk

This recipe helps you add custom stopwords and then remove them from text in nltk
Last Updated: 19 Jan 2023

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In a text or sentence, there are some words that do not contribute importance in the sentence or text, and we need to remove them. So there is a package called stopwords which is already present in the NLTK library that consists of the most commonly used words that should be removed from the text. But if we want to add our own custom list of words that we want to stop in our text or sentence, lets see how to make it.

Stopwords these are the words which does not add much meaning in the actual sentence or text, and they can be safely removed from the sentence or text. The words like the, is, have, has and many more can be removed.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Recipe Objective

Step 1 - Import nltk and download stopwords, and then import stopwords from NLTK

import nltk nltk.download('stopwords') from nltk.corpus import stopwords

Step 2 - lets see the stop word list present in the NLTK library, without adding our custom list

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Step 3 - Create a Simple sentence

simple_text = "the city is beautiful, but due to traffic noice polution is increasing on daily basis which is hurting all the people"

Step 4 - Create our custom stopword list to add

new_stopwords = ["all", "due", "to", "on", "daily"]

Step 5 - add custom list to stopword list of nltk

stpwrd = nltk.corpus.stopwords.words('english') stpwrd.extend(new_stopwords)

Step 6 - download and import the tokenizer from nltk

nltk.download('punkt') from nltk.tokenize import word_tokenize

Step 7 - tokenizing the simple text by using word tokenizer

text_tokens = word_tokenize(simple_text)

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Step 8 - Remove the custom stop words and print it

removing_custom_words = [words for words in text_tokens if not words in stpwrd] print(removing_custom_words)

['city', 'beautiful', ',', 'traffic', 'noice', 'polution', 'increasing', 'basis', 'hurting', 'people']

As we can see all custom words that we have added have been removed from our text.

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More