How to Clean and Prepare Data Using Pandas in Python

How to Clean and Prepare Data Using Pandas in Python

Do you want to become a data scientist? Wait what you heard someone saying I want to clean my data. But how?  In that case, learning data analysis with Python is a good option for you. For now we are here to make it simple for you!

Our data often comes from different resources and is of course not clean. It may contain

  • Missing values

  • Undesired formats

  • Duplicates

  • Wrong format

Giving your efforts on this messy data leads to incorrect results. Therefore, it is a must to groom your data before it is fed to your model. This setting of the data by identifying and solving the potential errors, inaccuracies, and inconsistencies is called Data Cleaning. 

Why Is Data Cleaning Essential?

Data Cleaning using Pandas in Python is the most crucial task that a data science professional should undertake. Incorrect or poor-quality data can be harmful to processes and analyses. Clean data will ultimately enhance overall productivity and allow for the best quality information in decision-making.

Data Cleaning With Pandas

Pandas represents “Python Data Analysis Library.” This data analysis with Python is a commonly utilized library for data processing, cleaning, manipulation, and analysis. It features classes for reading, processing, and writing CSV files. Many data cleaning tools are available, but the Pandas library offers a fast and effective approach to manage and explore data. It achieves this by providing Series and DataFrames, which help in representing data efficiently and manipulating it in diverse ways. We are utilizing a simple dataset for data cleaning, namely, the iris species dataset.

1. Loading the Dataset

Load the Iris dataset using Pandas' read_csv() function:

python

Copy

column_names = ['id', 'sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

iris_data = pd.read_csv('data/Iris.csv', names=column_names, header=0)

iris_data.head()

The header=0 parameter indicates that the first row contains column names.

2. Explore the Dataset

Get insights about the dataset:

python

Copy

print(iris_data.info())

print(iris_data.describe())

3. Checking Class Distribution

Check class distribution in categorical columns:

python

Copy

print(iris_data['species'].value_counts())

4. Removing Missing Values

Since there are no missing values, we can skip this step. If needed, use:

python

Copy

iris_data.dropna(inplace=True)

5. Removing Duplicates

Check for duplicates:

python

Copy

duplicate_rows = iris_data.duplicated()

print("Number of duplicate rows:", duplicate_rows.sum())

6. One-Hot Encoding

Perform one-hot encoding on the species column:

python

Copy

encoded_species = pd.get_dummies(iris_data['species'], prefix='species', drop_first=False).astype('int')

iris_data = pd.concat([iris_data, encoded_species], axis=1)

iris_data.drop(columns=['species'], inplace=True)

7. Normalization of Float Value Columns

Normalize numerical features:

python

Copy

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

cols_to_normalize = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

iris_data[cols_to_normalize] = scaler.fit_transform(iris_data[cols_to_normalize])

8. Save the Cleaned Dataset

Save the cleaned dataset:

python

Copy

iris_data.to_csv('cleaned_iris.csv', index=False)

Data cleaning in Pandas is a key element of Data Analysis with Python techniques. Once you are done with this cleaning procedure, you can explore trends and visualize data effectively, enhancing your insights. These steps improve the accuracy of your analysis.

Wrapping Up

Congratulations! You have effectively cleaned your initial dataset. You might face further challenges when handling intricate datasets. Nonetheless, the basic techniques outlined here will assist you in getting started and readying your data analysis with Python. Still, feel difficulty? Then you must consider enrolling in Free Online Data Analysis with Python Courses offered by Hadi E Learning. Learn programming from Industry Experts while staying at your home and that’s FREE. Enrol now as seats are limited for each session.

ILMA N.

CS-DUET'26 | Freelancer | Emerging SQL Dev | Intern @ PCF | Fellowship @ GYSA | Growing in Graphic Design & Video Editing | Designer at @She’s Beyond | Exploring Opportunities

1mo

Intresting ✨

Like
Reply
Amir ♥

Data Analyst | Power Bi | Looker Studio | Sql | Tableau Expert | Transforming Data into Strategic Business Insights

3mo

Interesting

Like
Reply
Syed Noman Sharafat

Asstt. Manager HR Sales| HR Analytics | Power BI | SQL | KNIME | Looker Studio | Snowflake | Advance Excel

3mo

Waiting for enrollment 😔

Ali Hassan Abbasi

📊 Transforming Data into Strategic Insights as a Data and Bussiness intelligence Analyst to Help Businesses Drive Growth 🚀

3mo

Sir website pa show NH ho rha data analysis with python course

To view or add a comment, sign in

Others also viewed

Explore topics