How to Clean and Prepare Data Using Pandas in Python
Do you want to become a data scientist? Wait what you heard someone saying I want to clean my data. But how? In that case, learning data analysis with Python is a good option for you. For now we are here to make it simple for you!
Our data often comes from different resources and is of course not clean. It may contain
Missing values
Undesired formats
Duplicates
Wrong format
Giving your efforts on this messy data leads to incorrect results. Therefore, it is a must to groom your data before it is fed to your model. This setting of the data by identifying and solving the potential errors, inaccuracies, and inconsistencies is called Data Cleaning.
Why Is Data Cleaning Essential?
Data Cleaning using Pandas in Python is the most crucial task that a data science professional should undertake. Incorrect or poor-quality data can be harmful to processes and analyses. Clean data will ultimately enhance overall productivity and allow for the best quality information in decision-making.
Data Cleaning With Pandas
Pandas represents “Python Data Analysis Library.” This data analysis with Python is a commonly utilized library for data processing, cleaning, manipulation, and analysis. It features classes for reading, processing, and writing CSV files. Many data cleaning tools are available, but the Pandas library offers a fast and effective approach to manage and explore data. It achieves this by providing Series and DataFrames, which help in representing data efficiently and manipulating it in diverse ways. We are utilizing a simple dataset for data cleaning, namely, the iris species dataset.
1. Loading the Dataset
Load the Iris dataset using Pandas' read_csv() function:
python
Copy
column_names = ['id', 'sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris_data = pd.read_csv('data/Iris.csv', names=column_names, header=0)
iris_data.head()
The header=0 parameter indicates that the first row contains column names.
2. Explore the Dataset
Get insights about the dataset:
python
Copy
print(iris_data.info())
print(iris_data.describe())
3. Checking Class Distribution
Check class distribution in categorical columns:
python
Copy
print(iris_data['species'].value_counts())
4. Removing Missing Values
Since there are no missing values, we can skip this step. If needed, use:
python
Copy
iris_data.dropna(inplace=True)
5. Removing Duplicates
Check for duplicates:
python
Copy
duplicate_rows = iris_data.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())
6. One-Hot Encoding
Perform one-hot encoding on the species column:
python
Copy
encoded_species = pd.get_dummies(iris_data['species'], prefix='species', drop_first=False).astype('int')
iris_data = pd.concat([iris_data, encoded_species], axis=1)
iris_data.drop(columns=['species'], inplace=True)
7. Normalization of Float Value Columns
Normalize numerical features:
python
Copy
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cols_to_normalize = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_data[cols_to_normalize] = scaler.fit_transform(iris_data[cols_to_normalize])
8. Save the Cleaned Dataset
Save the cleaned dataset:
python
Copy
iris_data.to_csv('cleaned_iris.csv', index=False)
Data cleaning in Pandas is a key element of Data Analysis with Python techniques. Once you are done with this cleaning procedure, you can explore trends and visualize data effectively, enhancing your insights. These steps improve the accuracy of your analysis.
Wrapping Up
Congratulations! You have effectively cleaned your initial dataset. You might face further challenges when handling intricate datasets. Nonetheless, the basic techniques outlined here will assist you in getting started and readying your data analysis with Python. Still, feel difficulty? Then you must consider enrolling in Free Online Data Analysis with Python Courses offered by Hadi E Learning. Learn programming from Industry Experts while staying at your home and that’s FREE. Enrol now as seats are limited for each session.
CS-DUET'26 | Freelancer | Emerging SQL Dev | Intern @ PCF | Fellowship @ GYSA | Growing in Graphic Design & Video Editing | Designer at @She’s Beyond | Exploring Opportunities
1moIntresting ✨
Data Analyst | Power Bi | Looker Studio | Sql | Tableau Expert | Transforming Data into Strategic Business Insights
3moInteresting
Asstt. Manager HR Sales| HR Analytics | Power BI | SQL | KNIME | Looker Studio | Snowflake | Advance Excel
3moWaiting for enrollment 😔
📊 Transforming Data into Strategic Insights as a Data and Bussiness intelligence Analyst to Help Businesses Drive Growth 🚀
3moSir website pa show NH ho rha data analysis with python course