SlideShare a Scribd company logo
Data Preprocessing
What is Data Preprocessing?
Data preprocessing is a key aspect of data preparation. It refers to any processing
applied to raw data to ready it for further analysis or processing tasks.
Thus, data preprocessing may be defined as the process of converting raw data into
a format that can be processed more efficiently and accurately in tasks such as:
• Data analysis
• Machine learning
• Data science
• AI
Steps in Data Preprocessing
• Step 1: Data cleaning
• Handling missing values
• Removing duplicates
• Correcting inconsistent formats
• Step 2: Data integration
• Schema matching
• Data deduplication
• Step 3: Data transformation
• Scaling and normalization
• Encoding categorical variables
• Feature engineering and extraction
• Step 4: Data reduction
• Feature selection
• Principal component analysis (PCA)
• Sampling methods
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Data Cleaning Tool:
Handling Missing Values
# Creating a manual dataset
data = pd.DataFrame({
'name': ['John', 'Jane', 'Jack', 'John', None],
'age': [28, 34, None, 28, 22],
'purchase_amount': [100.5, None, 85.3, 100.5, 50.0],
'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03']
})
# Handling missing values using mean imputation for 'age' and 'purchase_amount'
imputer = SimpleImputer(strategy='mean')
data[['age', 'purchase_amount']] = imputer.fit_transform(data[['age', 'purchase_amount']])
# Removing duplicate rows
data = data.drop_duplicates()
print(data)
Output:
• Missing Values:
This occurs when a particular variable lacks data points, resulting in incomplete
information and potentially harming the accuracy and dependability of your
models.
Types of Missing Values
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
Effective Strategies for Handling Missing Values in Data
Analysis
import pandas as pd
import numpy as np
# Creating a sample DataFrame with missing values
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Data Analytics ,Data Preprocessing What is Data Preprocessing?
# Removing rows with missing values
df_cleaned = df.dropna()
# Displaying the DataFrame after removing missing values
print("nDataFrame after removing rows with missing values:")
print(df_cleaned)
Imputation Methods
• Mean
• Median and
• Mode
Mean Imputation:
Step 1 - Import the library
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
Step 2 - Setting up the Data
df = pd.DataFrame()
df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526,
0.7341,0.4546,0.1426,0.1490,0.2500]
df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan,
0.8308,0.4962,np.nan,0.5340,0.6731]
print(df)
Step 3 - Using Imputer to fill the nun values with the
Mean
• missing_values : In this we have to place the missing values and in pandas it is
'NaN'.
• strategy : In this we have to pass the strategy that we need to follow to impute in
missing value it can be mean, median, most_frequent or constant. By default it is
mean.
• fill_value : By default it is set as none. It is used when the strategy is set to
constant then we have to pass the value that we want to fill as a constant in all the
nun places.
• axis : In this we have to pass 0 for columns and 1 for rows.
miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df =
miss_mean_imputer.transform(df.values)
print(imputed_df)
Data Analytics ,Data Preprocessing What is Data Preprocessing?
nun values have been filled by the mean of the columns.
Median Imputation:
• It is the middle value of a dataset when it is ordered from lowest to highest.
• If there is an even number of values, the median is the average of the two middle
values.
• the mean, the median is not affected by outliers, making it a more reliable
measure for skewed distributions.
• Median: data=data.fillna(data.median())
#replacing missing values in quantity
# column with mean of that column
data['quantity'] = data['quantity'].fillna(data['quantity'].mean())
# replacing missing values in price column
# with median of that column
data['price'] = data['price'].fillna(data['price'].median())
print(Data)
Mode imputation:
• Mode - The most common value
from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
#The mode() method returns a ModeResult object that contains the mode
number (86), and count (how many times the mode number appeared (3)).
Data Skewed
• Data skewed" means that the distribution of data points in a dataset
is uneven, with a noticeable concentration of values on one side of the
distribution, creating a "tail" extending towards the other side, making
the data appear distorted or asymmetrical when visualized on a graph
• Types of skew:
• Positive skew (right skew): The tail extends towards the higher
values on the right side of the graph.
• Negative skew (left skew): The tail extends towards the lower values
on the left side of the graph.
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Data Integration
• Data integration is a process where data from many sources goes to a
single centralized location, which is often a data warehouse.
• The end location needs to be flexible enough to handle lots of different
kinds of data at potentially large volumes.
• Data integration is deal for powering analytical use cases.
Data Loading
Load data from various sources using respective libraries
• import pandas as pd
• # Load CSV
• df_csv = pd.read_csv('data.csv')
• # Load Excel
• df_excel = pd.read_excel('data.xlsx')
• # Load JSON
• df_json = pd.read_json('data.json')
Data Cleaning and Transformation
Use Pandas or other libraries to clean and normalize the data
# Drop null values
• df = df.dropna()
# Rename columns
• df = df.rename(columns={'OldName': 'NewName'})
# Standardize formats
• df['date'] = pd.to_datetime(df['date'])
Combining Data
• Concatenation: Stack datasets vertically.
combined_df = pd.concat([df1, df2])
• Merging: Combine datasets based on a key.
merged_df = pd.merge(df1, df2, on='common_key’)
Data transformation
Encoding
1.Label Encoding
2.One-hot Encoding
3.Ordinal Encoding
Label Encoding
• Label Encoding is a technique that is used to convert categorical columns into
numerical ones so that they can be fitted by machine learning models which only
take numerical data.
• It is an important pre-processing step in a machine-learning project.
Example Of Label Encoding
Steps for Label Encoding
Using sklearn.preprocessing.LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Sample categorical data
categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse']
# Initialize LabelEncoder
encoder = LabelEncoder()
# Fit and transform the data
encoded_labels = encoder.fit_transform(categories)
print(encoded_labels) # Output: [0 1 2 1 0 2]
# Get the mapping
print(encoder.classes_) # Output: ['cat' 'dog' 'mouse']
# Decode back to original categories
decoded_labels = encoder.inverse_transform(encoded_labels)
print(decoded_labels) # Output: ['cat' 'dog' 'mouse' 'dog' 'cat' 'mouse']
import pandas as pd
# Sample categorical data
categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse']
# Encode labels
encoded_labels, uniques = pd.factorize(categories)
print(encoded_labels) # Output: [0 1 2 1 0 2]
print(uniques) # Output: Index(['cat', 'dog', 'mouse'], dtype='object')
Using pandas.factorize
Key Points to Consider
1. Order Sensitivity:
•Label encoding assumes an ordinal relationship between categories (e.g., 0 < 1 < 2). This is fine
for ordered categories like "low", "medium", and "high".
•For unordered categories (e.g., "dog", "cat"), this may mislead algorithms into interpreting
numerical relationships. In such cases, use one-hot encoding instead.
2. Decoding:
If you need to map back to original categories, ensure you retain the mapping (via
LabelEncoder.classes_ or similar method.
One Hot Encoding
• One Hot Encoding in machine learning transforms categorical data into a
numerical format that machine learning algorithms can process without imposing
any ordinal relationships.
• It creates new binary columns (0s and 1s) for each category in the original
variable. Each category in the original column is represented as a separate column,
where a value of 1 indicates the presence of that category, and 0 indicates its
absence.
How One-Hot Encoding Works: An Example
• Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the
other fruit columns (like Mango or Orange) will contain 0.
• This pattern ensures that each categorical value gets its own column, represented
with binary values (1 or 0), making it usable for machine learning models.
Fruit
Categorical value of
fruit
Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20
The output after applying one-hot encoding on the data is given as follows,
Fruit_apple Fruit_mango Fruit_orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
import pandas as pd
# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)
# Applying one-hot encoding
df_encoded = pd.get_dummies(df, dtype=int)
# Displaying the encoded DataFrame
print(df_encoded)
Using Pandas get_dummies()
Data Analytics ,Data Preprocessing What is Data Preprocessing?
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Creating the encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Sample data
X = [['Red'], ['Green'], ['Blue']]
# Fitting the encoder to the data
enc.fit(X)
# Transforming new data
result = enc.transform([['Red']]).toarray()
# Displaying the encoded result
print(result)
Using Scikit-learn's OneHotEncoder
Data Normalization
• Data normalization is a technique used to transform the values of a dataset into a
common scale.
• This is important because many machine learning algorithms are sensitive to the
scale of the input features and can produce better results when the data is
normalized.
Need of Normalization
There are several different normalization techniques that can be
used
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
4. Logarithmic transformation
5. Root transformation
Min-Max scaling
• Min-Max scaling is a technique in data preprocessing where all values within a
feature are linearly transformed to fall within a specified range, usually between 0
and 1, by calculating the proportion of each value relative to the minimum and
maximum values in that feature
• Essentially, it scales the data based on its relative position within the original
range, making it suitable for algorithms sensitive to feature scale differences
Key points about Min-Max scaling:
• To normalize a value 'x' using min-max scaling, the formula is:
x= (x - min) / (max - min)
where 'min' is the minimum value in the feature and 'max' is the maximum
value.
When to use Min-Max scaling:
• When you need to scale features to a specific range (like 0 to 1) and preserve
relative relationships between data points.
• When dealing with datasets where features have significantly different scales and
you want to ensure no single feature dominates the analysis.
Example:
Imagine a dataset with a feature "Temperature" ranging from 10 to 40 degrees
Celsius. To normalize this using min-max scaling:
• Min value: 10
• Max value: 40
• To normalize a temperature of 25 degrees: Calculation: (25 - 10) / (40 - 10) = 0.5.
Benefits:
• Standardizes feature scales
• Preserves relative relationships
Drawbacks
• Outlier sensitivity
• Not robust to noise
Outliers
• Outliers are data points that are significantly different from the majority of other
data points in a set.
• They can be higher or lower than the other values in the set.
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Important Questions:
• What is data preprocessing in the context of machine learning?
• Explain Data Cleaning with Python's pandas Library.
• What are common data quality issues you might encounter?
• How do you handle missing data within a dataset?
• What is the difference between imputation and deletion of missing values?
• What is one-hot encoding, and when should it be used? Explain with an example.
• Write a NumPy program to compute the mean, standard deviation and the median of a given array .
Sample array [0 1 2 3 4 5]
1. import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Age': [22, 25, 27, np.nan, 30, 35],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'],
'Salary': [50000, 60000, np.nan, 80000, 75000, 90000]
}
df = pd.DataFrame(data)
# Display the original data
print("Original Data:")
print(df)
# 1. Handling missing data Fill missing values for 'Age' column with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values for 'Salary' column with the median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# 2. Encoding categorical data (e.g., Gender column) Convert 'Gender' column to numerical
values (Male = 0, Female = 1)
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
# 3. Feature scaling (standardization of 'Salary' column)
# Standardize the 'Salary' column using z-score normalization
df['Salary'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
# Display the preprocessed data
print("nPreprocessed Data:")
print(df)
2.
• Handling missing values:
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, np.nan, 30, 22, np.nan],
'Salary': [50000, 60000, np.nan, 80000, 75000]}
df = pd.DataFrame(data)
# Checking for missing values
print(df.isnull()) # True for NaN values
• Filling Missing Data
# Filling missing values with the mean of the column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("nAfter Filling Missing Values:")
print(df)
• Dropping Missing Data
# Drop rows with missing values
df.dropna(inplace=True)
print("nAfter Dropping Rows with Missing Data:")
print(df)
• Removing Duplicates
# Sample DataFrame with duplicates
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Alice'],
'Age': [25, 30, 35, 30, 25],
'Salary': [50000, 60000, 70000, 60000, 50000]}
df = pd.DataFrame(data)
# Removing duplicate rows
df.drop_duplicates(inplace=True)
print("nAfter Removing Duplicates:")
print(df)
6.
import pandas as pd
# Sample data
data = pd.DataFrame({'Size': ['S', 'M', 'M', 'L', 'S', 'L']})
# One-hot encoding
one_hot_encoded = pd.get_dummies(data, columns=['Size'])
print(one_hot_encoded)
7.
import numpy as np
# Given array
arr = np.array([0, 1, 2, 3, 4, 5])
# Compute mean, standard deviation, and median
mean = np.mean(arr)
std_dev = np.std(arr)
median = np.median(arr)
# Print results
print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Median:", median)
Questions:
1. How do you calculate the mean, median, and mode using Python?
2. How do you calculate the weighted mean in Python?
3. How do you find the mode for categorical data in Python?
4. How do you identify if the data is skewed using mean and median?
How do you calculate the mean, median, and mode using Python?
import numpy as np
from scipy import stats
# Sample data
data = [10, 20, 20, 30, 40, 50, 100]
# Mean
mean = np.mean(data)
print("Mean:", mean)
# Median
median = np.median(data)
print("Median:", median)
# Mode
mode = stats.mode(data)
print("Mode:", mode.mode[0], "Count:", mode.count[0])
How do you calculate the weighted mean in Python?
import numpy as np
# Data and weights
data = [10, 20, 30, 40]
weights = [1, 2, 3, 4]
# Weighted mean
weighted_mean = np.average(data, weights=weights)
print("Weighted Mean:", weighted_mean)
How do you find the mode for categorical data in Python?
from scipy import stats
# Categorical data
categories = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
# Mode
mode = stats.mode(categories)
print("Mode:", mode.mode[0], "Count:", mode.count[0])
How do you identify if the data is skewed using mean and median?
data = [10, 20, 30, 40, 100] # Right-skewed data
mean = np.mean(data)
median = np.median(data)
if mean > median:
print("The data is right-skewed.")
elif mean < median:
print("The data is left-skewed.")
else:
print("The data is symmetric.")
1. Create a python code using pandas library to create binary columns for each category, where
dataset data = {‘Vehicle’: [‘BUS’, ‘VAN’, ‘TRAIN’, ‘BUS’, ‘CYCLE’]}
2. Write a python code using Pandas library to delete the duplicate values.
3. Write a NumPy program to compute the mean and the median of a given array. data = [1, 2,
2, 3, 4, 5, 10]
4. Create a DataFrame where
data = pd.DataFrame({
'name': ['John', 'Jane', 'Jack', 'John', None],
'age': [28, 34, None, 28, 22],
'purchase_amount': [100.5, None, 85.3, 100.5, 50.0],
'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03']
})
Write a program in python using the Mean imputation method to replace the missing values.

More Related Content

PPTX
Data preprocessing in Machine learning
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
PPTX
Working with Graphs _python.pptx
PPTX
Unit 4_Working with Graphs _python (2).pptx
PDF
Data preprocessing in Machine Learning
PPTX
Machine Learning - Dataset Preparation
PDF
13_Data Preprocessing in Python.pptx (1).pdf
PDF
The model interacts with the environment seeking ways to maximize the reward....
Data preprocessing in Machine learning
Pandas Data Cleaning and Preprocessing PPT.pptx
Working with Graphs _python.pptx
Unit 4_Working with Graphs _python (2).pptx
Data preprocessing in Machine Learning
Machine Learning - Dataset Preparation
13_Data Preprocessing in Python.pptx (1).pdf
The model interacts with the environment seeking ways to maximize the reward....

Similar to Data Analytics ,Data Preprocessing What is Data Preprocessing? (20)

PPTX
Exploratory Data Analysis Unit 1 ppt presentation.pptx
PPTX
Lec2(Types of ML) & Preprocessing of data.pptx
PPT
Preprocessing_new.ppt
PPTX
data_preprocessingknnnaiveandothera.pptx
PPTX
Introduction to ML_Data Preprocessing.pptx
PDF
Explore ML day 1
PDF
Preparing Data
PDF
Introduction to Artificial Intelligence_ Lec 5
PDF
Lesson 2 data preprocessing
PDF
Machine Learning Algorithms
PPTX
Python libraries for analysis Pandas.pptx
PDF
Data preprocessing in Data Mining
PDF
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
PPTX
Intro to Machine Learning for non-Data Scientists
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PPTX
Data Science- Data Preprocessing, Data Cleaning.
PDF
overview of_data_processing
 
PPTX
Data_Preparation.pptx
PPTX
Predicting Employee Churn: A Data-Driven Approach Project Presentation
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Preprocessing_new.ppt
data_preprocessingknnnaiveandothera.pptx
Introduction to ML_Data Preprocessing.pptx
Explore ML day 1
Preparing Data
Introduction to Artificial Intelligence_ Lec 5
Lesson 2 data preprocessing
Machine Learning Algorithms
Python libraries for analysis Pandas.pptx
Data preprocessing in Data Mining
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Intro to Machine Learning for non-Data Scientists
ML-ChapterTwo-Data Preprocessing.ppt
Data Science- Data Preprocessing, Data Cleaning.
overview of_data_processing
 
Data_Preparation.pptx
Predicting Employee Churn: A Data-Driven Approach Project Presentation
python-pandas-For-Data-Analysis-Manipulate.pptx
Ad

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Well-logging-methods_new................
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Sustainable Sites - Green Building Construction
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
PPT on Performance Review to get promotions
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
web development for engineering and engineering
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Construction Project Organization Group 2.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Well-logging-methods_new................
CH1 Production IntroductoryConcepts.pptx
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Sustainable Sites - Green Building Construction
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPT on Performance Review to get promotions
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
web development for engineering and engineering
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Ad

Data Analytics ,Data Preprocessing What is Data Preprocessing?

  • 2. What is Data Preprocessing? Data preprocessing is a key aspect of data preparation. It refers to any processing applied to raw data to ready it for further analysis or processing tasks. Thus, data preprocessing may be defined as the process of converting raw data into a format that can be processed more efficiently and accurately in tasks such as: • Data analysis • Machine learning • Data science • AI
  • 3. Steps in Data Preprocessing • Step 1: Data cleaning • Handling missing values • Removing duplicates • Correcting inconsistent formats • Step 2: Data integration • Schema matching • Data deduplication
  • 4. • Step 3: Data transformation • Scaling and normalization • Encoding categorical variables • Feature engineering and extraction • Step 4: Data reduction • Feature selection • Principal component analysis (PCA) • Sampling methods
  • 6. Data Cleaning Tool: Handling Missing Values # Creating a manual dataset data = pd.DataFrame({ 'name': ['John', 'Jane', 'Jack', 'John', None], 'age': [28, 34, None, 28, 22], 'purchase_amount': [100.5, None, 85.3, 100.5, 50.0], 'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03'] }) # Handling missing values using mean imputation for 'age' and 'purchase_amount' imputer = SimpleImputer(strategy='mean') data[['age', 'purchase_amount']] = imputer.fit_transform(data[['age', 'purchase_amount']]) # Removing duplicate rows data = data.drop_duplicates() print(data)
  • 8. • Missing Values: This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models.
  • 9. Types of Missing Values • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Missing Not at Random (MNAR)
  • 10. Effective Strategies for Handling Missing Values in Data Analysis import pandas as pd import numpy as np # Creating a sample DataFrame with missing values data = { 'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'], 'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'], 'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'], 'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'], 'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88], 'Rank': [2, 1, 4, 3, 8, 1, 5, 3], 'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B'] } df = pd.DataFrame(data) print("Sample DataFrame:") print(df)
  • 12. # Removing rows with missing values df_cleaned = df.dropna() # Displaying the DataFrame after removing missing values print("nDataFrame after removing rows with missing values:") print(df_cleaned)
  • 13. Imputation Methods • Mean • Median and • Mode
  • 14. Mean Imputation: Step 1 - Import the library import pandas as pd import numpy as np from sklearn.preprocessing import Imputer
  • 15. Step 2 - Setting up the Data df = pd.DataFrame() df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526, 0.7341,0.4546,0.1426,0.1490,0.2500] df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan, 0.8308,0.4962,np.nan,0.5340,0.6731] print(df)
  • 16. Step 3 - Using Imputer to fill the nun values with the Mean • missing_values : In this we have to place the missing values and in pandas it is 'NaN'. • strategy : In this we have to pass the strategy that we need to follow to impute in missing value it can be mean, median, most_frequent or constant. By default it is mean. • fill_value : By default it is set as none. It is used when the strategy is set to constant then we have to pass the value that we want to fill as a constant in all the nun places. • axis : In this we have to pass 0 for columns and 1 for rows.
  • 17. miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df = miss_mean_imputer.transform(df.values) print(imputed_df)
  • 19. nun values have been filled by the mean of the columns.
  • 20. Median Imputation: • It is the middle value of a dataset when it is ordered from lowest to highest. • If there is an even number of values, the median is the average of the two middle values. • the mean, the median is not affected by outliers, making it a more reliable measure for skewed distributions.
  • 22. #replacing missing values in quantity # column with mean of that column data['quantity'] = data['quantity'].fillna(data['quantity'].mean()) # replacing missing values in price column # with median of that column data['price'] = data['price'].fillna(data['price'].median()) print(Data)
  • 23. Mode imputation: • Mode - The most common value from scipy import stats speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = stats.mode(speed) print(x) #The mode() method returns a ModeResult object that contains the mode number (86), and count (how many times the mode number appeared (3)).
  • 24. Data Skewed • Data skewed" means that the distribution of data points in a dataset is uneven, with a noticeable concentration of values on one side of the distribution, creating a "tail" extending towards the other side, making the data appear distorted or asymmetrical when visualized on a graph
  • 25. • Types of skew: • Positive skew (right skew): The tail extends towards the higher values on the right side of the graph. • Negative skew (left skew): The tail extends towards the lower values on the left side of the graph.
  • 28. Data Integration • Data integration is a process where data from many sources goes to a single centralized location, which is often a data warehouse. • The end location needs to be flexible enough to handle lots of different kinds of data at potentially large volumes. • Data integration is deal for powering analytical use cases.
  • 29. Data Loading Load data from various sources using respective libraries • import pandas as pd • # Load CSV • df_csv = pd.read_csv('data.csv') • # Load Excel • df_excel = pd.read_excel('data.xlsx') • # Load JSON • df_json = pd.read_json('data.json')
  • 30. Data Cleaning and Transformation Use Pandas or other libraries to clean and normalize the data # Drop null values • df = df.dropna() # Rename columns • df = df.rename(columns={'OldName': 'NewName'}) # Standardize formats • df['date'] = pd.to_datetime(df['date'])
  • 31. Combining Data • Concatenation: Stack datasets vertically. combined_df = pd.concat([df1, df2]) • Merging: Combine datasets based on a key. merged_df = pd.merge(df1, df2, on='common_key’)
  • 33. Label Encoding • Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. • It is an important pre-processing step in a machine-learning project.
  • 34. Example Of Label Encoding
  • 35. Steps for Label Encoding Using sklearn.preprocessing.LabelEncoder from sklearn.preprocessing import LabelEncoder # Sample categorical data categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse'] # Initialize LabelEncoder encoder = LabelEncoder() # Fit and transform the data encoded_labels = encoder.fit_transform(categories) print(encoded_labels) # Output: [0 1 2 1 0 2] # Get the mapping print(encoder.classes_) # Output: ['cat' 'dog' 'mouse'] # Decode back to original categories decoded_labels = encoder.inverse_transform(encoded_labels) print(decoded_labels) # Output: ['cat' 'dog' 'mouse' 'dog' 'cat' 'mouse']
  • 36. import pandas as pd # Sample categorical data categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse'] # Encode labels encoded_labels, uniques = pd.factorize(categories) print(encoded_labels) # Output: [0 1 2 1 0 2] print(uniques) # Output: Index(['cat', 'dog', 'mouse'], dtype='object') Using pandas.factorize
  • 37. Key Points to Consider 1. Order Sensitivity: •Label encoding assumes an ordinal relationship between categories (e.g., 0 < 1 < 2). This is fine for ordered categories like "low", "medium", and "high". •For unordered categories (e.g., "dog", "cat"), this may mislead algorithms into interpreting numerical relationships. In such cases, use one-hot encoding instead. 2. Decoding: If you need to map back to original categories, ensure you retain the mapping (via LabelEncoder.classes_ or similar method.
  • 38. One Hot Encoding • One Hot Encoding in machine learning transforms categorical data into a numerical format that machine learning algorithms can process without imposing any ordinal relationships. • It creates new binary columns (0s and 1s) for each category in the original variable. Each category in the original column is represented as a separate column, where a value of 1 indicates the presence of that category, and 0 indicates its absence.
  • 39. How One-Hot Encoding Works: An Example • Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the other fruit columns (like Mango or Orange) will contain 0. • This pattern ensures that each categorical value gets its own column, represented with binary values (1 or 0), making it usable for machine learning models.
  • 40. Fruit Categorical value of fruit Price apple 1 5 mango 2 10 apple 1 15 orange 3 20
  • 41. The output after applying one-hot encoding on the data is given as follows, Fruit_apple Fruit_mango Fruit_orange price 1 0 0 5 0 1 0 10 1 0 0 15 0 0 1 20
  • 42. import pandas as pd # Sample data data = {'Color': ['Red', 'Green', 'Blue', 'Red']} df = pd.DataFrame(data) # Applying one-hot encoding df_encoded = pd.get_dummies(df, dtype=int) # Displaying the encoded DataFrame print(df_encoded) Using Pandas get_dummies()
  • 44. from sklearn.preprocessing import OneHotEncoder import numpy as np # Creating the encoder enc = OneHotEncoder(handle_unknown='ignore') # Sample data X = [['Red'], ['Green'], ['Blue']] # Fitting the encoder to the data enc.fit(X) # Transforming new data result = enc.transform([['Red']]).toarray() # Displaying the encoded result print(result) Using Scikit-learn's OneHotEncoder
  • 45. Data Normalization • Data normalization is a technique used to transform the values of a dataset into a common scale. • This is important because many machine learning algorithms are sensitive to the scale of the input features and can produce better results when the data is normalized.
  • 47. There are several different normalization techniques that can be used 1. Min-Max normalization 2. Z-score normalization 3. Decimal Scaling 4. Logarithmic transformation 5. Root transformation
  • 48. Min-Max scaling • Min-Max scaling is a technique in data preprocessing where all values within a feature are linearly transformed to fall within a specified range, usually between 0 and 1, by calculating the proportion of each value relative to the minimum and maximum values in that feature • Essentially, it scales the data based on its relative position within the original range, making it suitable for algorithms sensitive to feature scale differences
  • 49. Key points about Min-Max scaling: • To normalize a value 'x' using min-max scaling, the formula is: x= (x - min) / (max - min) where 'min' is the minimum value in the feature and 'max' is the maximum value.
  • 50. When to use Min-Max scaling: • When you need to scale features to a specific range (like 0 to 1) and preserve relative relationships between data points. • When dealing with datasets where features have significantly different scales and you want to ensure no single feature dominates the analysis.
  • 51. Example: Imagine a dataset with a feature "Temperature" ranging from 10 to 40 degrees Celsius. To normalize this using min-max scaling: • Min value: 10 • Max value: 40 • To normalize a temperature of 25 degrees: Calculation: (25 - 10) / (40 - 10) = 0.5.
  • 52. Benefits: • Standardizes feature scales • Preserves relative relationships Drawbacks • Outlier sensitivity • Not robust to noise
  • 53. Outliers • Outliers are data points that are significantly different from the majority of other data points in a set. • They can be higher or lower than the other values in the set.
  • 56. Important Questions: • What is data preprocessing in the context of machine learning? • Explain Data Cleaning with Python's pandas Library. • What are common data quality issues you might encounter? • How do you handle missing data within a dataset? • What is the difference between imputation and deletion of missing values? • What is one-hot encoding, and when should it be used? Explain with an example. • Write a NumPy program to compute the mean, standard deviation and the median of a given array . Sample array [0 1 2 3 4 5]
  • 57. 1. import pandas as pd import numpy as np # Create a sample DataFrame data = { 'Age': [22, 25, 27, np.nan, 30, 35], 'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'], 'Salary': [50000, 60000, np.nan, 80000, 75000, 90000] } df = pd.DataFrame(data) # Display the original data print("Original Data:") print(df) # 1. Handling missing data Fill missing values for 'Age' column with the mean df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill missing values for 'Salary' column with the median df['Salary'].fillna(df['Salary'].median(), inplace=True)
  • 58. # 2. Encoding categorical data (e.g., Gender column) Convert 'Gender' column to numerical values (Male = 0, Female = 1) df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1}) # 3. Feature scaling (standardization of 'Salary' column) # Standardize the 'Salary' column using z-score normalization df['Salary'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std() # Display the preprocessed data print("nPreprocessed Data:") print(df)
  • 59. 2. • Handling missing values: import pandas as pd import numpy as np # Sample DataFrame with missing values data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, np.nan, 30, 22, np.nan], 'Salary': [50000, 60000, np.nan, 80000, 75000]} df = pd.DataFrame(data) # Checking for missing values print(df.isnull()) # True for NaN values
  • 60. • Filling Missing Data # Filling missing values with the mean of the column df['Age'].fillna(df['Age'].mean(), inplace=True) df['Salary'].fillna(df['Salary'].mean(), inplace=True) print("nAfter Filling Missing Values:") print(df)
  • 61. • Dropping Missing Data # Drop rows with missing values df.dropna(inplace=True) print("nAfter Dropping Rows with Missing Data:") print(df)
  • 62. • Removing Duplicates # Sample DataFrame with duplicates data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Alice'], 'Age': [25, 30, 35, 30, 25], 'Salary': [50000, 60000, 70000, 60000, 50000]} df = pd.DataFrame(data) # Removing duplicate rows df.drop_duplicates(inplace=True) print("nAfter Removing Duplicates:") print(df)
  • 63. 6. import pandas as pd # Sample data data = pd.DataFrame({'Size': ['S', 'M', 'M', 'L', 'S', 'L']}) # One-hot encoding one_hot_encoded = pd.get_dummies(data, columns=['Size']) print(one_hot_encoded)
  • 64. 7. import numpy as np # Given array arr = np.array([0, 1, 2, 3, 4, 5]) # Compute mean, standard deviation, and median mean = np.mean(arr) std_dev = np.std(arr) median = np.median(arr) # Print results print("Mean:", mean) print("Standard Deviation:", std_dev) print("Median:", median)
  • 65. Questions: 1. How do you calculate the mean, median, and mode using Python? 2. How do you calculate the weighted mean in Python? 3. How do you find the mode for categorical data in Python? 4. How do you identify if the data is skewed using mean and median?
  • 66. How do you calculate the mean, median, and mode using Python? import numpy as np from scipy import stats # Sample data data = [10, 20, 20, 30, 40, 50, 100] # Mean mean = np.mean(data) print("Mean:", mean) # Median median = np.median(data) print("Median:", median) # Mode mode = stats.mode(data) print("Mode:", mode.mode[0], "Count:", mode.count[0])
  • 67. How do you calculate the weighted mean in Python? import numpy as np # Data and weights data = [10, 20, 30, 40] weights = [1, 2, 3, 4] # Weighted mean weighted_mean = np.average(data, weights=weights) print("Weighted Mean:", weighted_mean)
  • 68. How do you find the mode for categorical data in Python? from scipy import stats # Categorical data categories = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana'] # Mode mode = stats.mode(categories) print("Mode:", mode.mode[0], "Count:", mode.count[0])
  • 69. How do you identify if the data is skewed using mean and median? data = [10, 20, 30, 40, 100] # Right-skewed data mean = np.mean(data) median = np.median(data) if mean > median: print("The data is right-skewed.") elif mean < median: print("The data is left-skewed.") else: print("The data is symmetric.")
  • 70. 1. Create a python code using pandas library to create binary columns for each category, where dataset data = {‘Vehicle’: [‘BUS’, ‘VAN’, ‘TRAIN’, ‘BUS’, ‘CYCLE’]} 2. Write a python code using Pandas library to delete the duplicate values. 3. Write a NumPy program to compute the mean and the median of a given array. data = [1, 2, 2, 3, 4, 5, 10] 4. Create a DataFrame where data = pd.DataFrame({ 'name': ['John', 'Jane', 'Jack', 'John', None], 'age': [28, 34, None, 28, 22], 'purchase_amount': [100.5, None, 85.3, 100.5, 50.0], 'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03'] }) Write a program in python using the Mean imputation method to replace the missing values.