SlideShare a Scribd company logo
Data Preprocessing
What is Data Preprocessing?
Data preprocessing is a key aspect of data preparation. It refers to any processing
applied to raw data to ready it for further analysis or processing tasks.
Thus, data preprocessing may be defined as the process of converting raw data into
a format that can be processed more efficiently and accurately in tasks such as:
• Data analysis
• Machine learning
• Data science
• AI
Steps in Data Preprocessing
• Step 1: Data cleaning
• Handling missing values
• Removing duplicates
• Correcting inconsistent formats
• Step 2: Data integration
• Schema matching
• Data deduplication
• Step 3: Data transformation
• Scaling and normalization
• Encoding categorical variables
• Feature engineering and extraction
• Step 4: Data reduction
• Feature selection
• Principal component analysis (PCA)
• Sampling methods
Data science using python, Data Preprocessing
Data Cleaning Tool:
Handling Missing Values
# Creating a manual dataset
data = pd.DataFrame({
'name': ['John', 'Jane', 'Jack', 'John', None],
'age': [28, 34, None, 28, 22],
'purchase_amount': [100.5, None, 85.3, 100.5, 50.0],
'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03']
})
# Handling missing values using mean imputation for 'age' and 'purchase_amount'
imputer = SimpleImputer(strategy='mean')
data[['age', 'purchase_amount']] = imputer.fit_transform(data[['age', 'purchase_amount']])
# Removing duplicate rows
data = data.drop_duplicates()
print(data)
Output:
• Missing Values:
This occurs when a particular variable lacks data points, resulting in incomplete
information and potentially harming the accuracy and dependability of your
models.
Types of Missing Values
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
Effective Strategies for Handling Missing Values in Data
Analysis
import pandas as pd
import numpy as np
# Creating a sample DataFrame with missing values
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Data science using python, Data Preprocessing
# Removing rows with missing values
df_cleaned = df.dropna()
# Displaying the DataFrame after removing missing values
print("nDataFrame after removing rows with missing values:")
print(df_cleaned)
Imputation Methods
• Mean
• Median and
• Mode
Mean Imputation:
Step 1 - Import the library
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
Step 2 - Setting up the Data
df = pd.DataFrame()
df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526,
0.7341,0.4546,0.1426,0.1490,0.2500]
df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan,
0.8308,0.4962,np.nan,0.5340,0.6731]
print(df)
Step 3 - Using Imputer to fill the nun values with the
Mean
• missing_values : In this we have to place the missing values and in pandas it is
'NaN'.
• strategy : In this we have to pass the strategy that we need to follow to impute in
missing value it can be mean, median, most_frequent or constant. By default it is
mean.
• fill_value : By default it is set as none. It is used when the strategy is set to
constant then we have to pass the value that we want to fill as a constant in all the
nun places.
• axis : In this we have to pass 0 for columns and 1 for rows.
miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df =
miss_mean_imputer.transform(df.values)
print(imputed_df)
Data science using python, Data Preprocessing
nun values have been filled by the mean of the columns.
Median Imputation:
• It is the middle value of a dataset when it is ordered from lowest to highest.
• If there is an even number of values, the median is the average of the two middle
values.
• the mean, the median is not affected by outliers, making it a more reliable
measure for skewed distributions.
• Median: data=data.fillna(data.median())
#replacing missing values in quantity
# column with mean of that column
data['quantity'] = data['quantity'].fillna(data['quantity'].mean())
# replacing missing values in price column
# with median of that column
data['price'] = data['price'].fillna(data['price'].median())
print(Data)
Mode imputation:
• Mode - The most common value
from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
#The mode() method returns a ModeResult object that contains the mode
number (86), and count (how many times the mode number appeared (3)).
Data Skewed
• Data skewed" means that the distribution of data points in a dataset
is uneven, with a noticeable concentration of values on one side of the
distribution, creating a "tail" extending towards the other side, making
the data appear distorted or asymmetrical when visualized on a graph
• Types of skew:
• Positive skew (right skew): The tail extends towards the higher
values on the right side of the graph.
• Negative skew (left skew): The tail extends towards the lower values
on the left side of the graph.
Data science using python, Data Preprocessing
Data science using python, Data Preprocessing
Data Integration
• Data integration is a process where data from many sources goes to a
single centralized location, which is often a data warehouse.
• The end location needs to be flexible enough to handle lots of different
kinds of data at potentially large volumes.
• Data integration is deal for powering analytical use cases.
Data Loading
Load data from various sources using respective libraries
• import pandas as pd
• # Load CSV
• df_csv = pd.read_csv('data.csv')
• # Load Excel
• df_excel = pd.read_excel('data.xlsx')
• # Load JSON
• df_json = pd.read_json('data.json')
Data Cleaning and Transformation
Use Pandas or other libraries to clean and normalize the data
# Drop null values
• df = df.dropna()
# Rename columns
• df = df.rename(columns={'OldName': 'NewName'})
# Standardize formats
• df['date'] = pd.to_datetime(df['date'])
Combining Data
• Concatenation: Stack datasets vertically.
combined_df = pd.concat([df1, df2])
• Merging: Combine datasets based on a key.
merged_df = pd.merge(df1, df2, on='common_key’)
Data transformation
Encoding
1.Label Encoding
2.One-hot Encoding
3.Ordinal Encoding
Label Encoding
• Label Encoding is a technique that is used to convert categorical columns into
numerical ones so that they can be fitted by machine learning models which only
take numerical data.
• It is an important pre-processing step in a machine-learning project.
Example Of Label Encoding
Steps for Label Encoding
Using sklearn.preprocessing.LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Sample categorical data
categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse']
# Initialize LabelEncoder
encoder = LabelEncoder()
# Fit and transform the data
encoded_labels = encoder.fit_transform(categories)
print(encoded_labels) # Output: [0 1 2 1 0 2]
# Get the mapping
print(encoder.classes_) # Output: ['cat' 'dog' 'mouse']
# Decode back to original categories
decoded_labels = encoder.inverse_transform(encoded_labels)
print(decoded_labels) # Output: ['cat' 'dog' 'mouse' 'dog' 'cat' 'mouse']
import pandas as pd
# Sample categorical data
categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse']
# Encode labels
encoded_labels, uniques = pd.factorize(categories)
print(encoded_labels) # Output: [0 1 2 1 0 2]
print(uniques) # Output: Index(['cat', 'dog', 'mouse'], dtype='object')
Using pandas.factorize
Key Points to Consider
1. Order Sensitivity:
•Label encoding assumes an ordinal relationship between categories (e.g., 0 < 1 < 2). This is fine
for ordered categories like "low", "medium", and "high".
•For unordered categories (e.g., "dog", "cat"), this may mislead algorithms into interpreting
numerical relationships. In such cases, use one-hot encoding instead.
2. Decoding:
If you need to map back to original categories, ensure you retain the mapping (via
LabelEncoder.classes_ or similar method.
One Hot Encoding
• One Hot Encoding in machine learning transforms categorical data into a
numerical format that machine learning algorithms can process without imposing
any ordinal relationships.
• It creates new binary columns (0s and 1s) for each category in the original
variable. Each category in the original column is represented as a separate column,
where a value of 1 indicates the presence of that category, and 0 indicates its
absence.
How One-Hot Encoding Works: An Example
• Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the
other fruit columns (like Mango or Orange) will contain 0.
• This pattern ensures that each categorical value gets its own column, represented
with binary values (1 or 0), making it usable for machine learning models.
Fruit
Categorical value of
fruit
Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20
The output after applying one-hot encoding on the data is given as follows,
Fruit_apple Fruit_mango Fruit_orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
import pandas as pd
# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)
# Applying one-hot encoding
df_encoded = pd.get_dummies(df, dtype=int)
# Displaying the encoded DataFrame
print(df_encoded)
Using Pandas get_dummies()
Data science using python, Data Preprocessing
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Creating the encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Sample data
X = [['Red'], ['Green'], ['Blue']]
# Fitting the encoder to the data
enc.fit(X)
# Transforming new data
result = enc.transform([['Red']]).toarray()
# Displaying the encoded result
print(result)
Using Scikit-learn's OneHotEncoder
Data Normalization
• Data normalization is a technique used to transform the values of a dataset into a
common scale.
• This is important because many machine learning algorithms are sensitive to the
scale of the input features and can produce better results when the data is
normalized.
Need of Normalization
There are several different normalization techniques that can be
used
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
4. Logarithmic transformation
5. Root transformation
Min-Max scaling
• Min-Max scaling is a technique in data preprocessing where all values within a
feature are linearly transformed to fall within a specified range, usually between 0
and 1, by calculating the proportion of each value relative to the minimum and
maximum values in that feature
• Essentially, it scales the data based on its relative position within the original
range, making it suitable for algorithms sensitive to feature scale differences
Key points about Min-Max scaling:
• To normalize a value 'x' using min-max scaling, the formula is:
x= (x - min) / (max - min)
where 'min' is the minimum value in the feature and 'max' is the maximum
value.
When to use Min-Max scaling:
• When you need to scale features to a specific range (like 0 to 1) and preserve
relative relationships between data points.
• When dealing with datasets where features have significantly different scales and
you want to ensure no single feature dominates the analysis.
Example:
Imagine a dataset with a feature "Temperature" ranging from 10 to 40 degrees
Celsius. To normalize this using min-max scaling:
• Min value: 10
• Max value: 40
• To normalize a temperature of 25 degrees: Calculation: (25 - 10) / (40 - 10) = 0.5.
Benefits:
• Standardizes feature scales
• Preserves relative relationships
Drawbacks
• Outlier sensitivity
• Not robust to noise
Outliers
• Outliers are data points that are significantly different from the majority of other
data points in a set.
• They can be higher or lower than the other values in the set.
Data science using python, Data Preprocessing
Data science using python, Data Preprocessing
Feature Engineering
• Feature Engineering is the process of creating new features or transforming
existing features to improve the performance of a machine-learning model.
• It involves selecting relevant information from raw data and transforming it into a
format that can be easily understood by a model.
• The goal is to improve model accuracy by providing more meaningful and
relevant information.
Data science using python, Data Preprocessing
What is a Feature?
• A feature (also known as a variable or attribute) is an individual measurable
property or characteristic of a data point that is used as input for a machine
learning algorithm.
• Features can be numerical, categorical, or text-based, and they represent different
aspects of the data that are relevant to the problem at hand.
• Feature engineering consists of mainly 5 processes:
• Feature Creation:
• Feature Creation is the process of generating new features based on domain knowledge or
by observing patterns in the data.
• Feature Transformation:
• Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model.
• Feature Extraction
• Feature Extraction is the process of creating new features from existing ones to provide
more relevant information to the machine learning model
• Feature Selection:
• Feature Selection is the process of selecting a subset of relevant features from the dataset
to be used in a machine-learning model.
Data science using python, Data Preprocessing
• Feature selection methods have been traditionally grouped into
• filter methods
• wrapper methods
• embedded methods.
Chi-square Test for Feature Selection
• Chi-square test is used for categorical features in a dataset.
• Calculate Chi-square between each feature and the target and select the desired number of
features with best Chi-square scores.
• Features that show significant dependencies with the target variable are considered
important for prediction and can be selected for further analysis.
• Step 1: Null Hypothesis (H0): There is no significant association between the two
categorical variables.
• Step 2: Contingency table
• Step 3: Now, calculate the expected frequencies.
• For example, the expected frequency for “Low Income” and “Subscribed” would be:
• FindTotal count for each row Ri and each column Cj and Total number of observations
are 140.
• Low Income, subscribed=(50×70)÷140=25(50×70)÷140=25
• Step 4: Calculate the Chi-Square Statistic
• Let’s summarize the observed and expected values into a table and calculate the Chi-
Square value:
Step 6: Interpretations
Now, you can compare the calculated value χ2 (3.747) with the critical value from the Chi-
Square distribution table or any statistical software tool with 2 degrees of freedom. If
the χ2value is greater than the critical value, you would reject the null hypothesis. This suggests
that there is a significant association between “income level” and “subscription status,” and
“income level” is a relevant feature for predicting subscription status.
• Example:
• Let's say we are conducting a Chi-Square test with 2 degrees of freedom at the 0.05
significance level.
• Degree of Freedom (df) = 2
• Significance Level (α) = 0.05
• We look up the critical value in a Chi-Square table or calculate it using statistical
software. For df = 2 and α = 0.05, the critical value is approximately 5.991.
• If the computed Chi-Square statistic from our test is greater than 5.991, we reject the null
hypothesis. If it is less than 5.991, we fail to reject the null hypothesis.
1. Univariate Feature Selection (Using statistical tests)
In this method, you use statistical tests to select the features that have the strongest
relationship with the target variable.
2. Recursive Feature Elimination (RFE)
RFE works by recursively removing the least important features and building a model on
the remaining features. It uses a model to rank feature importance.
3. Random Forest Classification
Tree-based models such as Random Forest or XGBoost can be used to rank the importance of
each feature based on how useful they are in reducing impurity in the trees.
Important Questions:
• What is data preprocessing in the context of machine learning?
• Explain Data Cleaning with Python's pandas Library.
• What are common data quality issues you might encounter?
• How do you handle missing data within a dataset?
• What is the difference between imputation and deletion of missing values?
• What is one-hot encoding, and when should it be used? Explain with an example.
• Write a NumPy program to compute the mean, standard deviation and the median of a given array .
Sample array [0 1 2 3 4 5]
1. import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Age': [22, 25, 27, np.nan, 30, 35],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'],
'Salary': [50000, 60000, np.nan, 80000, 75000, 90000]
}
df = pd.DataFrame(data)
# Display the original data
print("Original Data:")
print(df)
# 1. Handling missing data Fill missing values for 'Age' column with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values for 'Salary' column with the median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# 2. Encoding categorical data (e.g., Gender column) Convert 'Gender' column to numerical
values (Male = 0, Female = 1)
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
# 3. Feature scaling (standardization of 'Salary' column)
# Standardize the 'Salary' column using z-score normalization
df['Salary'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
# Display the preprocessed data
print("nPreprocessed Data:")
print(df)
2.
• Handling missing values:
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, np.nan, 30, 22, np.nan],
'Salary': [50000, 60000, np.nan, 80000, 75000]}
df = pd.DataFrame(data)
# Checking for missing values
print(df.isnull()) # True for NaN values
• Filling Missing Data
# Filling missing values with the mean of the column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("nAfter Filling Missing Values:")
print(df)
• Dropping Missing Data
# Drop rows with missing values
df.dropna(inplace=True)
print("nAfter Dropping Rows with Missing Data:")
print(df)
• Removing Duplicates
# Sample DataFrame with duplicates
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Alice'],
'Age': [25, 30, 35, 30, 25],
'Salary': [50000, 60000, 70000, 60000, 50000]}
df = pd.DataFrame(data)
# Removing duplicate rows
df.drop_duplicates(inplace=True)
print("nAfter Removing Duplicates:")
print(df)
6.
import pandas as pd
# Sample data
data = pd.DataFrame({'Size': ['S', 'M', 'M', 'L', 'S', 'L']})
# One-hot encoding
one_hot_encoded = pd.get_dummies(data, columns=['Size'])
print(one_hot_encoded)
7.
import numpy as np
# Given array
arr = np.array([0, 1, 2, 3, 4, 5])
# Compute mean, standard deviation, and median
mean = np.mean(arr)
std_dev = np.std(arr)
median = np.median(arr)
# Print results
print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Median:", median)
Questions:
1. How do you calculate the mean, median, and mode using Python?
2. How do you calculate the weighted mean in Python?
3. How do you find the mode for categorical data in Python?
4. How do you identify if the data is skewed using mean and median?
How do you calculate the mean, median, and mode using Python?
import numpy as np
from scipy import stats
# Sample data
data = [10, 20, 20, 30, 40, 50, 100]
# Mean
mean = np.mean(data)
print("Mean:", mean)
# Median
median = np.median(data)
print("Median:", median)
# Mode
mode = stats.mode(data)
print("Mode:", mode.mode[0], "Count:", mode.count[0])
How do you calculate the weighted mean in Python?
import numpy as np
# Data and weights
data = [10, 20, 30, 40]
weights = [1, 2, 3, 4]
# Weighted mean
weighted_mean = np.average(data, weights=weights)
print("Weighted Mean:", weighted_mean)
How do you find the mode for categorical data in Python?
from scipy import stats
# Categorical data
categories = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
# Mode
mode = stats.mode(categories)
print("Mode:", mode.mode[0], "Count:", mode.count[0])
How do you identify if the data is skewed using mean and median?
data = [10, 20, 30, 40, 100] # Right-skewed data
mean = np.mean(data)
median = np.median(data)
if mean > median:
print("The data is right-skewed.")
elif mean < median:
print("The data is left-skewed.")
else:
print("The data is symmetric.")
1. Create a python code using pandas library to create binary columns for each category, where
dataset data = {‘Vehicle’: [‘BUS’, ‘VAN’, ‘TRAIN’, ‘BUS’, ‘CYCLE’]}
2. Write a python code using Pandas library to delete the duplicate values.
3. Write a NumPy program to compute the mean and the median of a given array. data = [1, 2,
2, 3, 4, 5, 10]
4. Create a DataFrame where
data = pd.DataFrame({
'name': ['John', 'Jane', 'Jack', 'John', None],
'age': [28, 34, None, 28, 22],
'purchase_amount': [100.5, None, 85.3, 100.5, 50.0],
'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03']
})
Write a program in python using the Mean imputation method to replace the missing values.

More Related Content

PPTX
Data preprocessing in Machine learning
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
PPTX
Working with Graphs _python.pptx
PPTX
Unit 4_Working with Graphs _python (2).pptx
PDF
Data preprocessing in Machine Learning
PPTX
Machine Learning - Dataset Preparation
PDF
13_Data Preprocessing in Python.pptx (1).pdf
PDF
The model interacts with the environment seeking ways to maximize the reward....
Data preprocessing in Machine learning
Pandas Data Cleaning and Preprocessing PPT.pptx
Working with Graphs _python.pptx
Unit 4_Working with Graphs _python (2).pptx
Data preprocessing in Machine Learning
Machine Learning - Dataset Preparation
13_Data Preprocessing in Python.pptx (1).pdf
The model interacts with the environment seeking ways to maximize the reward....

Similar to Data science using python, Data Preprocessing (20)

PPTX
Exploratory Data Analysis Unit 1 ppt presentation.pptx
PPTX
Lec2(Types of ML) & Preprocessing of data.pptx
PPT
Preprocessing_new.ppt
PPTX
data_preprocessingknnnaiveandothera.pptx
PPTX
Introduction to ML_Data Preprocessing.pptx
PDF
Explore ML day 1
PDF
Preparing Data
PDF
Introduction to Artificial Intelligence_ Lec 5
PDF
Lesson 2 data preprocessing
PDF
Machine Learning Algorithms
PPTX
Python libraries for analysis Pandas.pptx
PDF
Data preprocessing in Data Mining
PDF
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
PPTX
Intro to Machine Learning for non-Data Scientists
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PPTX
Data Science- Data Preprocessing, Data Cleaning.
PDF
overview of_data_processing
 
PPTX
Data_Preparation.pptx
PPTX
Predicting Employee Churn: A Data-Driven Approach Project Presentation
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Preprocessing_new.ppt
data_preprocessingknnnaiveandothera.pptx
Introduction to ML_Data Preprocessing.pptx
Explore ML day 1
Preparing Data
Introduction to Artificial Intelligence_ Lec 5
Lesson 2 data preprocessing
Machine Learning Algorithms
Python libraries for analysis Pandas.pptx
Data preprocessing in Data Mining
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Intro to Machine Learning for non-Data Scientists
ML-ChapterTwo-Data Preprocessing.ppt
Data Science- Data Preprocessing, Data Cleaning.
overview of_data_processing
 
Data_Preparation.pptx
Predicting Employee Churn: A Data-Driven Approach Project Presentation
python-pandas-For-Data-Analysis-Manipulate.pptx
Ad

Recently uploaded (20)

PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Welding lecture in detail for understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
composite construction of structures.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Sustainable Sites - Green Building Construction
PDF
Digital Logic Computer Design lecture notes
PPTX
Construction Project Organization Group 2.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
bas. eng. economics group 4 presentation 1.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CH1 Production IntroductoryConcepts.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Welding lecture in detail for understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
composite construction of structures.pdf
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
OOP with Java - Java Introduction (Basics)
additive manufacturing of ss316l using mig welding
Sustainable Sites - Green Building Construction
Digital Logic Computer Design lecture notes
Construction Project Organization Group 2.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Ad

Data science using python, Data Preprocessing

  • 2. What is Data Preprocessing? Data preprocessing is a key aspect of data preparation. It refers to any processing applied to raw data to ready it for further analysis or processing tasks. Thus, data preprocessing may be defined as the process of converting raw data into a format that can be processed more efficiently and accurately in tasks such as: • Data analysis • Machine learning • Data science • AI
  • 3. Steps in Data Preprocessing • Step 1: Data cleaning • Handling missing values • Removing duplicates • Correcting inconsistent formats • Step 2: Data integration • Schema matching • Data deduplication
  • 4. • Step 3: Data transformation • Scaling and normalization • Encoding categorical variables • Feature engineering and extraction • Step 4: Data reduction • Feature selection • Principal component analysis (PCA) • Sampling methods
  • 6. Data Cleaning Tool: Handling Missing Values # Creating a manual dataset data = pd.DataFrame({ 'name': ['John', 'Jane', 'Jack', 'John', None], 'age': [28, 34, None, 28, 22], 'purchase_amount': [100.5, None, 85.3, 100.5, 50.0], 'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03'] }) # Handling missing values using mean imputation for 'age' and 'purchase_amount' imputer = SimpleImputer(strategy='mean') data[['age', 'purchase_amount']] = imputer.fit_transform(data[['age', 'purchase_amount']]) # Removing duplicate rows data = data.drop_duplicates() print(data)
  • 8. • Missing Values: This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models.
  • 9. Types of Missing Values • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Missing Not at Random (MNAR)
  • 10. Effective Strategies for Handling Missing Values in Data Analysis import pandas as pd import numpy as np # Creating a sample DataFrame with missing values data = { 'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'], 'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'], 'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'], 'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'], 'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88], 'Rank': [2, 1, 4, 3, 8, 1, 5, 3], 'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B'] } df = pd.DataFrame(data) print("Sample DataFrame:") print(df)
  • 12. # Removing rows with missing values df_cleaned = df.dropna() # Displaying the DataFrame after removing missing values print("nDataFrame after removing rows with missing values:") print(df_cleaned)
  • 13. Imputation Methods • Mean • Median and • Mode
  • 14. Mean Imputation: Step 1 - Import the library import pandas as pd import numpy as np from sklearn.preprocessing import Imputer
  • 15. Step 2 - Setting up the Data df = pd.DataFrame() df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526, 0.7341,0.4546,0.1426,0.1490,0.2500] df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan, 0.8308,0.4962,np.nan,0.5340,0.6731] print(df)
  • 16. Step 3 - Using Imputer to fill the nun values with the Mean • missing_values : In this we have to place the missing values and in pandas it is 'NaN'. • strategy : In this we have to pass the strategy that we need to follow to impute in missing value it can be mean, median, most_frequent or constant. By default it is mean. • fill_value : By default it is set as none. It is used when the strategy is set to constant then we have to pass the value that we want to fill as a constant in all the nun places. • axis : In this we have to pass 0 for columns and 1 for rows.
  • 17. miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df = miss_mean_imputer.transform(df.values) print(imputed_df)
  • 19. nun values have been filled by the mean of the columns.
  • 20. Median Imputation: • It is the middle value of a dataset when it is ordered from lowest to highest. • If there is an even number of values, the median is the average of the two middle values. • the mean, the median is not affected by outliers, making it a more reliable measure for skewed distributions.
  • 22. #replacing missing values in quantity # column with mean of that column data['quantity'] = data['quantity'].fillna(data['quantity'].mean()) # replacing missing values in price column # with median of that column data['price'] = data['price'].fillna(data['price'].median()) print(Data)
  • 23. Mode imputation: • Mode - The most common value from scipy import stats speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = stats.mode(speed) print(x) #The mode() method returns a ModeResult object that contains the mode number (86), and count (how many times the mode number appeared (3)).
  • 24. Data Skewed • Data skewed" means that the distribution of data points in a dataset is uneven, with a noticeable concentration of values on one side of the distribution, creating a "tail" extending towards the other side, making the data appear distorted or asymmetrical when visualized on a graph
  • 25. • Types of skew: • Positive skew (right skew): The tail extends towards the higher values on the right side of the graph. • Negative skew (left skew): The tail extends towards the lower values on the left side of the graph.
  • 28. Data Integration • Data integration is a process where data from many sources goes to a single centralized location, which is often a data warehouse. • The end location needs to be flexible enough to handle lots of different kinds of data at potentially large volumes. • Data integration is deal for powering analytical use cases.
  • 29. Data Loading Load data from various sources using respective libraries • import pandas as pd • # Load CSV • df_csv = pd.read_csv('data.csv') • # Load Excel • df_excel = pd.read_excel('data.xlsx') • # Load JSON • df_json = pd.read_json('data.json')
  • 30. Data Cleaning and Transformation Use Pandas or other libraries to clean and normalize the data # Drop null values • df = df.dropna() # Rename columns • df = df.rename(columns={'OldName': 'NewName'}) # Standardize formats • df['date'] = pd.to_datetime(df['date'])
  • 31. Combining Data • Concatenation: Stack datasets vertically. combined_df = pd.concat([df1, df2]) • Merging: Combine datasets based on a key. merged_df = pd.merge(df1, df2, on='common_key’)
  • 33. Label Encoding • Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. • It is an important pre-processing step in a machine-learning project.
  • 34. Example Of Label Encoding
  • 35. Steps for Label Encoding Using sklearn.preprocessing.LabelEncoder from sklearn.preprocessing import LabelEncoder # Sample categorical data categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse'] # Initialize LabelEncoder encoder = LabelEncoder() # Fit and transform the data encoded_labels = encoder.fit_transform(categories) print(encoded_labels) # Output: [0 1 2 1 0 2] # Get the mapping print(encoder.classes_) # Output: ['cat' 'dog' 'mouse'] # Decode back to original categories decoded_labels = encoder.inverse_transform(encoded_labels) print(decoded_labels) # Output: ['cat' 'dog' 'mouse' 'dog' 'cat' 'mouse']
  • 36. import pandas as pd # Sample categorical data categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse'] # Encode labels encoded_labels, uniques = pd.factorize(categories) print(encoded_labels) # Output: [0 1 2 1 0 2] print(uniques) # Output: Index(['cat', 'dog', 'mouse'], dtype='object') Using pandas.factorize
  • 37. Key Points to Consider 1. Order Sensitivity: •Label encoding assumes an ordinal relationship between categories (e.g., 0 < 1 < 2). This is fine for ordered categories like "low", "medium", and "high". •For unordered categories (e.g., "dog", "cat"), this may mislead algorithms into interpreting numerical relationships. In such cases, use one-hot encoding instead. 2. Decoding: If you need to map back to original categories, ensure you retain the mapping (via LabelEncoder.classes_ or similar method.
  • 38. One Hot Encoding • One Hot Encoding in machine learning transforms categorical data into a numerical format that machine learning algorithms can process without imposing any ordinal relationships. • It creates new binary columns (0s and 1s) for each category in the original variable. Each category in the original column is represented as a separate column, where a value of 1 indicates the presence of that category, and 0 indicates its absence.
  • 39. How One-Hot Encoding Works: An Example • Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the other fruit columns (like Mango or Orange) will contain 0. • This pattern ensures that each categorical value gets its own column, represented with binary values (1 or 0), making it usable for machine learning models.
  • 40. Fruit Categorical value of fruit Price apple 1 5 mango 2 10 apple 1 15 orange 3 20
  • 41. The output after applying one-hot encoding on the data is given as follows, Fruit_apple Fruit_mango Fruit_orange price 1 0 0 5 0 1 0 10 1 0 0 15 0 0 1 20
  • 42. import pandas as pd # Sample data data = {'Color': ['Red', 'Green', 'Blue', 'Red']} df = pd.DataFrame(data) # Applying one-hot encoding df_encoded = pd.get_dummies(df, dtype=int) # Displaying the encoded DataFrame print(df_encoded) Using Pandas get_dummies()
  • 44. from sklearn.preprocessing import OneHotEncoder import numpy as np # Creating the encoder enc = OneHotEncoder(handle_unknown='ignore') # Sample data X = [['Red'], ['Green'], ['Blue']] # Fitting the encoder to the data enc.fit(X) # Transforming new data result = enc.transform([['Red']]).toarray() # Displaying the encoded result print(result) Using Scikit-learn's OneHotEncoder
  • 45. Data Normalization • Data normalization is a technique used to transform the values of a dataset into a common scale. • This is important because many machine learning algorithms are sensitive to the scale of the input features and can produce better results when the data is normalized.
  • 47. There are several different normalization techniques that can be used 1. Min-Max normalization 2. Z-score normalization 3. Decimal Scaling 4. Logarithmic transformation 5. Root transformation
  • 48. Min-Max scaling • Min-Max scaling is a technique in data preprocessing where all values within a feature are linearly transformed to fall within a specified range, usually between 0 and 1, by calculating the proportion of each value relative to the minimum and maximum values in that feature • Essentially, it scales the data based on its relative position within the original range, making it suitable for algorithms sensitive to feature scale differences
  • 49. Key points about Min-Max scaling: • To normalize a value 'x' using min-max scaling, the formula is: x= (x - min) / (max - min) where 'min' is the minimum value in the feature and 'max' is the maximum value.
  • 50. When to use Min-Max scaling: • When you need to scale features to a specific range (like 0 to 1) and preserve relative relationships between data points. • When dealing with datasets where features have significantly different scales and you want to ensure no single feature dominates the analysis.
  • 51. Example: Imagine a dataset with a feature "Temperature" ranging from 10 to 40 degrees Celsius. To normalize this using min-max scaling: • Min value: 10 • Max value: 40 • To normalize a temperature of 25 degrees: Calculation: (25 - 10) / (40 - 10) = 0.5.
  • 52. Benefits: • Standardizes feature scales • Preserves relative relationships Drawbacks • Outlier sensitivity • Not robust to noise
  • 53. Outliers • Outliers are data points that are significantly different from the majority of other data points in a set. • They can be higher or lower than the other values in the set.
  • 56. Feature Engineering • Feature Engineering is the process of creating new features or transforming existing features to improve the performance of a machine-learning model. • It involves selecting relevant information from raw data and transforming it into a format that can be easily understood by a model. • The goal is to improve model accuracy by providing more meaningful and relevant information.
  • 58. What is a Feature? • A feature (also known as a variable or attribute) is an individual measurable property or characteristic of a data point that is used as input for a machine learning algorithm. • Features can be numerical, categorical, or text-based, and they represent different aspects of the data that are relevant to the problem at hand.
  • 59. • Feature engineering consists of mainly 5 processes: • Feature Creation: • Feature Creation is the process of generating new features based on domain knowledge or by observing patterns in the data. • Feature Transformation: • Feature Transformation is the process of transforming the features into a more suitable representation for the machine learning model. • Feature Extraction • Feature Extraction is the process of creating new features from existing ones to provide more relevant information to the machine learning model • Feature Selection: • Feature Selection is the process of selecting a subset of relevant features from the dataset to be used in a machine-learning model.
  • 61. • Feature selection methods have been traditionally grouped into • filter methods • wrapper methods • embedded methods.
  • 62. Chi-square Test for Feature Selection • Chi-square test is used for categorical features in a dataset. • Calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. • Features that show significant dependencies with the target variable are considered important for prediction and can be selected for further analysis.
  • 63. • Step 1: Null Hypothesis (H0): There is no significant association between the two categorical variables. • Step 2: Contingency table
  • 64. • Step 3: Now, calculate the expected frequencies. • For example, the expected frequency for “Low Income” and “Subscribed” would be: • FindTotal count for each row Ri and each column Cj and Total number of observations are 140. • Low Income, subscribed=(50×70)÷140=25(50×70)÷140=25
  • 65. • Step 4: Calculate the Chi-Square Statistic • Let’s summarize the observed and expected values into a table and calculate the Chi- Square value:
  • 66. Step 6: Interpretations Now, you can compare the calculated value χ2 (3.747) with the critical value from the Chi- Square distribution table or any statistical software tool with 2 degrees of freedom. If the χ2value is greater than the critical value, you would reject the null hypothesis. This suggests that there is a significant association between “income level” and “subscription status,” and “income level” is a relevant feature for predicting subscription status.
  • 67. • Example: • Let's say we are conducting a Chi-Square test with 2 degrees of freedom at the 0.05 significance level. • Degree of Freedom (df) = 2 • Significance Level (α) = 0.05 • We look up the critical value in a Chi-Square table or calculate it using statistical software. For df = 2 and α = 0.05, the critical value is approximately 5.991. • If the computed Chi-Square statistic from our test is greater than 5.991, we reject the null hypothesis. If it is less than 5.991, we fail to reject the null hypothesis.
  • 68. 1. Univariate Feature Selection (Using statistical tests) In this method, you use statistical tests to select the features that have the strongest relationship with the target variable.
  • 69. 2. Recursive Feature Elimination (RFE) RFE works by recursively removing the least important features and building a model on the remaining features. It uses a model to rank feature importance.
  • 70. 3. Random Forest Classification Tree-based models such as Random Forest or XGBoost can be used to rank the importance of each feature based on how useful they are in reducing impurity in the trees.
  • 71. Important Questions: • What is data preprocessing in the context of machine learning? • Explain Data Cleaning with Python's pandas Library. • What are common data quality issues you might encounter? • How do you handle missing data within a dataset? • What is the difference between imputation and deletion of missing values? • What is one-hot encoding, and when should it be used? Explain with an example. • Write a NumPy program to compute the mean, standard deviation and the median of a given array . Sample array [0 1 2 3 4 5]
  • 72. 1. import pandas as pd import numpy as np # Create a sample DataFrame data = { 'Age': [22, 25, 27, np.nan, 30, 35], 'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'], 'Salary': [50000, 60000, np.nan, 80000, 75000, 90000] } df = pd.DataFrame(data) # Display the original data print("Original Data:") print(df) # 1. Handling missing data Fill missing values for 'Age' column with the mean df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill missing values for 'Salary' column with the median df['Salary'].fillna(df['Salary'].median(), inplace=True)
  • 73. # 2. Encoding categorical data (e.g., Gender column) Convert 'Gender' column to numerical values (Male = 0, Female = 1) df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1}) # 3. Feature scaling (standardization of 'Salary' column) # Standardize the 'Salary' column using z-score normalization df['Salary'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std() # Display the preprocessed data print("nPreprocessed Data:") print(df)
  • 74. 2. • Handling missing values: import pandas as pd import numpy as np # Sample DataFrame with missing values data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, np.nan, 30, 22, np.nan], 'Salary': [50000, 60000, np.nan, 80000, 75000]} df = pd.DataFrame(data) # Checking for missing values print(df.isnull()) # True for NaN values
  • 75. • Filling Missing Data # Filling missing values with the mean of the column df['Age'].fillna(df['Age'].mean(), inplace=True) df['Salary'].fillna(df['Salary'].mean(), inplace=True) print("nAfter Filling Missing Values:") print(df)
  • 76. • Dropping Missing Data # Drop rows with missing values df.dropna(inplace=True) print("nAfter Dropping Rows with Missing Data:") print(df)
  • 77. • Removing Duplicates # Sample DataFrame with duplicates data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Alice'], 'Age': [25, 30, 35, 30, 25], 'Salary': [50000, 60000, 70000, 60000, 50000]} df = pd.DataFrame(data) # Removing duplicate rows df.drop_duplicates(inplace=True) print("nAfter Removing Duplicates:") print(df)
  • 78. 6. import pandas as pd # Sample data data = pd.DataFrame({'Size': ['S', 'M', 'M', 'L', 'S', 'L']}) # One-hot encoding one_hot_encoded = pd.get_dummies(data, columns=['Size']) print(one_hot_encoded)
  • 79. 7. import numpy as np # Given array arr = np.array([0, 1, 2, 3, 4, 5]) # Compute mean, standard deviation, and median mean = np.mean(arr) std_dev = np.std(arr) median = np.median(arr) # Print results print("Mean:", mean) print("Standard Deviation:", std_dev) print("Median:", median)
  • 80. Questions: 1. How do you calculate the mean, median, and mode using Python? 2. How do you calculate the weighted mean in Python? 3. How do you find the mode for categorical data in Python? 4. How do you identify if the data is skewed using mean and median?
  • 81. How do you calculate the mean, median, and mode using Python? import numpy as np from scipy import stats # Sample data data = [10, 20, 20, 30, 40, 50, 100] # Mean mean = np.mean(data) print("Mean:", mean) # Median median = np.median(data) print("Median:", median) # Mode mode = stats.mode(data) print("Mode:", mode.mode[0], "Count:", mode.count[0])
  • 82. How do you calculate the weighted mean in Python? import numpy as np # Data and weights data = [10, 20, 30, 40] weights = [1, 2, 3, 4] # Weighted mean weighted_mean = np.average(data, weights=weights) print("Weighted Mean:", weighted_mean)
  • 83. How do you find the mode for categorical data in Python? from scipy import stats # Categorical data categories = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana'] # Mode mode = stats.mode(categories) print("Mode:", mode.mode[0], "Count:", mode.count[0])
  • 84. How do you identify if the data is skewed using mean and median? data = [10, 20, 30, 40, 100] # Right-skewed data mean = np.mean(data) median = np.median(data) if mean > median: print("The data is right-skewed.") elif mean < median: print("The data is left-skewed.") else: print("The data is symmetric.")
  • 85. 1. Create a python code using pandas library to create binary columns for each category, where dataset data = {‘Vehicle’: [‘BUS’, ‘VAN’, ‘TRAIN’, ‘BUS’, ‘CYCLE’]} 2. Write a python code using Pandas library to delete the duplicate values. 3. Write a NumPy program to compute the mean and the median of a given array. data = [1, 2, 2, 3, 4, 5, 10] 4. Create a DataFrame where data = pd.DataFrame({ 'name': ['John', 'Jane', 'Jack', 'John', None], 'age': [28, 34, None, 28, 22], 'purchase_amount': [100.5, None, 85.3, 100.5, 50.0], 'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03'] }) Write a program in python using the Mean imputation method to replace the missing values.