Data science using python, Data Preprocessing

What is Data Preprocessing?
Data preprocessing is a key aspect of data preparation. It refers to any processing
applied to raw data to ready it for further analysis or processing tasks.
Thus, data preprocessing may be defined as the process of converting raw data into
a format that can be processed more efficiently and accurately in tasks such as:
• Data analysis
• Machine learning
• Data science
• AI

Steps in Data Preprocessing
• Step 1: Data cleaning
• Handling missing values
• Removing duplicates
• Correcting inconsistent formats
• Step 2: Data integration
• Schema matching
• Data deduplication

• Step 3: Data transformation
• Scaling and normalization
• Encoding categorical variables
• Feature engineering and extraction
• Step 4: Data reduction
• Feature selection
• Principal component analysis (PCA)
• Sampling methods

Data Cleaning Tool:
Handling Missing Values
# Creating a manual dataset
data = pd.DataFrame({
'name': ['John', 'Jane', 'Jack', 'John', None],
'age': [28, 34, None, 28, 22],
'purchase_amount': [100.5, None, 85.3, 100.5, 50.0],
'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03']
})
# Handling missing values using mean imputation for 'age' and 'purchase_amount'
imputer = SimpleImputer(strategy='mean')
data[['age', 'purchase_amount']] = imputer.fit_transform(data[['age', 'purchase_amount']])
# Removing duplicate rows
data = data.drop_duplicates()
print(data)

• Missing Values:
This occurs when a particular variable lacks data points, resulting in incomplete
information and potentially harming the accuracy and dependability of your
models.

Types of Missing Values
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)

Effective Strategies for Handling Missing Values in Data
Analysis
import pandas as pd
import numpy as np
# Creating a sample DataFrame with missing values
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)

# Removing rows with missing values
df_cleaned = df.dropna()
# Displaying the DataFrame after removing missing values
print("nDataFrame after removing rows with missing values:")
print(df_cleaned)

Imputation Methods
• Mean
• Median and
• Mode

Mean Imputation:
Step 1 - Import the library
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer

Step 2 - Setting up the Data
df = pd.DataFrame()
df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526,
0.7341,0.4546,0.1426,0.1490,0.2500]
df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan,
0.8308,0.4962,np.nan,0.5340,0.6731]
print(df)

Step 3 - Using Imputer to fill the nun values with the
Mean
• missing_values : In this we have to place the missing values and in pandas it is
'NaN'.
• strategy : In this we have to pass the strategy that we need to follow to impute in
missing value it can be mean, median, most_frequent or constant. By default it is
mean.
• fill_value : By default it is set as none. It is used when the strategy is set to
constant then we have to pass the value that we want to fill as a constant in all the
nun places.
• axis : In this we have to pass 0 for columns and 1 for rows.

miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df =
miss_mean_imputer.transform(df.values)
print(imputed_df)

nun values have been filled by the mean of the columns.

Median Imputation:
• It is the middle value of a dataset when it is ordered from lowest to highest.
• If there is an even number of values, the median is the average of the two middle
values.
• the mean, the median is not affected by outliers, making it a more reliable
measure for skewed distributions.

• Median: data=data.fillna(data.median())

#replacing missing values in quantity
# column with mean of that column
data['quantity'] = data['quantity'].fillna(data['quantity'].mean())
# replacing missing values in price column
# with median of that column
data['price'] = data['price'].fillna(data['price'].median())
print(Data)

Mode imputation:
• Mode - The most common value
from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
#The mode() method returns a ModeResult object that contains the mode
number (86), and count (how many times the mode number appeared (3)).

Data Skewed
• Data skewed" means that the distribution of data points in a dataset
is uneven, with a noticeable concentration of values on one side of the
distribution, creating a "tail" extending towards the other side, making
the data appear distorted or asymmetrical when visualized on a graph

• Types of skew:
• Positive skew (right skew): The tail extends towards the higher
values on the right side of the graph.
• Negative skew (left skew): The tail extends towards the lower values
on the left side of the graph.

Data Integration
• Data integration is a process where data from many sources goes to a
single centralized location, which is often a data warehouse.
• The end location needs to be flexible enough to handle lots of different
kinds of data at potentially large volumes.
• Data integration is deal for powering analytical use cases.

Data Loading
Load data from various sources using respective libraries
• import pandas as pd
• # Load CSV
• df_csv = pd.read_csv('data.csv')
• # Load Excel
• df_excel = pd.read_excel('data.xlsx')
• # Load JSON
• df_json = pd.read_json('data.json')

Data Cleaning and Transformation
Use Pandas or other libraries to clean and normalize the data
# Drop null values
• df = df.dropna()
# Rename columns
• df = df.rename(columns={'OldName': 'NewName'})
# Standardize formats
• df['date'] = pd.to_datetime(df['date'])

Combining Data
• Concatenation: Stack datasets vertically.
combined_df = pd.concat([df1, df2])
• Merging: Combine datasets based on a key.
merged_df = pd.merge(df1, df2, on='common_key’)

Data transformation
Encoding
1.Label Encoding
2.One-hot Encoding
3.Ordinal Encoding

Label Encoding
• Label Encoding is a technique that is used to convert categorical columns into
numerical ones so that they can be fitted by machine learning models which only
take numerical data.
• It is an important pre-processing step in a machine-learning project.

Steps for Label Encoding
Using sklearn.preprocessing.LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Sample categorical data
categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse']
# Initialize LabelEncoder
encoder = LabelEncoder()
# Fit and transform the data
encoded_labels = encoder.fit_transform(categories)
print(encoded_labels) # Output: [0 1 2 1 0 2]
# Get the mapping
print(encoder.classes_) # Output: ['cat' 'dog' 'mouse']
# Decode back to original categories
decoded_labels = encoder.inverse_transform(encoded_labels)
print(decoded_labels) # Output: ['cat' 'dog' 'mouse' 'dog' 'cat' 'mouse']

import pandas as pd
# Sample categorical data
categories = ['cat', 'dog', 'mouse', 'dog', 'cat', 'mouse']
# Encode labels
encoded_labels, uniques = pd.factorize(categories)
print(encoded_labels) # Output: [0 1 2 1 0 2]
print(uniques) # Output: Index(['cat', 'dog', 'mouse'], dtype='object')
Using pandas.factorize

Key Points to Consider
1. Order Sensitivity:
•Label encoding assumes an ordinal relationship between categories (e.g., 0 < 1 < 2). This is fine
for ordered categories like "low", "medium", and "high".
•For unordered categories (e.g., "dog", "cat"), this may mislead algorithms into interpreting
numerical relationships. In such cases, use one-hot encoding instead.
2. Decoding:
If you need to map back to original categories, ensure you retain the mapping (via
LabelEncoder.classes_ or similar method.

One Hot Encoding
• One Hot Encoding in machine learning transforms categorical data into a
numerical format that machine learning algorithms can process without imposing
any ordinal relationships.
• It creates new binary columns (0s and 1s) for each category in the original
variable. Each category in the original column is represented as a separate column,
where a value of 1 indicates the presence of that category, and 0 indicates its
absence.

How One-Hot Encoding Works: An Example
• Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the
other fruit columns (like Mango or Orange) will contain 0.
• This pattern ensures that each categorical value gets its own column, represented
with binary values (1 or 0), making it usable for machine learning models.

Fruit
Categorical value of
fruit
Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

The output after applying one-hot encoding on the data is given as follows,
Fruit_apple Fruit_mango Fruit_orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

import pandas as pd
# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
# Applying one-hot encoding
df_encoded = pd.get_dummies(df, dtype=int)
# Displaying the encoded DataFrame
print(df_encoded)
Using Pandas get_dummies()

from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Creating the encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Sample data
X = [['Red'], ['Green'], ['Blue']]
# Fitting the encoder to the data
enc.fit(X)
# Transforming new data
result = enc.transform([['Red']]).toarray()
# Displaying the encoded result
print(result)
Using Scikit-learn's OneHotEncoder

Data Normalization
• Data normalization is a technique used to transform the values of a dataset into a
common scale.
• This is important because many machine learning algorithms are sensitive to the
scale of the input features and can produce better results when the data is
normalized.

There are several different normalization techniques that can be
used
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
4. Logarithmic transformation
5. Root transformation

Min-Max scaling
• Min-Max scaling is a technique in data preprocessing where all values within a
feature are linearly transformed to fall within a specified range, usually between 0
and 1, by calculating the proportion of each value relative to the minimum and
maximum values in that feature
• Essentially, it scales the data based on its relative position within the original
range, making it suitable for algorithms sensitive to feature scale differences

Key points about Min-Max scaling:
• To normalize a value 'x' using min-max scaling, the formula is:
x= (x - min) / (max - min)
where 'min' is the minimum value in the feature and 'max' is the maximum
value.

When to use Min-Max scaling:
• When you need to scale features to a specific range (like 0 to 1) and preserve
relative relationships between data points.
• When dealing with datasets where features have significantly different scales and
you want to ensure no single feature dominates the analysis.

Example:
Imagine a dataset with a feature "Temperature" ranging from 10 to 40 degrees
Celsius. To normalize this using min-max scaling:
• Min value: 10
• Max value: 40
• To normalize a temperature of 25 degrees: Calculation: (25 - 10) / (40 - 10) = 0.5.

Benefits:
• Standardizes feature scales
• Preserves relative relationships
Drawbacks
• Outlier sensitivity
• Not robust to noise

Outliers
• Outliers are data points that are significantly different from the majority of other
data points in a set.
• They can be higher or lower than the other values in the set.

Feature Engineering
• Feature Engineering is the process of creating new features or transforming
existing features to improve the performance of a machine-learning model.
• It involves selecting relevant information from raw data and transforming it into a
format that can be easily understood by a model.
• The goal is to improve model accuracy by providing more meaningful and
relevant information.

What is a Feature?
• A feature (also known as a variable or attribute) is an individual measurable
property or characteristic of a data point that is used as input for a machine
learning algorithm.
• Features can be numerical, categorical, or text-based, and they represent different
aspects of the data that are relevant to the problem at hand.

• Feature engineering consists of mainly 5 processes:
• Feature Creation:
• Feature Creation is the process of generating new features based on domain knowledge or
by observing patterns in the data.
• Feature Transformation:
• Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model.
• Feature Extraction
• Feature Extraction is the process of creating new features from existing ones to provide
more relevant information to the machine learning model
• Feature Selection:
• Feature Selection is the process of selecting a subset of relevant features from the dataset
to be used in a machine-learning model.

• Feature selection methods have been traditionally grouped into
• filter methods
• wrapper methods
• embedded methods.

Chi-square Test for Feature Selection
• Chi-square test is used for categorical features in a dataset.
• Calculate Chi-square between each feature and the target and select the desired number of
features with best Chi-square scores.
• Features that show significant dependencies with the target variable are considered
important for prediction and can be selected for further analysis.

• Step 1: Null Hypothesis (H0): There is no significant association between the two
categorical variables.
• Step 2: Contingency table

• Step 3: Now, calculate the expected frequencies.
• For example, the expected frequency for “Low Income” and “Subscribed” would be:
• FindTotal count for each row Ri and each column Cj and Total number of observations
are 140.
• Low Income, subscribed=(50×70)÷140=25(50×70)÷140=25

• Step 4: Calculate the Chi-Square Statistic
• Let’s summarize the observed and expected values into a table and calculate the Chi-
Square value:

Step 6: Interpretations
Now, you can compare the calculated value χ2 (3.747) with the critical value from the Chi-
Square distribution table or any statistical software tool with 2 degrees of freedom. If
the χ2value is greater than the critical value, you would reject the null hypothesis. This suggests
that there is a significant association between “income level” and “subscription status,” and
“income level” is a relevant feature for predicting subscription status.

• Example:
• Let's say we are conducting a Chi-Square test with 2 degrees of freedom at the 0.05
significance level.
• Degree of Freedom (df) = 2
• Significance Level (α) = 0.05
• We look up the critical value in a Chi-Square table or calculate it using statistical
software. For df = 2 and α = 0.05, the critical value is approximately 5.991.
• If the computed Chi-Square statistic from our test is greater than 5.991, we reject the null
hypothesis. If it is less than 5.991, we fail to reject the null hypothesis.

1. Univariate Feature Selection (Using statistical tests)
In this method, you use statistical tests to select the features that have the strongest
relationship with the target variable.

2. Recursive Feature Elimination (RFE)
RFE works by recursively removing the least important features and building a model on
the remaining features. It uses a model to rank feature importance.

3. Random Forest Classification
Tree-based models such as Random Forest or XGBoost can be used to rank the importance of
each feature based on how useful they are in reducing impurity in the trees.

Important Questions:
• What is data preprocessing in the context of machine learning?
• Explain Data Cleaning with Python's pandas Library.
• What are common data quality issues you might encounter?
• How do you handle missing data within a dataset?
• What is the difference between imputation and deletion of missing values?
• What is one-hot encoding, and when should it be used? Explain with an example.
• Write a NumPy program to compute the mean, standard deviation and the median of a given array .
Sample array [0 1 2 3 4 5]

1. import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Age': [22, 25, 27, np.nan, 30, 35],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'],
'Salary': [50000, 60000, np.nan, 80000, 75000, 90000]
}
# Display the original data
print("Original Data:")
print(df)
# 1. Handling missing data Fill missing values for 'Age' column with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values for 'Salary' column with the median
df['Salary'].fillna(df['Salary'].median(), inplace=True)

# 2. Encoding categorical data (e.g., Gender column) Convert 'Gender' column to numerical
values (Male = 0, Female = 1)
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
# 3. Feature scaling (standardization of 'Salary' column)
# Standardize the 'Salary' column using z-score normalization
df['Salary'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
# Display the preprocessed data
print("nPreprocessed Data:")
print(df)

2.
• Handling missing values:
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, np.nan, 30, 22, np.nan],
'Salary': [50000, 60000, np.nan, 80000, 75000]}
# Checking for missing values
print(df.isnull()) # True for NaN values

• Filling Missing Data
# Filling missing values with the mean of the column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("nAfter Filling Missing Values:")
print(df)

• Dropping Missing Data
# Drop rows with missing values
df.dropna(inplace=True)
print("nAfter Dropping Rows with Missing Data:")
print(df)

• Removing Duplicates
# Sample DataFrame with duplicates
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Alice'],
'Age': [25, 30, 35, 30, 25],
'Salary': [50000, 60000, 70000, 60000, 50000]}
# Removing duplicate rows
df.drop_duplicates(inplace=True)
print("nAfter Removing Duplicates:")
print(df)

6.
import pandas as pd
# Sample data
data = pd.DataFrame({'Size': ['S', 'M', 'M', 'L', 'S', 'L']})
# One-hot encoding
one_hot_encoded = pd.get_dummies(data, columns=['Size'])
print(one_hot_encoded)

7.
import numpy as np
# Given array
arr = np.array([0, 1, 2, 3, 4, 5])
# Compute mean, standard deviation, and median
mean = np.mean(arr)
std_dev = np.std(arr)
median = np.median(arr)
# Print results
print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Median:", median)

Questions:
1. How do you calculate the mean, median, and mode using Python?
2. How do you calculate the weighted mean in Python?
3. How do you find the mode for categorical data in Python?
4. How do you identify if the data is skewed using mean and median?

How do you calculate the mean, median, and mode using Python?
import numpy as np
# Sample data
data = [10, 20, 20, 30, 40, 50, 100]
# Mean
mean = np.mean(data)
print("Mean:", mean)
# Median
median = np.median(data)
print("Median:", median)
# Mode
mode = stats.mode(data)
print("Mode:", mode.mode[0], "Count:", mode.count[0])

How do you calculate the weighted mean in Python?
import numpy as np
# Data and weights
data = [10, 20, 30, 40]
weights = [1, 2, 3, 4]
# Weighted mean
weighted_mean = np.average(data, weights=weights)
print("Weighted Mean:", weighted_mean)

How do you find the mode for categorical data in Python?
# Categorical data
categories = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
# Mode
mode = stats.mode(categories)
print("Mode:", mode.mode[0], "Count:", mode.count[0])

How do you identify if the data is skewed using mean and median?
data = [10, 20, 30, 40, 100] # Right-skewed data
mean = np.mean(data)
median = np.median(data)
if mean > median:
print("The data is right-skewed.")
elif mean < median:
print("The data is left-skewed.")
else:
print("The data is symmetric.")

1. Create a python code using pandas library to create binary columns for each category, where
dataset data = {‘Vehicle’: [‘BUS’, ‘VAN’, ‘TRAIN’, ‘BUS’, ‘CYCLE’]}
2. Write a python code using Pandas library to delete the duplicate values.
3. Write a NumPy program to compute the mean and the median of a given array. data = [1, 2,
2, 3, 4, 5, 10]
4. Create a DataFrame where
data = pd.DataFrame({
'name': ['John', 'Jane', 'Jack', 'John', None],
'age': [28, 34, None, 28, 22],
'purchase_amount': [100.5, None, 85.3, 100.5, 50.0],
'date_of_purchase': ['2023/12/01', '2023/12/02', '2023/12/01', '2023/12/01', '2023/12/03']
})
Write a program in python using the Mean imputation method to replace the missing values.

Data science using python, Data Preprocessing

More Related Content

Similar to Data science using python, Data Preprocessing (20)

Recently uploaded (20)

Data science using python, Data Preprocessing