2. Introduction (Definition)
It involves transforming,
correcting, and organizing data
to ensure accuracy and
consistency.
Definition: Data preprocessing is
the process of preparing and
cleaning raw data before
analysis or modeling.
3. 1
Importance
Enhances data
quality, making it
suitable for analysis
and machine
learning.
01
Reduces errors and
inconsistencies that
can impact
predictions.
02
Improves efficiency
and reliability of
data-driven
decisions.
03
Essential for
achieving
meaningful insights
from data.
04
2 3 4
4. Data should be correct and free
from errors.
No missing values or gaps in
important attributes.
Characteristics of High-Quality Data
Data should be up-to-date and
relevant.
Data should meet the specific
needs of analysis.
Uniform formatting and structure
across datasets.
Data should be trustworthy, ensuring
relationships among different datasets
remain intact.
Timeliness:
Accuracy:
Completeness: Relevance:
Consistency: Integrity
5. Characteristics of High-Quality Data
Data quality relies on accuracy, completeness, consistency, timeliness, validity, and integrity
to ensure reliable and meaningful analysis.
6. Common Data Issues
Some data points are unavailable or left blank.
• Example: A dataset of patient records missing key
medical history.
• Solution: Imputation techniques (mean, median,
mode) or removing incomplete records.
Missing Values:
7. Common Data Issues
Repetitive entries in the dataset.
• Example: Multiple entries for the same
patient with slight variations in spelling.
• Solution: Deduplication techniques and
using unique identifiers.
Duplicates:
8. Common Data Issues
Extreme values that differ significantly from
other observations.
• Example: A patient’s age recorded as
250 years.
• Solution: Statistical methods to detect
and remove or correct anomalies.
Outliers
9. Common Causes of Poor Data Quality
Poor data quality stems from errors, inconsistencies, missing values, outdated info, and lack of
validation, leading to unreliable analysis and flawed decisions.
12. Case Study: Wells Fargo Fake Accounts Scandal (2016)
The bank created fake customer
accounts due to incorrect and duplicate
data handling.
Problem
• Millions of fake accounts led to
fraudulent fees.
• Loss of customer trust and legal
penalties.
• A $3 billion settlement and regulatory
actions.
Consequence
High-quality data management is
essential to prevent financial and
reputational losses.
Lesson Learned:
02
03
01
13. Data Transformation Methods
Scaling data to a
specific range (e.g., 0
to 1)
Normalizati
on
Adjusting data to have
a mean of 0 and a
standard deviation of
1.
Standardizati
on
• Label Encoding
• One-Hot Encoding
Encoding
Categorical
Data:
Creating new
meaningful features
from existing data.
Feature
Engineering:
14. Summary
Data preprocessing
is essential for accurate
analysis and decision-
making
High-quality data
ensures accuracy,
consistency, and
completeness.
Addressing data
issues
improves overall data
integrity and reliability.
Poor data
management
can lead to significant
risks and inefficiencies.
1
3
2
4
1 2
3
4