SlideShare a Scribd company logo
Data Preprocessing
Phase
2025
Introduction (Definition)
It involves transforming,
correcting, and organizing data
to ensure accuracy and
consistency.
Definition: Data preprocessing is
the process of preparing and
cleaning raw data before
analysis or modeling.
1
Importance
Enhances data
quality, making it
suitable for analysis
and machine
learning.
01
Reduces errors and
inconsistencies that
can impact
predictions.
02
Improves efficiency
and reliability of
data-driven
decisions.
03
Essential for
achieving
meaningful insights
from data.
04
2 3 4
Data should be correct and free
from errors.
No missing values or gaps in
important attributes.
Characteristics of High-Quality Data
Data should be up-to-date and
relevant.
Data should meet the specific
needs of analysis.
Uniform formatting and structure
across datasets.
Data should be trustworthy, ensuring
relationships among different datasets
remain intact.
Timeliness:
Accuracy:
Completeness: Relevance:
Consistency: Integrity
Characteristics of High-Quality Data
Data quality relies on accuracy, completeness, consistency, timeliness, validity, and integrity
to ensure reliable and meaningful analysis.
Common Data Issues
Some data points are unavailable or left blank.
• Example: A dataset of patient records missing key
medical history.
• Solution: Imputation techniques (mean, median,
mode) or removing incomplete records.
Missing Values:
Common Data Issues
Repetitive entries in the dataset.
• Example: Multiple entries for the same
patient with slight variations in spelling.
• Solution: Deduplication techniques and
using unique identifiers.
Duplicates:
Common Data Issues
Extreme values that differ significantly from
other observations.
• Example: A patient’s age recorded as
250 years.
• Solution: Statistical methods to detect
and remove or correct anomalies.
Outliers
Common Causes of Poor Data Quality
Poor data quality stems from errors, inconsistencies, missing values, outdated info, and lack of
validation, leading to unreliable analysis and flawed decisions.
The Impact of Poor Data Quality on Business
Case Study – Poor
Data Quality
Impact
Case Study: Wells Fargo Fake Accounts Scandal (2016)
The bank created fake customer
accounts due to incorrect and duplicate
data handling.
Problem
• Millions of fake accounts led to
fraudulent fees.
• Loss of customer trust and legal
penalties.
• A $3 billion settlement and regulatory
actions.
Consequence
High-quality data management is
essential to prevent financial and
reputational losses.
Lesson Learned:
02
03
01
Data Transformation Methods
Scaling data to a
specific range (e.g., 0
to 1)
Normalizati
on
Adjusting data to have
a mean of 0 and a
standard deviation of
1.
Standardizati
on
• Label Encoding
• One-Hot Encoding
Encoding
Categorical
Data:
Creating new
meaningful features
from existing data.
Feature
Engineering:
Summary
Data preprocessing
is essential for accurate
analysis and decision-
making
High-quality data
ensures accuracy,
consistency, and
completeness.
Addressing data
issues
improves overall data
integrity and reliability.
Poor data
management
can lead to significant
risks and inefficiencies.
1
3
2
4
1 2
3
4

More Related Content

PDF
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
PPTX
Data Quality
PDF
Data Analytics Course through IIM SKILLS
PDF
Machine Learning for Predictive Data Analysis in Clinical Research
PDF
Data Cleaning and Validation: Best Practices for Data Integrity
PPT
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
PDF
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
PPTX
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Data Quality
Data Analytics Course through IIM SKILLS
Machine Learning for Predictive Data Analysis in Clinical Research
Data Cleaning and Validation: Best Practices for Data Integrity
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi

Similar to Statistical Measures and Data Analysis - 8th Grade by Slidesgo.pptx (20)

PPTX
Data Preparation.pptx
PDF
From Compliance to Customer 360: Winning with Data Quality & Data Governance
PPTX
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
PDF
Master Your Data. Master Your Business
PPTX
Data Quality: A Raising Data Warehousing Concern
PPTX
DATA PROCESSING EDITING^J CODING^Jclassification.pptx
PPT
JR's Lifetime Advanced Analytics
PPT
JR's Lifetime Advanced Analytics
PDF
Data Cleansing The Never Ending Quest for Lead Generation.pdf
PDF
Data Cleansing What, Why, How, and Trends .pdf
PDF
How do you assess the quality and reliability of data sources in data analysi...
PPT
The data quality challenge
PDF
Essential Insights Top 7 Data Conversion Mistakes and Solutions
PDF
Survival Guide: Taming the Data Quality Beast
PPTX
Mastering B2B Data Cleansing: 10 Essential Strategies for 2025
PDF
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
PDF
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
PDF
Pmcf data quality challenges & best practices
PPTX
Data Integrity Concepts - Without Logo.pptx
PPTX
Data architecture around risk management
Data Preparation.pptx
From Compliance to Customer 360: Winning with Data Quality & Data Governance
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Master Your Data. Master Your Business
Data Quality: A Raising Data Warehousing Concern
DATA PROCESSING EDITING^J CODING^Jclassification.pptx
JR's Lifetime Advanced Analytics
JR's Lifetime Advanced Analytics
Data Cleansing The Never Ending Quest for Lead Generation.pdf
Data Cleansing What, Why, How, and Trends .pdf
How do you assess the quality and reliability of data sources in data analysi...
The data quality challenge
Essential Insights Top 7 Data Conversion Mistakes and Solutions
Survival Guide: Taming the Data Quality Beast
Mastering B2B Data Cleansing: 10 Essential Strategies for 2025
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Pmcf data quality challenges & best practices
Data Integrity Concepts - Without Logo.pptx
Data architecture around risk management
Ad

Recently uploaded (20)

PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
Chapter 5_Foreign Exchange Market in .pdf
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
Laughter Yoga Basic Learning Workshop Manual
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
Business model innovation report 2022.pdf
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PDF
MSPs in 10 Words - Created by US MSP Network
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
ICG2025_ICG 6th steering committee 30-8-24.pptx
Chapter 5_Foreign Exchange Market in .pdf
Unit 1 Cost Accounting - Cost sheet
Laughter Yoga Basic Learning Workshop Manual
COST SHEET- Tender and Quotation unit 2.pdf
Power and position in leadershipDOC-20250808-WA0011..pdf
Reconciliation AND MEMORANDUM RECONCILATION
Business model innovation report 2022.pdf
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
Roadmap Map-digital Banking feature MB,IB,AB
MSPs in 10 Words - Created by US MSP Network
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
Euro SEO Services 1st 3 General Updates.docx
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
HR Introduction Slide (1).pptx on hr intro
Nidhal Samdaie CV - International Business Consultant
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
Ad

Statistical Measures and Data Analysis - 8th Grade by Slidesgo.pptx

  • 2. Introduction (Definition) It involves transforming, correcting, and organizing data to ensure accuracy and consistency. Definition: Data preprocessing is the process of preparing and cleaning raw data before analysis or modeling.
  • 3. 1 Importance Enhances data quality, making it suitable for analysis and machine learning. 01 Reduces errors and inconsistencies that can impact predictions. 02 Improves efficiency and reliability of data-driven decisions. 03 Essential for achieving meaningful insights from data. 04 2 3 4
  • 4. Data should be correct and free from errors. No missing values or gaps in important attributes. Characteristics of High-Quality Data Data should be up-to-date and relevant. Data should meet the specific needs of analysis. Uniform formatting and structure across datasets. Data should be trustworthy, ensuring relationships among different datasets remain intact. Timeliness: Accuracy: Completeness: Relevance: Consistency: Integrity
  • 5. Characteristics of High-Quality Data Data quality relies on accuracy, completeness, consistency, timeliness, validity, and integrity to ensure reliable and meaningful analysis.
  • 6. Common Data Issues Some data points are unavailable or left blank. • Example: A dataset of patient records missing key medical history. • Solution: Imputation techniques (mean, median, mode) or removing incomplete records. Missing Values:
  • 7. Common Data Issues Repetitive entries in the dataset. • Example: Multiple entries for the same patient with slight variations in spelling. • Solution: Deduplication techniques and using unique identifiers. Duplicates:
  • 8. Common Data Issues Extreme values that differ significantly from other observations. • Example: A patient’s age recorded as 250 years. • Solution: Statistical methods to detect and remove or correct anomalies. Outliers
  • 9. Common Causes of Poor Data Quality Poor data quality stems from errors, inconsistencies, missing values, outdated info, and lack of validation, leading to unreliable analysis and flawed decisions.
  • 10. The Impact of Poor Data Quality on Business
  • 11. Case Study – Poor Data Quality Impact
  • 12. Case Study: Wells Fargo Fake Accounts Scandal (2016) The bank created fake customer accounts due to incorrect and duplicate data handling. Problem • Millions of fake accounts led to fraudulent fees. • Loss of customer trust and legal penalties. • A $3 billion settlement and regulatory actions. Consequence High-quality data management is essential to prevent financial and reputational losses. Lesson Learned: 02 03 01
  • 13. Data Transformation Methods Scaling data to a specific range (e.g., 0 to 1) Normalizati on Adjusting data to have a mean of 0 and a standard deviation of 1. Standardizati on • Label Encoding • One-Hot Encoding Encoding Categorical Data: Creating new meaningful features from existing data. Feature Engineering:
  • 14. Summary Data preprocessing is essential for accurate analysis and decision- making High-quality data ensures accuracy, consistency, and completeness. Addressing data issues improves overall data integrity and reliability. Poor data management can lead to significant risks and inefficiencies. 1 3 2 4 1 2 3 4