SlideShare a Scribd company logo
Data Preprocessing
Content
 What & Why preprocess the data?
 Data cleaning
 Data integration
 Data transformation
 Data reduction

PAAS Group
It is a data mining technique that involves
transforming raw data into an understandable
format.

PAAS Group
Why preprocess the data?

PAAS Group
Data Preprocessing
• Data in the real world is:
– incomplete: lacking values, certain attributes of
interest, etc.
– noisy: containing errors or outliers
– inconsistent: lack of compatibility or similarity
between two or more facts.

• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of
quality data

PAAS Group
Measure of Data Quality
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility

PAAS Group
Data preprocessing techniques
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction

PAAS Group
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies

• Data integration
– Integration of multiple databases, data cubes, or files

• Data transformation
– Normalization and aggregation

• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results

PAAS Group
Data Preprocessing

PAAS Group
Data Cleaning

PAAS Group
Data Cleaning
“Data Cleaning attempt to fill in missing
values, smooth out noise while identifying
outliers and correct inconsistencies in the real
world data.”

PAAS Group
Data Cleaning - Missing Values
•
•
•
•
•

Ignore the tuple
Fill in the missing value manually
Use a global constant
Use attribute mean
Use the most probable value
(decision tree, Bayesian Formalism)

PAAS Group
Data Cleaning - Noisy Data
•
•
•
•

Binning
Clustering
Combined computer and human inspection
Regression

PAAS Group
Data Cleaning - Inconsistent Data
• Manually, using external references
• Knowledge engineering tools

PAAS Group
Few Important Terms
• Discrepancy Detection
– Human Error
– Data Decay
– Deliberate Errors

• Metadata
• Unique Rules
• Null Rules
PAAS Group
Data Integration

PAAS Group
Data Integration
“Data Integration implies combining of data
from multiple sources into a coherent data
store(data warehouse). ”

PAAS Group
Data Integration - Issues
•
•
•
•

Entity identification problem
Redundancy
Tuple Duplication
Detecting data value conflicts

PAAS Group
Data Transformation

PAAS Group
Data Transformation
“Transforming or consolidating data into
mining suitable form is known as Data
Transformation.”

PAAS Group
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in
different databases
– One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Handling Redundant Data in Data Integration
• Redundant data may be able to be detected by
correlation analysis
• Careful integration of the data from multiple sources
may
help
reduce/avoid
redundancies
and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
Data Reduction

PAAS Group
Data Reduction
“Data reduction techniques are applied to obtain a
reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity
of base data.”

PAAS Group
Data Reduction - Strategies
•
•
•
•
•

Data cube aggregation
Dimension Reduction
Data Compression
Numerosity Reduction
Discretization and concept hierarchy
generation

PAAS Group
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?

A1?

Class 1
>

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4, A6}
PAAS Group
Histograms
• A popular data reduction
technique
• Divide data into buckets and
store average (sum) for each
bucket
• Can be constructed optimally
in one dimension.
• Related to quantization
problems.

PAAS Group
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered.
• Can have hierarchical clustering and be stored in multidimensional index tree structures.

PAAS Group
Sampling
• Allows a large data set to be represented by a much
smaller of the data.
• Let a large data set D, contains N tuples.
• Methods to reduce data set D:
–
–
–
–

Simple random sample without replacement (SRSWOR)
Simple random sample with replacement (SRSWR)
Cluster sample
Stright sample

PAAS Group
Sampling
SRSW
(simp OR
le ra n
do
samp
le wit m
hout
replac
emen
t)
R
SW
SR

Raw Data

PAAS Group
Sampling
Cluster/Stratified Sample

Raw Data

PAAS Group
PAAS Group

More Related Content

PPTX
Data preprocessing PPT
PPTX
Data preprocessing in Machine learning
PDF
Data preprocessing using Machine Learning
PPT
Data preprocessing
PPTX
Data preprocessing
PPTX
Child abuse ppt
PDF
Support Vector Machines ( SVM )
PPTX
Data mining
Data preprocessing PPT
Data preprocessing in Machine learning
Data preprocessing using Machine Learning
Data preprocessing
Data preprocessing
Child abuse ppt
Support Vector Machines ( SVM )
Data mining

What's hot (20)

PPTX
Data mining: Classification and prediction
PPT
Data mining slides
 
PPTX
Data mining primitives
PPTX
Data science applications and usecases
PPTX
Dm from databases perspective u 1
PDF
Decision tree
PPT
Data mining :Concepts and Techniques Chapter 2, data
PPT
Clustering
PPTX
Data mining tasks
PPT
OLAP
PPTX
Introduction to Data Mining
PPTX
Machine learning and types
PDF
Data Visualization in Data Science
PPTX
1. Data Analytics-introduction
PPTX
DATA WAREHOUSING
PDF
Dimensionality Reduction
PPT
5.2 mining time series data
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPT
Expert systems
PPTX
Machine Learning-Linear regression
Data mining: Classification and prediction
Data mining slides
 
Data mining primitives
Data science applications and usecases
Dm from databases perspective u 1
Decision tree
Data mining :Concepts and Techniques Chapter 2, data
Clustering
Data mining tasks
OLAP
Introduction to Data Mining
Machine learning and types
Data Visualization in Data Science
1. Data Analytics-introduction
DATA WAREHOUSING
Dimensionality Reduction
5.2 mining time series data
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Expert systems
Machine Learning-Linear regression
Ad

Viewers also liked (20)

PPTX
Data Mining: Data processing
PPT
PPT
1.7 data reduction
PPT
Types of Data Processing
PPT
DataPreProcessing
PPT
PPT
Data Processing-Presentation
PPTX
Data discretization
PPTX
Data preprocessing
PPTX
Data Cleaning Techniques
PPTX
Data cleansing
PPT
1.8 discretization
PPT
Data preprocessing
PPTX
Data processing cycle
PPTX
Data editing ( In research methodology )
PPT
Data Mining Concepts
PPT
introduction to data mining tutorial
PDF
Data mining (lecture 1 & 2) conecpts and techniques
PPT
Data preprocessing ng
PPTX
Data preprocessing
Data Mining: Data processing
1.7 data reduction
Types of Data Processing
DataPreProcessing
Data Processing-Presentation
Data discretization
Data preprocessing
Data Cleaning Techniques
Data cleansing
1.8 discretization
Data preprocessing
Data processing cycle
Data editing ( In research methodology )
Data Mining Concepts
introduction to data mining tutorial
Data mining (lecture 1 & 2) conecpts and techniques
Data preprocessing ng
Data preprocessing
Ad

Similar to Data preprocessing (20)

PPT
Data preprocessing ppt1
PPT
Data extraction, cleanup & transformation tools 29.1.16
PPT
Unit 3 part ii Data mining
PPT
Preprocess
PPT
Data pre processing
PPTX
Preprocessing
PPTX
Pre processing
PDF
3-DataPreprocessing a complete guide.pdf
PPT
Preprocessing
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Datapreprocess
PPT
Pre-Processing and Data Preparation
PPTX
Data Preparation.pptx
PPT
Preprocessing.ppt
PPTX
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
PPT
Data1
Data preprocessing ppt1
Data extraction, cleanup & transformation tools 29.1.16
Unit 3 part ii Data mining
Preprocess
Data pre processing
Preprocessing
Pre processing
3-DataPreprocessing a complete guide.pdf
Preprocessing
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Datapreprocess
Pre-Processing and Data Preparation
Data Preparation.pptx
Preprocessing.ppt
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Data1

More from ankur bhalla (20)

PPTX
Load balancing
PPT
Languages
PPTX
E branding
PPTX
PPTX
Johari window
PPT
Generation of computer
PPTX
Dont mix religion and politics
PPTX
Windows 7
PPTX
Effective leadership
PPT
Random and raster scan
PPTX
Flat panel
PPTX
Softwares
PPTX
Animation in bollywood
PPT
PPTX
5d cinemas
PPTX
Animation
PPT
Walt disney
PPTX
Olympics
PPTX
Apple.paas2010
PPT
Certifications in IT fields
Load balancing
Languages
E branding
Johari window
Generation of computer
Dont mix religion and politics
Windows 7
Effective leadership
Random and raster scan
Flat panel
Softwares
Animation in bollywood
5d cinemas
Animation
Walt disney
Olympics
Apple.paas2010
Certifications in IT fields

Data preprocessing

  • 2. Content  What & Why preprocess the data?  Data cleaning  Data integration  Data transformation  Data reduction PAAS Group
  • 3. It is a data mining technique that involves transforming raw data into an understandable format. PAAS Group
  • 4. Why preprocess the data? PAAS Group
  • 5. Data Preprocessing • Data in the real world is: – incomplete: lacking values, certain attributes of interest, etc. – noisy: containing errors or outliers – inconsistent: lack of compatibility or similarity between two or more facts. • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data PAAS Group
  • 6. Measure of Data Quality  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility PAAS Group
  • 7. Data preprocessing techniques • Data Cleaning • Data Integration • Data Transformation • Data Reduction PAAS Group
  • 8. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results PAAS Group
  • 11. Data Cleaning “Data Cleaning attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the real world data.” PAAS Group
  • 12. Data Cleaning - Missing Values • • • • • Ignore the tuple Fill in the missing value manually Use a global constant Use attribute mean Use the most probable value (decision tree, Bayesian Formalism) PAAS Group
  • 13. Data Cleaning - Noisy Data • • • • Binning Clustering Combined computer and human inspection Regression PAAS Group
  • 14. Data Cleaning - Inconsistent Data • Manually, using external references • Knowledge engineering tools PAAS Group
  • 15. Few Important Terms • Discrepancy Detection – Human Error – Data Decay – Deliberate Errors • Metadata • Unique Rules • Null Rules PAAS Group
  • 17. Data Integration “Data Integration implies combining of data from multiple sources into a coherent data store(data warehouse). ” PAAS Group
  • 18. Data Integration - Issues • • • • Entity identification problem Redundancy Tuple Duplication Detecting data value conflicts PAAS Group
  • 20. Data Transformation “Transforming or consolidating data into mining suitable form is known as Data Transformation.” PAAS Group
  • 21. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table, e.g., annual revenue
  • 22. Handling Redundant Data in Data Integration • Redundant data may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 23. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing
  • 25. Data Reduction “Data reduction techniques are applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.” PAAS Group
  • 26. Data Reduction - Strategies • • • • • Data cube aggregation Dimension Reduction Data Compression Numerosity Reduction Discretization and concept hierarchy generation PAAS Group
  • 27. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 1 > Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6} PAAS Group
  • 28. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension. • Related to quantization problems. PAAS Group
  • 29. Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered. • Can have hierarchical clustering and be stored in multidimensional index tree structures. PAAS Group
  • 30. Sampling • Allows a large data set to be represented by a much smaller of the data. • Let a large data set D, contains N tuples. • Methods to reduce data set D: – – – – Simple random sample without replacement (SRSWOR) Simple random sample with replacement (SRSWR) Cluster sample Stright sample PAAS Group
  • 31. Sampling SRSW (simp OR le ra n do samp le wit m hout replac emen t) R SW SR Raw Data PAAS Group

Editor's Notes

  • #6: lack of compatibility or similarity between two or more facts.