SlideShare a Scribd company logo
2
Most read
3
Most read
15
Most read
By
V.Sakthi Priya ,M.Sc(it)
Department Of CS & IT,
Nadar Saraswathi College Of Arts And
Science,
Theni.
Data Reduction
Data Reduction
1.Overview
2.The Curse of
Dimensionality
3.Data Sampling
4.Binning and Reduction of
Cardinality
Overview
 Data Reduction techniques are usually
categorized into three main families:
 Dimensionality Reduction: ensures the reduction of
the number of attributes or random variables in the data
set.
 Sample Numerosity Reduction: replaces the original
data by an alternative smaller data representation
 Cardinality Reduction: transformations applied to
obtain a reduced representation of the original data
The Curse of Dimensionality
 Dimensionality becomes a serious obstacle for
the efficiency of most of the DM algorithms.
 It has been estimated that as the number of
dimensions increase, the sample size needs to
increase exponentially in order to have an
effective estimate of multivariate densities.
• Several dimension reducers have been
developed over the years.
• Linear methods:
–Factor analysis
–MultiDimensional Scaling
–Principal Components Analysis
• Nonlinear methods:
–Locally Linear Embedding
–ISOMAP
The Curse of Dimensionality
 Feature Selection methods are aimed at
eliminating irrelevant and redundant features,
reducing the number of variables in the model.
 To find an optimal subset B that solves the
following optimization problem:
J(B)  criterion function
A  Original set of Features
Z  Subset of Features
d  minimum nº features
MultiDimensional Scaling:
 Method for situating a set of points in a low
space such that a classical distance measure
(like Euclidean) between them is as close as
possible to each pair of points.
 We can compute the distances in a
dimensional space of the original data and
then to use as input this distance matrix, which
then projects it in to a lower-dimensional
space so as to preserve these distances.
Data Sampling
 To reduce the number of instances submitted
to the DM algorithm.
 To support the selection of only those cases
in which the response is relatively
homogeneous.
 To assist regarding the balance of data and
occurrence of rare events.
 To divide a data set into three data sets to
carry out the subsequent analysis of DM
algorithms.
Data Sampling:Data Condensation
 It emerges from the fact that naive sampling
methods, such as random sampling or stratified
sampling, are not suitable for real-world problems
with noisy data since the performance of the
algorithms may change unpredictably and
significantly.
 They attempt to obtain a minimal set which
correctly classifies all the original examples.
Data Clustering
Clustering
• Partition Data Set Into Clusters Based On Similarity,
And Store Cluster Representation (E.G., Centroid And
Diameter) Only
• Can Be Very Effective If Data Is Clustered But Not If
Data Is “Smeared”
• Can Have Hierarchical Clustering And Be Stored In
Multidimensional Index Tree Structures
• There Are Many Choices Of Clustering Definitions
And Clustering Algorithms • Cluster Analysis Will Be
Studied
• Data Reduction Strategies – Dimensionality
Reduction, E.G., Remove Unimportant Attributes
• Wavelet Transforms
• Principal Components Analysis (PCA)
• Feature Subset Selection, Feature Creation –
Numerosity Reduction (Some Simply Call It: Data
Reduction)
• Regression And Log-linear Models
• Histograms, Clustering, Sampling
• Data Cube Aggregation – Data Compression
Data Reduction Strategies
• Discrete Wavelet Transform (DWT) For Linear
Signal Processing, Multi-resolution Analysis •
Compressed Approximation: Store Only A Small
Fraction Of The Strongest Of The Wavelet
Coefficients
• Similar To Discrete Fourier Transform (DFT),
But Better Lossy Compression, Localized In Space
• Method: – Length, L, Must Be An Integer Power
Of 2 (Padding With 0’s, When Necessary) – Each
Transform Has 2 Functions: Smoothing, Difference
– Applies To Pairs Of Data, Resulting In Two Set
Of Data Of Length L/2 – Applies Two Functions
Recursively, Until Reaches The Desired Length
WaveletTransformation
Region Where Points Cluster – Suppress
Weaker Information In Their
Boundaries
• Effective Removal Of Outliers –
Insensitive To Noise, Insensitive To Input
Order
• Multi-resolution – Detect Arbitrary
Shaped Clusters At Different Scales
• Efficient – Complexity O(N)
• Only Applicable To Low Dimensional
Data
– Normalize Input Data: Each Attribute Falls Within The
Same Range
– Compute K Orthonormal (Unit) Vectors, I.E., Principal
Components
– Each Input Data (Vector) Is A Linear Combination Of The
K Principal Component Vectors – The Principal Components
Are Sorted In Order Of Decreasing “Significance” Or
Strength
– Since The Components Are Sorted, The Size Of The Data
Can Be Reduced By Eliminating The Weak Components, I.E.,
Those With Low Variance (I.E., Using The Strongest Principal
Components, It Is Possible To Reconstruct A Good
Approximation Of The Original Data)
Principal Component Analysis (Steps)
• Another Way To Reduce Dimensionality Of
Data
• Redundant Attributes – Duplicate Much Or
All Of The Information Contained In One Or
More Other Attributes – E.G., Purchase Price
Of A Product And The Amount Of Sales Tax
Paid
• Irrelevant Attributes – Contain No
Information That Is Useful For The Data
Mining Task At Hand – E.G., Students' ID Is
Often Irrelevant To The Task Of Predicting
Students' GPA
Attribute Subset Selection
Reduce Data Volume By Choosing Alternative,
Smaller Forms Of Data Representation
• Parametric Methods (E.G., Regression) – Assume
The Data Fits Some Model, Estimate Model
Parameters, Store Only The Parameters, And Discard
The Data (Except Possible Outliers) – Ex.: Log-linear
Models—obtain Value At A Point In M-d Space As
The Product On Appropriate Marginal Subspaces
• Non-parametric Methods – Do Not Assume Models
– Major Families: Histograms, Clustering, Sampling,
Numerosity Reduction
 Linear Regression – Data Modeled To Fit A
Straight Line – Often Uses The Least-square
Method To Fit The Line
• Multiple Regression – Allows A Response
Variable Y To Be Modeled As A Linear
Function Of Multidimensional Feature Vector
• Log-linear Model – Approximates Discrete
Multidimensional Probability Distributions
Parametric Data Reduction:
Regression and Log-Linear Models

More Related Content

PPTX
Data Reduction
PPT
2.2 decision tree
PPT
K mean-clustering algorithm
PPTX
Machine learning clustering
PPT
1.8 discretization
PPT
Data preprocessing in Data Mining
PPTX
Classification in data mining
PPTX
03. Data Exploration.pptx
Data Reduction
2.2 decision tree
K mean-clustering algorithm
Machine learning clustering
1.8 discretization
Data preprocessing in Data Mining
Classification in data mining
03. Data Exploration.pptx

What's hot (20)

PPTX
Unsupervised learning (clustering)
PDF
data mining
PDF
EM Algorithm
PPTX
Presentation on K-Means Clustering
PPTX
Clustering in Data Mining
PPTX
Data Mining: clustering and analysis
PPTX
K means clustering
PDF
Dimensionality Reduction
PDF
Outlier Detection
PPTX
Data Mining: Outlier analysis
PPT
Cluster analysis
PPTX
Cluster Analysis
PPT
13. Query Processing in DBMS
PPTX
trees in data structure
PPTX
Difference between File system And DBMS.pptx
PPTX
Structure of dbms
PPTX
Linked list
PPTX
Dbms Introduction and Basics
PPTX
Linked list in Data Structure and Algorithm
PPTX
Concurrency Control in Distributed Database.
Unsupervised learning (clustering)
data mining
EM Algorithm
Presentation on K-Means Clustering
Clustering in Data Mining
Data Mining: clustering and analysis
K means clustering
Dimensionality Reduction
Outlier Detection
Data Mining: Outlier analysis
Cluster analysis
Cluster Analysis
13. Query Processing in DBMS
trees in data structure
Difference between File system And DBMS.pptx
Structure of dbms
Linked list
Dbms Introduction and Basics
Linked list in Data Structure and Algorithm
Concurrency Control in Distributed Database.
Ad

Similar to Data reduction (20)

PPT
Pre-Processing and Data Preparation
PPT
dimension reduction.ppt
PPT
1.7 data reduction
PPTX
Dimensionality Reduction.pptx
PDF
13_Data Preprocessing in Python.pptx (1).pdf
PPT
Data extraction, cleanup & transformation tools 29.1.16
PPTX
Introduction to Datamining Concept and Techniques
PPT
Data preprocessing 2
PPTX
Introduction to data mining
PPT
data clean.ppt
PPTX
dimentionalityreduction-241109090040-5290a6cd.pptx
PDF
Dimentionality Reduction PCA Version 1.pdf
PDF
KNOLX_Data_preprocessing
PPTX
Intro to Data warehousing lecture 17
PPT
Data1
PPT
Data1
PDF
Introduction to Artificial Intelligence_ Lec 5
PPTX
UNIT-2. unsupervised learning of machine learning
PPTX
Data preprocessing in Machine learning
PDF
Data Mining Module 2 Business Analytics.
Pre-Processing and Data Preparation
dimension reduction.ppt
1.7 data reduction
Dimensionality Reduction.pptx
13_Data Preprocessing in Python.pptx (1).pdf
Data extraction, cleanup & transformation tools 29.1.16
Introduction to Datamining Concept and Techniques
Data preprocessing 2
Introduction to data mining
data clean.ppt
dimentionalityreduction-241109090040-5290a6cd.pptx
Dimentionality Reduction PCA Version 1.pdf
KNOLX_Data_preprocessing
Intro to Data warehousing lecture 17
Data1
Data1
Introduction to Artificial Intelligence_ Lec 5
UNIT-2. unsupervised learning of machine learning
Data preprocessing in Machine learning
Data Mining Module 2 Business Analytics.
Ad

More from GowriLatha1 (20)

PPTX
Filtering in frequency domain
PPTX
Demand assigned and packet reservation multiple access
PPTX
Software engineering
PPTX
Shadow paging
PPTX
Multithreading
PPTX
PPTX
Web services & com+ components
PPTX
Comparison with Traditional databases
PPTX
Recovery system
PPTX
Comparison with Traditional databases
PPTX
Static analysis
PPTX
Hema dm
PPTX
Inter process communication
PPTX
computer network
PPTX
Operating System
PPTX
Data mining query language
PPTX
Enterprice java
PPTX
Ethernet
PPTX
Java script
PPTX
Path & application(ds)2
Filtering in frequency domain
Demand assigned and packet reservation multiple access
Software engineering
Shadow paging
Multithreading
Web services & com+ components
Comparison with Traditional databases
Recovery system
Comparison with Traditional databases
Static analysis
Hema dm
Inter process communication
computer network
Operating System
Data mining query language
Enterprice java
Ethernet
Java script
Path & application(ds)2

Recently uploaded (20)

PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
01-Introduction-to-Information-Management.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Basic Mud Logging Guide for educational purpose
01-Introduction-to-Information-Management.pdf
Insiders guide to clinical Medicine.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
Week 4 Term 3 Study Techniques revisited.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Module 4: Burden of Disease Tutorial Slides S2 2025
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf

Data reduction

  • 1. By V.Sakthi Priya ,M.Sc(it) Department Of CS & IT, Nadar Saraswathi College Of Arts And Science, Theni. Data Reduction
  • 2. Data Reduction 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality
  • 3. Overview  Data Reduction techniques are usually categorized into three main families:  Dimensionality Reduction: ensures the reduction of the number of attributes or random variables in the data set.  Sample Numerosity Reduction: replaces the original data by an alternative smaller data representation  Cardinality Reduction: transformations applied to obtain a reduced representation of the original data
  • 4. The Curse of Dimensionality  Dimensionality becomes a serious obstacle for the efficiency of most of the DM algorithms.  It has been estimated that as the number of dimensions increase, the sample size needs to increase exponentially in order to have an effective estimate of multivariate densities.
  • 5. • Several dimension reducers have been developed over the years. • Linear methods: –Factor analysis –MultiDimensional Scaling –Principal Components Analysis • Nonlinear methods: –Locally Linear Embedding –ISOMAP
  • 6. The Curse of Dimensionality  Feature Selection methods are aimed at eliminating irrelevant and redundant features, reducing the number of variables in the model.  To find an optimal subset B that solves the following optimization problem: J(B)  criterion function A  Original set of Features Z  Subset of Features d  minimum nº features
  • 7. MultiDimensional Scaling:  Method for situating a set of points in a low space such that a classical distance measure (like Euclidean) between them is as close as possible to each pair of points.  We can compute the distances in a dimensional space of the original data and then to use as input this distance matrix, which then projects it in to a lower-dimensional space so as to preserve these distances.
  • 8. Data Sampling  To reduce the number of instances submitted to the DM algorithm.  To support the selection of only those cases in which the response is relatively homogeneous.  To assist regarding the balance of data and occurrence of rare events.  To divide a data set into three data sets to carry out the subsequent analysis of DM algorithms.
  • 9. Data Sampling:Data Condensation  It emerges from the fact that naive sampling methods, such as random sampling or stratified sampling, are not suitable for real-world problems with noisy data since the performance of the algorithms may change unpredictably and significantly.  They attempt to obtain a minimal set which correctly classifies all the original examples.
  • 11. Clustering • Partition Data Set Into Clusters Based On Similarity, And Store Cluster Representation (E.G., Centroid And Diameter) Only • Can Be Very Effective If Data Is Clustered But Not If Data Is “Smeared” • Can Have Hierarchical Clustering And Be Stored In Multidimensional Index Tree Structures • There Are Many Choices Of Clustering Definitions And Clustering Algorithms • Cluster Analysis Will Be Studied
  • 12. • Data Reduction Strategies – Dimensionality Reduction, E.G., Remove Unimportant Attributes • Wavelet Transforms • Principal Components Analysis (PCA) • Feature Subset Selection, Feature Creation – Numerosity Reduction (Some Simply Call It: Data Reduction) • Regression And Log-linear Models • Histograms, Clustering, Sampling • Data Cube Aggregation – Data Compression Data Reduction Strategies
  • 13. • Discrete Wavelet Transform (DWT) For Linear Signal Processing, Multi-resolution Analysis • Compressed Approximation: Store Only A Small Fraction Of The Strongest Of The Wavelet Coefficients • Similar To Discrete Fourier Transform (DFT), But Better Lossy Compression, Localized In Space • Method: – Length, L, Must Be An Integer Power Of 2 (Padding With 0’s, When Necessary) – Each Transform Has 2 Functions: Smoothing, Difference – Applies To Pairs Of Data, Resulting In Two Set Of Data Of Length L/2 – Applies Two Functions Recursively, Until Reaches The Desired Length WaveletTransformation
  • 14. Region Where Points Cluster – Suppress Weaker Information In Their Boundaries • Effective Removal Of Outliers – Insensitive To Noise, Insensitive To Input Order • Multi-resolution – Detect Arbitrary Shaped Clusters At Different Scales • Efficient – Complexity O(N) • Only Applicable To Low Dimensional Data
  • 15. – Normalize Input Data: Each Attribute Falls Within The Same Range – Compute K Orthonormal (Unit) Vectors, I.E., Principal Components – Each Input Data (Vector) Is A Linear Combination Of The K Principal Component Vectors – The Principal Components Are Sorted In Order Of Decreasing “Significance” Or Strength – Since The Components Are Sorted, The Size Of The Data Can Be Reduced By Eliminating The Weak Components, I.E., Those With Low Variance (I.E., Using The Strongest Principal Components, It Is Possible To Reconstruct A Good Approximation Of The Original Data) Principal Component Analysis (Steps)
  • 16. • Another Way To Reduce Dimensionality Of Data • Redundant Attributes – Duplicate Much Or All Of The Information Contained In One Or More Other Attributes – E.G., Purchase Price Of A Product And The Amount Of Sales Tax Paid • Irrelevant Attributes – Contain No Information That Is Useful For The Data Mining Task At Hand – E.G., Students' ID Is Often Irrelevant To The Task Of Predicting Students' GPA Attribute Subset Selection
  • 17. Reduce Data Volume By Choosing Alternative, Smaller Forms Of Data Representation • Parametric Methods (E.G., Regression) – Assume The Data Fits Some Model, Estimate Model Parameters, Store Only The Parameters, And Discard The Data (Except Possible Outliers) – Ex.: Log-linear Models—obtain Value At A Point In M-d Space As The Product On Appropriate Marginal Subspaces • Non-parametric Methods – Do Not Assume Models – Major Families: Histograms, Clustering, Sampling, Numerosity Reduction
  • 18.  Linear Regression – Data Modeled To Fit A Straight Line – Often Uses The Least-square Method To Fit The Line • Multiple Regression – Allows A Response Variable Y To Be Modeled As A Linear Function Of Multidimensional Feature Vector • Log-linear Model – Approximates Discrete Multidimensional Probability Distributions Parametric Data Reduction: Regression and Log-Linear Models