SlideShare a Scribd company logo
A. D. Patel Institute Of Technology
Data Mining And Business Intelligence (2170715): A. Y. 2019-20
Data Compression – Numerosity Reduction
Prepared By :
Dhruv V. Shah (160010116053)
B.E. (IT) Sem - VII
Guided By :
Prof. Ravi D. Patel
(Dept Of IT , ADIT)
Department Of Information Technology
A.D. Patel Institute Of Technology (ADIT)
New Vallabh Vidyanagar , Anand , Gujarat
1
Outline
 Introduction
 Data Reduction Strategies
 Numerosity Reduction
 Numerosity Reduction Methods
1) Parametric Methods
1.1) Regression
1.2) Log-Linear Model
2) Non-Parametric Methods
2.1) Histograms
2.2) Clustering
2.3) Sampling
2.4) Data Cube Aggregation.
 References
2
 Why Need Data Reduction?
 A database/data warehouse may store terabytes of data.
 Complex data analysis/mining may take a very long time to run on the complete data set.
3
 Data Reduction:
Introduction
 Data Reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
 That, is Mining on the reduced data set should be more efficient yet produce the same
analytical results.
Data Reduction Strategies
4
 Data cube aggregation
 Attribute Subset Selection
 Numerosity reduction — e.g., fit data into models
 Dimensionality reduction - Data Compression
 Discretization and concept hierarchy generation
Numerosity Reduction
5
 What is Numerosity Reduction?
 These techniques replace the original data volume by alternative, smaller forms of data
representation.
 There are two techniques for numerosity reduction methods.
1) Parametric
2) Non-Parametric
Numerosity Reduction Methods
1) Parametric Methods :
 A model is used to estimate the data, so that only the data parameters need to be restored and
not the actual data.
 It assumes that the data fits some model estimates model parameters.
 The Regression and Log-Linear methods are used for creating such models.
 Regression :
 Regression can be a simple linear regression or multiple linear regression.
 When there is only single independent attribute, such regression model is called simple linear
regression and if there are multiple independent attributes, then such regression models are
called multiple linear regression.
 In linear regression, the data are modeled to a fit straight line.
6
Cont.…
7
 For example,
a random variable y can be modeled as a linear function of another random variable x with the
equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the
line, respectively.
In multiple linear regression, y will be modeled as a linear function of two or more
predictor(independent) variables.
 Log-Linear Model :
 Log-linear model can be used to estimate the probability of each data point in a
multidimensional space for a set of discretized attributes, based on a smaller subset of
dimensional combinations.
 This allows a higher-dimensional data space to be constructed from lower-dimensional
attributes.
 Regression and log-linear model can both be used on sparse data, although their application
may be limited.
2) Non-Parametric Methods :
 Do not assume the data.
 These methods are used for storing reduced representations of the data include histograms,
clustering, sampling and data cube aggregation.
8
Cont.…
1) Histograms :
 Divide data into buckets and store average (sum) for each bucket.
 Partitioning rules:
1) Equal-width:
Equal bucket range
2) Equal-frequency (or equal-depth) :
It uses binning to approximate data distribution
 Binning Method :
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Smoothing by bin means:
Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4
Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4
Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4
9
Cont.…
3) V-optimal:
with the least histogram variance (weighted sum of the original values that each
bucket represents)
4) MaxDiff:
Consider difference between pair of adjacent values. Set bucket boundary between
each pair for pairs having the β (No. of buckets)–1 largest differences
Cont….
10
 Multi-dimensional histogram
Fig. Histogram with Singleton buckets
Cont.…
11
Fig. Equal-width Histogram
 List of prices:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18,
20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
2) Clustering :
 Clustering divides the data into groups/clusters.
 This technique partitions the whole data into different clusters.
 In data reduction, the cluster representation of the data are used to replace the actual data.
 It also helps to detect outliers in data.
12
13
C1 C2
C3
Fig. Clustering
14
3) Sampling :
 Sampling obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the
data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling
 Approximate the percentage of each class (or subpopulation of interest) in the
overall database.
 Used in conjunction with skewed data.
 Sampling may not reduce database I/Os (page at a time).
15
Sampling Techniques :
 Simple Random Sample Without Replacement (SRSWOR)
 Simple Random Sample With Replacement (SRSWR)
 Cluster Sample
 Stratified Sample
Sampling Random Sample with or without Replacement
Fig. SRSWOR & SRSWR
16
Raw Data
Cluster Sample
17
 Tuples are grouped into M mutually disjoint clusters
 SRS of m clusters is taken where m < M
 Tuples in a database retrieved in pages
 Page - Cluster
 SRSWOR to pages
Stratified Sample
18
 Data is divided into mutually disjoint parts called strata
 SRS at each stratum
 Representative samples ensured even in the presence of skewed data
Cluster and Stratified Sampling
19
Fig. Cluster & Stratified Sampling
Features of Sampling :
 Cost depends on size of sample.
 Sub-linear on size of data.
 Linear with respect to dimensions.
 Estimates answer to an aggregate query.
20
21
3) Data Cube Aggregation: :
 A data cube is generally used to easily interpret data. It is especially useful when representing
data together with dimensions as certain measures of business requirements.
 A cube's every dimension represents certain characteristic of the database.
 Data Cubes store multidimensional aggregated information.
 Data cubes provide fast access to precomputed, summarized data, thereby benefiting online
analytical processing (OLAP) as well as data mining.
22
Categories of Data Cube :
 Dimensions:
 Represents categories of data such as time or location.
 Each dimension includes different levels of categories.
 Example :
23
Categories of Data Cube :
 Measures:
 These are the actual data values that occupy the cells as defined by the dimensions selected.
 Measures include facts or variables typically stored as numerical fields.
 Example :
24
References
 https://en.wikipedia.org/wiki/Data_cube
 https://www.geeksforgeeks.org/numerosity-reduction-in-data-mining/
 http://www.lastnightstudy.com/Show?id=44/Data-Reduction-In-Data-Mining
25

More Related Content

PPT
1.7 data reduction
PPTX
Data reduction
PPTX
Data Reduction Stratergies
PPTX
Data discretization
PPT
1.8 discretization
PPT
Data preprocessing
PPTX
Data Reduction
PPT
Clustering
1.7 data reduction
Data reduction
Data Reduction Stratergies
Data discretization
1.8 discretization
Data preprocessing
Data Reduction
Clustering

What's hot (20)

PPT
3.3 hierarchical methods
PPTX
Data Integration and Transformation in Data mining
PPT
Cluster analysis
PPT
Data Mining
PPTX
Types of clustering and different types of clustering algorithms
PPTX
Clustering in Data Mining
PPT
DATA MINING:Clustering Types
PPTX
Cluster Analysis
PPT
PPT
Chap8 basic cluster_analysis
PPTX
Clustering in data Mining (Data Mining)
PPT
Cluster analysis
PDF
Dimensionality reduction
PPT
Data preprocessing
PPTX
Cluster analysis
PPTX
Introduction to Clustering algorithm
PPTX
Discretization and concept hierarchy(os)
PDF
Unsupervised learning clustering
PPTX
Cluster analysis
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
3.3 hierarchical methods
Data Integration and Transformation in Data mining
Cluster analysis
Data Mining
Types of clustering and different types of clustering algorithms
Clustering in Data Mining
DATA MINING:Clustering Types
Cluster Analysis
Chap8 basic cluster_analysis
Clustering in data Mining (Data Mining)
Cluster analysis
Dimensionality reduction
Data preprocessing
Cluster analysis
Introduction to Clustering algorithm
Discretization and concept hierarchy(os)
Unsupervised learning clustering
Cluster analysis
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Ad

Similar to Data Compression in Data mining and Business Intelligencs (20)

PPT
data clean.ppt
PPT
Data preprocessing in Data Mining
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
PPT
Data preperation
PPT
Data preparation
PPT
Data preparation
PPT
Data preparation
PPT
Data preparation
PPT
Data preperation
PPT
Data preperation
PPT
Datapreprocessingppt
PPT
Data preprocessing 2
PPT
Pre-Processing and Data Preparation
PPTX
Introduction to data mining
PPT
Data1
PPT
Data1
PPT
Data extraction, cleanup &amp; transformation tools 29.1.16
PPTX
Data .pptx
PPT
1.6.data preprocessing
PPTX
data mining is the process of data reduction in the field of data mining
data clean.ppt
Data preprocessing in Data Mining
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
Data preperation
Data preparation
Data preparation
Data preparation
Data preparation
Data preperation
Data preperation
Datapreprocessingppt
Data preprocessing 2
Pre-Processing and Data Preparation
Introduction to data mining
Data1
Data1
Data extraction, cleanup &amp; transformation tools 29.1.16
Data .pptx
1.6.data preprocessing
data mining is the process of data reduction in the field of data mining
Ad

More from ShahDhruv21 (12)

PPTX
Semantic net in AI
PPTX
Error Detection & Error Correction Codes
PPTX
Secure Hash Algorithm (SHA)
PPTX
Data Mining in Health Care
PPTX
MongoDB installation,CRUD operation & JavaScript shell
PPTX
2D Transformation
PPTX
Interpreter
PPTX
Topological Sorting
PPTX
Pyramid Vector Quantization
PPTX
Event In JavaScript
PPTX
JSP Directives
PPTX
WaterFall Model & Spiral Mode
Semantic net in AI
Error Detection & Error Correction Codes
Secure Hash Algorithm (SHA)
Data Mining in Health Care
MongoDB installation,CRUD operation & JavaScript shell
2D Transformation
Interpreter
Topological Sorting
Pyramid Vector Quantization
Event In JavaScript
JSP Directives
WaterFall Model & Spiral Mode

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Welding lecture in detail for understanding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
composite construction of structures.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Construction Project Organization Group 2.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Welding lecture in detail for understanding
Structs to JSON How Go Powers REST APIs.pdf
composite construction of structures.pdf
CH1 Production IntroductoryConcepts.pptx
UNIT 4 Total Quality Management .pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Internet of Things (IOT) - A guide to understanding
Operating System & Kernel Study Guide-1 - converted.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
bas. eng. economics group 4 presentation 1.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Model Code of Practice - Construction Work - 21102022 .pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Lecture Notes Electrical Wiring System Components
Construction Project Organization Group 2.pptx
Digital Logic Computer Design lecture notes
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

Data Compression in Data mining and Business Intelligencs

  • 1. A. D. Patel Institute Of Technology Data Mining And Business Intelligence (2170715): A. Y. 2019-20 Data Compression – Numerosity Reduction Prepared By : Dhruv V. Shah (160010116053) B.E. (IT) Sem - VII Guided By : Prof. Ravi D. Patel (Dept Of IT , ADIT) Department Of Information Technology A.D. Patel Institute Of Technology (ADIT) New Vallabh Vidyanagar , Anand , Gujarat 1
  • 2. Outline  Introduction  Data Reduction Strategies  Numerosity Reduction  Numerosity Reduction Methods 1) Parametric Methods 1.1) Regression 1.2) Log-Linear Model 2) Non-Parametric Methods 2.1) Histograms 2.2) Clustering 2.3) Sampling 2.4) Data Cube Aggregation.  References 2
  • 3.  Why Need Data Reduction?  A database/data warehouse may store terabytes of data.  Complex data analysis/mining may take a very long time to run on the complete data set. 3  Data Reduction: Introduction  Data Reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.  That, is Mining on the reduced data set should be more efficient yet produce the same analytical results.
  • 4. Data Reduction Strategies 4  Data cube aggregation  Attribute Subset Selection  Numerosity reduction — e.g., fit data into models  Dimensionality reduction - Data Compression  Discretization and concept hierarchy generation
  • 5. Numerosity Reduction 5  What is Numerosity Reduction?  These techniques replace the original data volume by alternative, smaller forms of data representation.  There are two techniques for numerosity reduction methods. 1) Parametric 2) Non-Parametric
  • 6. Numerosity Reduction Methods 1) Parametric Methods :  A model is used to estimate the data, so that only the data parameters need to be restored and not the actual data.  It assumes that the data fits some model estimates model parameters.  The Regression and Log-Linear methods are used for creating such models.  Regression :  Regression can be a simple linear regression or multiple linear regression.  When there is only single independent attribute, such regression model is called simple linear regression and if there are multiple independent attributes, then such regression models are called multiple linear regression.  In linear regression, the data are modeled to a fit straight line. 6
  • 7. Cont.… 7  For example, a random variable y can be modeled as a linear function of another random variable x with the equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the line, respectively. In multiple linear regression, y will be modeled as a linear function of two or more predictor(independent) variables.  Log-Linear Model :  Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations.  This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.  Regression and log-linear model can both be used on sparse data, although their application may be limited.
  • 8. 2) Non-Parametric Methods :  Do not assume the data.  These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation. 8 Cont.… 1) Histograms :  Divide data into buckets and store average (sum) for each bucket.  Partitioning rules: 1) Equal-width: Equal bucket range 2) Equal-frequency (or equal-depth) : It uses binning to approximate data distribution
  • 9.  Binning Method :  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Smoothing by bin means: Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4 Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4 Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4 9 Cont.… 3) V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) 4) MaxDiff: Consider difference between pair of adjacent values. Set bucket boundary between each pair for pairs having the β (No. of buckets)–1 largest differences
  • 10. Cont…. 10  Multi-dimensional histogram Fig. Histogram with Singleton buckets
  • 11. Cont.… 11 Fig. Equal-width Histogram  List of prices: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
  • 12. 2) Clustering :  Clustering divides the data into groups/clusters.  This technique partitions the whole data into different clusters.  In data reduction, the cluster representation of the data are used to replace the actual data.  It also helps to detect outliers in data. 12
  • 14. 14 3) Sampling :  Sampling obtaining a small sample s to represent the whole data set N  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling  Approximate the percentage of each class (or subpopulation of interest) in the overall database.  Used in conjunction with skewed data.  Sampling may not reduce database I/Os (page at a time).
  • 15. 15 Sampling Techniques :  Simple Random Sample Without Replacement (SRSWOR)  Simple Random Sample With Replacement (SRSWR)  Cluster Sample  Stratified Sample
  • 16. Sampling Random Sample with or without Replacement Fig. SRSWOR & SRSWR 16 Raw Data
  • 17. Cluster Sample 17  Tuples are grouped into M mutually disjoint clusters  SRS of m clusters is taken where m < M  Tuples in a database retrieved in pages  Page - Cluster  SRSWOR to pages
  • 18. Stratified Sample 18  Data is divided into mutually disjoint parts called strata  SRS at each stratum  Representative samples ensured even in the presence of skewed data
  • 19. Cluster and Stratified Sampling 19 Fig. Cluster & Stratified Sampling
  • 20. Features of Sampling :  Cost depends on size of sample.  Sub-linear on size of data.  Linear with respect to dimensions.  Estimates answer to an aggregate query. 20
  • 21. 21 3) Data Cube Aggregation: :  A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions as certain measures of business requirements.  A cube's every dimension represents certain characteristic of the database.  Data Cubes store multidimensional aggregated information.  Data cubes provide fast access to precomputed, summarized data, thereby benefiting online analytical processing (OLAP) as well as data mining.
  • 22. 22 Categories of Data Cube :  Dimensions:  Represents categories of data such as time or location.  Each dimension includes different levels of categories.  Example :
  • 23. 23 Categories of Data Cube :  Measures:  These are the actual data values that occupy the cells as defined by the dimensions selected.  Measures include facts or variables typically stored as numerical fields.  Example :
  • 25. 25