SlideShare a Scribd company logo
2
Most read
3
Most read
8
Most read
1
Discretization and Concept
Hierarchy Generation
2
Discretization
 Types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic
rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Reduce data size by discretization
3
Discretization and Concept Hierarchy
 Discretization
 Reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
4
Concept hierarchy
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level
concepts (such as numeric values for age) by higher level
concepts (such as young, middle-aged, or senior)
 Detail lost
 More meaningful
 Easier to interpret
 Mining becomes easier
 Several concept hierarchies can be defined for the same
attribute
 Manual / Implicit
5
Discretization and Concept Hierarchy
Generation for Numeric Data
 Typical methods:
 Binning
 Histogram analysis
 Clustering analysis
 Entropy-based discretization
 χ2
merging
 Segmentation by natural partitioning
All the methods can be applied recursively
6
Techniques
 Binning
 Distribute values into bins
 Replace by bin mean / median
 Recursive application – leads to concept hierarchies
 Unsupervised technique
 Histogram Analysis
 Data Distribution – Partition
 Equiwidth – (0-100], (100-200], …
 Equidepth
 Recursive
 Minimum Interval size
 Unsupervised
7
Techniques
 Cluster Analysis
 Clusters form nodes of concept hierarchy
 Can decompose / combine
 Lower level / higher level of hierarchy
8
Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S1 and S2
using boundary T, the expected information requirement after partitioning is
 Entropy is calculated based on class distribution of the samples in the set.
Given m classes, the entropy of S1 is
where pi is the probability of class i in S1
 The boundary that minimizes the expected information requirement over all
possible boundaries is selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
criterion is met
)(
||
||
)(
||
||
),( 2
2
1
1
SEntropy
S
S
SEntropy
S
S
TSI +=
∑=
−=
m
i
ii ppSEntropy
1
21 )(log)(
9
 Reduces data size
 Class information is considered
 Improves accuracy
Entropy-Based Discretization
10
Interval Merging by χ2
Analysis
 ChiMerge
 Bottom-up approach
 find the best neighbouring intervals and merges them to form larger intervals
 Supervised
 If two adjacent intervals have similar distribution of classes – they can be
merged
 Initially each value is in a separate interval
 χ2
tests are performed for adjacent intervals. Those with least
values are merged
 Can be repeated
 Stopping condition (Threshold, Number of intervals)
11
Segmentation by Natural Partitioning
 A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most significant digit,
partition the range into 5 intervals
12
 Outliers could be present
 Consider only the majority values
 5th
percentile – 95th
percentile
Segmentation by Natural Partitioning
13
Example of 3-4-5 Rule
(-$400 -$5,000)
(-$400 - 0)
(-$400 -
-$300)
(-$300 -
-$200)
(-$200 -
-$100)
(-$100 -
0)
(0 - $1,000)
(0 -
$200)
($200 -
$400)
($400 -
$600)
($600 -
$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -
$3,000)
($3,000 -
$4,000)
($4,000 -
$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -
$1,400)
($1,400 -
$1,600)
($1,600 -
$1,800)
($1,800 -
$2,000)
msd=1,000 Low=-$1,000 High=$2,000Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
14
Concept Hierarchy Generation for
Categorical Data
 Specification of a partial ordering of attributes explicitly at
the schema level by users or experts
 User / Expert defines hierarchy
 Street < city < state < country
 Specification of a portion of a hierarchy by explicit data
grouping
 Manual
 Intermediate level information specified
 Industrial, Agricultural..
15
Concept Hierarchy Generation for
Categorical Data
 Specification of a set of attributes but not their partial
ordering
 Automatically inferring the hierarchy
 Heuristic rule
 High level concepts contain a smaller number of values
 Specification of only a partial set of attributes
 Embedding data semantics
 Attributes with tight semantic connections are pinned together

More Related Content

PPTX
data generalization and summarization
PPT
2.2 decision tree
PPT
2.4 rule based classification
PPTX
Data Reduction
PPTX
Clusters techniques
PPTX
Data mining: Classification and prediction
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
Data cube computation
data generalization and summarization
2.2 decision tree
2.4 rule based classification
Data Reduction
Clusters techniques
Data mining: Classification and prediction
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Data cube computation

What's hot (20)

PPT
K mean-clustering algorithm
PPT
3.1 clustering
PPT
3.3 hierarchical methods
PPTX
Birch Algorithm With Solved Example
PPTX
Distributed database management system
PPTX
multi dimensional data model
PPT
Np cooks theorem
PPT
13. Query Processing in DBMS
PPTX
Data preprocessing in Machine learning
PPT
5.2 mining time series data
PPTX
Association Analysis in Data Mining
PPTX
Data mining Measuring similarity and desimilarity
PPT
Data preprocessing
PPT
Map reduce in BIG DATA
PDF
Logistic regression in Machine Learning
PPTX
Association rule mining.pptx
PPT
Data preprocessing in Data Mining
PPT
15. Transactions in DBMS
PPTX
Challenges of Conventional Systems.pptx
PPTX
Machine learning clustering
K mean-clustering algorithm
3.1 clustering
3.3 hierarchical methods
Birch Algorithm With Solved Example
Distributed database management system
multi dimensional data model
Np cooks theorem
13. Query Processing in DBMS
Data preprocessing in Machine learning
5.2 mining time series data
Association Analysis in Data Mining
Data mining Measuring similarity and desimilarity
Data preprocessing
Map reduce in BIG DATA
Logistic regression in Machine Learning
Association rule mining.pptx
Data preprocessing in Data Mining
15. Transactions in DBMS
Challenges of Conventional Systems.pptx
Machine learning clustering
Ad

Viewers also liked (20)

PDF
Data Mining: Association Rules Basics
PPTX
Concept description characterization and comparison
PDF
Dimensionality Reduction
PDF
Dimensionality reduction
PPTX
Different type of databases
PPT
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PDF
How I data mined my text message history
PPTX
Odam: Open Data, Access and Mining
PPT
1.7 data reduction
PPTX
Data Mining: Data cube computation and data generalization
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PPT
Data Mining Concepts
PPTX
Data Mining: Mining ,associations, and correlations
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPT
3.2 partitioning methods
PPT
Mining Frequent Patterns, Association and Correlations
PPTX
Data visualization
PPT
Data Warehousing and Data Mining
PPTX
Data Mining: Classification and analysis
PDF
Support Vector Machines for Classification
Data Mining: Association Rules Basics
Concept description characterization and comparison
Dimensionality Reduction
Dimensionality reduction
Different type of databases
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
How I data mined my text message history
Odam: Open Data, Access and Mining
1.7 data reduction
Data Mining: Data cube computation and data generalization
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining Concepts
Data Mining: Mining ,associations, and correlations
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
3.2 partitioning methods
Mining Frequent Patterns, Association and Correlations
Data visualization
Data Warehousing and Data Mining
Data Mining: Classification and analysis
Support Vector Machines for Classification
Ad

Similar to 1.8 discretization (20)

PPTX
PPTX
Data mining
PPTX
Discretization and concept hierarchy(os)
PPTX
Datamining
PDF
clustering in DataMining and differences in models/ clustering in data mining
PPT
multiarmed bandit.ppt
PPT
clustering.ppt
PPTX
Cluster analysis
PPTX
8clustering.pptx
PPT
DM UNIT_4 PPT for btech final year students
PPTX
Dm powerpoint
PPT
Jewei Hans & Kamber Capter 7
PPTX
Hierarchical clustering
PPTX
Cluster Analysis
PDF
Chapter 5.pdf
PPT
Chapter 07
PDF
4 module 3 --
PPT
My8clst
PPTX
Cluster analysis
PDF
Multilevel techniques for the clustering problem
Data mining
Discretization and concept hierarchy(os)
Datamining
clustering in DataMining and differences in models/ clustering in data mining
multiarmed bandit.ppt
clustering.ppt
Cluster analysis
8clustering.pptx
DM UNIT_4 PPT for btech final year students
Dm powerpoint
Jewei Hans & Kamber Capter 7
Hierarchical clustering
Cluster Analysis
Chapter 5.pdf
Chapter 07
4 module 3 --
My8clst
Cluster analysis
Multilevel techniques for the clustering problem

More from Krish_ver2 (20)

PPT
5.5 back tracking
PPT
5.5 back track
PPT
5.5 back tracking 02
PPT
5.4 randomized datastructures
PPT
5.4 randomized datastructures
PPT
5.4 randamized algorithm
PPT
5.3 dynamic programming 03
PPT
5.3 dynamic programming
PPT
5.3 dyn algo-i
PPT
5.2 divede and conquer 03
PPT
5.2 divide and conquer
PPT
5.2 divede and conquer 03
PPT
5.1 greedyyy 02
PPT
5.1 greedy
PPT
5.1 greedy 03
PPT
4.4 hashing02
PPT
4.4 hashing
PPT
4.4 hashing ext
PPT
4.4 external hashing
PPT
4.2 bst
5.5 back tracking
5.5 back track
5.5 back tracking 02
5.4 randomized datastructures
5.4 randomized datastructures
5.4 randamized algorithm
5.3 dynamic programming 03
5.3 dynamic programming
5.3 dyn algo-i
5.2 divede and conquer 03
5.2 divide and conquer
5.2 divede and conquer 03
5.1 greedyyy 02
5.1 greedy
5.1 greedy 03
4.4 hashing02
4.4 hashing
4.4 hashing ext
4.4 external hashing
4.2 bst

Recently uploaded (20)

PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Pre independence Education in Inndia.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
01-Introduction-to-Information-Management.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
Week 4 Term 3 Study Techniques revisited.pptx
Microbial diseases, their pathogenesis and prophylaxis
Pre independence Education in Inndia.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
O7-L3 Supply Chain Operations - ICLT Program
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Complications of Minimal Access Surgery at WLH
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPH.pptx obstetrics and gynecology in nursing
01-Introduction-to-Information-Management.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Basic Mud Logging Guide for educational purpose
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Microbial disease of the cardiovascular and lymphatic systems
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
2.FourierTransform-ShortQuestionswithAnswers.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?

1.8 discretization

  • 2. 2 Discretization  Types of attributes:  Nominal — values from an unordered set, e.g., color, profession  Ordinal — values from an ordered set, e.g., military or academic rank  Continuous — real numbers, e.g., integer or real numbers  Discretization:  Divide the range of a continuous attribute into intervals  Reduce data size by discretization
  • 3. 3 Discretization and Concept Hierarchy  Discretization  Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals  Interval labels can then be used to replace actual data values  Supervised vs. unsupervised  Split (top-down) vs. merge (bottom-up)  Discretization can be performed recursively on an attribute
  • 4. 4 Concept hierarchy  Concept hierarchy formation  Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)  Detail lost  More meaningful  Easier to interpret  Mining becomes easier  Several concept hierarchies can be defined for the same attribute  Manual / Implicit
  • 5. 5 Discretization and Concept Hierarchy Generation for Numeric Data  Typical methods:  Binning  Histogram analysis  Clustering analysis  Entropy-based discretization  χ2 merging  Segmentation by natural partitioning All the methods can be applied recursively
  • 6. 6 Techniques  Binning  Distribute values into bins  Replace by bin mean / median  Recursive application – leads to concept hierarchies  Unsupervised technique  Histogram Analysis  Data Distribution – Partition  Equiwidth – (0-100], (100-200], …  Equidepth  Recursive  Minimum Interval size  Unsupervised
  • 7. 7 Techniques  Cluster Analysis  Clusters form nodes of concept hierarchy  Can decompose / combine  Lower level / higher level of hierarchy
  • 8. 8 Entropy-Based Discretization  Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the expected information requirement after partitioning is  Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is where pi is the probability of class i in S1  The boundary that minimizes the expected information requirement over all possible boundaries is selected as a binary discretization  The process is recursively applied to partitions obtained until some stopping criterion is met )( || || )( || || ),( 2 2 1 1 SEntropy S S SEntropy S S TSI += ∑= −= m i ii ppSEntropy 1 21 )(log)(
  • 9. 9  Reduces data size  Class information is considered  Improves accuracy Entropy-Based Discretization
  • 10. 10 Interval Merging by χ2 Analysis  ChiMerge  Bottom-up approach  find the best neighbouring intervals and merges them to form larger intervals  Supervised  If two adjacent intervals have similar distribution of classes – they can be merged  Initially each value is in a separate interval  χ2 tests are performed for adjacent intervals. Those with least values are merged  Can be repeated  Stopping condition (Threshold, Number of intervals)
  • 11. 11 Segmentation by Natural Partitioning  A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.  If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals  If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals  If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
  • 12. 12  Outliers could be present  Consider only the majority values  5th percentile – 95th percentile Segmentation by Natural Partitioning
  • 13. 13 Example of 3-4-5 Rule (-$400 -$5,000) (-$400 - 0) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000) msd=1,000 Low=-$1,000 High=$2,000Step 2: Step 4: Step 1: -$351 -$159 profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max count (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000) Step 3: ($1,000 - $2,000)
  • 14. 14 Concept Hierarchy Generation for Categorical Data  Specification of a partial ordering of attributes explicitly at the schema level by users or experts  User / Expert defines hierarchy  Street < city < state < country  Specification of a portion of a hierarchy by explicit data grouping  Manual  Intermediate level information specified  Industrial, Agricultural..
  • 15. 15 Concept Hierarchy Generation for Categorical Data  Specification of a set of attributes but not their partial ordering  Automatically inferring the hierarchy  Heuristic rule  High level concepts contain a smaller number of values  Specification of only a partial set of attributes  Embedding data semantics  Attributes with tight semantic connections are pinned together