SlideShare a Scribd company logo
3
Most read
4
Most read
10
Most read
PRESENTED BY
R.RAMADEVI
I . M SC (CS & IT)
NADAR SARASWATHI COLLEGE OF ARTS & SCIENCE
THENI.
DATA GENERALIZATION AND SUMMARIZATION-BASED
CHARACTERIZED
DATA GENERALIZATION AND SUMMARIZATION-
BASED CHARACTERIZATION
 Data and objects in database often contain detailed information at primitive
concept levels
FOR EXAMPLE:
The item relation in a sales database may contain attributes describing low-
level item information such as item _ id ,name ,brand , category ,supplier , place
_ made and price
 This requries an important functionality in data mining :data generalization
DATA GENERALIZATION
 DATA GENERALIZATION is a process that abstracts a large set of task
relevant data in a database from a relatively low conceptual level to hight
conceptual levels
 The generalization of large data sets can be categorized according to two
approaches
(1)The data cube (or OLAP)approach
(2)The attribute-oriented induction approach
ATTRIBUTE-ORIENTED INDUCTION
 The attribute-oriented induction(AOI) approach to data generalization and
summarization _ based characterization was first proposed in 1989
 The data cube approach can be considered as a data warehouse based
precomputation oriented materialized _ view approach
 It performs off _ line aggregation before an OLAP or data mining query is
submitted for processing
 The attribute oriented induction approach , a relation database query oriented ,
generalization-based ,on-line data analysis technique
ATTRIBUTE-ORIENTED INDUCTION
 Some aggregation in the data cube can be computed on – line
 While off – line precomputation of multidimensional space can speed up
attribute – oriented induction as well
 To first collect the task _ relevant data using a relational database query and
then perform generalization based on the examination of the number of distinct
value of each attributer in the relevant set of data
EXAMPLE
Specifying a data mining query for
characterization with DMQL:
 Suppose that a user would like to describe the general characteristics of graduate
students in the BIG _ UNIVERSITY
 The attributes (name ,gender ,major , birth _ place ,
birth _ data , phone no & gpa
use Big _ university _ DB
mine characteristics as “science _ students”
in relevant to name , gander , major , birth place , birth date , phone no ,gpa
from student
where status in “ graduate “
TRANSFORMING A DATA MINING QUERY TO A
RELATIONAL QUERY
 The transformed query is executed against the relational data base
 Big university DB and return the data show
 This table on which induction will be perfomed
use Big _ university _ DB
select name , gander , major , birth place , birth date , phone no ,gpa
from student
where status in [ “M.SC”, “ M.A ”,” M.B.A ., ”,” Ph.D”]
DATA GENERALIZATION TWO TYPES
ATTRIBUTES REMOVED:
 If there is a large set of distinct values for an attributes of the initial working
relation
(1)There is no generalization operator on the attributes
(2)Its higher level concept are expressed in terms of other attributes
ATTRIBUTES GENERALIZATION
 If there is a large set of distinct values for an attributes in the initial working
relation and there exists a set of generalization operation on the attributes
 This corresponds to the generalization rule known as climbing generalization
trees in learning example or concept tree ascension
First technique: called attributes generalization threshold control
second technique : called generalization
relation threshold control
ATTRIBUTE – ORIENTED INDUCTION
For each attributes of the relation the generalization proceeds as follows:
1.name:the large number of distinct values for gender , no generalization operation
defined attributes is removed
2.gender:There are two distinct values , the attributes is retained
3.major:support the concept hierarchy has be defined the attributes major to
generalization to the values{arts _ science ,business)
4.Birth _ place: The attributes has a large number of distinct values , birth _ data
defined as city < province _ or _ status < country
ATTRIBUTE – ORIENTED INDUCTION
5.Birth date: support that hierarchy exists that can generalization birth date to age
& age to age _ range
6.residence:The number of distinct vales for number & street will likely be very
high
7.phone:The attributes contain to many distinct values & therefore be removed in
generalization
8.gpa:support a concept hierarchy exists for gpa that groups values for grade point
average numerical intervals like {3.75-4.0,3.5-75,..}
EFFICIENT IMPLEMENTATION OF ATTRIBUTE –
ORIENTED INDUCTION
Algorithm: attribute _ oriented _ induction mining generalization
characteristics in a relational database given a users data mining request
INPUT: (i)DB a relational data base
(ii)DMQ query a data mining query
(iii)a _ list a list of attributes
(iv)Get(a) a sat of concept hierarchies or generalization operators on
attributes
(v)a _ get _ thresh(a)
OUTPUT & METHODS
Output: p , a prime _ generalization _ relation
Methods : the method is outline as follows
1.W get _ task _relevant _ data (DMQ query , DB)the working relevant hold
the task _ relevant data
2.Prepare _ for _generalization(W)
(a)scan w & collect the distinct values for each attributes
(b)For each attribute ai determine if not computer its minimum desired level L
P GENERALIZATION (W)
 The prime _generalization _ relation P derived by replacing each value v in w
accumulating count and computing any other aggregate value
(a)For each generalization tuple insert the tuple into a sorted prime relation p by
a binary search
(b)since in most cases the number of distinct values at the prime relation level is
small
PRESENTATION OF THE DERIVED
GENERALIZATION
 Attributes – oriented induction generates one or a set of
generalized description
Location item sales count
Asia TV 15 300
Europe TV 12 250
North America TV 28 450
Asia computer 120 1000
A CROSSTAB FOR THE SALES IN 1999
LOCATIONITEM TV COMPUTER BOTH _ ITEM
sales count sales count sales count
ASIA 15 300 120 1000 135 13000
Europe 12 250 150 1200 162 1450
All regions 55 1000 470 4000 525 5000
The t-weight as an interestingness measures
the typicality of each disjunct in the rule
T -WEIGHT
 The t weight for Qa is the percentage of tuple of the
target class from the initial working relation that are
covered by Qa
t _ weight = count (qa )/count(qi)
BAR CHART REPRESENTATION
200
150
100
50
0
TV computer TV + Computer
PIE CHART REPRESENTATION
North Asia(27.7%)
America(50%) TV sales
Europe(21.82%)
Asia(42%) Europe(25%)
north(31%)
computer sales
THANK YOU!!!

More Related Content

PPT
1.8 discretization
PPTX
Major issues in data mining
PPTX
Data Analytics Life Cycle
PPT
13. Query Processing in DBMS
PPTX
Classification in data mining
PPT
2.4 rule based classification
PPTX
multi dimensional data model
PPTX
Data mining tasks
1.8 discretization
Major issues in data mining
Data Analytics Life Cycle
13. Query Processing in DBMS
Classification in data mining
2.4 rule based classification
multi dimensional data model
Data mining tasks

What's hot (20)

PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPT
2.5 backpropagation
PPTX
Mining Association Rules in Large Database
PPT
1.2 steps and functionalities
PPT
2.2 decision tree
PPTX
Classification and prediction in data mining
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPTX
Kdd process
PPTX
Data Reduction
PPTX
Association Analysis in Data Mining
PPTX
Data cube computation
PPTX
Data warehouse physical design
PPTX
Distributed database management system
PPT
MACHINE LEARNING LIFE CYCLE
PDF
Code optimization in compiler design
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PDF
Data Models
PPTX
Data mining primitives
PPTX
basic structure of SQL FINAL.pptx
Decision tree induction \ Decision Tree Algorithm with Example| Data science
2.5 backpropagation
Mining Association Rules in Large Database
1.2 steps and functionalities
2.2 decision tree
Classification and prediction in data mining
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Kdd process
Data Reduction
Association Analysis in Data Mining
Data cube computation
Data warehouse physical design
Distributed database management system
MACHINE LEARNING LIFE CYCLE
Code optimization in compiler design
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Data Models
Data mining primitives
basic structure of SQL FINAL.pptx
Ad

Similar to data generalization and summarization (20)

PPTX
19CS3052R-CO1-7-S7 ECE
PDF
Characterization
PPT
concept desciption.ppt-Basket data.ppt data warehouse-Data Mining
PPT
Characterization and Comparison
PPT
Classification
PPTX
Data Mining: Data cube computation and data generalization
PPTX
Data Mining: Data cube computation and data generalization
PPTX
Attribute oriented analysis
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PPTX
Analysis Of Attribute Revelance
PDF
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
PDF
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
PDF
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
PDF
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
PPT
Data preprocessing in Data Mining
PPTX
Unit3-AssociationRuleMining and data techniques.pptx
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
DOCX
Data Mining DataLecture Notes for Chapter 2Introduc
PPT
Data preprocessing
19CS3052R-CO1-7-S7 ECE
Characterization
concept desciption.ppt-Basket data.ppt data warehouse-Data Mining
Characterization and Comparison
Classification
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
Attribute oriented analysis
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Analysis Of Attribute Revelance
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
Data preprocessing in Data Mining
Unit3-AssociationRuleMining and data techniques.pptx
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
Data Mining DataLecture Notes for Chapter 2Introduc
Data preprocessing
Ad

More from janani thirupathi (17)

PPTX
PPTX
Multimedia
PPTX
Data structure
PPTX
Software Engineering
PPTX
Data warehouse architecture
PPTX
Evolution of os
PPTX
PPTX
File sharing
PPTX
Data transfer and manipulation
PPTX
Arithmetic Logic
PPTX
Transaction management
PPTX
Programming in c Arrays
PPTX
Memory System
PPTX
Cn assignment
PPTX
Narrowband ISDN
Multimedia
Data structure
Software Engineering
Data warehouse architecture
Evolution of os
File sharing
Data transfer and manipulation
Arithmetic Logic
Transaction management
Programming in c Arrays
Memory System
Cn assignment
Narrowband ISDN

Recently uploaded (20)

PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
master seminar digital applications in india
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Business Ethics Teaching Materials for college
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Institutional Correction lecture only . . .
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
RMMM.pdf make it easy to upload and study
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Week 4 Term 3 Study Techniques revisited.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Microbial diseases, their pathogenesis and prophylaxis
master seminar digital applications in india
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Classroom Observation Tools for Teachers
Business Ethics Teaching Materials for college
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Institutional Correction lecture only . . .
STATICS OF THE RIGID BODIES Hibbelers.pdf
Pre independence Education in Inndia.pdf
Basic Mud Logging Guide for educational purpose
human mycosis Human fungal infections are called human mycosis..pptx
VCE English Exam - Section C Student Revision Booklet
RMMM.pdf make it easy to upload and study

data generalization and summarization

  • 1. PRESENTED BY R.RAMADEVI I . M SC (CS & IT) NADAR SARASWATHI COLLEGE OF ARTS & SCIENCE THENI. DATA GENERALIZATION AND SUMMARIZATION-BASED CHARACTERIZED
  • 2. DATA GENERALIZATION AND SUMMARIZATION- BASED CHARACTERIZATION  Data and objects in database often contain detailed information at primitive concept levels FOR EXAMPLE: The item relation in a sales database may contain attributes describing low- level item information such as item _ id ,name ,brand , category ,supplier , place _ made and price  This requries an important functionality in data mining :data generalization
  • 3. DATA GENERALIZATION  DATA GENERALIZATION is a process that abstracts a large set of task relevant data in a database from a relatively low conceptual level to hight conceptual levels  The generalization of large data sets can be categorized according to two approaches (1)The data cube (or OLAP)approach (2)The attribute-oriented induction approach
  • 4. ATTRIBUTE-ORIENTED INDUCTION  The attribute-oriented induction(AOI) approach to data generalization and summarization _ based characterization was first proposed in 1989  The data cube approach can be considered as a data warehouse based precomputation oriented materialized _ view approach  It performs off _ line aggregation before an OLAP or data mining query is submitted for processing  The attribute oriented induction approach , a relation database query oriented , generalization-based ,on-line data analysis technique
  • 5. ATTRIBUTE-ORIENTED INDUCTION  Some aggregation in the data cube can be computed on – line  While off – line precomputation of multidimensional space can speed up attribute – oriented induction as well  To first collect the task _ relevant data using a relational database query and then perform generalization based on the examination of the number of distinct value of each attributer in the relevant set of data
  • 6. EXAMPLE Specifying a data mining query for characterization with DMQL:  Suppose that a user would like to describe the general characteristics of graduate students in the BIG _ UNIVERSITY  The attributes (name ,gender ,major , birth _ place , birth _ data , phone no & gpa use Big _ university _ DB mine characteristics as “science _ students” in relevant to name , gander , major , birth place , birth date , phone no ,gpa from student where status in “ graduate “
  • 7. TRANSFORMING A DATA MINING QUERY TO A RELATIONAL QUERY  The transformed query is executed against the relational data base  Big university DB and return the data show  This table on which induction will be perfomed use Big _ university _ DB select name , gander , major , birth place , birth date , phone no ,gpa from student where status in [ “M.SC”, “ M.A ”,” M.B.A ., ”,” Ph.D”]
  • 8. DATA GENERALIZATION TWO TYPES ATTRIBUTES REMOVED:  If there is a large set of distinct values for an attributes of the initial working relation (1)There is no generalization operator on the attributes (2)Its higher level concept are expressed in terms of other attributes
  • 9. ATTRIBUTES GENERALIZATION  If there is a large set of distinct values for an attributes in the initial working relation and there exists a set of generalization operation on the attributes  This corresponds to the generalization rule known as climbing generalization trees in learning example or concept tree ascension First technique: called attributes generalization threshold control second technique : called generalization relation threshold control
  • 10. ATTRIBUTE – ORIENTED INDUCTION For each attributes of the relation the generalization proceeds as follows: 1.name:the large number of distinct values for gender , no generalization operation defined attributes is removed 2.gender:There are two distinct values , the attributes is retained 3.major:support the concept hierarchy has be defined the attributes major to generalization to the values{arts _ science ,business) 4.Birth _ place: The attributes has a large number of distinct values , birth _ data defined as city < province _ or _ status < country
  • 11. ATTRIBUTE – ORIENTED INDUCTION 5.Birth date: support that hierarchy exists that can generalization birth date to age & age to age _ range 6.residence:The number of distinct vales for number & street will likely be very high 7.phone:The attributes contain to many distinct values & therefore be removed in generalization 8.gpa:support a concept hierarchy exists for gpa that groups values for grade point average numerical intervals like {3.75-4.0,3.5-75,..}
  • 12. EFFICIENT IMPLEMENTATION OF ATTRIBUTE – ORIENTED INDUCTION Algorithm: attribute _ oriented _ induction mining generalization characteristics in a relational database given a users data mining request INPUT: (i)DB a relational data base (ii)DMQ query a data mining query (iii)a _ list a list of attributes (iv)Get(a) a sat of concept hierarchies or generalization operators on attributes (v)a _ get _ thresh(a)
  • 13. OUTPUT & METHODS Output: p , a prime _ generalization _ relation Methods : the method is outline as follows 1.W get _ task _relevant _ data (DMQ query , DB)the working relevant hold the task _ relevant data 2.Prepare _ for _generalization(W) (a)scan w & collect the distinct values for each attributes (b)For each attribute ai determine if not computer its minimum desired level L
  • 14. P GENERALIZATION (W)  The prime _generalization _ relation P derived by replacing each value v in w accumulating count and computing any other aggregate value (a)For each generalization tuple insert the tuple into a sorted prime relation p by a binary search (b)since in most cases the number of distinct values at the prime relation level is small
  • 15. PRESENTATION OF THE DERIVED GENERALIZATION  Attributes – oriented induction generates one or a set of generalized description Location item sales count Asia TV 15 300 Europe TV 12 250 North America TV 28 450 Asia computer 120 1000
  • 16. A CROSSTAB FOR THE SALES IN 1999 LOCATIONITEM TV COMPUTER BOTH _ ITEM sales count sales count sales count ASIA 15 300 120 1000 135 13000 Europe 12 250 150 1200 162 1450 All regions 55 1000 470 4000 525 5000 The t-weight as an interestingness measures the typicality of each disjunct in the rule
  • 17. T -WEIGHT  The t weight for Qa is the percentage of tuple of the target class from the initial working relation that are covered by Qa t _ weight = count (qa )/count(qi)
  • 19. PIE CHART REPRESENTATION North Asia(27.7%) America(50%) TV sales Europe(21.82%) Asia(42%) Europe(25%) north(31%) computer sales