data generalization and summarization

PRESENTED BY
R.RAMADEVI
I . M SC (CS & IT)
NADAR SARASWATHI COLLEGE OF ARTS & SCIENCE
THENI.
DATA GENERALIZATION AND SUMMARIZATION-BASED
CHARACTERIZED

DATA GENERALIZATION AND SUMMARIZATION-
BASED CHARACTERIZATION
 Data and objects in database often contain detailed information at primitive
concept levels
FOR EXAMPLE:
The item relation in a sales database may contain attributes describing low-
level item information such as item _ id ,name ,brand , category ,supplier , place
_ made and price
 This requries an important functionality in data mining :data generalization

DATA GENERALIZATION
 DATA GENERALIZATION is a process that abstracts a large set of task
relevant data in a database from a relatively low conceptual level to hight
conceptual levels
 The generalization of large data sets can be categorized according to two
approaches
(1)The data cube (or OLAP)approach
(2)The attribute-oriented induction approach

ATTRIBUTE-ORIENTED INDUCTION
 The attribute-oriented induction(AOI) approach to data generalization and
summarization _ based characterization was first proposed in 1989
 The data cube approach can be considered as a data warehouse based
precomputation oriented materialized _ view approach
 It performs off _ line aggregation before an OLAP or data mining query is
submitted for processing
 The attribute oriented induction approach , a relation database query oriented ,
generalization-based ,on-line data analysis technique

ATTRIBUTE-ORIENTED INDUCTION
 Some aggregation in the data cube can be computed on – line
 While off – line precomputation of multidimensional space can speed up
attribute – oriented induction as well
 To first collect the task _ relevant data using a relational database query and
then perform generalization based on the examination of the number of distinct
value of each attributer in the relevant set of data

EXAMPLE
Specifying a data mining query for
characterization with DMQL:
 Suppose that a user would like to describe the general characteristics of graduate
students in the BIG _ UNIVERSITY
 The attributes (name ,gender ,major , birth _ place ,
birth _ data , phone no & gpa
use Big _ university _ DB
mine characteristics as “science _ students”
in relevant to name , gander , major , birth place , birth date , phone no ,gpa
from student
where status in “ graduate “

TRANSFORMING A DATA MINING QUERY TO A
RELATIONAL QUERY
 The transformed query is executed against the relational data base
 Big university DB and return the data show
 This table on which induction will be perfomed
use Big _ university _ DB
select name , gander , major , birth place , birth date , phone no ,gpa
from student
where status in [ “M.SC”, “ M.A ”,” M.B.A ., ”,” Ph.D”]

DATA GENERALIZATION TWO TYPES
ATTRIBUTES REMOVED:
 If there is a large set of distinct values for an attributes of the initial working
relation
(1)There is no generalization operator on the attributes
(2)Its higher level concept are expressed in terms of other attributes

ATTRIBUTES GENERALIZATION
 If there is a large set of distinct values for an attributes in the initial working
relation and there exists a set of generalization operation on the attributes
 This corresponds to the generalization rule known as climbing generalization
trees in learning example or concept tree ascension
First technique: called attributes generalization threshold control
second technique : called generalization
relation threshold control

ATTRIBUTE – ORIENTED INDUCTION
For each attributes of the relation the generalization proceeds as follows:
1.name:the large number of distinct values for gender , no generalization operation
defined attributes is removed
2.gender:There are two distinct values , the attributes is retained
3.major:support the concept hierarchy has be defined the attributes major to
generalization to the values{arts _ science ,business)
4.Birth _ place: The attributes has a large number of distinct values , birth _ data
defined as city < province _ or _ status < country

ATTRIBUTE – ORIENTED INDUCTION
5.Birth date: support that hierarchy exists that can generalization birth date to age
& age to age _ range
6.residence:The number of distinct vales for number & street will likely be very
high
7.phone:The attributes contain to many distinct values & therefore be removed in
generalization
8.gpa:support a concept hierarchy exists for gpa that groups values for grade point
average numerical intervals like {3.75-4.0,3.5-75,..}

EFFICIENT IMPLEMENTATION OF ATTRIBUTE –
ORIENTED INDUCTION
Algorithm: attribute _ oriented _ induction mining generalization
characteristics in a relational database given a users data mining request
INPUT: (i)DB a relational data base
(ii)DMQ query a data mining query
(iii)a _ list a list of attributes
(iv)Get(a) a sat of concept hierarchies or generalization operators on
attributes
(v)a _ get _ thresh(a)

OUTPUT & METHODS
Output: p , a prime _ generalization _ relation
Methods : the method is outline as follows
1.W get _ task _relevant _ data (DMQ query , DB)the working relevant hold
the task _ relevant data
2.Prepare _ for _generalization(W)
(a)scan w & collect the distinct values for each attributes
(b)For each attribute ai determine if not computer its minimum desired level L

P GENERALIZATION (W)
 The prime _generalization _ relation P derived by replacing each value v in w
accumulating count and computing any other aggregate value
(a)For each generalization tuple insert the tuple into a sorted prime relation p by
a binary search
(b)since in most cases the number of distinct values at the prime relation level is
small

PRESENTATION OF THE DERIVED
GENERALIZATION
 Attributes – oriented induction generates one or a set of
generalized description
Location item sales count
Asia TV 15 300
Europe TV 12 250
North America TV 28 450
Asia computer 120 1000

A CROSSTAB FOR THE SALES IN 1999
LOCATIONITEM TV COMPUTER BOTH _ ITEM
sales count sales count sales count
ASIA 15 300 120 1000 135 13000
Europe 12 250 150 1200 162 1450
All regions 55 1000 470 4000 525 5000
The t-weight as an interestingness measures
the typicality of each disjunct in the rule

T -WEIGHT
 The t weight for Qa is the percentage of tuple of the
target class from the initial working relation that are
covered by Qa
t _ weight = count (qa )/count(qi)

BAR CHART REPRESENTATION
200
150
100
50
0
TV computer TV + Computer

PIE CHART REPRESENTATION
North Asia(27.7%)
America(50%) TV sales
Europe(21.82%)
Asia(42%) Europe(25%)
north(31%)
computer sales

data generalization and summarization

More Related Content

What's hot (20)

Similar to data generalization and summarization (20)

More from janani thirupathi (17)

Recently uploaded (20)

data generalization and summarization