SlideShare a Scribd company logo
Data Mining: 
Concepts and Techniques 
— Slides for Textbook — 
— Chapter 2 — 
October 8, 2014 Data Mining: Concepts and Techniques 1
Chapter 2: Data Warehousing and 
OLAP Technology for Data Mining 
 What is a data warehouse? 
 A multi-dimensional data model 
 Data warehouse architecture 
 Data warehouse implementation 
 Further development of data cube technology 
 From data warehousing to data mining 
October 8, 2014 Data Mining: Concepts and Techniques 2
What is Data Warehouse? 
 Defined in many different ways, but not rigorously. 
 A decision support database that is maintained separately from 
the organization’s operational database 
 Support information processing by providing a solid platform of 
consolidated, historical data for analysis. 
 “A data warehouse is a subject-oriented, integrated, time-variant, 
and nonvolatile collection of data in support of management’s 
decision-making process.”—W. H. Inmon 
 Data warehousing: 
 The process of constructing and using data warehouses 
October 8, 2014 Data Mining: Concepts and Techniques 3
Data Warehouse—Subject-Oriented 
 Organized around major subjects, such as customer, 
product, sales. 
 Focusing on the modeling and analysis of data for decision 
makers, not on daily operations or transaction processing. 
 Provide a simple and concise view around particular 
subject issues by excluding data that are not useful in the 
decision support process. 
October 8, 2014 Data Mining: Concepts and Techniques 4
Data Warehouse—Integrated 
 Constructed by integrating multiple, heterogeneous data 
sources 
 relational databases, flat files, on-line transaction 
records 
 Data cleaning and data integration techniques are 
applied. 
 Ensure consistency in naming conventions, encoding 
structures, attribute measures, etc. among different 
data sources 
 E.g., Hotel price: currency, tax, breakfast covered, etc. 
 When data is moved to the warehouse, it is 
converted. 
October 8, 2014 Data Mining: Concepts and Techniques 5
Data Warehouse—Time Variant 
 The time horizon for the data warehouse is significantly 
longer than that of operational systems. 
 Operational database: current value data. 
 Data warehouse data: provide information from a 
historical perspective (e.g., past 5-10 years) 
 Every key structure in the data warehouse 
 Contains an element of time, explicitly or implicitly 
 But the key of operational data may or may not contain 
“time element”. 
October 8, 2014 Data Mining: Concepts and Techniques 6
Data Warehouse—Non-Volatile 
 A physically separate store of data transformed from the 
operational environment. 
 Operational update of data does not occur in the data 
warehouse environment. 
 Does not require transaction processing, recovery, 
and concurrency control mechanisms 
 Requires only two operations in data accessing: 
 initial loading of data and access of data. 
October 8, 2014 Data Mining: Concepts and Techniques 7
Data Warehouse vs. Heterogeneous DBMS 
 Traditional heterogeneous DB integration: 
 Build wrappers/mediators on top of heterogeneous databases 
 Query driven approach 
 When a query is posed to a client site, a meta-dictionary is 
used to translate the query into queries appropriate for 
individual heterogeneous sites involved, and the results are 
integrated into a global answer set 
 Complex information filtering, compete for resources 
 Data warehouse: update-driven, high performance 
 Information from heterogeneous sources is integrated in advance 
and stored in warehouses for direct query and analysis 
October 8, 2014 Data Mining: Concepts and Techniques 8
Data Warehouse vs. Operational DBMS 
 OLTP (on-line transaction processing) 
 Major task of traditional relational DBMS 
 Day-to-day operations: purchasing, inventory, banking, 
manufacturing, payroll, registration, accounting, etc. 
 OLAP (on-line analytical processing) 
 Major task of data warehouse system 
 Data analysis and decision making 
 Distinct features (OLTP vs. OLAP): 
 User and system orientation: customer vs. market 
 Data contents: current, detailed vs. historical, consolidated 
 Database design: ER + application vs. star + subject 
 View: current, local vs. evolutionary, integrated 
 Access patterns: update vs. read-only but complex queries 
October 8, 2014 Data Mining: Concepts and Techniques 9
OLTP vs. OLAP 
OLTP OLAP 
users clerk, IT professional knowledge worker 
function day to day operations decision support 
DB design application-oriented subject-oriented 
data current, up-to-date 
detailed, flat relational 
isolated 
historical, 
summarized, multidimensional 
integrated, consolidated 
usage repetitive ad-hoc 
access read/write 
index/hash on prim. key 
lots of scans 
unit of work short, simple transaction complex query 
# records accessed tens millions 
#users thousands hundreds 
DB size 100MB-GB 100GB-TB 
metric transaction throughput query throughput, response 
October 8, 2014 Data Mining: Concepts and Techniques 10
Why Separate Data Warehouse? 
 High performance for both systems 
 DBMS— tuned for OLTP: access methods, indexing, concurrency 
control, recovery 
 Warehouse—tuned for OLAP: complex OLAP queries, 
multidimensional view, consolidation. 
 Different functions and different data: 
 missing data: Decision support requires historical data which 
operational DBs do not typically maintain 
 data consolidation: DS requires consolidation (aggregation, 
summarization) of data from heterogeneous sources 
 data quality: different sources typically use inconsistent data 
representations, codes and formats which have to be reconciled 
October 8, 2014 Data Mining: Concepts and Techniques 11
Chapter 2: Data Warehousing and 
OLAP Technology for Data Mining 
 What is a data warehouse? 
 A multi-dimensional data model 
 Data warehouse architecture 
 Data warehouse implementation 
 Further development of data cube technology 
 From data warehousing to data mining 
October 8, 2014 Data Mining: Concepts and Techniques 12
From Tables and Spreadsheets to Data 
Cubes 
 A data warehouse is based on a multidimensional data model which 
views data in the form of a data cube 
 A data cube, such as sales, allows data to be modeled and viewed in 
multiple dimensions 
 Dimension tables, such as item (item_name, brand, type), or 
time(day, week, month, quarter, year) 
 Fact table contains measures (such as dollars_sold) and keys to 
each of the related dimension tables 
 In data warehousing literature, an n-D base cube is called a base 
cuboid. The top most 0-D cuboid, which holds the highest-level of 
summarization, is called the apex cuboid. The lattice of cuboids 
forms a data cube. 
October 8, 2014 Data Mining: Concepts and Techniques 13
Cube: A Lattice of Cuboids 
all 
time item location supplier 
time,item time,location 
item,location 
time,supplier 
item,supplier 
1-D cuboids 
location,supplier 
time,item,location 
time,location,supplier 
time,item,supplier 
item,location,supplier 
time, item, location, supplier 
0-D(apex) cuboid 
2-D cuboids 
3-D cuboids 
4-D(base) cuboid 
October 8, 2014 Data Mining: Concepts and Techniques 14
Conceptual Modeling of Data Warehouses 
 Modeling data warehouses: dimensions & measures 
 Star schema: A fact table in the middle connected to a 
set of dimension tables 
 Snowflake schema: A refinement of star schema 
where some dimensional hierarchy is normalized into a 
set of smaller dimension tables, forming a shape 
similar to snowflake 
 Fact constellations: Multiple fact tables share 
dimension tables, viewed as a collection of stars, 
therefore called galaxy schema or fact constellation 
October 8, 2014 Data Mining: Concepts and Techniques 15
Example of Star Schema 
time 
time_key 
day 
day_of_the_week 
month 
quarter 
year 
item 
item_key 
item_name 
brand 
type 
supplier_type 
location 
location_key 
street 
city 
state_or_province 
country 
Sales Fact Table 
time_key 
item_key 
branch_key 
location_key 
units_sold 
dollars_sold 
avg_sales 
branch 
branch_key 
branch_name 
branch_type 
Measures 
October 8, 2014 Data Mining: Concepts and Techniques 16
Example of Snowflake Schema 
time 
time_key 
day 
day_of_the_week 
month 
quarter 
year 
item 
item_key 
item_name 
brand 
type 
supplier_key 
location 
location_key 
street 
city_key 
Sales Fact Table 
time_key 
item_key 
branch_key 
location_key 
units_sold 
dollars_sold 
avg_sales 
branch 
branch_key 
branch_name 
branch_type 
Measures 
supplier 
supplier_key 
supplier_type 
city 
city_key 
city 
state_or_province 
country 
October 8, 2014 Data Mining: Concepts and Techniques 17
Example of Fact Constellation 
time 
time_key 
day 
day_of_the_week 
month 
quarter 
year 
item 
item_key 
item_name 
brand 
type 
supplier_type 
location 
location_key 
street 
city 
province_or_state 
country 
Sales Fact Table 
time_key 
item_key 
branch_key 
location_key 
units_sold 
dollars_sold 
avg_sales 
branch 
branch_key 
branch_name 
branch_type 
Measures 
Shipping Fact Table 
time_key 
item_key 
shipper_key 
from_location 
to_location 
dollars_cost 
units_shipped 
shipper 
shipper_key 
shipper_name 
location_key 
shipper_type 
October 8, 2014 Data Mining: Concepts and Techniques 18
A Data Mining Query Language: DMQL 
 Cube Definition (Fact Table) 
define cube <cube_name> [<dimension_list>]: 
<measure_list> 
 Dimension Definition ( Dimension Table ) 
define dimension <dimension_name> as 
(<attribute_or_subdimension_list>) 
 Special Case (Shared Dimension Tables) 
 First time as “cube definition” 
 define dimension <dimension_name> as 
<dimension_name_first_time> in cube 
<cube_name_first_time> 
October 8, 2014 Data Mining: Concepts and Techniques 19
Defining a Star Schema in DMQL 
define cube sales_star [time, item, branch, location]: 
dollars_sold = sum(sales_in_dollars), avg_sales = 
avg(sales_in_dollars), units_sold = count(*) 
define dimension time as (time_key, day, day_of_week, 
month, quarter, year) 
define dimension item as (item_key, item_name, brand, 
type, supplier_type) 
define dimension branch as (branch_key, branch_name, 
branch_type) 
define dimension location as (location_key, street, city, 
province_or_state, country) 
October 8, 2014 Data Mining: Concepts and Techniques 20
Defining a Snowflake Schema in DMQL 
define cube sales_snowflake [time, item, branch, location]: 
dollars_sold = sum(sales_in_dollars), avg_sales = 
avg(sales_in_dollars), units_sold = count(*) 
define dimension time as (time_key, day, day_of_week, month, quarter, 
year) 
define dimension item as (item_key, item_name, brand, type, 
supplier(supplier_key, supplier_type)) 
define dimension branch as (branch_key, branch_name, branch_type) 
define dimension location as (location_key, street, city(city_key, 
province_or_state, country)) 
October 8, 2014 Data Mining: Concepts and Techniques 21
Defining a Fact Constellation in DMQL 
define cube sales [time, item, branch, location]: 
dollars_sold = sum(sales_in_dollars), avg_sales = 
avg(sales_in_dollars), units_sold = count(*) 
define dimension time as (time_key, day, day_of_week, month, quarter, year) 
define dimension item as (item_key, item_name, brand, type, supplier_type) 
define dimension branch as (branch_key, branch_name, branch_type) 
define dimension location as (location_key, street, city, province_or_state, 
country) 
define cube shipping [time, item, shipper, from_location, to_location]: 
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) 
define dimension time as time in cube sales 
define dimension item as item in cube sales 
define dimension shipper as (shipper_key, shipper_name, location as location 
in cube sales, shipper_type) 
define dimension from_location as location in cube sales 
define dimension to_location as location in cube sales 
October 8, 2014 Data Mining: Concepts and Techniques 22
Measures: Three Categories 
 distributive: if the result derived by applying the function to 
n aggregate values is the same as that derived by applying 
the function on all the data without partitioning. 
 E.g., count(), sum(), min(), max(). 
 algebraic: if it can be computed by an algebraic function 
with M arguments (where M is a bounded integer), each of 
which is obtained by applying a distributive aggregate 
function. 
 E.g., avg(), min_N(), standard_deviation(). 
 holistic: if there is no constant bound on the storage size 
needed to describe a subaggregate. 
 E.g., median(), mode(), rank(). 
October 8, 2014 Data Mining: Concepts and Techniques 23
A Concept Hierarchy: Dimension (location) 
all 
Europe ... 
North_America 
Germany ... Spain Canada ... 
Mexico 
... Vancouver 
... 
city Frankfurt Toronto 
L. Chan ... 
M. Wind 
all 
region 
country 
office 
October 8, 2014 Data Mining: Concepts and Techniques 24
View of Warehouses and Hierarchies 
Specification of hierarchies 
 Schema hierarchy 
day < {month < quarter; 
week} < year 
 Set_grouping hierarchy 
{1..10} < inexpensive 
October 8, 2014 Data Mining: Concepts and Techniques 25
Multidimensional Data 
 Sales volume as a function of product, month, 
and region 
Region 
Product 
Month 
Dimensions: Product, Location, Time 
Hierarchical summarization paths 
Industry Region Year 
Category Country Quarter 
Product City Month Week 
Office Day 
October 8, 2014 Data Mining: Concepts and Techniques 26
A Sample Data Cube 
Total annual sales 
Date of TV in U.S.A. 
Product 
Country 
PC 
VCR 
sum 
sum 
TV 
1Qtr 2Qtr 3Qtr 4Qtr 
U.S.A 
Canada 
Mexico 
sum 
October 8, 2014 Data Mining: Concepts and Techniques 27
Cuboids Corresponding to the Cube 
all 
product date country 
product,date product,country date, country 
product, date, country 
0-D(apex) cuboid 
1-D cuboids 
2-D cuboids 
3-D(base) cuboid 
October 8, 2014 Data Mining: Concepts and Techniques 28
Browsing a Data Cube 
 Visualization 
 OLAP capabilities 
 Interactive manipulation 
October 8, 2014 Data Mining: Concepts and Techniques 29
Typical OLAP Operations 
 Roll up (drill-up): summarize data 
 by climbing up hierarchy or by dimension reduction 
 Drill down (roll down): reverse of roll-up 
 from higher level summary to lower level summary or detailed 
data, or introducing new dimensions 
 Slice and dice: 
 project and select 
 Pivot (rotate): 
 reorient the cube, visualization, 3D to series of 2D planes. 
 Other operations 
 drill across: involving (across) more than one fact table 
 drill through: through the bottom level of the cube to its back-end 
relational tables (using SQL) 
October 8, 2014 Data Mining: Concepts and Techniques 30
A Star-Net Query Model 
Shipping Method 
AIR-EXPRESS 
Customer Orders 
TRUCK 
CONTRACTS 
ORDER 
Customer 
Product 
PRODUCT LINE 
PRODUCT GROUP 
PRODUCT ITEM 
SALES PERSON 
DISTRICT 
DIVISION 
ANNUALY QTRLY DAILY 
Promotion Organization 
CITY 
COUNTRY 
Time 
REGION 
Location 
Each circle is 
called a footprint 
October 8, 2014 Data Mining: Concepts and Techniques 31
Chapter 2: Data Warehousing and 
OLAP Technology for Data Mining 
 What is a data warehouse? 
 A multi-dimensional data model 
 Data warehouse architecture 
 Data warehouse implementation 
 Further development of data cube technology 
 From data warehousing to data mining 
October 8, 2014 Data Mining: Concepts and Techniques 32
Design of a Data Warehouse: A Business 
Analysis Framework 
 Four views regarding the design of a data warehouse 
 Top-down view 
 allows selection of the relevant information necessary for the 
data warehouse 
 Data source view 
 exposes the information being captured, stored, and 
managed by operational systems 
 Data warehouse view 
 consists of fact tables and dimension tables 
 Business query view 
 sees the perspectives of data in the warehouse from the view 
of end-user 
October 8, 2014 Data Mining: Concepts and Techniques 33
Data Warehouse Design Process 
 Top-down, bottom-up approaches or a combination of both 
 Top-down: Starts with overall design and planning (mature) 
 Bottom-up: Starts with experiments and prototypes (rapid) 
 From software engineering point of view 
 Waterfall: structured and systematic analysis at each step before 
proceeding to the next 
 Spiral: rapid generation of increasingly functional systems, short 
turn around time, quick turn around 
 Typical data warehouse design process 
 Choose a business process to model, e.g., orders, invoices, etc. 
 Choose the grain (atomic level of data) of the business process 
 Choose the dimensions that will apply to each fact table record 
 Choose the measure that will populate each fact table record 
October 8, 2014 Data Mining: Concepts and Techniques 34
MMuullttii--TTiieerreedd AArrcchhiitteeccttuurree 
Monitor 
& 
Integrator 
Data 
Warehouse 
Metadata 
Extract 
Transform 
Load 
Refresh 
OLAP Server 
Serve 
Data Marts 
other 
source 
s 
Operational 
DBs 
Data Sources OLAP Engine 
Front-End Tools 
Analysis 
Query 
Reports 
Data mining 
Data Storage 
October 8, 2014 Data Mining: Concepts and Techniques 35
Three Data Warehouse Models 
 Enterprise warehouse 
 collects all of the information about subjects spanning 
the entire organization 
 Data Mart 
 a subset of corporate-wide data that is of value to a 
specific groups of users. Its scope is confined to 
specific, selected groups, such as marketing data mart 
 Independent vs. dependent (directly from warehouse) data mart 
 Virtual warehouse 
 A set of views over operational databases 
 Only some of the possible summary views may be 
materialized 
October 8, 2014 Data Mining: Concepts and Techniques 36
Data Warehouse Development: 
A Recommended Approach 
Data 
Mart 
Distributed 
Data Marts 
Data 
Mart 
Multi-Tier Data 
Warehouse 
Enterprise 
Data 
Warehouse 
Model refinement Model refinement 
Define a high-level corporate data model 
October 8, 2014 Data Mining: Concepts and Techniques 37
OLAP Server Architectures 
 Relational OLAP (ROLAP) 
 Use relational or extended-relational DBMS to store and manage 
warehouse data and OLAP middle ware to support missing pieces 
 Include optimization of DBMS backend, implementation of 
aggregation navigation logic, and additional tools and services 
 greater scalability 
 Multidimensional OLAP (MOLAP) 
 Array-based multidimensional storage engine (sparse matrix 
techniques) 
 fast indexing to pre-computed summarized data 
 Hybrid OLAP (HOLAP) 
 User flexibility, e.g., low level: relational, high-level: array 
 Specialized SQL servers 
 specialized support for SQL queries over star/snowflake schemas 
October 8, 2014 Data Mining: Concepts and Techniques 38
Chapter 2: Data Warehousing and 
OLAP Technology for Data Mining 
 What is a data warehouse? 
 A multi-dimensional data model 
 Data warehouse architecture 
 Data warehouse implementation 
 Further development of data cube technology 
 From data warehousing to data mining 
October 8, 2014 Data Mining: Concepts and Techniques 39
Efficient Data Cube Computation 
 Data cube can be viewed as a lattice of cuboids 
 The bottom-most cuboid is the base cuboid 
 The top-most cuboid (apex) contains only one cell 
 How many cuboids in an n-dimensional cube with L 
levels? 
n 
i i T L 
( + Õ= 
1) 
1 
= 
 Materialization of data cube 
 Materialize every (cuboid) (full materialization), none 
(no materialization), or some (partial materialization) 
 Selection of which cuboids to materialize 
 Based on size, sharing, access frequency, etc. 
October 8, 2014 Data Mining: Concepts and Techniques 40
Cube Operation 
 Cube definition and computation in DMQL 
define cube sales[item, city, year]: sum(sales_in_dollars) 
compute cube sales 
 Transform it into a SQL-like language (with a new operator 
cube by, introduced by Gray et al.’96) 
SELECT item, city, year, SUM (amount) 
FROM SALES 
CUBE BY item, city, year 
 Need compute the following Group-Bys 
() 
(city) (item) 
(date, product, customer), 
(date,product),(date, customer), (product, customer), 
(date), (product), (customer) 
() 
(year) 
(city, item) (city, year) (item, year) 
(city, item, year) 
October 8, 2014 Data Mining: Concepts and Techniques 41
Cube Computation: ROLAP-Based 
Method 
 Efficient cube computation methods 
 ROLAP-based cubing algorithms (Agarwal et al’96) 
 Array-based cubing algorithm (Zhao et al’97) 
 Bottom-up computation method (Beyer & Ramarkrishnan’99) 
 H-cubing technique (Han, Pei, Dong & Wang:SIGMOD’01) 
 ROLAP-based cubing algorithms 
 Sorting, hashing, and grouping operations are applied to the 
dimension attributes in order to reorder and cluster related 
tuples 
 Grouping is performed on some sub-aggregates as a “partial 
grouping step” 
 Aggregates may be computed from previously computed 
aggregates, rather than from the base fact table 
October 8, 2014 Data Mining: Concepts and Techniques 42
Cube Computation: ROLAP-Based Method (2) 
 This is not in the textbook but in a research paper 
 Hash/sort based methods (Agarwal et. al. VLDB’96) 
 Smallest-parent: computing a cuboid from the 
smallest, previously computed cuboid 
 Cache-results: caching results of a cuboid from 
which other cuboids are computed to reduce disk I/Os 
 Amortize-scans: computing as many as possible 
cuboids at the same time to amortize disk reads 
 Share-sorts: sharing sorting costs cross multiple 
cuboids when sort-based method is used 
 Share-partitions: sharing the partitioning cost across 
multiple cuboids when hash-based algorithms are used 
October 8, 2014 Data Mining: Concepts and Techniques 43
Multi-way Array Aggregation for Cube 
Computation 
 Partition arrays into chunks (a small subcube which fits in memory). 
 Compressed sparse array addressing: (chunk_id, offset) 
 Compute aggregates in “multiway” by visiting cube cells in the order 
which minimizes the # of times to visit each cell, and reduces memory 
access and storage cost. 
What is the best 
traversing order 
to do multi-way 
aggregation? 
c3 
c2 
61 62 63 64 
45 46 47 48 
29 30 31 32 
C 
c 0c1 
b3 
b2 
b1 
b0 
13 14 15 16 
9 
5 
1 2 3 4 
a0 a1 
A 
B 
a2 a3 
B 
60 
44 
28 56 
24 4036 52 
20 
October 8, 2014 Data Mining: Concepts and Techniques 44
Multi-way Array Aggregation for 
Cube Computation 
c3 
61 62 63 64 
45 46 47 48 
C 
c2 
c1 
c 0 
b3 
b2 
b1 
b0 
29 30 31 32 
13 14 15 16 
9 
5 
1 2 3 4 
A 
B 
a0 a1 
a2 a3 
44 
60 
28 56 
40 
24 52 
36 
20 
B 
October 8, 2014 Data Mining: Concepts and Techniques 45
Multi-way Array Aggregation for 
Cube Computation 
c3 
61 62 63 64 
45 46 47 48 
C 
c2 
c1 
c 0 
b3 
b2 
b1 
b0 
29 30 31 32 
13 14 15 16 
9 
5 
1 2 3 4 
A 
B 
a0 a1 
a2 a3 
44 
60 
28 56 
40 
24 52 
36 
20 
B 
October 8, 2014 Data Mining: Concepts and Techniques 46
Multi-Way Array Aggregation for 
Cube Computation (Cont.) 
 Method: the planes should be sorted and computed 
according to their size in ascending order. 
 See the details of Example 2.12 (pp. 75-78) 
 Idea: keep the smallest plane in the main memory, 
fetch and compute only one chunk at a time for the 
largest plane 
 Limitation of the method: computing well only for a 
small number of dimensions 
 If there are a large number of dimensions, “bottom-up 
computation” and iceberg cube computation 
methods can be explored 
October 8, 2014 Data Mining: Concepts and Techniques 47
Indexing OLAP Data: Bitmap Index 
 Index on a particular column 
 Each value in the column has a bit vector: bit-op is fast 
 The length of the bit vector: # of records in the base table 
 The i-th bit is set if the i-th row of the base table has the value for 
the indexed column 
 not suitable for high cardinality domains 
Base table Index on Region Index on Type 
Cust Region Type 
C1 Asia Retail 
C2 Europe Dealer 
C3 Asia Dealer 
C4 America Retail 
C5 Europe Dealer 
RecID Retail Dealer 
1 1 0 
2 0 1 
3 0 1 
4 1 0 
5 0 1 
RecIDAsia Europe America 
1 1 0 0 
2 0 1 0 
3 1 0 0 
4 0 0 1 
5 0 1 0 
October 8, 2014 Data Mining: Concepts and Techniques 48
Indexing OLAP Data: Join Indices 
 Join index: JI(R-id, S-id) where R (R-id, …)  S 
(S-id, …) 
 Traditional indices map the values to a list of 
record ids 
 It materializes relational join in JI file and 
speeds up relational join — a rather costly 
operation 
 In data warehouses, join index relates the values 
of the dimensions of a start schema to rows in 
the fact table. 
 E.g. fact table: Sales and two dimensions city 
and product 
 A join index on city maintains for each 
distinct city a list of R-IDs of the tuples 
recording the Sales in the city 
 Join indices can span multiple dimensions 
October 8, 2014 Data Mining: Concepts and Techniques 49
Efficient Processing OLAP Queries 
 Determine which operations should be performed on the 
available cuboids: 
 transform drill, roll, etc. into corresponding SQL and/or 
OLAP operations, e.g, dice = selection + projection 
 Determine to which materialized cuboid(s) the relevant 
operations should be applied. 
 Exploring indexing structures and compressed vs. dense 
array structures in MOLAP 
October 8, 2014 Data Mining: Concepts and Techniques 50
Metadata Repository 
 Meta data is the data defining warehouse objects. It has the following 
kinds 
 Description of the structure of the warehouse 
 schema, view, dimensions, hierarchies, derived data defn, data mart 
locations and contents 
 Operational meta-data 
 data lineage (history of migrated data and transformation path), 
currency of data (active, archived, or purged), monitoring information 
(warehouse usage statistics, error reports, audit trails) 
 The algorithms used for summarization 
 The mapping from operational environment to the data warehouse 
 Data related to system performance 
 warehouse schema, view and derived data definitions 
 Business data 
 business terms and definitions, ownership of data, charging policies 
October 8, 2014 Data Mining: Concepts and Techniques 51
Data Warehouse Back-End Tools and Utilities 
 Data extraction: 
 get data from multiple, heterogeneous, and external 
sources 
 Data cleaning: 
 detect errors in the data and rectify them when possible 
 Data transformation: 
 convert data from legacy or host format to warehouse 
format 
 Load: 
 sort, summarize, consolidate, compute views, check 
integrity, and build indicies and partitions 
 Refresh 
 propagate the updates from the data sources to the 
warehouse 
October 8, 2014 Data Mining: Concepts and Techniques 52
Chapter 2: Data Warehousing and 
OLAP Technology for Data Mining 
 What is a data warehouse? 
 A multi-dimensional data model 
 Data warehouse architecture 
 Data warehouse implementation 
 Further development of data cube technology 
 From data warehousing to data mining 
October 8, 2014 Data Mining: Concepts and Techniques 53
Iceberg Cube 
 Computing only the cuboid cells whose count 
or other aggregates satisfying the condition: 
HAVING COUNT(*) >= minsup 
 Motivation 
 Only a small portion of cube cells may be 
“above the water’’ in a sparse cube 
 Only calculate “interesting” data—data 
above certain threshold 
 Suppose 100 dimensions, only 1 base cell. 
How many aggregate (non-base) cells if 
count >= 1? What about count >= 2? 
October 8, 2014 Data Mining: Concepts and Techniques 54
Bottom-Up Computation (BUC) 
 BUC (Beyer & Ramakrishnan, 
SIGMOD’99) 
 Bottom-up vs. top-down?— 
depending on how you view it! 
 Apriori property: 
 Aggregate the data, 
then move to the 
next level 
 If minsup is not met, stop! 
 If minsup = 1 Þ compute full 
CUBE! 
October 8, 2014 Data Mining: Concepts and Techniques 55
Partitioning 
 Usually, entire data set can’t 
fit in main memory 
 Sort distinct values, partition into blocks that fit 
 Continue processing 
 Optimizations 
 Partitioning 
 External Sorting, Hashing, Counting Sort 
 Ordering dimensions to encourage pruning 
 Cardinality, Skew, Correlation 
 Collapsing duplicates 
 Can’t do holistic aggregates anymore! 
October 8, 2014 Data Mining: Concepts and Techniques 56
Drawbacks of BUC 
 Requires a significant amount of memory 
 On par with most other CUBE algorithms though 
 Does not obtain good performance with dense CUBEs 
 Overly skewed data or a bad choice of dimension 
ordering reduces performance 
 Cannot compute iceberg cubes with complex measures 
CREATE CUBE Sales_Iceberg AS 
SELECT month, city, cust_grp, 
AVG(price), COUNT(*) 
FROM Sales_Infor 
CUBEBY month, city, cust_grp 
HAVING AVG(price) >= 800 AND 
COUNT(*) >= 50 
October 8, 2014 Data Mining: Concepts and Techniques 57
Non-Anti-Monotonic Measures 
 The cubing query with avg is non-anti-monotonic! 
 (Mar, *, *, 600, 1800) fails the HAVING clause 
 (Mar, *, Bus, 1300, 360) passes the clause 
CREATE CUBE Sales_Iceberg AS 
SELECT month, city, cust_grp, 
AVG(price), COUNT(*) 
FROM Sales_Infor 
CUBEBY month, city, cust_grp 
HAVING AVG(price) >= 800 AND 
COUNT(*) >= 50 
Month City Cust_grp Prod Cost Price 
Jan Tor Edu Printer 500 485 
Jan Tor Hld TV 800 1200 
Jan Tor Edu Camera 1160 1280 
Feb Mon Bus Laptop 1500 2500 
Mar Van Edu HD 540 520 
… … … … … … 
October 8, 2014 Data Mining: Concepts and Techniques 58
Top-k Average 
 Let (*, Van, *) cover 1,000 records 
 Avg(price) is the average price of those 1000 sales 
 Avg50(price) is the average price of the top-50 sales 
(top-50 according to the sales price 
 Top-k average is anti-monotonic 
 The top 50 sales in Van. is with avg(price) <= 800  
the top 50 deals in Van. during Feb. must be with 
avg(price) <= 800 
Month City Cust_grp Prod Cost Price 
… … … … … … 
October 8, 2014 Data Mining: Concepts and Techniques 59
Binning for Top-k Average 
 Computing top-k avg is costly with large k 
 Binning idea 
 Avg50(c) >= 800 
 Large value collapsing: use a sum and a count 
to summarize records with measure >= 800 
 If count>=800, no need to check “small” records 
 Small value binning: a group of bins 
 One bin covers a range, e.g., 600~800, 400~600, 
etc. 
 Register a sum and a count for each bin 
October 8, 2014 Data Mining: Concepts and Techniques 60
Approximate top-k average 
Suppose for (*, Van, *), we have 
Range Sum Count 
Over 800 28000 20 
600~800 10600 15 
400~600 15200 30 
… … … 
Approximate avg50()= 
(28000+10600+600*15)/50=95 
2 
Top 50 
The cell may pass the HAVING clause 
Month City Cust_grp Prod Cost Price 
… … … … … … 
October 8, 2014 Data Mining: Concepts and Techniques 61
Quant-info for Top-k Average 
Binning 
 Accumulate quant-info for cells to compute 
average iceberg cubes efficiently 
 Three pieces: sum, count, top-k bins 
 Use top-k bins to estimate/prune descendants 
 Use sum and count to consolidate current cell 
weakest strongest 
Approximate avg50() 
Anti-monotonic, can 
be computed 
efficiently 
real avg50() 
Anti-monotonic, but 
computationally 
costly 
avg() 
Not anti-monotonic 
October 8, 2014 Data Mining: Concepts and Techniques 62
An Efficient Iceberg Cubing Method: 
Top-k H-Cubing 
 One can revise Apriori or BUC to compute a top-k avg 
iceberg cube. This leads to top-k-Apriori and top-k BUC. 
 Can we compute iceberg cube more efficiently? 
 Top-k H-cubing: an efficient method to compute iceberg 
cubes with average measure 
 H-tree: a hyper-tree structure 
 H-cubing: computing iceberg cubes using H-tree 
October 8, 2014 Data Mining: Concepts and Techniques 63
H-tree: A Prefix Hyper-tree 
Attr. Val. Quant-Info Side-link 
Edu Sum:2285 … 
Hhd … 
Bus … 
… … 
Jan … 
Feb … 
… … 
Tor … 
Van … 
Mon … 
… … 
Header 
table 
Month City Cust_grp Prod Cost Price 
Jan Tor Edu Printer 500 485 
Jan Tor Hhd TV 800 1200 
Jan Tor Edu Camera 1160 1280 
Feb Mon Bus Laptop 1500 2500 
Mar Van Edu HD 540 520 
… … … … … … 
root 
edu hhd bus 
Jan Mar Jan Feb 
Tor Van Tor Mon 
Quant-Info Q.I. Q.I. Q.I. 
Sum: 1765 
Cnt: 2 
bins 
October 8, 2014 Data Mining: Concepts and Techniques 64
Properties of H-tree 
 Construction cost: a single database scan 
 Completeness: It contains the complete 
information needed for computing the iceberg 
cube 
 Compactness: # of nodes  n*m+1 
 n: # of tuples in the table 
 m: # of attributes 
October 8, 2014 Data Mining: Concepts and Techniques 65
Computing Cells Involving 
Dimension City 
From (*, *, Tor) to (*, Jan, Tor) 
root 
Edu. Hhd. Bus. 
Jan. Mar. Jan. Feb. 
Tor. Van. Tor. Mon. 
Quant-Info Q.I. Q.I. Q.I. 
Sum: 1765 
Cnt: 2 
bins 
Attr. 
Val. 
Q.I. Side-link 
Edu … 
Hhd … 
Bus … 
… … 
Jan … 
Feb … 
… … 
Header 
Table 
HTor 
Attr. Val. Quant-Info Side-link 
Edu Sum:2285 … 
Hhd … 
Bus … 
… … 
Jan … 
Feb … 
… … 
TToorr …… 
Van … 
Mon … 
… … 
October 8, 2014 Data Mining: Concepts and Techniques 66
Computing Cells Involving Month 
But No City 
root 
Edu. Hhd. Bus. 
Jan. Mar. Jan. Feb. 
Q.I. Q.I. Q.I. 
1. Roll up quant-info 
2. Compute cells involving 
month but no city 
Q.I. 
Tor. Van. Tor. Mont. 
Attr. Val. Quant-Info Side-link 
Edu. Sum:2285 … 
Hhd. … 
Bus. … 
… … 
Jan. … 
Feb. … 
Mar. … 
… … 
Tor. … 
Van. … 
Mont. … 
… … 
Top-k OK mark: if Q.I. in a child passes 
top-k avg threshold, so does its parents. 
No binning is needed! 
October 8, 2014 Data Mining: Concepts and Techniques 67
Computing Cells Involving Only 
Cust_grp 
root 
edu hhd bus 
Jan Mar Jan Feb 
Q.I. Q.I. Q.I. 
Check header table directly 
Q.I. 
Tor Van Tor Mon 
Attr. Val. Quant-Info Side-link 
Edu Sum:2285 
… 
Hhd … 
Bus … 
… … 
Jan … 
Feb … 
Mar … 
… … 
Tor … 
Van … 
Mon … 
… … 
October 8, 2014 Data Mining: Concepts and Techniques 68
Properties of H-Cubing 
 Space cost 
 an H-tree 
 a stack of up to (m-1) header tables 
 One database scan 
 Main memory-based tree traversal & side-links 
updates 
 Top-k_OK marking 
October 8, 2014 Data Mining: Concepts and Techniques 69
Scalability w.r.t. Count Threshold 
(No min_avg Setting) 
300 
250 
200 
150 
100 
50 
0 
top-k H-Cubing 
top-k BUC 
0.00% 0.05% 0.10% 
Count threshold 
Runtime (second) 
October 8, 2014 Data Mining: Concepts and Techniques 70
Computing Iceberg Cubes with Other 
Complex Measures 
 Computing other complex measures 
 Key point: find a function which is weaker but ensures 
certain anti-monotonicity 
 Examples 
 Avg() £ v: avgk(c) £ v (bottom-k avg) 
 Avg() ³ v only (no count): max(price) ³ v 
 Sum(profit) (profit can be negative): 
 p_sum(c) ³ v if p_count(c) ³ k; or otherwise, sumk(c) ³ v 
 Others: conjunctions of multiple conditions 
October 8, 2014 Data Mining: Concepts and Techniques 71
Discussion: Other Issues 
 Computing iceberg cubes with more complex measures? 
 No general answer for holistic measures, e.g., median, 
mode, rank 
 A research theme even for complex algebraic functions, 
e.g., standard_dev, variance 
 Dynamic vs . static computation of iceberg cubes 
 v and k are only available at query time 
 Setting reasonably low parameters for most nontrivial 
cases 
 Memory-hog? what if the cubing is too big to fit in memory? 
—projection and then cubing 
October 8, 2014 Data Mining: Concepts and Techniques 72
Condensed Cube 
 W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective 
Approach to Reducing Data Cube Size. ICDE’02. 
 Icerberg cube cannot solve all the problems 
 Suppose 100 dimensions, only 1 base cell with count = 10. 
How many aggregate (non-base) cells if count >= 10? 
 Condensed cube 
 Only need to store one cell (a1, a2, …, a100, 10), which represents 
all the corresponding aggregate cells 
 Adv. 
 Fully precomputed cube without compression 
 Efficient computation of the minimal condensed cube 
October 8, 2014 Data Mining: Concepts and Techniques 73
Chapter 2: Data Warehousing and 
OLAP Technology for Data Mining 
 What is a data warehouse? 
 A multi-dimensional data model 
 Data warehouse architecture 
 Data warehouse implementation 
 Further development of data cube technology 
 From data warehousing to data mining 
October 8, 2014 Data Mining: Concepts and Techniques 74
Data Warehouse Usage 
 Three kinds of data warehouse applications 
 Information processing 
 supports querying, basic statistical analysis, and reporting 
using crosstabs, tables, charts and graphs 
 Analytical processing 
 multidimensional analysis of data warehouse data 
 supports basic OLAP operations, slice-dice, drilling, pivoting 
 Data mining 
 knowledge discovery from hidden patterns 
 supports associations, constructing analytical models, 
performing classification and prediction, and presenting the 
mining results using visualization tools. 
 Differences among the three tasks 
October 8, 2014 Data Mining: Concepts and Techniques 75
From On-Line Analytical Processing 
to On Line Analytical Mining 
(OLAM) 
 Why online analytical mining? 
 High quality of data in data warehouses 
 DW contains integrated, consistent, cleaned data 
 Available information processing structure surrounding data 
warehouses 
 ODBC, OLEDB, Web accessing, service facilities, reporting 
and OLAP tools 
 OLAP-based exploratory data analysis 
 mining with drilling, dicing, pivoting, etc. 
 On-line selection of data mining functions 
 integration and swapping of multiple mining functions, 
algorithms, and tasks. 
 Architecture of OLAM 
October 8, 2014 Data Mining: Concepts and Techniques 76
An OLAM Architecture 
Mining query Mining result 
OLAP 
Engine 
Meta 
Data 
Filtering&Integration Filtering 
Data 
Warehouse 
User GUI API 
Data Cube API 
MDDB 
OLAM 
Engine 
Database API 
Data cleaning 
Data integration 
Layer4 
User Interface 
Layer3 
OLAP/OLAM 
Layer2 
MDDB 
Layer1 
Data 
Repository 
Databases 
October 8, 2014 Data Mining: Concepts and Techniques 77
Discovery-Driven Exploration of Data 
Cubes 
 Hypothesis-driven 
 exploration by user, huge search space 
 Discovery-driven (Sarawagi, et al.’98) 
 Effective navigation of large OLAP data cubes 
 pre-compute measures indicating exceptions, guide 
user in the data analysis, at all levels of aggregation 
 Exception: significantly different from the value 
anticipated, based on a statistical model 
 Visual cues such as background color are used to 
reflect the degree of exception of each cell 
October 8, 2014 Data Mining: Concepts and Techniques 78
Kinds of Exceptions and their Computation 
 Parameters 
 SelfExp: surprise of cell relative to other cells at same 
level of aggregation 
 InExp: surprise beneath the cell 
 PathExp: surprise beneath cell for each drill-down 
path 
 Computation of exception indicator (modeling fitting and 
computing SelfExp, InExp, and PathExp values) can be 
overlapped with cube construction 
 Exception themselves can be stored, indexed and 
retrieved like precomputed aggregates 
October 8, 2014 Data Mining: Concepts and Techniques 79
Examples: Discovery-Driven Data 
Cubes 
October 8, 2014 Data Mining: Concepts and Techniques 80
Complex Aggregation at Multiple 
Granularities: Multi-Feature Cubes 
 Multi-feature cubes (Ross, et al. 1998): Compute complex queries 
involving multiple dependent aggregates at multiple granularities 
 Ex. Grouping by all subsets of {item, region, month}, find the 
maximum price in 1997 for each group, and the total sales among all 
maximum price tuples 
select item, region, month, max(price), sum(R.sales) 
from purchases 
where year = 1997 
cube by item, region, month: R 
such that R.price = max(price) 
 Continuing the last example, among the max price tuples, find the 
min and max shelf live, and find the fraction of the total sales due to 
tuple that have min shelf life within the set of all max price tuples 
October 8, 2014 Data Mining: Concepts and Techniques 81
Cube-Gradient (Cubegrade) 
 Analysis of changes of sophisticated measures 
in multi-dimensional spaces 
 Query: changes of average house price in 
Vancouver in ‘00 comparing against ’99 
 Answer: Apts in West went down 20%, 
houses in Metrotown went up 10% 
 Cubegrade problem by Imielinski et al. 
 Changes in dimensions  changes in 
measures 
 Drill-down, roll-up, and mutation 
October 8, 2014 Data Mining: Concepts and Techniques 82
From Cubegrade to Multi-dimensional 
Constrained Gradients in Data Cubes 
 Significantly more expressive than association rules 
 Capture trends in user-specified measures 
 Serious challenges 
 Many trivial cells in a cube  “significance constraint” 
to prune trivial cells 
 Numerate pairs of cells  “probe constraint” to select 
a subset of cells to examine 
 Only interesting changes wanted “gradient 
constraint” to capture significant changes 
October 8, 2014 Data Mining: Concepts and Techniques 83
MD Constrained Gradient Mining 
 Significance constraint Csig: (cnt³100) 
 Probe constraint Cprb: (city=“Van”, cust_grp=“busi”, 
prod_grp=“*”) 
 Gradient constraint Cgrad(cg, cp): 
(avg_price(cg)/avg_price(cp)³1.3) 
Probe cell: satisfied Cprb (c4, c2) satisfies Cgrad! 
Dimensions Measures 
cid Yr City Cst_grp Prd_grp Cnt Avg_price 
c1 00 Van Busi PC 300 2100 
c2 * Van Busi PC 2800 1800 
c3 * Tor Busi PC 7900 2350 
c4 * * busi PC 58600 2250 
Base cell 
Aggregated cell 
Siblings 
Ancestor 
October 8, 2014 Data Mining: Concepts and Techniques 84
A LiveSet-Driven Algorithm 
 Compute probe cells using Csig and Cprb 
 The set of probe cells P is often very small 
 Use probe P and constraints to find gradients 
 Pushing selection deeply 
 Set-oriented processing for probe cells 
 Iceberg growing from low to high dimensionalities 
 Dynamic pruning probe cells during growth 
 Incorporating efficient iceberg cubing method 
October 8, 2014 Data Mining: Concepts and Techniques 85
Summary 
 Data warehouse 
 A multi-dimensional model of a data warehouse 
 Star schema, snowflake schema, fact constellations 
 A data cube consists of dimensions & measures 
 OLAP operations: drilling, rolling, slicing, dicing and pivoting 
 OLAP servers: ROLAP, MOLAP, HOLAP 
 Efficient computation of data cubes 
 Partial vs. full vs. no materialization 
 Multiway array aggregation 
 Bitmap index and join index implementations 
 Further development of data cube technology 
 Discovery-drive and multi-feature cubes 
 From OLAP to OLAM (on-line analytical mining) 
October 8, 2014 Data Mining: Concepts and Techniques 86

More Related Content

PPT
03 data mining : data warehouse
PPT
Data mining 1
PPT
Chapter 1. Introduction
PPT
What Is DATA MINING(INTRODUCTION)
PPT
introduction to data mining tutorial
PPT
data mining
PPT
Data Mining Overview
PPT
Introduction data mining
03 data mining : data warehouse
Data mining 1
Chapter 1. Introduction
What Is DATA MINING(INTRODUCTION)
introduction to data mining tutorial
data mining
Data Mining Overview
Introduction data mining

What's hot (20)

PPT
Introduction to DataMining
PDF
Data mining (lecture 1 & 2) conecpts and techniques
PPTX
Data mining
PPT
Introduction to Data Mining
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPT
Chapter 1: Introduction to Data Mining
PPT
Introduction To Data Mining
PPT
Data mining-2
PPT
Chapter 08 Data Mining Techniques
PPTX
Data Mining: an Introduction
PPTX
Introduction to Data mining
DOCX
data mining and data warehousing
PPTX
Data mining , Knowledge Discovery Process, Classification
PPT
Introduction
PPT
Data Warehouse and Data Mining
PPTX
Introduction to-data-mining chapter 1
PPTX
Data mining
PPTX
Introduction to Data Mining
PPTX
Data mining concepts and work
Introduction to DataMining
Data mining (lecture 1 & 2) conecpts and techniques
Data mining
Introduction to Data Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Chapter 1: Introduction to Data Mining
Introduction To Data Mining
Data mining-2
Chapter 08 Data Mining Techniques
Data Mining: an Introduction
Introduction to Data mining
data mining and data warehousing
Data mining , Knowledge Discovery Process, Classification
Introduction
Data Warehouse and Data Mining
Introduction to-data-mining chapter 1
Data mining
Introduction to Data Mining
Data mining concepts and work
Ad

Viewers also liked (17)

PPT
04 data mining : data generelization
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
PPT
Chapter 06 Data Mining Techniques
PPT
Jewei Hans & Kamber Capter 7
PPT
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
Data Mining Techniques
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PPT
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPT
Data Mining: Concepts and techniques: Chapter 13 trend
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPT
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
01 Data Mining: Concepts and Techniques, 2nd ed.
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PPT
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
04 data mining : data generelization
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Chapter 06 Data Mining Techniques
Jewei Hans & Kamber Capter 7
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Data Mining Techniques
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
01 Data Mining: Concepts and Techniques, 2nd ed.
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Ad

Similar to Data Mining Concepts and Techniques (20)

PPT
DataMining and OLAP Technology Concepts Presented By Quontra Solutions
PPT
Data Warehousing and Mining
PPT
Data warehousing and data mining presentation
PPTX
Dataware house multidimensionalmodelling
PPT
konsep dan teknik datamining bagian 2.ppt
PPT
Data mining presentation for OLAP and other details
PPT
2. olap warehouse
PPT
Topic(4)-OLAP data mining master ALEX.ppt
PPT
Data Warehousing and Data Mining
PPT
Data Mining - Concept and Techniques- University of Illinois
PPT
data warehouse and data mining unit 2 ppt
PPT
Data Mining and Warehousing Concept and Techniques
PPT
04OLAP in data mining concept Online Analytical Processing.ppt
PPT
Data Mining Concept & Technique-ch04.ppt
PPT
Datawarehouse and OLAP
PPT
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
PDF
data warehousing and online analtytical processing
PPTX
Data Warehousing for students educationpptx
PPT
Datawarehousing
PPT
assassasassaassasasasasasasasasasdw2.ppt
DataMining and OLAP Technology Concepts Presented By Quontra Solutions
Data Warehousing and Mining
Data warehousing and data mining presentation
Dataware house multidimensionalmodelling
konsep dan teknik datamining bagian 2.ppt
Data mining presentation for OLAP and other details
2. olap warehouse
Topic(4)-OLAP data mining master ALEX.ppt
Data Warehousing and Data Mining
Data Mining - Concept and Techniques- University of Illinois
data warehouse and data mining unit 2 ppt
Data Mining and Warehousing Concept and Techniques
04OLAP in data mining concept Online Analytical Processing.ppt
Data Mining Concept & Technique-ch04.ppt
Datawarehouse and OLAP
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
data warehousing and online analtytical processing
Data Warehousing for students educationpptx
Datawarehousing
assassasassaassasasasasasasasasasdw2.ppt

More from Pratik Tambekar (16)

PPTX
ASP DOT NET
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed System by Pratik Tambekar
PPT
Distributed Transaction
PPT
World Wide Web(WWW)
PPTX
How To Add System Call In Ubuntu OS
ASP DOT NET
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Distributed Transaction
World Wide Web(WWW)
How To Add System Call In Ubuntu OS

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Digital Logic Computer Design lecture notes
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
Project quality management in manufacturing
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Welding lecture in detail for understanding
PPTX
additive manufacturing of ss316l using mig welding
PPTX
web development for engineering and engineering
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT 4 Total Quality Management .pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Digital Logic Computer Design lecture notes
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Project quality management in manufacturing
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Sustainable Sites - Green Building Construction
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Welding lecture in detail for understanding
additive manufacturing of ss316l using mig welding
web development for engineering and engineering
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Lecture Notes Electrical Wiring System Components
Embodied AI: Ushering in the Next Era of Intelligent Systems
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

Data Mining Concepts and Techniques

  • 1. Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 2 — October 8, 2014 Data Mining: Concepts and Techniques 1
  • 2. Chapter 2: Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining October 8, 2014 Data Mining: Concepts and Techniques 2
  • 3. What is Data Warehouse?  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses October 8, 2014 Data Mining: Concepts and Techniques 3
  • 4. Data Warehouse—Subject-Oriented  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. October 8, 2014 Data Mining: Concepts and Techniques 4
  • 5. Data Warehouse—Integrated  Constructed by integrating multiple, heterogeneous data sources  relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources  E.g., Hotel price: currency, tax, breakfast covered, etc.  When data is moved to the warehouse, it is converted. October 8, 2014 Data Mining: Concepts and Techniques 5
  • 6. Data Warehouse—Time Variant  The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”. October 8, 2014 Data Mining: Concepts and Techniques 6
  • 7. Data Warehouse—Non-Volatile  A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing:  initial loading of data and access of data. October 8, 2014 Data Mining: Concepts and Techniques 7
  • 8. Data Warehouse vs. Heterogeneous DBMS  Traditional heterogeneous DB integration:  Build wrappers/mediators on top of heterogeneous databases  Query driven approach  When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set  Complex information filtering, compete for resources  Data warehouse: update-driven, high performance  Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis October 8, 2014 Data Mining: Concepts and Techniques 8
  • 9. Data Warehouse vs. Operational DBMS  OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.  OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs. read-only but complex queries October 8, 2014 Data Mining: Concepts and Techniques 9
  • 10. OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response October 8, 2014 Data Mining: Concepts and Techniques 10
  • 11. Why Separate Data Warehouse?  High performance for both systems  DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.  Different functions and different data:  missing data: Decision support requires historical data which operational DBs do not typically maintain  data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled October 8, 2014 Data Mining: Concepts and Techniques 11
  • 12. Chapter 2: Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining October 8, 2014 Data Mining: Concepts and Techniques 12
  • 13. From Tables and Spreadsheets to Data Cubes  A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. October 8, 2014 Data Mining: Concepts and Techniques 13
  • 14. Cube: A Lattice of Cuboids all time item location supplier time,item time,location item,location time,supplier item,supplier 1-D cuboids location,supplier time,item,location time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 2-D cuboids 3-D cuboids 4-D(base) cuboid October 8, 2014 Data Mining: Concepts and Techniques 14
  • 15. Conceptual Modeling of Data Warehouses  Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation October 8, 2014 Data Mining: Concepts and Techniques 15
  • 16. Example of Star Schema time time_key day day_of_the_week month quarter year item item_key item_name brand type supplier_type location location_key street city state_or_province country Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales branch branch_key branch_name branch_type Measures October 8, 2014 Data Mining: Concepts and Techniques 16
  • 17. Example of Snowflake Schema time time_key day day_of_the_week month quarter year item item_key item_name brand type supplier_key location location_key street city_key Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales branch branch_key branch_name branch_type Measures supplier supplier_key supplier_type city city_key city state_or_province country October 8, 2014 Data Mining: Concepts and Techniques 17
  • 18. Example of Fact Constellation time time_key day day_of_the_week month quarter year item item_key item_name brand type supplier_type location location_key street city province_or_state country Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales branch branch_key branch_name branch_type Measures Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type October 8, 2014 Data Mining: Concepts and Techniques 18
  • 19. A Data Mining Query Language: DMQL  Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list>  Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>)  Special Case (Shared Dimension Tables)  First time as “cube definition”  define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> October 8, 2014 Data Mining: Concepts and Techniques 19
  • 20. Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) October 8, 2014 Data Mining: Concepts and Techniques 20
  • 21. Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) October 8, 2014 Data Mining: Concepts and Techniques 21
  • 22. Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales October 8, 2014 Data Mining: Concepts and Techniques 22
  • 23. Measures: Three Categories  distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.  E.g., count(), sum(), min(), max().  algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.  E.g., avg(), min_N(), standard_deviation().  holistic: if there is no constant bound on the storage size needed to describe a subaggregate.  E.g., median(), mode(), rank(). October 8, 2014 Data Mining: Concepts and Techniques 23
  • 24. A Concept Hierarchy: Dimension (location) all Europe ... North_America Germany ... Spain Canada ... Mexico ... Vancouver ... city Frankfurt Toronto L. Chan ... M. Wind all region country office October 8, 2014 Data Mining: Concepts and Techniques 24
  • 25. View of Warehouses and Hierarchies Specification of hierarchies  Schema hierarchy day < {month < quarter; week} < year  Set_grouping hierarchy {1..10} < inexpensive October 8, 2014 Data Mining: Concepts and Techniques 25
  • 26. Multidimensional Data  Sales volume as a function of product, month, and region Region Product Month Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product City Month Week Office Day October 8, 2014 Data Mining: Concepts and Techniques 26
  • 27. A Sample Data Cube Total annual sales Date of TV in U.S.A. Product Country PC VCR sum sum TV 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum October 8, 2014 Data Mining: Concepts and Techniques 27
  • 28. Cuboids Corresponding to the Cube all product date country product,date product,country date, country product, date, country 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D(base) cuboid October 8, 2014 Data Mining: Concepts and Techniques 28
  • 29. Browsing a Data Cube  Visualization  OLAP capabilities  Interactive manipulation October 8, 2014 Data Mining: Concepts and Techniques 29
  • 30. Typical OLAP Operations  Roll up (drill-up): summarize data  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice:  project and select  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes.  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) October 8, 2014 Data Mining: Concepts and Techniques 30
  • 31. A Star-Net Query Model Shipping Method AIR-EXPRESS Customer Orders TRUCK CONTRACTS ORDER Customer Product PRODUCT LINE PRODUCT GROUP PRODUCT ITEM SALES PERSON DISTRICT DIVISION ANNUALY QTRLY DAILY Promotion Organization CITY COUNTRY Time REGION Location Each circle is called a footprint October 8, 2014 Data Mining: Concepts and Techniques 31
  • 32. Chapter 2: Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining October 8, 2014 Data Mining: Concepts and Techniques 32
  • 33. Design of a Data Warehouse: A Business Analysis Framework  Four views regarding the design of a data warehouse  Top-down view  allows selection of the relevant information necessary for the data warehouse  Data source view  exposes the information being captured, stored, and managed by operational systems  Data warehouse view  consists of fact tables and dimension tables  Business query view  sees the perspectives of data in the warehouse from the view of end-user October 8, 2014 Data Mining: Concepts and Techniques 33
  • 34. Data Warehouse Design Process  Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature)  Bottom-up: Starts with experiments and prototypes (rapid)  From software engineering point of view  Waterfall: structured and systematic analysis at each step before proceeding to the next  Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around  Typical data warehouse design process  Choose a business process to model, e.g., orders, invoices, etc.  Choose the grain (atomic level of data) of the business process  Choose the dimensions that will apply to each fact table record  Choose the measure that will populate each fact table record October 8, 2014 Data Mining: Concepts and Techniques 34
  • 35. MMuullttii--TTiieerreedd AArrcchhiitteeccttuurree Monitor & Integrator Data Warehouse Metadata Extract Transform Load Refresh OLAP Server Serve Data Marts other source s Operational DBs Data Sources OLAP Engine Front-End Tools Analysis Query Reports Data mining Data Storage October 8, 2014 Data Mining: Concepts and Techniques 35
  • 36. Three Data Warehouse Models  Enterprise warehouse  collects all of the information about subjects spanning the entire organization  Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart  Independent vs. dependent (directly from warehouse) data mart  Virtual warehouse  A set of views over operational databases  Only some of the possible summary views may be materialized October 8, 2014 Data Mining: Concepts and Techniques 36
  • 37. Data Warehouse Development: A Recommended Approach Data Mart Distributed Data Marts Data Mart Multi-Tier Data Warehouse Enterprise Data Warehouse Model refinement Model refinement Define a high-level corporate data model October 8, 2014 Data Mining: Concepts and Techniques 37
  • 38. OLAP Server Architectures  Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  greater scalability  Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine (sparse matrix techniques)  fast indexing to pre-computed summarized data  Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high-level: array  Specialized SQL servers  specialized support for SQL queries over star/snowflake schemas October 8, 2014 Data Mining: Concepts and Techniques 38
  • 39. Chapter 2: Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining October 8, 2014 Data Mining: Concepts and Techniques 39
  • 40. Efficient Data Cube Computation  Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels? n i i T L ( + Õ= 1) 1 =  Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc. October 8, 2014 Data Mining: Concepts and Techniques 40
  • 41. Cube Operation  Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales  Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year  Need compute the following Group-Bys () (city) (item) (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () (year) (city, item) (city, year) (item, year) (city, item, year) October 8, 2014 Data Mining: Concepts and Techniques 41
  • 42. Cube Computation: ROLAP-Based Method  Efficient cube computation methods  ROLAP-based cubing algorithms (Agarwal et al’96)  Array-based cubing algorithm (Zhao et al’97)  Bottom-up computation method (Beyer & Ramarkrishnan’99)  H-cubing technique (Han, Pei, Dong & Wang:SIGMOD’01)  ROLAP-based cubing algorithms  Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples  Grouping is performed on some sub-aggregates as a “partial grouping step”  Aggregates may be computed from previously computed aggregates, rather than from the base fact table October 8, 2014 Data Mining: Concepts and Techniques 42
  • 43. Cube Computation: ROLAP-Based Method (2)  This is not in the textbook but in a research paper  Hash/sort based methods (Agarwal et. al. VLDB’96)  Smallest-parent: computing a cuboid from the smallest, previously computed cuboid  Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os  Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads  Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used  Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used October 8, 2014 Data Mining: Concepts and Techniques 43
  • 44. Multi-way Array Aggregation for Cube Computation  Partition arrays into chunks (a small subcube which fits in memory).  Compressed sparse array addressing: (chunk_id, offset)  Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. What is the best traversing order to do multi-way aggregation? c3 c2 61 62 63 64 45 46 47 48 29 30 31 32 C c 0c1 b3 b2 b1 b0 13 14 15 16 9 5 1 2 3 4 a0 a1 A B a2 a3 B 60 44 28 56 24 4036 52 20 October 8, 2014 Data Mining: Concepts and Techniques 44
  • 45. Multi-way Array Aggregation for Cube Computation c3 61 62 63 64 45 46 47 48 C c2 c1 c 0 b3 b2 b1 b0 29 30 31 32 13 14 15 16 9 5 1 2 3 4 A B a0 a1 a2 a3 44 60 28 56 40 24 52 36 20 B October 8, 2014 Data Mining: Concepts and Techniques 45
  • 46. Multi-way Array Aggregation for Cube Computation c3 61 62 63 64 45 46 47 48 C c2 c1 c 0 b3 b2 b1 b0 29 30 31 32 13 14 15 16 9 5 1 2 3 4 A B a0 a1 a2 a3 44 60 28 56 40 24 52 36 20 B October 8, 2014 Data Mining: Concepts and Techniques 46
  • 47. Multi-Way Array Aggregation for Cube Computation (Cont.)  Method: the planes should be sorted and computed according to their size in ascending order.  See the details of Example 2.12 (pp. 75-78)  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: computing well only for a small number of dimensions  If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored October 8, 2014 Data Mining: Concepts and Techniques 47
  • 48. Indexing OLAP Data: Bitmap Index  Index on a particular column  Each value in the column has a bit vector: bit-op is fast  The length of the bit vector: # of records in the base table  The i-th bit is set if the i-th row of the base table has the value for the indexed column  not suitable for high cardinality domains Base table Index on Region Index on Type Cust Region Type C1 Asia Retail C2 Europe Dealer C3 Asia Dealer C4 America Retail C5 Europe Dealer RecID Retail Dealer 1 1 0 2 0 1 3 0 1 4 1 0 5 0 1 RecIDAsia Europe America 1 1 0 0 2 0 1 0 3 1 0 0 4 0 0 1 5 0 1 0 October 8, 2014 Data Mining: Concepts and Techniques 48
  • 49. Indexing OLAP Data: Join Indices  Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)  Traditional indices map the values to a list of record ids  It materializes relational join in JI file and speeds up relational join — a rather costly operation  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product  A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions October 8, 2014 Data Mining: Concepts and Techniques 49
  • 50. Efficient Processing OLAP Queries  Determine which operations should be performed on the available cuboids:  transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection  Determine to which materialized cuboid(s) the relevant operations should be applied.  Exploring indexing structures and compressed vs. dense array structures in MOLAP October 8, 2014 Data Mining: Concepts and Techniques 50
  • 51. Metadata Repository  Meta data is the data defining warehouse objects. It has the following kinds  Description of the structure of the warehouse  schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational meta-data  data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance  warehouse schema, view and derived data definitions  Business data  business terms and definitions, ownership of data, charging policies October 8, 2014 Data Mining: Concepts and Techniques 51
  • 52. Data Warehouse Back-End Tools and Utilities  Data extraction:  get data from multiple, heterogeneous, and external sources  Data cleaning:  detect errors in the data and rectify them when possible  Data transformation:  convert data from legacy or host format to warehouse format  Load:  sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions  Refresh  propagate the updates from the data sources to the warehouse October 8, 2014 Data Mining: Concepts and Techniques 52
  • 53. Chapter 2: Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining October 8, 2014 Data Mining: Concepts and Techniques 53
  • 54. Iceberg Cube  Computing only the cuboid cells whose count or other aggregates satisfying the condition: HAVING COUNT(*) >= minsup  Motivation  Only a small portion of cube cells may be “above the water’’ in a sparse cube  Only calculate “interesting” data—data above certain threshold  Suppose 100 dimensions, only 1 base cell. How many aggregate (non-base) cells if count >= 1? What about count >= 2? October 8, 2014 Data Mining: Concepts and Techniques 54
  • 55. Bottom-Up Computation (BUC)  BUC (Beyer & Ramakrishnan, SIGMOD’99)  Bottom-up vs. top-down?— depending on how you view it!  Apriori property:  Aggregate the data, then move to the next level  If minsup is not met, stop!  If minsup = 1 Þ compute full CUBE! October 8, 2014 Data Mining: Concepts and Techniques 55
  • 56. Partitioning  Usually, entire data set can’t fit in main memory  Sort distinct values, partition into blocks that fit  Continue processing  Optimizations  Partitioning  External Sorting, Hashing, Counting Sort  Ordering dimensions to encourage pruning  Cardinality, Skew, Correlation  Collapsing duplicates  Can’t do holistic aggregates anymore! October 8, 2014 Data Mining: Concepts and Techniques 56
  • 57. Drawbacks of BUC  Requires a significant amount of memory  On par with most other CUBE algorithms though  Does not obtain good performance with dense CUBEs  Overly skewed data or a bad choice of dimension ordering reduces performance  Cannot compute iceberg cubes with complex measures CREATE CUBE Sales_Iceberg AS SELECT month, city, cust_grp, AVG(price), COUNT(*) FROM Sales_Infor CUBEBY month, city, cust_grp HAVING AVG(price) >= 800 AND COUNT(*) >= 50 October 8, 2014 Data Mining: Concepts and Techniques 57
  • 58. Non-Anti-Monotonic Measures  The cubing query with avg is non-anti-monotonic!  (Mar, *, *, 600, 1800) fails the HAVING clause  (Mar, *, Bus, 1300, 360) passes the clause CREATE CUBE Sales_Iceberg AS SELECT month, city, cust_grp, AVG(price), COUNT(*) FROM Sales_Infor CUBEBY month, city, cust_grp HAVING AVG(price) >= 800 AND COUNT(*) >= 50 Month City Cust_grp Prod Cost Price Jan Tor Edu Printer 500 485 Jan Tor Hld TV 800 1200 Jan Tor Edu Camera 1160 1280 Feb Mon Bus Laptop 1500 2500 Mar Van Edu HD 540 520 … … … … … … October 8, 2014 Data Mining: Concepts and Techniques 58
  • 59. Top-k Average  Let (*, Van, *) cover 1,000 records  Avg(price) is the average price of those 1000 sales  Avg50(price) is the average price of the top-50 sales (top-50 according to the sales price  Top-k average is anti-monotonic  The top 50 sales in Van. is with avg(price) <= 800  the top 50 deals in Van. during Feb. must be with avg(price) <= 800 Month City Cust_grp Prod Cost Price … … … … … … October 8, 2014 Data Mining: Concepts and Techniques 59
  • 60. Binning for Top-k Average  Computing top-k avg is costly with large k  Binning idea  Avg50(c) >= 800  Large value collapsing: use a sum and a count to summarize records with measure >= 800  If count>=800, no need to check “small” records  Small value binning: a group of bins  One bin covers a range, e.g., 600~800, 400~600, etc.  Register a sum and a count for each bin October 8, 2014 Data Mining: Concepts and Techniques 60
  • 61. Approximate top-k average Suppose for (*, Van, *), we have Range Sum Count Over 800 28000 20 600~800 10600 15 400~600 15200 30 … … … Approximate avg50()= (28000+10600+600*15)/50=95 2 Top 50 The cell may pass the HAVING clause Month City Cust_grp Prod Cost Price … … … … … … October 8, 2014 Data Mining: Concepts and Techniques 61
  • 62. Quant-info for Top-k Average Binning  Accumulate quant-info for cells to compute average iceberg cubes efficiently  Three pieces: sum, count, top-k bins  Use top-k bins to estimate/prune descendants  Use sum and count to consolidate current cell weakest strongest Approximate avg50() Anti-monotonic, can be computed efficiently real avg50() Anti-monotonic, but computationally costly avg() Not anti-monotonic October 8, 2014 Data Mining: Concepts and Techniques 62
  • 63. An Efficient Iceberg Cubing Method: Top-k H-Cubing  One can revise Apriori or BUC to compute a top-k avg iceberg cube. This leads to top-k-Apriori and top-k BUC.  Can we compute iceberg cube more efficiently?  Top-k H-cubing: an efficient method to compute iceberg cubes with average measure  H-tree: a hyper-tree structure  H-cubing: computing iceberg cubes using H-tree October 8, 2014 Data Mining: Concepts and Techniques 63
  • 64. H-tree: A Prefix Hyper-tree Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … … … Tor … Van … Mon … … … Header table Month City Cust_grp Prod Cost Price Jan Tor Edu Printer 500 485 Jan Tor Hhd TV 800 1200 Jan Tor Edu Camera 1160 1280 Feb Mon Bus Laptop 1500 2500 Mar Van Edu HD 540 520 … … … … … … root edu hhd bus Jan Mar Jan Feb Tor Van Tor Mon Quant-Info Q.I. Q.I. Q.I. Sum: 1765 Cnt: 2 bins October 8, 2014 Data Mining: Concepts and Techniques 64
  • 65. Properties of H-tree  Construction cost: a single database scan  Completeness: It contains the complete information needed for computing the iceberg cube  Compactness: # of nodes  n*m+1  n: # of tuples in the table  m: # of attributes October 8, 2014 Data Mining: Concepts and Techniques 65
  • 66. Computing Cells Involving Dimension City From (*, *, Tor) to (*, Jan, Tor) root Edu. Hhd. Bus. Jan. Mar. Jan. Feb. Tor. Van. Tor. Mon. Quant-Info Q.I. Q.I. Q.I. Sum: 1765 Cnt: 2 bins Attr. Val. Q.I. Side-link Edu … Hhd … Bus … … … Jan … Feb … … … Header Table HTor Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … … … TToorr …… Van … Mon … … … October 8, 2014 Data Mining: Concepts and Techniques 66
  • 67. Computing Cells Involving Month But No City root Edu. Hhd. Bus. Jan. Mar. Jan. Feb. Q.I. Q.I. Q.I. 1. Roll up quant-info 2. Compute cells involving month but no city Q.I. Tor. Van. Tor. Mont. Attr. Val. Quant-Info Side-link Edu. Sum:2285 … Hhd. … Bus. … … … Jan. … Feb. … Mar. … … … Tor. … Van. … Mont. … … … Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed! October 8, 2014 Data Mining: Concepts and Techniques 67
  • 68. Computing Cells Involving Only Cust_grp root edu hhd bus Jan Mar Jan Feb Q.I. Q.I. Q.I. Check header table directly Q.I. Tor Van Tor Mon Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … Mar … … … Tor … Van … Mon … … … October 8, 2014 Data Mining: Concepts and Techniques 68
  • 69. Properties of H-Cubing  Space cost  an H-tree  a stack of up to (m-1) header tables  One database scan  Main memory-based tree traversal & side-links updates  Top-k_OK marking October 8, 2014 Data Mining: Concepts and Techniques 69
  • 70. Scalability w.r.t. Count Threshold (No min_avg Setting) 300 250 200 150 100 50 0 top-k H-Cubing top-k BUC 0.00% 0.05% 0.10% Count threshold Runtime (second) October 8, 2014 Data Mining: Concepts and Techniques 70
  • 71. Computing Iceberg Cubes with Other Complex Measures  Computing other complex measures  Key point: find a function which is weaker but ensures certain anti-monotonicity  Examples  Avg() £ v: avgk(c) £ v (bottom-k avg)  Avg() ³ v only (no count): max(price) ³ v  Sum(profit) (profit can be negative):  p_sum(c) ³ v if p_count(c) ³ k; or otherwise, sumk(c) ³ v  Others: conjunctions of multiple conditions October 8, 2014 Data Mining: Concepts and Techniques 71
  • 72. Discussion: Other Issues  Computing iceberg cubes with more complex measures?  No general answer for holistic measures, e.g., median, mode, rank  A research theme even for complex algebraic functions, e.g., standard_dev, variance  Dynamic vs . static computation of iceberg cubes  v and k are only available at query time  Setting reasonably low parameters for most nontrivial cases  Memory-hog? what if the cubing is too big to fit in memory? —projection and then cubing October 8, 2014 Data Mining: Concepts and Techniques 72
  • 73. Condensed Cube  W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02.  Icerberg cube cannot solve all the problems  Suppose 100 dimensions, only 1 base cell with count = 10. How many aggregate (non-base) cells if count >= 10?  Condensed cube  Only need to store one cell (a1, a2, …, a100, 10), which represents all the corresponding aggregate cells  Adv.  Fully precomputed cube without compression  Efficient computation of the minimal condensed cube October 8, 2014 Data Mining: Concepts and Techniques 73
  • 74. Chapter 2: Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse implementation  Further development of data cube technology  From data warehousing to data mining October 8, 2014 Data Mining: Concepts and Techniques 74
  • 75. Data Warehouse Usage  Three kinds of data warehouse applications  Information processing  supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs  Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slice-dice, drilling, pivoting  Data mining  knowledge discovery from hidden patterns  supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.  Differences among the three tasks October 8, 2014 Data Mining: Concepts and Techniques 75
  • 76. From On-Line Analytical Processing to On Line Analytical Mining (OLAM)  Why online analytical mining?  High quality of data in data warehouses  DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses  ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAP-based exploratory data analysis  mining with drilling, dicing, pivoting, etc.  On-line selection of data mining functions  integration and swapping of multiple mining functions, algorithms, and tasks.  Architecture of OLAM October 8, 2014 Data Mining: Concepts and Techniques 76
  • 77. An OLAM Architecture Mining query Mining result OLAP Engine Meta Data Filtering&Integration Filtering Data Warehouse User GUI API Data Cube API MDDB OLAM Engine Database API Data cleaning Data integration Layer4 User Interface Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Databases October 8, 2014 Data Mining: Concepts and Techniques 77
  • 78. Discovery-Driven Exploration of Data Cubes  Hypothesis-driven  exploration by user, huge search space  Discovery-driven (Sarawagi, et al.’98)  Effective navigation of large OLAP data cubes  pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation  Exception: significantly different from the value anticipated, based on a statistical model  Visual cues such as background color are used to reflect the degree of exception of each cell October 8, 2014 Data Mining: Concepts and Techniques 78
  • 79. Kinds of Exceptions and their Computation  Parameters  SelfExp: surprise of cell relative to other cells at same level of aggregation  InExp: surprise beneath the cell  PathExp: surprise beneath cell for each drill-down path  Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction  Exception themselves can be stored, indexed and retrieved like precomputed aggregates October 8, 2014 Data Mining: Concepts and Techniques 79
  • 80. Examples: Discovery-Driven Data Cubes October 8, 2014 Data Mining: Concepts and Techniques 80
  • 81. Complex Aggregation at Multiple Granularities: Multi-Feature Cubes  Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities  Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 1997 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 1997 cube by item, region, month: R such that R.price = max(price)  Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples October 8, 2014 Data Mining: Concepts and Techniques 81
  • 82. Cube-Gradient (Cubegrade)  Analysis of changes of sophisticated measures in multi-dimensional spaces  Query: changes of average house price in Vancouver in ‘00 comparing against ’99  Answer: Apts in West went down 20%, houses in Metrotown went up 10%  Cubegrade problem by Imielinski et al.  Changes in dimensions  changes in measures  Drill-down, roll-up, and mutation October 8, 2014 Data Mining: Concepts and Techniques 82
  • 83. From Cubegrade to Multi-dimensional Constrained Gradients in Data Cubes  Significantly more expressive than association rules  Capture trends in user-specified measures  Serious challenges  Many trivial cells in a cube  “significance constraint” to prune trivial cells  Numerate pairs of cells  “probe constraint” to select a subset of cells to examine  Only interesting changes wanted “gradient constraint” to capture significant changes October 8, 2014 Data Mining: Concepts and Techniques 83
  • 84. MD Constrained Gradient Mining  Significance constraint Csig: (cnt³100)  Probe constraint Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”)  Gradient constraint Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)³1.3) Probe cell: satisfied Cprb (c4, c2) satisfies Cgrad! Dimensions Measures cid Yr City Cst_grp Prd_grp Cnt Avg_price c1 00 Van Busi PC 300 2100 c2 * Van Busi PC 2800 1800 c3 * Tor Busi PC 7900 2350 c4 * * busi PC 58600 2250 Base cell Aggregated cell Siblings Ancestor October 8, 2014 Data Mining: Concepts and Techniques 84
  • 85. A LiveSet-Driven Algorithm  Compute probe cells using Csig and Cprb  The set of probe cells P is often very small  Use probe P and constraints to find gradients  Pushing selection deeply  Set-oriented processing for probe cells  Iceberg growing from low to high dimensionalities  Dynamic pruning probe cells during growth  Incorporating efficient iceberg cubing method October 8, 2014 Data Mining: Concepts and Techniques 85
  • 86. Summary  Data warehouse  A multi-dimensional model of a data warehouse  Star schema, snowflake schema, fact constellations  A data cube consists of dimensions & measures  OLAP operations: drilling, rolling, slicing, dicing and pivoting  OLAP servers: ROLAP, MOLAP, HOLAP  Efficient computation of data cubes  Partial vs. full vs. no materialization  Multiway array aggregation  Bitmap index and join index implementations  Further development of data cube technology  Discovery-drive and multi-feature cubes  From OLAP to OLAM (on-line analytical mining) October 8, 2014 Data Mining: Concepts and Techniques 86