SlideShare a Scribd company logo
3
Most read
4
Most read
5
Most read
Data Mining Lecture 3
Define multi-dimensional data
model
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on
 multidimensional data model which views
data in the form of a data cube
A data cube allows data to be modeled and
viewed in multiple dimensions (such as sales)
 Dimension tables, such as item (item_name,
brand, type), or time(day, week, month,
quarter, year)
 Fact table contains measures (such as
Euros_sold) and keys to each of the related
dimension tables
Definitions
 an n-Dimensional base cube is called a base
cuboid
 The top most 0-D cuboid, which holds the
highest-level of summarisation, is called the
apex cuboid
 The lattice of cuboids forms a data cube
Cube: A Lattice of Cuboids
Conceptual Modeling of Data
Warehouses
Modeling data warehouses: dimensions & measures
Star schema
- A fact table in the middle connected to a set
of dimension tables
Snowflake schema
- A refinement of star schema where some
dimensional hierarchy is normalized into a set
of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations
- Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore
called galaxy schema or fact
o constellation
DMQL: Language Primitives
Cube Definition (Fact Table)
 define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition (Dimension Table)
 define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
 First time as “cube definition”
 define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
Defining a Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
Euros_sold = sum(sales_in_Euros),
avg_sales = avg(sales_in_Euros),
units_sold = count(*)
define dimension time as
(time_key, day, day_of_week, month,
quarter, year)
define dimension item as
(item_key, item_name, brand, type,
supplier_type)
define dimension branch as
(branch_key, branch_name, branch_type)
define dimension location as
(location_key, street, city, county, province,
country)
Defining a Snowflake Schema in
DMQL
define cube sales_snowflake [time, item, branch,
location]:
Euros_sold = sum(sales_in_Euros),
avg_sales = avg(sales_in_Euros),
units_sold = count(*)
define dimension time as
(time_key, day, day_of_week, month,
quarter, year)
define dimension item as
(item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as
(branch_key, branch_name, branch_type)
define dimension location as
(location_key, street, city(city_key, county,
province, country))
Defining a Fact Constellation in
DMQL
define cube sales [time, item, branch, location]:
Euros_sold = sum(sales_in_Euros), avg_sales =
avg(sales_in_Euros),
units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city, province_or_state,
country)
define cube shipping [time, item, shipper,
from_location, to_location]:
Euro_cost = sum(cost_in_Euros), unit_shipped =
count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key,
shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube
sales
define dimension to_location as location in cube sales
Measures: Three Categories
Distributive
 if the result derived by applying the function
to n aggregate values is the same as that
derived by applying the function on all the
data without partitioning.
o E.g., count(), sum(), min(), max()
Algebraic
 if it can be computed by an algebraic function
with M arguments (where M is a bounded
integer), each of which is obtained by
applying a distributive aggregate function.
o E.g., avg(), min_N(),
standard_deviation()
Holistic
 if there is no constant bound on the storage
size needed to describe a sub-aggregate.
o E.g., median(), mode(), rank()
A Concept Hierarchy: Dimension
(location)
Concept hierarchy allows data to be handled at
varying levels of abstractions.
Multidimensional Data
Sales volume as a function of product,
month, and Country
A Sample Data Cube
Cuboids Corresponding to the Cube
Browsing a Data Cube
Typical OLAP Operations
Roll up (drill-up): summarise data
 by climbing up hierarchy or by dimension
reduction
Drill down (roll down): reverse of roll-up
 from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
Slice and dice
 project and select
Pivot (rotate)
 reorient the cube, visualisation, 3D to series
of 2D planes.
Other operations
 drill across: involving (across) more than one
fact table
 drill through: through the bottom level of the
cube to its back-end relational tables (using
SQL)
A Star-Net Query Model
Design of a Data Warehouse: A
Business
Analysis Framework
Four views regarding the design of a data
warehouse
- Top-down view
o allows selection of the relevant
information necessary for the data
warehouse
- Data source view
o exposes the information being
captured, stored, and managed by
operational systems
- Data warehouse view
o consists of fact tables and dimension
tables
- Business query view
o sees the perspectives of data in the
warehouse from the view of enduser
Data Warehouse Design Process
Top-down, bottom-up approaches or a
combination of both
 Top-down: Starts with overall design and
planning (mature)
 Bottom-up: Starts with experiments and
prototypes (rapid)
From software engineering point of view
 Steps: planning, data collection, DW design,
testing and evaluation, DW deployment
 Waterfall: structured and systematic analysis
at each step before proceeding to the next
 Spiral: rapid generation of increasingly
functional systems, short turn around time,
quick turn around
Typical data warehouse design process
 Choose a business process to model, e.g.,
orders, invoices, etc.
 Choose the grain (atomic level of data) of the
business process
 Choose the dimensions that will apply to each
fact table record
 Choose the measure that will populate each
fact table record
Multi-Tiered Architecture
Three Data Warehouse Models
Enterprise warehouse
 collects all of the information about subjects
spanning the entire organisation
Data Mart
 a subset of corporate-wide data that is of
value to a specific group of users. Its scope is
confined to specific, selected groups, such as
marketing data mart
- Independent vs. dependent (directly from
warehouse) data mart
Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views
may be materialised
Data Warehouse Development:
A Recommended Approach
OLAP Server Architectures
Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS
to store and manage warehouse data and
OLAP middleware to support missing pieces
 Include optimisation of DBMS backend,
implementation of aggregation navigation
logic, and additional tools and services
 greater scalability
Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine
(sparse matrix techniques)
 fast indexing to pre-computed summarised
data
Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational,
high-level: array
Specialised SQL servers
 specialised support for SQL queries over
star/snowflake schemas
Home Work 1a
Suppose that a data warehouse for a Big University
consists of the following 4
dimensions: students, module, semester, and lecturer
and 2 measures count
and avg_grade. When at the lowest conceptual level
(e.g., for a given student,
module, semester, and lecturer combination), the
avg_grade measure stores
the actual module grade of the student. At higher
conceptual levels,
avg_grade stores the average grade for the given
combination.
1. Draw a snowflake schema diagram for the data
warehouse.
2. Starting with the base cuboid [student, module,
semester, lecturer], what
specific OLAP operations (e.g., roll-up from semester
to year) should one
perform in order to list the average grade of CS
modules for each Big
University student.
3. If each dimension has 5 levels (including all), such
as “student < major <
status < university < all”, how many cuboids will this
cube contain (including
the base and apex cuboids

More Related Content

PPTX
Flipkart product management database model with dba perspective
PPT
Dimensional Modeling
PDF
Data mining 2 - Data warehouse (cheat sheet - printable)
PPTX
Database basics
PPTX
Bootstrap
PDF
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
PPT
MySql slides (ppt)
Flipkart product management database model with dba perspective
Dimensional Modeling
Data mining 2 - Data warehouse (cheat sheet - printable)
Database basics
Bootstrap
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
MySql slides (ppt)

What's hot (20)

PPT
Database design
PPTX
Javascript event handler
PPTX
CSS Position and it’s values
PPT
5 Data Modeling for NoSQL 1/2
PPTX
Html images syntax
PDF
Data mining 1 - Introduction (cheat sheet - printable)
PPTX
HTML5 & CSS3
PPT
02 data
PPTX
Query optimization
DOCX
Mcq tableau.docx
PDF
Data manipulation on r
PDF
Introduction to asp.net
PPTX
HTML5 - create hyperlinks and anchors
PPT
Master pages
PPTX
Javascript
DOCX
Tools for data warehousing
PPTX
Bootstrap 4 ppt
PPTX
DOCX
Add row in asp.net Gridview on button click using C# and vb.net
PDF
Html / CSS Presentation
Database design
Javascript event handler
CSS Position and it’s values
5 Data Modeling for NoSQL 1/2
Html images syntax
Data mining 1 - Introduction (cheat sheet - printable)
HTML5 & CSS3
02 data
Query optimization
Mcq tableau.docx
Data manipulation on r
Introduction to asp.net
HTML5 - create hyperlinks and anchors
Master pages
Javascript
Tools for data warehousing
Bootstrap 4 ppt
Add row in asp.net Gridview on button click using C# and vb.net
Html / CSS Presentation
Ad

Similar to Data mining 3 - Data Models and Data Warehouse Design (cheat sheet - printable) (20)

PPT
My2dw
PPTX
Data Warehousing for students educationpptx
PPTX
data mining and data warehousing
PDF
data warehousing and online analtytical processing
PPT
Data Warehousing and Data Mining
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPT
dm_cubes datamining cubes terminology .ppt
PPTX
1-Data Warehousing-Multi Dim Data Model.pptx
PPT
2. data warehouse 2nd unit
PDF
Data Warehouse Introduction - Data Mining
PDF
Data Warehouse and Architecture, OLAP Operation
PPTX
Dataware house multidimensionalmodelling
PPTX
CHAPTER 2 - Datawarehouse Architecture.pptx
PPT
Data warehousing and online analytical processing
PPT
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
PPT
Data Mining Concept & Technique-ch04.ppt
PPTX
Data minng and warehousing lecture notes 1PowerPoint.pptx
PPT
Topic(4)-OLAP data mining master ALEX.ppt
PPT
data warehouse and data mining unit 2 ppt
PPT
Data Mining and Warehousing Concept and Techniques
My2dw
Data Warehousing for students educationpptx
data mining and data warehousing
data warehousing and online analtytical processing
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
dm_cubes datamining cubes terminology .ppt
1-Data Warehousing-Multi Dim Data Model.pptx
2. data warehouse 2nd unit
Data Warehouse Introduction - Data Mining
Data Warehouse and Architecture, OLAP Operation
Dataware house multidimensionalmodelling
CHAPTER 2 - Datawarehouse Architecture.pptx
Data warehousing and online analytical processing
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Data Mining Concept & Technique-ch04.ppt
Data minng and warehousing lecture notes 1PowerPoint.pptx
Topic(4)-OLAP data mining master ALEX.ppt
data warehouse and data mining unit 2 ppt
Data Mining and Warehousing Concept and Techniques
Ad

Recently uploaded (20)

PPTX
Cell Types and Its function , kingdom of life
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
GDM (1) (1).pptx small presentation for students
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Pharma ospi slides which help in ospi learning
PDF
01-Introduction-to-Information-Management.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Cell Types and Its function , kingdom of life
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Microbial disease of the cardiovascular and lymphatic systems
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
O7-L3 Supply Chain Operations - ICLT Program
202450812 BayCHI UCSC-SV 20250812 v17.pptx
RMMM.pdf make it easy to upload and study
GDM (1) (1).pptx small presentation for students
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Final Presentation General Medicine 03-08-2024.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pharma ospi slides which help in ospi learning
01-Introduction-to-Information-Management.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf

Data mining 3 - Data Models and Data Warehouse Design (cheat sheet - printable)

  • 1. Data Mining Lecture 3 Define multi-dimensional data model From Tables and Spreadsheets to Data Cubes A data warehouse is based on  multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions (such as sales)  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)  Fact table contains measures (such as Euros_sold) and keys to each of the related dimension tables Definitions  an n-Dimensional base cube is called a base cuboid  The top most 0-D cuboid, which holds the highest-level of summarisation, is called the apex cuboid  The lattice of cuboids forms a data cube Cube: A Lattice of Cuboids Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema - A fact table in the middle connected to a set of dimension tables Snowflake schema - A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations - Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact o constellation
  • 2. DMQL: Language Primitives Cube Definition (Fact Table)  define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table)  define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables)  First time as “cube definition”  define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: Euros_sold = sum(sales_in_Euros), avg_sales = avg(sales_in_Euros), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, county, province, country) Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: Euros_sold = sum(sales_in_Euros), avg_sales = avg(sales_in_Euros), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, county, province, country)) Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: Euros_sold = sum(sales_in_Euros), avg_sales = avg(sales_in_Euros), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: Euro_cost = sum(cost_in_Euros), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales Measures: Three Categories Distributive  if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. o E.g., count(), sum(), min(), max() Algebraic  if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. o E.g., avg(), min_N(), standard_deviation() Holistic  if there is no constant bound on the storage size needed to describe a sub-aggregate. o E.g., median(), mode(), rank()
  • 3. A Concept Hierarchy: Dimension (location) Concept hierarchy allows data to be handled at varying levels of abstractions. Multidimensional Data Sales volume as a function of product, month, and Country A Sample Data Cube Cuboids Corresponding to the Cube Browsing a Data Cube
  • 4. Typical OLAP Operations Roll up (drill-up): summarise data  by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice  project and select Pivot (rotate)  reorient the cube, visualisation, 3D to series of 2D planes. Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) A Star-Net Query Model Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse - Top-down view o allows selection of the relevant information necessary for the data warehouse - Data source view o exposes the information being captured, stored, and managed by operational systems - Data warehouse view o consists of fact tables and dimension tables - Business query view o sees the perspectives of data in the warehouse from the view of enduser Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature)  Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view  Steps: planning, data collection, DW design, testing and evaluation, DW deployment  Waterfall: structured and systematic analysis at each step before proceeding to the next  Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process  Choose a business process to model, e.g., orders, invoices, etc.  Choose the grain (atomic level of data) of the business process  Choose the dimensions that will apply to each fact table record  Choose the measure that will populate each fact table record Multi-Tiered Architecture
  • 5. Three Data Warehouse Models Enterprise warehouse  collects all of the information about subjects spanning the entire organisation Data Mart  a subset of corporate-wide data that is of value to a specific group of users. Its scope is confined to specific, selected groups, such as marketing data mart - Independent vs. dependent (directly from warehouse) data mart Virtual warehouse  A set of views over operational databases  Only some of the possible summary views may be materialised Data Warehouse Development: A Recommended Approach OLAP Server Architectures Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middleware to support missing pieces  Include optimisation of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  greater scalability Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine (sparse matrix techniques)  fast indexing to pre-computed summarised data Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high-level: array Specialised SQL servers  specialised support for SQL queries over star/snowflake schemas Home Work 1a Suppose that a data warehouse for a Big University consists of the following 4 dimensions: students, module, semester, and lecturer and 2 measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, module, semester, and lecturer combination), the avg_grade measure stores the actual module grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. 1. Draw a snowflake schema diagram for the data warehouse. 2. Starting with the base cuboid [student, module, semester, lecturer], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS modules for each Big University student. 3. If each dimension has 5 levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids