ER 2016 Tutorial

17th
Nov. 2016 ER'2016@Gifu.Japan 1
Multi-Dimensional Database Modeling and
Querying: Methods, Experiences and
Challenging Problems
Alfredo Cuzzocrea University of Trieste & ICAR
Rim Moussa LaTICE Lab. & University of Carthage
The 35th Intl. Conference on Conceptual Modeling
@ Gifu, JAPAN
17th
of November, 2016

17th
Tutorial Outline
Part I: Data Warehouse Systems
Decision Support Systems
DWS Architectures
DW Schemas
OLAP cube and OLAP operations
DSS Benchmarks
OLAP Mandate
Part II: Multi-dimensional design
Part III: Challenging Problems
Conclusion

17th
Translating Data into Insights & Opportunities!
Source:

17th
From DIKW Pyramid To Decision Support System
SourceSource
Source
»Descriptions of things,
events, activities and
transactions
»internal or external
»Consolidated and organized
data that has meaning, value
and purpose
»Know-thats
»Processed data that conveys
understanding, experience or
learning applicable to a
problem|activity
»Ability to increase effectiveness
»What to do, act …
Wisdom
Knowledge
Information
Data

17th
Decision Support System
Decision Support System
»A collection of integrated software applications and hardware that form the
backbone of an organization’s decision making process.
»Performs different types of analyses
»What-if analysis
»Simulation analysis
»Goal-seeking analysis
Business Intelligence
»BI encompasses a variety of tools, applications and methodologies that enable
organizations to collect data from internal systems and external sources,
prepare it for analysis, develop and run queries against the data, and create
reports, dashboards and data visualizations to make the analytical results
available to corporate decision makers as well as operational workers.
»Integrations (ETL) tools, data warehousing tools, OLAP technologies, OLAP
clients, mining structures, reporting tools, dashboards’ design tools
»Integration workflows design methods, Multi-dimensional design methods, …

17th
OLTP vs OLAP (1/3)
OLTP: On-Line Transaction Processing
»Users/ applications interacting with database in real-time
»e.g.: on-line banking, on-line payment
»Required properties of memory access
ACID properties: Atomicity, Consistency, Isolation, Durability
Coherence
»Benchmarks: TPC-C, TPC-E
OLAP: On-Line Analytical Processing
»Experts doing offline data analysis
»e.g.: data warehouses, decision support systems
»Memory patterns: Indexed and Sequential access patterns
»Benchmarks: TPC-H, TPC-DS

17th
OLTP vs OLAP (2/3)
OLTP: On-Line Transaction Processing
»Users/ applications interacting with database in real-time
»e.g.: on-line banking, on-line payment
»Required properties of memory access
ACID properties: Atomicity, Consistency, Isolation, Durability
Coherence
»Benchmarks: TPC-C, TPC-E
OLAP: On-Line Analytical Processing
»Experts doing offline data analysis
»e.g.: data warehouses, decision support systems
»Memory patterns: Indexed and Sequential access patterns
»Benchmarks: TPC-H, TPC-DS

17th
OLTP vs OLAP (3/3)
 Users/apps interacting with DB in real-time
 E.g., customers buying/selling books at Amazon
 Many concurrent users, queries, connections
 DB is 3rd NF
 Simple queries, often predetermined
 Relatively little data (~GB)
 Each query touches little data
 Little computation per query
 Simple computation
 Data is continually updated
 Accuracy and recovery important, hence strict
transactions
 Throughput is most important
 Business Experts doing offline data Analysis
 E.g., bestseller at Amazon
 Few concurrent users, queries, connections
 DB is denormalized
 Complex ad-hoc queries
 Very large data sets (~TB)
 Queries touch large data sets
 Mining is compute intensive
 Complex operations in mining
 Data is mostly read-only
 Strict transactional
 semantics is not needed
 Latency is more important
OLTP OLAP

17th
Data Integration
Data Integration is the process of integrating data from multiple sources
and probably have a single view over all these sources
Integration can be physical (copy the data to warehouse) or virtual
(keep the data at the sources)
Data Acquisition: This is the process of moving company data from
the source systems into the warehouse. Data acquisition is an
ongoing, scheduled process, which is executed to keep the warehouse
current to a pre-determined period in time.
Changed Data Capture: The periodic update of the warehouse from
the transactional system(s) is complicated by the difficulty of
identifying which records in the source have changed since the last
update.
Data Cleansing: This is typically performed in conjunction with data
acquisition. It is a complicated process that validates and, if necessary,
corrects the data before it is inserted into the warehouse

17th
Data Integration
Software solutions
Software Solutions
»Known as ETL tools, Integration services, ...
»Vendors (non-exhaustive list): IBM (InfoSphere Data Event Publisher),
Informatica (Informatica Enterprise Data Integration), Information
Builders (iWay DataMigrator), Microsoft (Microsoft's SQL Server
Integration Services), Oracle (Oracle GoldenGate), Pentaho (Pentaho DI -
prev. Ketlle), SAP (BusinessObjects Data Integrator), SAS (SAS DataFlux)
and Talend (Talend Open Studio),
Kettle (Pentaho) and Talend offer open-source versions,
MicroSoft bundels integration services with database products

17th
Integrating Heterogeneous data sources: Lazy or Eager?
Lazy Integration aka Query-driven approach
»Accept a query, determine the sources which answer the query, devise the
execution tree: generate appropriate sub-queries for each source, obtain
resultsets, perform required post-processing: translation, merging and filtering
and return answer to application
Eager Integration aka Warehouse approach
»Information of each source of interest is extracted in-advance, translated, filtered
and processed as appropriate, then merged with information from other sources
and stored
»Accept a query, answer query from repository

17th
Data Integration
Eager vs. Lazy ?
 Integrate in-advance i.e. before queries
 Copy data from sources
 Query answer-set could be stale
 Need to refresh the data warehouse
 Operates when sources are unavailable
 High query performance: through building data
summaries and local processing at sources
unaffected
 OLAP queries are not visible outside the
warehouse
 Integrate on-demand i.e. at query time
 Leave data at sources
 Query answer-set is up-to-date
 No copy so no refresh and storage cost
 Out of service if sources unavailable
 Sources are drained: interference with local
processing
Eager Lazy

17th
Lazy Data Integration
 Query-driven Architecture
Source
WRAPPER WRAPPER
MEDIATOR
Source
WRAPPER

17th
Eager Data Integration
 Warehouse System Architecture

17th
Data Warehouse Definition
By Bill Inmon "A warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management's decision
making process"
»Subject-oriented: The data in the database is organized so that all the data
elements relating to the same real-world event or object are linked together;
e.g. finance, human resources, sales …
»Time-variant: The changes to the data in the database are tracked and recorded
so that reports can be produced showing changes over time. Hence, all addresses
of a customer are in the DW,
»Non-volatile: Data in the database is never over-written or deleted - once
committed, the data is static, read-only, but retained for future reporting;
»Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole. There is only a single way to identify
a product
By Ralph Kimbal “A copy of transaction data specifically structured for
query and analysis”
Inmon top-down approach vs. Kimbal bottom-up approach

17th
Nov. 2016 ER'2016@Gifu.Japan
Constellation Schema
»Galaxy schema
»Multiple fact tables which
share dimensions
Snowflake Schema
»Hierarchical relationships exist
among Dimensions Tables
»Normalized schema
Star Schema
»A single, large and central table Fact table
surrounded by multiple Dimension tables
»All Dimensions Tables have a hierarchical
relationship with the Fact Table
16
Data Warehouse Schemas
Fact
Table
Dim
Table 1
Dim
Table 3
Dim
Table 4
Dim
Table 2
Fact
Table
Dim
Table 5
Dim
Table 4
Dim
Table 2 Dim
Table 6
Dim
Table 1
Dim
Table 3
Dim
Table 7
Dim
Table 4
Fact
Table 2
Fact
Table 1
Dim
Table 5
Dim
Table 2
Dim
Table 6
Dim
Table 1
Dim
Table 3

17th
Multi-dimensional Data Analysis
Multi-dimensional Data Analysis
»refers to the process of summarizing data across multiple levels (called
dimensions) and then presenting the results in a multi-dimensional grid
format.
OLAP Cube
»Multi-dimensional representation of data
»aka Crosstab, Data Pivot

17th
Facts: are the objects that represent the subject of the desired analyses.
»Examples: sales records, weather records, cabs trips, …
»The fact table contained 3 types of attributes: measured attributes, foreign keys
to dimension tables, degenerate dimensions
Dimension(s):
»Levels are individual values that make up dimensions
»Examples
»Date dimension (Trimester, month, day)
»Time dimension (hour, min, sec)
»Geography dimension (Country, city, postal code)
Measure(s):
»Examples: revenue, lost revenue, sold quantities, expenses, forecasts, …
»Use aggregate functions: min, max, count, distinct-count, sum, average, …
18
OLAP Cube

17th
OLAP Operations
On-Line Analytical Processing

17th
OLAP Operations
Drill-down
stepping down to lower level data or introducing new dimensions
Roll-up
summarizes data by climbing up hierarchy or by dimension reduction
Pivot
rotate, reorienting the cube
Slice
selecting data on one dimension
Dice
selecting data on multiple dimensions

17th
OLAP Operations
Roll-up & Drill-down

17th
OLAP Operations
Slice & Dice

17th
Derived Data
OLAP Indices
»Inverted lists
»Bitmap Indices
»Join indices
»Text indices
Materialization of a data cube is a way to
pre-compute and store multi-dimensional
aggregates so that multi-dimensional
analysis can be performed on-the-fly [Li
et al., 2004]
Calculated Attributes
Aggregate Tables (aka materialized
tables)
Data synopsis: histograms and
sketches
Improved Query
Performance
Maintenance &
Refresh
Storage
Requirements

17th
Cuboids Lattice
What to materialize?
d3,d4
All
d2:Itemd1:Time d3:Customer d4:Supplier
d1,d2 d1,d3 d1,d4 d2,d3 d2,d4
d1,d2,d3 d1,d2,d4 d1,d3,d4 d2,d3,d4
d1,d2,d3,d4
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
For n-dimensional cube
»2n cuboids

17th
ROLAP Server
»Relational OLAP
»Uses relational or extended-relational DBMS to store and manage warehouse
data and OLAP middleware
»Includes optimization of DBMS back-end, implementation of aggregation
navigation logic, and additional tools and services
»High scalability
»E.g. Mondrian (Pentaho BI suite)
MOLAP Server
»Multidimensional OLAP
»Sparse array-based multidimensional storage engine
»Fast indexing to pre-computed summarized data
»E.g. Palo
HOLAP Server
»Low-level: relational / high-level: array
»High flexibility
»E.g. MSAS (Microsoft)
25
OLAP Servers

17th
Structured Query Language (SQL)
»Relational and static schema
»Data Definition, data Manipulation, and Data Control Language
»Analytic Functions (window functions over partition by …)
»Cube, roll-up and grouping sets operators
MultiDimensional eXpressions (MDX)
»Invented by Microsoft in 1997
»For querying and manipulating the multidimensional data stored OLAP cubes
»Static schema
Data Flow programming language
»Google Sawzall, Apache Apache Pig Latin, IBM Infosphere Streams
»Dynamic schema
»After data is loaded, multiple operators are applied on that data before the final
output is stored.
26
Query Languages
Load Data
Apply
Schema
Apply Filter Group Data
Apply
Aggregate
Function
Sort Data
Store
Output

17th
Query Languages
SQL – Q16 of TPC-H benchmark

17th
Query Languages
MDX –Q16 of TPC-H Benchmark
WITH SET [Brands] AS 'Except({[Part Brand].Members}, {[Part
Brand].[Brand#45 ]})'
SET [Types] AS 'Filter({[Part Type].Members}, (NOT ([Part
Type].CurrentMember.Name MATCHES "(?i)MEDIUM POLISHED.*")))'
SET [Sizes] AS 'Filter({[Part Size].Members}, ([Part Size].CurrentMember IN
{[Part Size].[3], [Part Size].[9], [Part Size].[14], [Part Size].[19], [Part Size].[23],
[Part Size].[36], [Part Size].[45], [Part Size].[49]}))'
SELECT [Measures].[Supplier Count] ON COLUMNS,
nonemptyCrossjoin(nonemptyCrossjoin([Brands], [Types]), [Sizes]) ON ROWS
FROM [Cube16]

17th
Query Languages
Data Flow –Pig Latin script for Q16 of TPC-H benchmark

17th
Decision Support Systems Benchmarks
Non-TPC Benchmarks
Real datasets
»Open data or proprietary data
»fixed size
»Devise a workload or trace the proprietary workload
APB-1: no scale factor
TPC Benchmarks
The Transaction Processing Council founded in 1988 to define benchmarks
In 2009, TPC-TC is set up as an International Technology Conference
Series on Performance Evaluation and Benchmarking
Examples of benchmarks relevant for benchmarking decision support
systems: TPC-H, TPC-DS and TPC-DI
Common characteristics of TPC benchmarks
»Synthetic data
»Scale factor allowing generation of different volumes 1GB to 1PB

17th
TPC-H Benchmark Schema (1/2)
TPC-H Benchmark: complex schema
22 ad-hoc SQL statements (star queries, nested queries, …) + refresh functions

17th
TPC-H Benchmark (2/2)
TPC-H Benchmark Metrics
2 Metrics
»QphH@Size is the number of queries processed per hour, that the system
under test can handle for a fixed load
»$/QphH@Size represents the ratio of cost to performance, where the
cost is the cost of ownership of the SUT (hardware,software,
maintenance).
Variants of TPC-H Benchmarks
TPC-H*d Benchmark –detailed in Part II
»Turning TPC-H benchmark into a Multi-dimensional benchmark
»Few schema changes
»No update to workload requirement
»MDX workload for OLAP cubes and OLAP queries
SSB: Star Schema Benchmark
»Turning TPC-H benchmark into star-schema
»Workload composed of 12 queries
TPC-H translated into Pig Latin (Apache Hadoop Ecosystem)
»22 pig latin scripts which load and process TPC-H raw data files (.tbl files)

17th
TPC-DS Benchmark (1/2)
TPC-DS Benchmark: 7 data marts

17th
TPC-DS Benchmark (2/2)
TPC-DS Benchmark Workload
Hundred of queries (99 query templates)
OLAP, windowing functions, mining, and reporting queries
Concurrent data maintenance
TPC-DS Benchmark Metrics
3 Metrics
»QphDS@Size is the number of queries processed per hour, that the
system under test can handle for a fixed load.
»Data Maintenance and Load Time are calculated
»$/QphDS@Size represents the ratio of cost to performance, where the
cost is a 3 year cost of ownership of the SUT (hardware,software,
maintenance)
»System Availability date: the date when the system is available to
customers.
TPC-DS implementations
TPC-DS v2.0
»Extension for non-relational systems such as Hadoop/Spark big data
systems

17th
TPC-DI Benchmark (1/3)
For benchmarking Data Integration technologies
Synthetic Data of a Factious Retail Brokerage Firm
»Internal Trading system data, Internal Human resources data, Internal
CRM System and External data
»Different data scales
»Data extracted from different sources:
»Structured (csv)
»Semi-structured data (xml)
»Multi record
»Change Data Capture (CDS)
Complex Data Integration Tasks
Load large volumes of historical data
Load incremental updates
Execute complex transformations
Check and ensure consistency of data

17th
TPC-DI Benchmark
Complex Transformations (2/3)
TPC-DI implements 18 Complex Transformations which feature the
following characteristics,
Transfer XML to relational data
Detect changes in dimension data, and applying appropriate tracking
mechanisms for history keeping dimensions
Filter input data according to pre-defined conditions
Identify new, deleted and updated records in input data
Merge multiple input files of the same structure
Join data of one input file to data from another input file with different
structure
Standardize entries of the input files
Join data from input file to dimension table
Join data from multiple input files with separate structures
Consolidate multiple change records per day and identify most current
Perform extensive arithmetic calculations
Read data from files with variable type records
Check data for errors or for adherence to business rules
Detect changes in fact data, and journaling updates to reflect current state

17th
TPC-DI Benchmark
Metrics (3/3)
Metrics
Performance Metric
TPC_DI_RPS = Trunc(GeoMean(TH, Min(TI1 , TI2)))
TH: Throughput of Historical Load
TI1: Throughput of Incremental Update 1
TI2: Throughput of Incremental Update 2
Price/Performance Metric
Price-per-TPC_DI_RPS = $ / TPC_DI_RPS
$ is the total 3-year pricing

17th
OLAP Mandate
12 Rules for Evaluating OLAP products by E.F. Codd et al.
Rule#1: Multidimensional Conceptual View
»The multidimensional conceptual view facilitates OLAP model design and
analysis, as well as inter and intra dimensional calculations.
Rule#2: Transparency
»Whether OLAP is or is not part of the user’s customary front-end product,
that fact should be transparent to the user. If OLAP is provided within the
context of a client server architecture, then this fact should be transparent to
the user-analyst as well.
Rule#3: Accessibility
»The OLAP tool must map its own logical schema to heterogeneous physical
data stores, access the data, and perform any conversions necessary to
present a single, coherent and consistent user view. Moreover, the tool and
not the end-user analyst must be concerned about where or from which type
of systems the physical data is actually coming

17th
OLAP Mandate
12 Rules for Evaluating OLAP products by E.F. Codd et al.
Rule#4: Consistent Reporting Performance
»As the number of dimensions or the size of the database increases, the OLAP
user-analyst should not perceive any significant degradation in reporting
performance.
Rule#5: Client-Server Architecture
»Most data currently requiring on-line analytical processing is stored on
mainframe systems and accessed via personal computers. It is imperative that
the server component of OLAP tools be sufficiently intelligent such that
various clients can be attached with minimum effort and integration
programming.
Rule#6: Generic Dimensionality
»Every data dimension must be equivalent in both its structure and
operational capabilities. Dimensions are symmetric, So the basic data
structure, formulae, and reporting formats should not be biased toward any
one data dimension.

17th
OLAP Mandate
12 OLAP Rules by E.F. Codd
Rule#7: Dynamic Sparse Matrix Handling
»The OLAP tools’ physical schema must adapt fully to the specific analytical
model being created to provide optimal sparse matrix handling.
»By adapting its physical data schema to the specific analytical model, OLAP
tools can empower user analysts to easily perform types of analysis which
previously have been avoided because of their perceived complexity.
Rule#8: Multi-User Support
»OLAP tools must provide concurrent access (retrieval and update), integrity,
and security.
Rule#9: Unrestricted Cross-dimensional Operations
»The various roll-up levels within consolidation paths, due to their inherent
hierarchical nature, represent in outline form, the majority of 1:1, 1:M, and
dependent relationships in an OLAP model or application. Accordingly, the
tool itself should infer the associated calculations and not require the user-
analyst to explicitly define these inherent calculations.

17th
OLAP Mandate
12 OLAP Rules by E.F. Codd
Rule#10: Intuitive Data Manipulation
»Consolidation path re-orientation, drilling down across columns or rows,
zooming out, and other manipulation inherent in the consolidation path
outlines should be accomplished via direct action upon the cells of the
analytical model, and should neither require the use of a menu nor multiple
trips across the user interface.
Rule#11: Flexible Reporting
»Reporting must be capable of presenting data to be synthesized, or
information resulting from animation of the data model according to any
possible orientation. This means that the rows, columns, or page headings
must each be capable of containing/displaying from 0 to N dimensions each,
where N is the number of dimensions in the entire analytical model.
Rule#12: Unlimited Dimensions and Aggregation Levels
»An OLAP tool should be able to accommodate at least fifteen and preferably
twenty data dimensions within a common analytical model.

17th
References
 M. Fricke, The Knowledge Pyramid: A Critique of the DIKW Hierarchy. Journal of Information
Science. 2009.
 E.F. Codd, S.B. Codd and C.T. Salley, Providing OLAP to User Analysts: an IT mandate, 1993.
 J. Widom, Integrating Heterogeneous databases: eager or lazy? ACM Computing Surveys (CSUR)
Vol.4, 1996
 Y.R. Cho, Data Warehouse and OLAP Operations www.ecs.baylor.edu/faculty/cho/4352
 TPC homepage http://www.tpc.org/
 M. Poess, T. Rabl and B. Caufield: TPC-DI: The First Industry Benchmark for Data
Integration. PVLDB 7(13): 1367-1378 (2014)
http://www.vldb.org/pvldb/vol7/p1367-poess.pdf
 X. Li, J. Han, H. Gonzalez: High-Dimensional OLAP: A Minimal Cubing Approach. VLDB 2004.
 C. Imhoff, N. Galemmo, J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. 2003.
 R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, B. Becker. The Data Warehouse
Lifecycle Toolkit. 2nd Edition.
 R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2nd Edition.
 H. G. Molia. Data Warehousing Overview: Issues, Terminology, Products.
www.cs.uh.edu/~ceick/6340/dw-olap.ppt (slides)

17th
Thank you for your Attention
Q & A
Multi-Dimensional Database Modeling and Querying: Methods,
Experiences and Challenging Problems
Alfredo Cuzzocrea and Rim Moussa
17th of November, 2016

17th
Part II
Multi-dimensional Benchmark Design
Rim Moussa University of Carthage & LaTICE
@ Gifu, JAPAN
17th
of November, 2016

17th
Tutorial Outline
Introduction
Part I: State-of-the-Art
Part II: Experiences
TPC-H*d Experience
AutoMDB
TPC-DS*d
Conclusion

17th
Given,
A relational Warehouse schema
A Workload -a set of SQL Statements,
W = {Q1, Q2, …, Qn}
where Qi is a parameterized query
How to design the Multi-dimensional DB Schema?
How to define cubes?
Will there be a single cube or multiple cubes? Are there any rules for merging of
cubes? Are there any rules for definition of virtual cubes?
Which Optimizations are suitable for performance tuning ?
Derived data calculus & refresh?
Data partitioning & parallel cube building?
# 3
Problem

17th
Nov. 2016 ER'2016@Gifu.Japan # 4
Idea
Map each business question to an OLAP cube
>> Obtain a multi-dimensional DB schema
Recommend & Test Optimizations
>> Derived Data
>> Data partitioning
>> Cube Merging

17th
SELECT t1.col_a, t1.col_b, …, tn.col_a, tn.col_z,
aggregate_function(column) as measure_1, …,
aggregate_function(expression) as measure_m
FROM table_1 t1, table_2 t2, …, table_n tn
WHERE ti.col_x operator $query_parameter$
AND ti.col_y = tj.col_z
AND …
GROUP BY t1.col_a, t1.col_b, …, tn.col_a, tn.col_z
aggregate_function: min, max, sum, avg, count, count-distinct …
Operator: =, < , <=, >=, !=
# 5
SQL Statement Template

17th
Measures feature aggregate functions,
e.g. min, max, count, count-distinct, sum, average, …
Simple Measure
Defined over a single attribute,
e.g. SUM(l_extendedprice),
Measure expressions
Involve more than one attribute,
e.g. SUM(l_extendedprice*(1 - l_discount))
Computed Members
Involve already defined measures or measure expressions,
e.g. M1=SUM(l_extendedprice), M2=COUNT(l_orderkey),
CM = M1 / M2
# 6
OLAP Cube Design: Measures’ Definition

17th
All attributes involved in measures and measure expressions
belong to the fact table,
Example: Q10 of TPC-H benchmark
# 7
OLAP Cube Design
Fact Table Definition (1/9)

17th
Case measurable attributes belong to different tables then the
fact table is a view defined as the join of relations, to which
attributes belong!
Example 1: Q9 of TPC-H benchmark, where l_extendedprice,
l_discount and l_quantity belong to lineitem, and
ps_supplycost belongs to partsupp.
The fact table is the join of lineitem and partsupp tables. Only
attributes needed for join with dimension tables (namely, l_partkey,
l_orderkey, l_suppkey), and measurable attributes (namely
l_extendedprice,l_discount,l_quantity,
ps_supplycost) are selected.
# 8
OLAP Cube Design
Fact Table Definition over multiple tables (2/9)

17th
Q9 SQL statement
OLAP Cube Design

17th
OLAP Cube Design

17th
Q14 SQL statement
OLAP Cube Design
Example 2: Q14 of TPC-H benchmark, where l_extendedprice,
l_discount and belong to lineitem, and p_type belongs to
part.
The fact table is the join of lineitem and part tables

17th
Filters Processing: the fact table is defined as a view of facts
with filters
»Extract all filters involving the fact table from the WHERE
clause, such as
(attr_i operator attr_j), where both attr_i and attr_j
belong the fact table,
(attr_k operator $value$), such that attr_k belongs to the
fact table,
[not] exists (select … from … where attr_k …), such
that attr_k belongs to the fact table,
attr_k [not] in (list of values), such that attr_k
belongs to the fact table,
Example 1: Q10 of TPC-H benchmark
# 12
OLAP Cube Design
Fact Table Definition and filters’ processing (6/9)

17th
Q10 SQL statement
OLAP Cube Design

17th
Q16 SQL statement
OLAP Cube Design

17th
Q21 SQL statement
OLAP Cube Design

17th
First, consider all attributes in the SELECT, WHERE and GROUP
BY clauses,
»Discard measurable attributes, which figure out in
measures, measure expressions, or computed members,
»Discard attributes which figure out in the WHERE clause,
and are used for joining tables or filtering the fact table
with static values,
»Compose time dimension along well known hierarchies,
»Year, quarter, month
»Compose geography dimension along well known
hierarchies,
»Region, nation, city
# 16
OLAP Cube Design
Dimension Definition (1/7)

17th
All highlighted attributes are considered for dimensions’ mount
Time dimension o_orderdate requires order_year and order_quarter
levels
# 17
OLAP Cube Design

17th
Second, find out hierarchical relations, i.e., one-to-many
relationships, and re-organize attributes along hierarchies to
form dimensions’ hierarchies,
»Example: Q10 of TPC-H benchmark
 each customer can be related to at most one nation, but
a nation may be related to many customers,
customer_dim:
Customer nation  n_name
Customer detailsc_custkey, c_name, c_acctbal, c_address,
c_phone, c_comment
order_dim
order_year
order_quarter
# 18
OLAP Cube Design

17th
Third, distinguish levels from properties.
Properties are in functional dependency with levels,
 For customer_dim, c_custkey is the level, and all of c_name,
c_acctbal, c_address, c_phone, c_comment
attributes are properties of c_custkey level.
# 19
OLAP Cube Design

17th
Filters Processing: not all tuples in the dimension table should
be considered, so we have to extract filters defined over
dimension tables from the WHERE clause not useful for multi-
dimensional design,
Exple 1: Q12 of TPC-H benchmark
For each line shipping mode, year, Count the number of high
priority orders (high line count) and the number of not high priority
orders (low line count) over orders’ facts and consider only lines
such as l_commit_date < l_receipt_date and
l_ship_date < l_commit_date. These are filters over
dimension table.
Exple 2: Q19 of TPC-H benchmark
Calculate revenue for particular parts
# 20
OLAP Cube Design
Dimension Definition and Filters’ processing (4/7)

17th
Example 1: Q12 of TPC-H Benchmark
# 21
OLAP Cube Design

17th
Example 2: Q19 of TPC-H Benchmark.
# 22
OLAP Cube Design

17th
OLAP Cube Design

17th
TPC-H*d
Truly OLAP variant of TPC-H benchmark
TPC-H SQL workload translated into MDX (MultiDimensional
eXpressions)
The workload is composed of 23 MDX statements for OLAP
cubes and 23 MDX statements for OLAP business queries.
Each business question of TPC-H benchmark is mapped to an OLAP
cube

17th
TPC-H*d
Q8: From SQL statement to OLAP cube

17th
TPC-H*d
TPC-H*d OLAP Cube C8
Market Share for each supplier nation within a region of customers,
for each year and each part type

17th
TPC-H*d
TPC-H*d OLAP Query Q8
Market Share for each RUSSIAN Suppliers within AMERICA region,
Over the years 1995 and 1996 and for part type ECO. ANODIZED STEEL

17th
Open source software implemented in java
Parses MDB schemas (.xml) files using SAX Library
Performs comparisons of OLAP cubes' characteristics.
»For each pair of OLAP cubes,
»show whether they have same fact table or not
»compute the nbr of shared | different | coalescable dimensions
»Dimensions are coalescable if they are extracted from the dimension table
and their hierarchies are coalescable
»compute the number of shared | different measures
»Run merge of OLAP cubes using different similarity functions
»Simple distance function have or not same fact table
»K-means clustering
»Distance function is computed with weights according dimensions
»Propose Virtual Cubes
»Auto-generate a new MDB Schema (.xml)
»Create MDB Schema from SQL Workload
»On-going, tests include TPC-DS benchmark
# 28
AutoMDB

17th
AutoMDB
Load OLAP Cubes defined in xml file

17th
AutoMDB
Compare OLAP Cubes –have or not same fact table

17th
AutoMDB
Compare Cubes –Group cubes which have same fact table

17th
AutoMDB
Compare Cubes –Auto-generate a new MDB schema

17th
References
Modeling Multidimensional Databases (non exhaustive list)
M. Gyssens and L. V.S. Lakshmanan. A Foundation for Multi-Dimensional Databases.
VLDB’1997.
R. Agrawal, A. Gupta and S. Sarawagi. Modeling Multidimensional Databases.
ICDE’1997.
J. Gray, A. Bosworth, A. Layman and H. Priahesh. Data Cube: A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. ICDE’2008.
P. Vassiliadis. Modeling Multidimensional Databases, Cubes and Cube Operations.
SSDBM’1998.
L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases.
EDBT’1998.
D. Cheung, B. Zhou, B. Kao, H. Lu, T. Lam and H. Ting. Requirement-based data cube
schema design. CIKM’1999.
T. Niemi, J. Nummenmaa and P. Thanisch. Constructing OLAP cubes based on Queries.
DOLAP’2001.
O. Teste. Towards Conceptual Multidimensional Design in Decision Support Systems.
DEXA’2010.
A. Cuzzocrea and R. Moussa. Multidimensional Database Design
via Schema Transformation: Turning TPC-H into the TPC-
H*d Multidimensional Benchmark. COMAD’2013.

17th
Q & A

17th
Part III
Rim Moussa University of Carthage & LaTICE
@ Gifu, JAPAN
17th
of November, 2016

17th
Tutorial Outline
Introduction
Part I: State-of-the-Art
Part III: Challenges
Big Data Integration
Flexible Schema Model
Curse of dimensionality
Systems which scale-out
Intelligent Recommenders
Real-time OLAP
Advanced Visualization
Conclusion

17th
Big Data
Volume
»Volume refers to the amount of data, henceforth the challenge is
integration at scale,
Velocity
»Velocity refers to the speed at which new data is generated, henceforth
the challenge is to integrate and analyze data while it is being generated,
Variety
»Variety refers to different types of data; e.g. structured (relational data),
semi-structured (XML, JSON, BSON), unstructured (text); henceforth the
challenge is integration of different types of data,
Veracity
»Veracity refers to the messiness or trustworthiness of the data,
henceforth the challenge is to integrate uncertain data quality in data
sources,
Value
»Value refers to our ability to turn data into value.

17th
Advanced and highly performance Technologies
»Allow to load and extract different data formats and allow to perform
complex transformations
»Solve heterogeneity
»data type heterogeneity (phone type is number or string),
»semantic heterogeneity (column title  column job title) ,
»value heterogeneity (Pr.  Prof.  Professor)
»entity resolution through identification of same entities with values misspelled,
values are synonyms or abbreviated, values originated from different systems
(date formats) or different domains (true is 1)
»Syncing Across Data Sources
»data copies migrated from a wide range of sources on different rates
and schedules can rapidly get out of the synchronization
»Perform data integration at scale
Low Solution Cost
TPC-DI is a good start for benchmarking integration
technologies
TPC-DI implementation is on-going
Challenge #1: Big Data Integration

17th
Challenge #2: Data Schema Model
Schema-based data
»E.g. relational model, multi-dimensional model
»They define what columns appear, their names, and their datatypes.
Schemaless Data and Dynamic Schema
»Non-uniform data: custom fields and non-uniform data types
»Parse data and query on-the-fly
»NoSQL systems (e.g. Apache pig latin, SQL-on-hadoop)
Data Lake
»All data is loaded from source systems i.e., no data is turned away.
»Data is stored at the leaf level in an untransformed or nearly
untransformed state.
»Data is transformed and schema is applied to fulfill the needs of analysis.
»Load in the raw data, as-is. At processing time give data a structure.
schema-on-read
»High agility: configuration and reconfiguration of models, queries and
applications as needed on-the-fly

17th
Challenge #3: Curse of Dimensionality
Datasets in applications like bioinformatics and text processing are
characterized by,
Large column data set a.k.a. highly dimensional data
»100 columns of data  100 dimension hyperspace
»A data cube of 100 dimensions such each dimension cardinality=10
 11100 aggregate cells
Moderate size
»One million of tuples
Proposed Solutions
Minimal Cubing Approach [Li et al., 2004 ]
Dimension Reduction & Feature Selection
»Principal Component Analysis (PCA)
»Linear Discriminant Analysis (LDA)
»Canonical Correlation Analysis (CCA)
»Latent Semantic Indexing (LSI) for text data
»Complex and not scalable

17th
Challenge #4: Scale-out
Introduction
Part I: Methods & State-of-the-Art
Conclusion
Data Fragmentation
Parallel IO
Parallel Processing
Technologies
Parallel Cube Calculus
Distributed Relational Data Warehouses + Mid-tier for parallel
cube calculus, e.g. OLAP* framework
SQL-on-Hadoop Systems
e.g. Hive, Spark SQL, Drill, Impala, BigInsights
Scalable and Distributed Data Structures
K-RP*s, SkipTree, SkipWebs, PN-Tree,

17th
Recommenders for performance
tuning
»Automated selection of materialized
views and indexes for SQL
workloads
»AutoAdmin research project at
Microsoft, which explores novel
techniques to make databases
self-tuning [Agrawal et al., 2000]
»MS Database Tuning Advisor
»Offline techniques
»DBA is completely out of the
picture
8
Challenge #5: Intelligent Recommenders
Indexes and Materialized Views are physical structures that can
significantly accelerate performance.

17th
Challenge #5: Intelligent Recommenders
Alerter Approach [Hose et al., 2008]
»support the aggregate configuration of an OLAP server by (1)
continuously monitoring information about the workload and the
benefit of aggregation tables and (2) alerting the DBA if changes to the
current configuration would be beneficial
Semi-Automatic Index Tuning: keeping DBAs in the loop
[Schnaiter and Polyzotis, 2012]
»Online workload analysis with decisions delegated to the DBA
»Solution takes into index interactions
»Index interactions: Two indices a and b interact if the benefit of a
depends on the presence of b.

17th
Challenge #6: Real-time Processing
Retrospective Analysis
Traditional Architectures
»DWS are deployed as part of an OLAP System separated out from the OLTP system,
»Data propagates down to the OLAP system, but typically after some lag,
»This is a sufficient system for retrospective analytics, but does not suffice in situations
that require real-time analytics .
Real-time analytics use cases
»Intelligent road-traffic management
»Remote health-care monitoring
»Complex event processing systems
New Data Processing Architectures
Architecture for Big Data processing at Scale
»Lambda Architecture by Nathan Marz
»Batch processing system + Speed processing system
»Kappa Architecture
»No batch processing systems
»New software for architecting DW systems!

17th
DSS Real-time Systems Characteristics
Data Stream System
»Streams pushed at system
Long running (continuous) queries
Stream characteristics often unknown and time-varying
On-line profiling and adapting necessary
Continuously arriving data streams
Up to gigabits per second in network monitoring
Would like to run continuous OLAP/mining queries
Working set for typical systems might fit in memory
Disk mostly for archiving purposes

17th
Lambda Architecture: Big Picture (1/4)

17th
Lambda Architecture: Batch Layer (2/4)
What does ?
»focus on the ingest and storage of large quantities of data and the
calculation of views from that data;
»Storage of an immutable, append-only, constantly expanding master copy
of the system’s data.
»Computation of views, derivative data for consumption by the serving
layer.
Technologies
»Batch System: Apache Hadoop/MapReduce, Apache Pig latin, Apache
Spark
»Batch view databases: ElephantDB, SploutSQL

17th
Lambda Architecture: Speed Layer (3/4)
What does?
»processes stream raw data into views and deploys those views on the
Serving Layer.
»Stream processing
»Continuous computation
Technologies
»Apache Storm, Apache Spark Streaming,
»Speed Layer views
»The views need to be stored in a random writable database and are
Read & Write.

17th
Lambda Architecture: Service Layer (4/4)
What does?
»Focus on serving up views of the data as quickly as possible.
»Queries the batch & real-time views and merges them,
»Should meet requirements of scalability and fault-tolerance.
Technologies
»Read-only: ElephantDB, Druid
»Read & write: Riak, Cassanda, Redis

17th
Challenge #7: Visualization
Introduction
Conclusion
Goal: Visualize multidimensional data sets by capturing the
dimensionality of data
Data cube
Is a representation of a multidimensional data set
OLAP clients technologies
Pivot table or cross-tab
OLAP ops: drill-down, roll-up, pivot, slice, dice
bar-charts, pie-charts, and time series
e.g. Apache jPivot, Saiku
Sophisticated visualization techniques challenges
Interactive
Reflect current data: updateable
Support of large datasets
Ergonomic

17th
Visualization Techniques
Tree-map: display hierarchical
(tree-structured) data as a set
of nested rectangles
Scatter plot: using Cartesian
coordinates to display values
for typically two variables for a
set of data

17th
Visualization Techniques
Parallel coordinate plots:
visualizing high-
dimensional geometry and
analyzing multivariate data.
Choropleth Maps: a thematic
map in which areas are
shaded or patterned in
proportion to the
measurement of the statistical
variable being displayed on
the map

17th
Multidimensional Networks (1/2)
Networks’ application domains: social networks (linkedIn, facebook, …) , the
web, communication networks, …
A multidimensional network
Graph stucture and Multidimensional attributes
Example: Graph Cube [Zhao et al. 2011]
-1- and -2- are friends

17th
Graph Cube: example queries (2/2)
Single multidimensional space
Q1: What’s the network structure between
different genders?
Q2: What’s the network structure between
the various gender and location
combinations?
Multiple multidimensional spaces
What is the network structure between the
user with ID = 3 and various locations?
What is the network structure between users
grouped by gender v.s. users grouped by
location?
Cuboid Queries Crossboid Queries

17th
Nanocubes by AT&T (1/3)
Real-time exploratory visualization on large spatiotemporal and
multidimensional datasets
Real-time: extremely fast queries
Spatial: spatial region that can be either a rectangle covering most of
the world, or a heatmap of activity
Multidimensional: besides latitude, longitude, and time, there are
other attributes such as tweet device: android or iphone, tweet
language: eng, fr, it, sp, ru…
Nanocube [Lins et al., 2013 ]
Algorithm outline
for every object oi
find the finest address of the schema S hit by this object,
update the time series associated with this address
update in a deepest first fashion, all coarser addresses hit by oi
Cons: memory usage
when indexing all six dimensions (latitude, longitude, time, language, device,
application), the 210 million points from Twitter take around 45GB of memory

17th
Which device is more popular for tweeting?
Is one device more popular in certain areas than in others?
How has this popularity changed over time?
lspatial1 is coarser than lspatial2

17th
Intermediate nanocubes generated by after each tweet is inserted

17th
Conclusion

17th
References
Introduction
Conclusion
M. Fowler, Schemaless data structures. 2013 http://martinfowler.com/articles/schemaless/
N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data
systems, 1st Edition
S. Agrawal, S. Chaudhuri and V. Narasayya Automated Selection of Materialized Views and
Indexes for SQL Databases. VLDB’2000
http://www.research.microsoft.com/dmx/AutoAdmin
K. Hose, D. Klan, M. Marx and K. Sattler. When is it Time to Rethink the Aggregate
Configuration of Your OLAP Server?. VLDB’2008
Karl Schnaitter and Neoklis Polyzotis. Semi-Automatic Index Tuning: Keeping DBAs in the
Loop. VLDB’2012
P. Zhao, X. Li, D. Xin and J. Han.
Graph cube: on warehousing and OLAP multidimensional networks. SIGMOD’2011
L. D. Lins, J. T. Klosowski and C. E. Scheidegger:
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput.
Graph. 2013 https://github.com/laurolins/nanocube

17th
Q & A

ER 2016 Tutorial

More Related Content

What's hot (20)

Similar to ER 2016 Tutorial (20)

More from Rim Moussa (19)

Recently uploaded (20)

ER 2016 Tutorial