SlideShare a Scribd company logo
17th
Nov. 2016 ER'2016@Gifu.Japan 1
Multi-Dimensional Database Modeling and
Querying: Methods, Experiences and
Challenging Problems
Alfredo Cuzzocrea University of Trieste & ICAR
Rim Moussa LaTICE Lab. & University of Carthage
The 35th Intl. Conference on Conceptual Modeling
@ Gifu, JAPAN
17th
of November, 2016
17th
Nov. 2016 ER'2016@Gifu.Japan 2
Tutorial Outline
Part I: Data Warehouse Systems
Decision Support Systems
DWS Architectures
DW Schemas
OLAP cube and OLAP operations
DSS Benchmarks
OLAP Mandate
Part II: Multi-dimensional design
Part III: Challenging Problems
Conclusion
17th
Nov. 2016 ER'2016@Gifu.Japan 3
Translating Data into Insights & Opportunities!
Source:
17th
Nov. 2016 ER'2016@Gifu.Japan 4
From DIKW Pyramid To Decision Support System
SourceSource
Source
»Descriptions of things,
events, activities and
transactions
»internal or external
»Consolidated and organized
data that has meaning, value
and purpose
»Know-thats
»Processed data that conveys
understanding, experience or
learning applicable to a
problem|activity
»Ability to increase effectiveness
»What to do, act …
Wisdom
Knowledge
Information
Data
17th
Nov. 2016 ER'2016@Gifu.Japan 5
Decision Support System
Decision Support System
»A collection of integrated software applications and hardware that form the
backbone of an organization’s decision making process.
»Performs different types of analyses
»What-if analysis
»Simulation analysis
»Goal-seeking analysis
Business Intelligence
»BI encompasses a variety of tools, applications and methodologies that enable
organizations to collect data from internal systems and external sources,
prepare it for analysis, develop and run queries against the data, and create
reports, dashboards and data visualizations to make the analytical results
available to corporate decision makers as well as operational workers.
»Integrations (ETL) tools, data warehousing tools, OLAP technologies, OLAP
clients, mining structures, reporting tools, dashboards’ design tools
»Integration workflows design methods, Multi-dimensional design methods, …
17th
Nov. 2016 ER'2016@Gifu.Japan 6
Decision Support System
OLTP vs OLAP (1/3)
OLTP: On-Line Transaction Processing
»Users/ applications interacting with database in real-time
»e.g.: on-line banking, on-line payment
»Required properties of memory access
ACID properties: Atomicity, Consistency, Isolation, Durability
Coherence
»Benchmarks: TPC-C, TPC-E
OLAP: On-Line Analytical Processing
»Experts doing offline data analysis
»e.g.: data warehouses, decision support systems
»Memory patterns: Indexed and Sequential access patterns
»Benchmarks: TPC-H, TPC-DS
17th
Nov. 2016 ER'2016@Gifu.Japan 7
Decision Support System
OLTP vs OLAP (2/3)
OLTP: On-Line Transaction Processing
»Users/ applications interacting with database in real-time
»e.g.: on-line banking, on-line payment
»Required properties of memory access
ACID properties: Atomicity, Consistency, Isolation, Durability
Coherence
»Benchmarks: TPC-C, TPC-E
OLAP: On-Line Analytical Processing
»Experts doing offline data analysis
»e.g.: data warehouses, decision support systems
»Memory patterns: Indexed and Sequential access patterns
»Benchmarks: TPC-H, TPC-DS
17th
Nov. 2016 ER'2016@Gifu.Japan 8
Decision Support System
OLTP vs OLAP (3/3)
 Users/apps interacting with DB in real-time
 E.g., customers buying/selling books at Amazon
 Many concurrent users, queries, connections
 DB is 3rd NF
 Simple queries, often predetermined
 Relatively little data (~GB)
 Each query touches little data
 Little computation per query
 Simple computation
 Data is continually updated
 Accuracy and recovery important, hence strict
transactions
 Throughput is most important
 Business Experts doing offline data Analysis
 E.g., bestseller at Amazon
 Few concurrent users, queries, connections
 DB is denormalized
 Complex ad-hoc queries
 Very large data sets (~TB)
 Queries touch large data sets
 Mining is compute intensive
 Complex operations in mining
 Data is mostly read-only
 Strict transactional
 semantics is not needed
 Latency is more important
OLTP OLAP
17th
Nov. 2016 ER'2016@Gifu.Japan 9
Data Integration
Data Integration is the process of integrating data from multiple sources
and probably have a single view over all these sources
Integration can be physical (copy the data to warehouse) or virtual
(keep the data at the sources)
Data Acquisition: This is the process of moving company data from
the source systems into the warehouse. Data acquisition is an
ongoing, scheduled process, which is executed to keep the warehouse
current to a pre-determined period in time.
Changed Data Capture: The periodic update of the warehouse from
the transactional system(s) is complicated by the difficulty of
identifying which records in the source have changed since the last
update.
Data Cleansing: This is typically performed in conjunction with data
acquisition. It is a complicated process that validates and, if necessary,
corrects the data before it is inserted into the warehouse
17th
Nov. 2016 ER'2016@Gifu.Japan 10
Data Integration
Software solutions
Software Solutions
»Known as ETL tools, Integration services, ...
»Vendors (non-exhaustive list): IBM (InfoSphere Data Event Publisher),
Informatica (Informatica Enterprise Data Integration), Information
Builders (iWay DataMigrator), Microsoft (Microsoft's SQL Server
Integration Services), Oracle (Oracle GoldenGate), Pentaho (Pentaho DI -
prev. Ketlle), SAP (BusinessObjects Data Integrator), SAS (SAS DataFlux)
and Talend (Talend Open Studio),
Kettle (Pentaho) and Talend offer open-source versions,
MicroSoft bundels integration services with database products
17th
Nov. 2016 ER'2016@Gifu.Japan 11
Integrating Heterogeneous data sources: Lazy or Eager?
Lazy Integration aka Query-driven approach
»Accept a query, determine the sources which answer the query, devise the
execution tree: generate appropriate sub-queries for each source, obtain
resultsets, perform required post-processing: translation, merging and filtering
and return answer to application
Eager Integration aka Warehouse approach
»Information of each source of interest is extracted in-advance, translated, filtered
and processed as appropriate, then merged with information from other sources
and stored
»Accept a query, answer query from repository
17th
Nov. 2016 ER'2016@Gifu.Japan 12
Data Integration
Eager vs. Lazy ?
 Integrate in-advance i.e. before queries
 Copy data from sources
 Query answer-set could be stale
 Need to refresh the data warehouse
 Operates when sources are unavailable
 High query performance: through building data
summaries and local processing at sources
unaffected
 OLAP queries are not visible outside the
warehouse
 Integrate on-demand i.e. at query time
 Leave data at sources
 Query answer-set is up-to-date
 No copy so no refresh and storage cost
 Out of service if sources unavailable
 Sources are drained: interference with local
processing
Eager Lazy
17th
Nov. 2016 ER'2016@Gifu.Japan 13
Lazy Data Integration
 Query-driven Architecture
Source
WRAPPER WRAPPER
MEDIATOR
Source
WRAPPER
17th
Nov. 2016 ER'2016@Gifu.Japan 14
Eager Data Integration
 Warehouse System Architecture
17th
Nov. 2016 ER'2016@Gifu.Japan 15
Data Warehouse Definition
By Bill Inmon "A warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management's decision
making process"
»Subject-oriented: The data in the database is organized so that all the data
elements relating to the same real-world event or object are linked together;
e.g. finance, human resources, sales …
»Time-variant: The changes to the data in the database are tracked and recorded
so that reports can be produced showing changes over time. Hence, all addresses
of a customer are in the DW,
»Non-volatile: Data in the database is never over-written or deleted - once
committed, the data is static, read-only, but retained for future reporting;
»Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole. There is only a single way to identify
a product
By Ralph Kimbal “A copy of transaction data specifically structured for
query and analysis”
Inmon top-down approach vs. Kimbal bottom-up approach
17th
Nov. 2016 ER'2016@Gifu.Japan
Constellation Schema
»Galaxy schema
»Multiple fact tables which
share dimensions
Snowflake Schema
»Hierarchical relationships exist
among Dimensions Tables
»Normalized schema
Star Schema
»A single, large and central table Fact table
surrounded by multiple Dimension tables
»All Dimensions Tables have a hierarchical
relationship with the Fact Table
16
Data Warehouse Schemas
Fact
Table
Dim
Table 1
Dim
Table 3
Dim
Table 4
Dim
Table 2
Fact
Table
Dim
Table 5
Dim
Table 4
Dim
Table 2 Dim
Table 6
Dim
Table 1
Dim
Table 3
Dim
Table 7
Dim
Table 4
Fact
Table 2
Fact
Table 1
Dim
Table 5
Dim
Table 2
Dim
Table 6
Dim
Table 1
Dim
Table 3
17th
Nov. 2016 ER'2016@Gifu.Japan 17
Multi-dimensional Data Analysis
Multi-dimensional Data Analysis
»refers to the process of summarizing data across multiple levels (called
dimensions) and then presenting the results in a multi-dimensional grid
format.
OLAP Cube
»Multi-dimensional representation of data
»aka Crosstab, Data Pivot
17th
Nov. 2016 ER'2016@Gifu.Japan
Facts: are the objects that represent the subject of the desired analyses.
»Examples: sales records, weather records, cabs trips, …
»The fact table contained 3 types of attributes: measured attributes, foreign keys
to dimension tables, degenerate dimensions
Dimension(s):
»Levels are individual values that make up dimensions
»Examples
»Date dimension (Trimester, month, day)
»Time dimension (hour, min, sec)
»Geography dimension (Country, city, postal code)
Measure(s):
»Examples: revenue, lost revenue, sold quantities, expenses, forecasts, …
»Use aggregate functions: min, max, count, distinct-count, sum, average, …
18
OLAP Cube
17th
Nov. 2016 ER'2016@Gifu.Japan 19
OLAP Operations
On-Line Analytical Processing
17th
Nov. 2016 ER'2016@Gifu.Japan 20
OLAP Operations
Drill-down
stepping down to lower level data or introducing new dimensions
Roll-up
summarizes data by climbing up hierarchy or by dimension reduction
Pivot
rotate, reorienting the cube
Slice
selecting data on one dimension
Dice
selecting data on multiple dimensions
17th
Nov. 2016 ER'2016@Gifu.Japan 21
OLAP Operations
Roll-up & Drill-down
17th
Nov. 2016 ER'2016@Gifu.Japan 22
OLAP Operations
Slice & Dice
17th
Nov. 2016 ER'2016@Gifu.Japan 23
Derived Data
OLAP Indices
»Inverted lists
»Bitmap Indices
»Join indices
»Text indices
Materialization of a data cube is a way to
pre-compute and store multi-dimensional
aggregates so that multi-dimensional
analysis can be performed on-the-fly [Li
et al., 2004]
Calculated Attributes
Aggregate Tables (aka materialized
tables)
Data synopsis: histograms and
sketches
Improved Query
Performance
Maintenance &
Refresh
Storage
Requirements
17th
Nov. 2016 ER'2016@Gifu.Japan 24
Cuboids Lattice
What to materialize?
d3,d4
All
d2:Itemd1:Time d3:Customer d4:Supplier
d1,d2 d1,d3 d1,d4 d2,d3 d2,d4
d1,d2,d3 d1,d2,d4 d1,d3,d4 d2,d3,d4
d1,d2,d3,d4
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
For n-dimensional cube
»2n cuboids
17th
Nov. 2016 ER'2016@Gifu.Japan
ROLAP Server
»Relational OLAP
»Uses relational or extended-relational DBMS to store and manage warehouse
data and OLAP middleware
»Includes optimization of DBMS back-end, implementation of aggregation
navigation logic, and additional tools and services
»High scalability
»E.g. Mondrian (Pentaho BI suite)
MOLAP Server
»Multidimensional OLAP
»Sparse array-based multidimensional storage engine
»Fast indexing to pre-computed summarized data
»E.g. Palo
HOLAP Server
»Low-level: relational / high-level: array
»High flexibility
»E.g. MSAS (Microsoft)
25
OLAP Servers
17th
Nov. 2016 ER'2016@Gifu.Japan
Structured Query Language (SQL)
»Relational and static schema
»Data Definition, data Manipulation, and Data Control Language
»Analytic Functions (window functions over partition by …)
»Cube, roll-up and grouping sets operators
MultiDimensional eXpressions (MDX)
»Invented by Microsoft in 1997
»For querying and manipulating the multidimensional data stored OLAP cubes
»Static schema
Data Flow programming language
»Google Sawzall, Apache Apache Pig Latin, IBM Infosphere Streams
»Dynamic schema
»After data is loaded, multiple operators are applied on that data before the final
output is stored.
26
Query Languages
Load Data
Apply
Schema
Apply Filter Group Data
Apply
Aggregate
Function
Sort Data
Store
Output
17th
Nov. 2016 ER'2016@Gifu.Japan 27
Query Languages
SQL – Q16 of TPC-H benchmark
17th
Nov. 2016 ER'2016@Gifu.Japan 28
Query Languages
MDX –Q16 of TPC-H Benchmark
WITH SET [Brands] AS 'Except({[Part Brand].Members}, {[Part
Brand].[Brand#45 ]})'
SET [Types] AS 'Filter({[Part Type].Members}, (NOT ([Part
Type].CurrentMember.Name MATCHES "(?i)MEDIUM POLISHED.*")))'
SET [Sizes] AS 'Filter({[Part Size].Members}, ([Part Size].CurrentMember IN
{[Part Size].[3], [Part Size].[9], [Part Size].[14], [Part Size].[19], [Part Size].[23],
[Part Size].[36], [Part Size].[45], [Part Size].[49]}))'
SELECT [Measures].[Supplier Count] ON COLUMNS,
nonemptyCrossjoin(nonemptyCrossjoin([Brands], [Types]), [Sizes]) ON ROWS
FROM [Cube16]
17th
Nov. 2016 ER'2016@Gifu.Japan 29
Query Languages
Data Flow –Pig Latin script for Q16 of TPC-H benchmark
17th
Nov. 2016 ER'2016@Gifu.Japan 30
Decision Support Systems Benchmarks
Non-TPC Benchmarks
Real datasets
»Open data or proprietary data
»fixed size
»Devise a workload or trace the proprietary workload
APB-1: no scale factor
TPC Benchmarks
The Transaction Processing Council founded in 1988 to define benchmarks
In 2009, TPC-TC is set up as an International Technology Conference
Series on Performance Evaluation and Benchmarking
Examples of benchmarks relevant for benchmarking decision support
systems: TPC-H, TPC-DS and TPC-DI
Common characteristics of TPC benchmarks
»Synthetic data
»Scale factor allowing generation of different volumes 1GB to 1PB
17th
Nov. 2016 ER'2016@Gifu.Japan 31
Decision Support Systems Benchmarks
TPC-H Benchmark Schema (1/2)
TPC-H Benchmark: complex schema
22 ad-hoc SQL statements (star queries, nested queries, …) + refresh functions
17th
Nov. 2016 ER'2016@Gifu.Japan 32
Decision Support Systems Benchmarks
TPC-H Benchmark (2/2)
TPC-H Benchmark Metrics
2 Metrics
»QphH@Size is the number of queries processed per hour, that the system
under test can handle for a fixed load
»$/QphH@Size represents the ratio of cost to performance, where the
cost is the cost of ownership of the SUT (hardware,software,
maintenance).
Variants of TPC-H Benchmarks
TPC-H*d Benchmark –detailed in Part II
»Turning TPC-H benchmark into a Multi-dimensional benchmark
»Few schema changes
»No update to workload requirement
»MDX workload for OLAP cubes and OLAP queries
SSB: Star Schema Benchmark
»Turning TPC-H benchmark into star-schema
»Workload composed of 12 queries
TPC-H translated into Pig Latin (Apache Hadoop Ecosystem)
»22 pig latin scripts which load and process TPC-H raw data files (.tbl files)
17th
Nov. 2016 ER'2016@Gifu.Japan 33
Decision Support Systems Benchmarks
TPC-DS Benchmark (1/2)
TPC-DS Benchmark: 7 data marts
17th
Nov. 2016 ER'2016@Gifu.Japan 34
Decision Support Systems Benchmarks
TPC-DS Benchmark (2/2)
TPC-DS Benchmark Workload
Hundred of queries (99 query templates)
OLAP, windowing functions, mining, and reporting queries
Concurrent data maintenance
TPC-DS Benchmark Metrics
3 Metrics
»QphDS@Size is the number of queries processed per hour, that the
system under test can handle for a fixed load.
»Data Maintenance and Load Time are calculated
»$/QphDS@Size represents the ratio of cost to performance, where the
cost is a 3 year cost of ownership of the SUT (hardware,software,
maintenance)
»System Availability date: the date when the system is available to
customers.
TPC-DS implementations
TPC-DS v2.0
»Extension for non-relational systems such as Hadoop/Spark big data
systems
17th
Nov. 2016 ER'2016@Gifu.Japan 35
Decision Support Systems Benchmarks
TPC-DI Benchmark (1/3)
For benchmarking Data Integration technologies
Synthetic Data of a Factious Retail Brokerage Firm
»Internal Trading system data, Internal Human resources data, Internal
CRM System and External data
»Different data scales
»Data extracted from different sources:
»Structured (csv)
»Semi-structured data (xml)
»Multi record
»Change Data Capture (CDS)
Complex Data Integration Tasks
Load large volumes of historical data
Load incremental updates
Execute complex transformations
Check and ensure consistency of data
17th
Nov. 2016 ER'2016@Gifu.Japan 36
TPC-DI Benchmark
Complex Transformations (2/3)
TPC-DI implements 18 Complex Transformations which feature the
following characteristics,
Transfer XML to relational data
Detect changes in dimension data, and applying appropriate tracking
mechanisms for history keeping dimensions
Filter input data according to pre-defined conditions
Identify new, deleted and updated records in input data
Merge multiple input files of the same structure
Join data of one input file to data from another input file with different
structure
Standardize entries of the input files
Join data from input file to dimension table
Join data from multiple input files with separate structures
Consolidate multiple change records per day and identify most current
Perform extensive arithmetic calculations
Read data from files with variable type records
Check data for errors or for adherence to business rules
Detect changes in fact data, and journaling updates to reflect current state
17th
Nov. 2016 ER'2016@Gifu.Japan 37
TPC-DI Benchmark
Metrics (3/3)
Metrics
Performance Metric
TPC_DI_RPS = Trunc(GeoMean(TH, Min(TI1 , TI2)))
TH: Throughput of Historical Load
TI1: Throughput of Incremental Update 1
TI2: Throughput of Incremental Update 2
Price/Performance Metric
Price-per-TPC_DI_RPS = $ / TPC_DI_RPS
$ is the total 3-year pricing
17th
Nov. 2016 ER'2016@Gifu.Japan 38
OLAP Mandate
12 Rules for Evaluating OLAP products by E.F. Codd et al.
Rule#1: Multidimensional Conceptual View
»The multidimensional conceptual view facilitates OLAP model design and
analysis, as well as inter and intra dimensional calculations.
Rule#2: Transparency
»Whether OLAP is or is not part of the user’s customary front-end product,
that fact should be transparent to the user. If OLAP is provided within the
context of a client server architecture, then this fact should be transparent to
the user-analyst as well.
Rule#3: Accessibility
»The OLAP tool must map its own logical schema to heterogeneous physical
data stores, access the data, and perform any conversions necessary to
present a single, coherent and consistent user view. Moreover, the tool and
not the end-user analyst must be concerned about where or from which type
of systems the physical data is actually coming
17th
Nov. 2016 ER'2016@Gifu.Japan 39
OLAP Mandate
12 Rules for Evaluating OLAP products by E.F. Codd et al.
Rule#4: Consistent Reporting Performance
»As the number of dimensions or the size of the database increases, the OLAP
user-analyst should not perceive any significant degradation in reporting
performance.
Rule#5: Client-Server Architecture
»Most data currently requiring on-line analytical processing is stored on
mainframe systems and accessed via personal computers. It is imperative that
the server component of OLAP tools be sufficiently intelligent such that
various clients can be attached with minimum effort and integration
programming.
Rule#6: Generic Dimensionality
»Every data dimension must be equivalent in both its structure and
operational capabilities. Dimensions are symmetric, So the basic data
structure, formulae, and reporting formats should not be biased toward any
one data dimension.
17th
Nov. 2016 ER'2016@Gifu.Japan 40
OLAP Mandate
12 OLAP Rules by E.F. Codd
Rule#7: Dynamic Sparse Matrix Handling
»The OLAP tools’ physical schema must adapt fully to the specific analytical
model being created to provide optimal sparse matrix handling.
»By adapting its physical data schema to the specific analytical model, OLAP
tools can empower user analysts to easily perform types of analysis which
previously have been avoided because of their perceived complexity.
Rule#8: Multi-User Support
»OLAP tools must provide concurrent access (retrieval and update), integrity,
and security.
Rule#9: Unrestricted Cross-dimensional Operations
»The various roll-up levels within consolidation paths, due to their inherent
hierarchical nature, represent in outline form, the majority of 1:1, 1:M, and
dependent relationships in an OLAP model or application. Accordingly, the
tool itself should infer the associated calculations and not require the user-
analyst to explicitly define these inherent calculations.
17th
Nov. 2016 ER'2016@Gifu.Japan 41
OLAP Mandate
12 OLAP Rules by E.F. Codd
Rule#10: Intuitive Data Manipulation
»Consolidation path re-orientation, drilling down across columns or rows,
zooming out, and other manipulation inherent in the consolidation path
outlines should be accomplished via direct action upon the cells of the
analytical model, and should neither require the use of a menu nor multiple
trips across the user interface.
Rule#11: Flexible Reporting
»Reporting must be capable of presenting data to be synthesized, or
information resulting from animation of the data model according to any
possible orientation. This means that the rows, columns, or page headings
must each be capable of containing/displaying from 0 to N dimensions each,
where N is the number of dimensions in the entire analytical model.
Rule#12: Unlimited Dimensions and Aggregation Levels
»An OLAP tool should be able to accommodate at least fifteen and preferably
twenty data dimensions within a common analytical model.
17th
Nov. 2016 ER'2016@Gifu.Japan 42
References
 M. Fricke, The Knowledge Pyramid: A Critique of the DIKW Hierarchy. Journal of Information
Science. 2009.
 E.F. Codd, S.B. Codd and C.T. Salley, Providing OLAP to User Analysts: an IT mandate, 1993.
 J. Widom, Integrating Heterogeneous databases: eager or lazy? ACM Computing Surveys (CSUR)
Vol.4, 1996
 Y.R. Cho, Data Warehouse and OLAP Operations www.ecs.baylor.edu/faculty/cho/4352
 TPC homepage http://www.tpc.org/
 M. Poess, T. Rabl and B. Caufield: TPC-DI: The First Industry Benchmark for Data
Integration. PVLDB 7(13): 1367-1378 (2014)
http://www.vldb.org/pvldb/vol7/p1367-poess.pdf
 X. Li, J. Han, H. Gonzalez: High-Dimensional OLAP: A Minimal Cubing Approach. VLDB 2004.
 C. Imhoff, N. Galemmo, J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. 2003.
 R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, B. Becker. The Data Warehouse
Lifecycle Toolkit. 2nd Edition.
 R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2nd Edition.
 H. G. Molia. Data Warehousing Overview: Issues, Terminology, Products.
www.cs.uh.edu/~ceick/6340/dw-olap.ppt (slides)
17th
Nov. 2016 ER'2016@Gifu.Japan 43
Thank you for your Attention
Q & A
Multi-Dimensional Database Modeling and Querying: Methods,
Experiences and Challenging Problems
Alfredo Cuzzocrea and Rim Moussa
17th of November, 2016
17th
Nov. 2016 ER'2016@Gifu.Japan 1
Multi-Dimensional Database Modeling and
Querying: Methods, Experiences and
Challenging Problems
Part II
Multi-dimensional Benchmark Design
Alfredo Cuzzocrea University of Trieste & ICAR
Rim Moussa University of Carthage & LaTICE
The 35th Intl. Conference on Conceptual Modeling
@ Gifu, JAPAN
17th
of November, 2016
17th
Nov. 2016 ER'2016@Gifu.Japan 2
Tutorial Outline
Introduction
Part I: State-of-the-Art
Part II: Experiences
TPC-H*d Experience
AutoMDB
TPC-DS*d
Part III: Challenging Problems
Conclusion
17th
Nov. 2016 ER'2016@Gifu.Japan
Given,
A relational Warehouse schema
A Workload -a set of SQL Statements,
W = {Q1, Q2, …, Qn}
where Qi is a parameterized query
How to design the Multi-dimensional DB Schema?
How to define cubes?
Will there be a single cube or multiple cubes? Are there any rules for merging of
cubes? Are there any rules for definition of virtual cubes?
Which Optimizations are suitable for performance tuning ?
Derived data calculus & refresh?
Data partitioning & parallel cube building?
# 3
Problem
17th
Nov. 2016 ER'2016@Gifu.Japan # 4
Idea
Map each business question to an OLAP cube
>> Obtain a multi-dimensional DB schema
Recommend & Test Optimizations
>> Derived Data
>> Data partitioning
>> Cube Merging
17th
Nov. 2016 ER'2016@Gifu.Japan
SELECT t1.col_a, t1.col_b, …, tn.col_a, tn.col_z,
aggregate_function(column) as measure_1, …,
aggregate_function(expression) as measure_m
FROM table_1 t1, table_2 t2, …, table_n tn
WHERE ti.col_x operator $query_parameter$
AND ti.col_y = tj.col_z
AND …
GROUP BY t1.col_a, t1.col_b, …, tn.col_a, tn.col_z
aggregate_function: min, max, sum, avg, count, count-distinct …
Operator: =, < , <=, >=, !=
# 5
SQL Statement Template
17th
Nov. 2016 ER'2016@Gifu.Japan
Measures feature aggregate functions,
e.g. min, max, count, count-distinct, sum, average, …
Simple Measure
Defined over a single attribute,
e.g. SUM(l_extendedprice),
Measure expressions
Involve more than one attribute,
e.g. SUM(l_extendedprice*(1 - l_discount))
Computed Members
Involve already defined measures or measure expressions,
e.g. M1=SUM(l_extendedprice), M2=COUNT(l_orderkey),
CM = M1 / M2
# 6
OLAP Cube Design: Measures’ Definition
17th
Nov. 2016 ER'2016@Gifu.Japan
All attributes involved in measures and measure expressions
belong to the fact table,
Example: Q10 of TPC-H benchmark
# 7
OLAP Cube Design
Fact Table Definition (1/9)
17th
Nov. 2016 ER'2016@Gifu.Japan
Case measurable attributes belong to different tables then the
fact table is a view defined as the join of relations, to which
attributes belong!
Example 1: Q9 of TPC-H benchmark, where l_extendedprice,
l_discount and l_quantity belong to lineitem, and
ps_supplycost belongs to partsupp.
The fact table is the join of lineitem and partsupp tables. Only
attributes needed for join with dimension tables (namely, l_partkey,
l_orderkey, l_suppkey), and measurable attributes (namely
l_extendedprice,l_discount,l_quantity,
ps_supplycost) are selected.
# 8
OLAP Cube Design
Fact Table Definition over multiple tables (2/9)
17th
Nov. 2016 ER'2016@Gifu.Japan # 9
Q9 SQL statement
OLAP Cube Design
Fact Table Definition over multiple tables (3/9)
17th
Nov. 2016 ER'2016@Gifu.Japan # 10
OLAP Cube Design
Fact Table Definition over multiple tables (4/9)
17th
Nov. 2016 ER'2016@Gifu.Japan # 11
Q14 SQL statement
OLAP Cube Design
Fact Table Definition over multiple tables (5/9)
Example 2: Q14 of TPC-H benchmark, where l_extendedprice,
l_discount and belong to lineitem, and p_type belongs to
part.
The fact table is the join of lineitem and part tables
17th
Nov. 2016 ER'2016@Gifu.Japan
Filters Processing: the fact table is defined as a view of facts
with filters
»Extract all filters involving the fact table from the WHERE
clause, such as
(attr_i operator attr_j), where both attr_i and attr_j
belong the fact table,
(attr_k operator $value$), such that attr_k belongs to the
fact table,
[not] exists (select … from … where attr_k …), such
that attr_k belongs to the fact table,
attr_k [not] in (list of values), such that attr_k
belongs to the fact table,
Example 1: Q10 of TPC-H benchmark
Example 2: Q16 of TPC-H benchmark
Example 3: Q21 of TPC-H benchmark
# 12
OLAP Cube Design
Fact Table Definition and filters’ processing (6/9)
17th
Nov. 2016 ER'2016@Gifu.Japan # 13
Q10 SQL statement
OLAP Cube Design
Fact Table Definition and filters’ processing (7/9)
17th
Nov. 2016 ER'2016@Gifu.Japan # 14
Q16 SQL statement
OLAP Cube Design
Fact Table Definition and filters’ processing (8/9)
17th
Nov. 2016 ER'2016@Gifu.Japan # 15
Q21 SQL statement
OLAP Cube Design
Fact Table Definition and filters’ processing (9/9)
17th
Nov. 2016 ER'2016@Gifu.Japan
First, consider all attributes in the SELECT, WHERE and GROUP
BY clauses,
»Discard measurable attributes, which figure out in
measures, measure expressions, or computed members,
»Discard attributes which figure out in the WHERE clause,
and are used for joining tables or filtering the fact table
with static values,
»Compose time dimension along well known hierarchies,
»Year, quarter, month
»Compose geography dimension along well known
hierarchies,
»Region, nation, city
# 16
OLAP Cube Design
Dimension Definition (1/7)
17th
Nov. 2016 ER'2016@Gifu.Japan
Example: Q10 of TPC-H benchmark
All highlighted attributes are considered for dimensions’ mount
Time dimension o_orderdate requires order_year and order_quarter
levels
# 17
OLAP Cube Design
Dimension Definition (2/7)
17th
Nov. 2016 ER'2016@Gifu.Japan
Second, find out hierarchical relations, i.e., one-to-many
relationships, and re-organize attributes along hierarchies to
form dimensions’ hierarchies,
»Example: Q10 of TPC-H benchmark
 each customer can be related to at most one nation, but
a nation may be related to many customers,
customer_dim:
Customer nation  n_name
Customer detailsc_custkey, c_name, c_acctbal, c_address,
c_phone, c_comment
order_dim
order_year
order_quarter
# 18
OLAP Cube Design
Dimension Definition (3/7)
17th
Nov. 2016 ER'2016@Gifu.Japan
Third, distinguish levels from properties.
Properties are in functional dependency with levels,
Example: Q10 of TPC-H benchmark
 For customer_dim, c_custkey is the level, and all of c_name,
c_acctbal, c_address, c_phone, c_comment
attributes are properties of c_custkey level.
# 19
OLAP Cube Design
Dimension Definition (3/7)
17th
Nov. 2016 ER'2016@Gifu.Japan
Filters Processing: not all tuples in the dimension table should
be considered, so we have to extract filters defined over
dimension tables from the WHERE clause not useful for multi-
dimensional design,
Exple 1: Q12 of TPC-H benchmark
For each line shipping mode, year, Count the number of high
priority orders (high line count) and the number of not high priority
orders (low line count) over orders’ facts and consider only lines
such as l_commit_date < l_receipt_date and
l_ship_date < l_commit_date. These are filters over
dimension table.
Exple 2: Q19 of TPC-H benchmark
Calculate revenue for particular parts
# 20
OLAP Cube Design
Dimension Definition and Filters’ processing (4/7)
17th
Nov. 2016 ER'2016@Gifu.Japan
Example 1: Q12 of TPC-H Benchmark
# 21
OLAP Cube Design
Dimension Definition and Filters’ processing (5/7)
17th
Nov. 2016 ER'2016@Gifu.Japan
Example 2: Q19 of TPC-H Benchmark.
# 22
OLAP Cube Design
Dimension Definition and Filters’ processing (6/7)
17th
Nov. 2016 ER'2016@Gifu.Japan # 23
OLAP Cube Design
Dimension Definition and Filters’ processing (7/7)
17th
Nov. 2016 ER'2016@Gifu.Japan # 24
TPC-H*d
Truly OLAP variant of TPC-H benchmark
TPC-H SQL workload translated into MDX (MultiDimensional
eXpressions)
The workload is composed of 23 MDX statements for OLAP
cubes and 23 MDX statements for OLAP business queries.
Each business question of TPC-H benchmark is mapped to an OLAP
cube
17th
Nov. 2016 ER'2016@Gifu.Japan # 25
TPC-H*d
Q8: From SQL statement to OLAP cube
17th
Nov. 2016 ER'2016@Gifu.Japan # 26
TPC-H*d
TPC-H*d OLAP Cube C8
Market Share for each supplier nation within a region of customers,
for each year and each part type
17th
Nov. 2016 ER'2016@Gifu.Japan # 27
TPC-H*d
TPC-H*d OLAP Query Q8
Market Share for each RUSSIAN Suppliers within AMERICA region,
Over the years 1995 and 1996 and for part type ECO. ANODIZED STEEL
17th
Nov. 2016 ER'2016@Gifu.Japan
Open source software implemented in java
Parses MDB schemas (.xml) files using SAX Library
Performs comparisons of OLAP cubes' characteristics.
»For each pair of OLAP cubes,
»show whether they have same fact table or not
»compute the nbr of shared | different | coalescable dimensions
»Dimensions are coalescable if they are extracted from the dimension table
and their hierarchies are coalescable
»compute the number of shared | different measures
»Run merge of OLAP cubes using different similarity functions
»Simple distance function have or not same fact table
»K-means clustering
»Distance function is computed with weights according dimensions
»Propose Virtual Cubes
»Auto-generate a new MDB Schema (.xml)
»Create MDB Schema from SQL Workload
»On-going, tests include TPC-DS benchmark
# 28
AutoMDB
17th
Nov. 2016 ER'2016@Gifu.Japan # 29
AutoMDB
Load OLAP Cubes defined in xml file
17th
Nov. 2016 ER'2016@Gifu.Japan # 30
AutoMDB
Compare OLAP Cubes –have or not same fact table
17th
Nov. 2016 ER'2016@Gifu.Japan # 31
AutoMDB
Compare Cubes –Group cubes which have same fact table
17th
Nov. 2016 ER'2016@Gifu.Japan # 32
AutoMDB
Compare Cubes –Auto-generate a new MDB schema
17th
Nov. 2016 ER'2016@Gifu.Japan # 33
References
Modeling Multidimensional Databases (non exhaustive list)
M. Gyssens and L. V.S. Lakshmanan. A Foundation for Multi-Dimensional Databases.
VLDB’1997.
R. Agrawal, A. Gupta and S. Sarawagi. Modeling Multidimensional Databases.
ICDE’1997.
J. Gray, A. Bosworth, A. Layman and H. Priahesh. Data Cube: A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. ICDE’2008.
P. Vassiliadis. Modeling Multidimensional Databases, Cubes and Cube Operations.
SSDBM’1998.
L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases.
EDBT’1998.
D. Cheung, B. Zhou, B. Kao, H. Lu, T. Lam and H. Ting. Requirement-based data cube
schema design. CIKM’1999.
T. Niemi, J. Nummenmaa and P. Thanisch. Constructing OLAP cubes based on Queries.
DOLAP’2001.
O. Teste. Towards Conceptual Multidimensional Design in Decision Support Systems.
DEXA’2010.
A. Cuzzocrea and R. Moussa. Multidimensional Database Design
via Schema Transformation: Turning TPC-H into the TPC-
H*d Multidimensional Benchmark. COMAD’2013.
17th
Nov. 2016 ER'2016@Gifu.Japan 34
Thank you for your Attention
Q & A
Multi-Dimensional Database Modeling and Querying: Methods,
Experiences and Challenging Problems
Alfredo Cuzzocrea and Rim Moussa
17th of November, 2016
17th
Nov. 2016 ER'2016@Gifu.Japan 1
Multi-Dimensional Database Modeling and
Querying: Methods, Experiences and
Challenging Problems
Part III
Challenging Problems
Alfredo Cuzzocrea University of Trieste & ICAR
Rim Moussa University of Carthage & LaTICE
The 35th Intl. Conference on Conceptual Modeling
@ Gifu, JAPAN
17th
of November, 2016
17th
Nov. 2016 ER'2016@Gifu.Japan 2
Tutorial Outline
Introduction
Part I: State-of-the-Art
Part II: Experiences
Part III: Challenges
Big Data Integration
Flexible Schema Model
Curse of dimensionality
Systems which scale-out
Intelligent Recommenders
Real-time OLAP
Advanced Visualization
Conclusion
17th
Nov. 2016 ER'2016@Gifu.Japan 3
Big Data
Volume
»Volume refers to the amount of data, henceforth the challenge is
integration at scale,
Velocity
»Velocity refers to the speed at which new data is generated, henceforth
the challenge is to integrate and analyze data while it is being generated,
Variety
»Variety refers to different types of data; e.g. structured (relational data),
semi-structured (XML, JSON, BSON), unstructured (text); henceforth the
challenge is integration of different types of data,
Veracity
»Veracity refers to the messiness or trustworthiness of the data,
henceforth the challenge is to integrate uncertain data quality in data
sources,
Value
»Value refers to our ability to turn data into value.
17th
Nov. 2016 ER'2016@Gifu.Japan 4
Advanced and highly performance Technologies
»Allow to load and extract different data formats and allow to perform
complex transformations
»Solve heterogeneity
»data type heterogeneity (phone type is number or string),
»semantic heterogeneity (column title  column job title) ,
»value heterogeneity (Pr.  Prof.  Professor)
»entity resolution through identification of same entities with values misspelled,
values are synonyms or abbreviated, values originated from different systems
(date formats) or different domains (true is 1)
»Syncing Across Data Sources
»data copies migrated from a wide range of sources on different rates
and schedules can rapidly get out of the synchronization
»Perform data integration at scale
Low Solution Cost
TPC-DI is a good start for benchmarking integration
technologies
TPC-DI implementation is on-going
Challenge #1: Big Data Integration
17th
Nov. 2016 ER'2016@Gifu.Japan 5
Challenge #2: Data Schema Model
Schema-based data
»E.g. relational model, multi-dimensional model
»They define what columns appear, their names, and their datatypes.
Schemaless Data and Dynamic Schema
»Non-uniform data: custom fields and non-uniform data types
»Parse data and query on-the-fly
»NoSQL systems (e.g. Apache pig latin, SQL-on-hadoop)
Data Lake
»All data is loaded from source systems i.e., no data is turned away.
»Data is stored at the leaf level in an untransformed or nearly
untransformed state.
»Data is transformed and schema is applied to fulfill the needs of analysis.
»Load in the raw data, as-is. At processing time give data a structure.
schema-on-read
»High agility: configuration and reconfiguration of models, queries and
applications as needed on-the-fly
17th
Nov. 2016 ER'2016@Gifu.Japan 6
Challenge #3: Curse of Dimensionality
Datasets in applications like bioinformatics and text processing are
characterized by,
Large column data set a.k.a. highly dimensional data
»100 columns of data  100 dimension hyperspace
»A data cube of 100 dimensions such each dimension cardinality=10
 11100 aggregate cells
Moderate size
»One million of tuples
Proposed Solutions
Minimal Cubing Approach [Li et al., 2004 ]
Dimension Reduction & Feature Selection
»Principal Component Analysis (PCA)
»Linear Discriminant Analysis (LDA)
»Canonical Correlation Analysis (CCA)
»Latent Semantic Indexing (LSI) for text data
»Complex and not scalable
17th
Nov. 2016 ER'2016@Gifu.Japan 7
Challenge #4: Scale-out
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
Data Fragmentation
Parallel IO
Parallel Processing
Technologies
Parallel Cube Calculus
Distributed Relational Data Warehouses + Mid-tier for parallel
cube calculus, e.g. OLAP* framework
SQL-on-Hadoop Systems
e.g. Hive, Spark SQL, Drill, Impala, BigInsights
Scalable and Distributed Data Structures
K-RP*s, SkipTree, SkipWebs, PN-Tree,
17th
Nov. 2016 ER'2016@Gifu.Japan
Recommenders for performance
tuning
»Automated selection of materialized
views and indexes for SQL
workloads
»AutoAdmin research project at
Microsoft, which explores novel
techniques to make databases
self-tuning [Agrawal et al., 2000]
»MS Database Tuning Advisor
»Offline techniques
»DBA is completely out of the
picture
8
Challenge #5: Intelligent Recommenders
Indexes and Materialized Views are physical structures that can
significantly accelerate performance.
17th
Nov. 2016 ER'2016@Gifu.Japan 9
Challenge #5: Intelligent Recommenders
Alerter Approach [Hose et al., 2008]
»support the aggregate configuration of an OLAP server by (1)
continuously monitoring information about the workload and the
benefit of aggregation tables and (2) alerting the DBA if changes to the
current configuration would be beneficial
Semi-Automatic Index Tuning: keeping DBAs in the loop
[Schnaiter and Polyzotis, 2012]
»Online workload analysis with decisions delegated to the DBA
»Solution takes into index interactions
»Index interactions: Two indices a and b interact if the benefit of a
depends on the presence of b.
17th
Nov. 2016 ER'2016@Gifu.Japan 10
Challenge #6: Real-time Processing
Retrospective Analysis
Traditional Architectures
»DWS are deployed as part of an OLAP System separated out from the OLTP system,
»Data propagates down to the OLAP system, but typically after some lag,
»This is a sufficient system for retrospective analytics, but does not suffice in situations
that require real-time analytics .
Real-time analytics use cases
»Intelligent road-traffic management
»Remote health-care monitoring
»Complex event processing systems
New Data Processing Architectures
Architecture for Big Data processing at Scale
»Lambda Architecture by Nathan Marz
»Batch processing system + Speed processing system
»Kappa Architecture
»No batch processing systems
»New software for architecting DW systems!
17th
Nov. 2016 ER'2016@Gifu.Japan 11
Challenge #6: Real-time Processing
DSS Real-time Systems Characteristics
Data Stream System
»Streams pushed at system
Long running (continuous) queries
Stream characteristics often unknown and time-varying
On-line profiling and adapting necessary
Continuously arriving data streams
Up to gigabits per second in network monitoring
Would like to run continuous OLAP/mining queries
Working set for typical systems might fit in memory
Disk mostly for archiving purposes
17th
Nov. 2016 ER'2016@Gifu.Japan 12
Challenge #6: Real-time Processing
Lambda Architecture: Big Picture (1/4)
17th
Nov. 2016 ER'2016@Gifu.Japan 13
Challenge #6: Real-time Processing
Lambda Architecture: Batch Layer (2/4)
What does ?
»focus on the ingest and storage of large quantities of data and the
calculation of views from that data;
»Storage of an immutable, append-only, constantly expanding master copy
of the system’s data.
»Computation of views, derivative data for consumption by the serving
layer.
Technologies
»Batch System: Apache Hadoop/MapReduce, Apache Pig latin, Apache
Spark
»Batch view databases: ElephantDB, SploutSQL
17th
Nov. 2016 ER'2016@Gifu.Japan 14
Challenge #6: Real-time Processing
Lambda Architecture: Speed Layer (3/4)
What does?
»processes stream raw data into views and deploys those views on the
Serving Layer.
»Stream processing
»Continuous computation
Technologies
»Apache Storm, Apache Spark Streaming,
»Speed Layer views
»The views need to be stored in a random writable database and are
Read & Write.
17th
Nov. 2016 ER'2016@Gifu.Japan 15
Challenge #6: Real-time Processing
Lambda Architecture: Service Layer (4/4)
What does?
»Focus on serving up views of the data as quickly as possible.
»Queries the batch & real-time views and merges them,
»Should meet requirements of scalability and fault-tolerance.
Technologies
»Read-only: ElephantDB, Druid
»Read & write: Riak, Cassanda, Redis
17th
Nov. 2016 ER'2016@Gifu.Japan 16
Challenge #7: Visualization
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
Goal: Visualize multidimensional data sets by capturing the
dimensionality of data
Data cube
Is a representation of a multidimensional data set
OLAP clients technologies
Pivot table or cross-tab
OLAP ops: drill-down, roll-up, pivot, slice, dice
bar-charts, pie-charts, and time series
e.g. Apache jPivot, Saiku
Sophisticated visualization techniques challenges
Interactive
Reflect current data: updateable
Support of large datasets
Ergonomic
17th
Nov. 2016 ER'2016@Gifu.Japan 17
Challenge #7: Visualization
Visualization Techniques
Tree-map: display hierarchical
(tree-structured) data as a set
of nested rectangles
Scatter plot: using Cartesian
coordinates to display values
for typically two variables for a
set of data
17th
Nov. 2016 ER'2016@Gifu.Japan 18
Challenge #7: Visualization
Visualization Techniques
Parallel coordinate plots:
visualizing high-
dimensional geometry and
analyzing multivariate data.
Choropleth Maps: a thematic
map in which areas are
shaded or patterned in
proportion to the
measurement of the statistical
variable being displayed on
the map
17th
Nov. 2016 ER'2016@Gifu.Japan 19
Challenge #7: Visualization
Multidimensional Networks (1/2)
Networks’ application domains: social networks (linkedIn, facebook, …) , the
web, communication networks, …
A multidimensional network
Graph stucture and Multidimensional attributes
Example: Graph Cube [Zhao et al. 2011]
-1- and -2- are friends
17th
Nov. 2016 ER'2016@Gifu.Japan 20
Challenge #7: Visualization
Graph Cube: example queries (2/2)
Single multidimensional space
Q1: What’s the network structure between
different genders?
Q2: What’s the network structure between
the various gender and location
combinations?
Multiple multidimensional spaces
What is the network structure between the
user with ID = 3 and various locations?
What is the network structure between users
grouped by gender v.s. users grouped by
location?
Cuboid Queries Crossboid Queries
17th
Nov. 2016 ER'2016@Gifu.Japan 21
Challenge #7: Visualization
Nanocubes by AT&T (1/3)
Real-time exploratory visualization on large spatiotemporal and
multidimensional datasets
Real-time: extremely fast queries
Spatial: spatial region that can be either a rectangle covering most of
the world, or a heatmap of activity
Multidimensional: besides latitude, longitude, and time, there are
other attributes such as tweet device: android or iphone, tweet
language: eng, fr, it, sp, ru…
Nanocube [Lins et al., 2013 ]
Algorithm outline
for every object oi
find the finest address of the schema S hit by this object,
update the time series associated with this address
update in a deepest first fashion, all coarser addresses hit by oi
Cons: memory usage
when indexing all six dimensions (latitude, longitude, time, language, device,
application), the 210 million points from Twitter take around 45GB of memory
17th
Nov. 2016 ER'2016@Gifu.Japan 22
Challenge #7: Visualization
Nanocubes by AT&T (2/3)
Which device is more popular for tweeting?
Is one device more popular in certain areas than in others?
How has this popularity changed over time?
lspatial1 is coarser than lspatial2
17th
Nov. 2016 ER'2016@Gifu.Japan 23
Challenge #7: Visualization
Nanocubes by AT&T (3/3)
Intermediate nanocubes generated by after each tweet is inserted
17th
Nov. 2016 ER'2016@Gifu.Japan 24
Conclusion
17th
Nov. 2016 ER'2016@Gifu.Japan 25
References
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
M. Fowler, Schemaless data structures. 2013 http://martinfowler.com/articles/schemaless/
N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data
systems, 1st Edition
S. Agrawal, S. Chaudhuri and V. Narasayya Automated Selection of Materialized Views and
Indexes for SQL Databases. VLDB’2000
http://www.research.microsoft.com/dmx/AutoAdmin
K. Hose, D. Klan, M. Marx and K. Sattler. When is it Time to Rethink the Aggregate
Configuration of Your OLAP Server?. VLDB’2008
Karl Schnaitter and Neoklis Polyzotis. Semi-Automatic Index Tuning: Keeping DBAs in the
Loop. VLDB’2012
P. Zhao, X. Li, D. Xin and J. Han.
Graph cube: on warehousing and OLAP multidimensional networks. SIGMOD’2011
L. D. Lins, J. T. Klosowski and C. E. Scheidegger:
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput.
Graph. 2013 https://github.com/laurolins/nanocube
17th
Nov. 2016 ER'2016@Gifu.Japan 26
Thank you for your Attention
Q & A
Multi-Dimensional Database Modeling and Querying: Methods,
Experiences and Challenging Problems
Alfredo Cuzzocrea and Rim Moussa
17th of November, 2016

More Related Content

PDF
exercices business intelligence
PDF
PDF
Resume de BI
PDF
HDFS Architecture
PDF
Reporting avec JasperServer & iReport
PPTX
Règles d’association
PPTX
La reconnaissance gestuelle
PPTX
Chp2 - Cahier des Charges
exercices business intelligence
Resume de BI
HDFS Architecture
Reporting avec JasperServer & iReport
Règles d’association
La reconnaissance gestuelle
Chp2 - Cahier des Charges

What's hot (20)

PDF
EDM Creating Formulas for Formula Profile & RTP Interface
PPT
Projet BI - 2 - Conception base de données
PDF
Sap sd quest_answer_2009061511245119496
PDF
Tp1 - OpenERP (1)
PDF
Cours Big Data Chap5
PPTX
Les Base de Données NOSQL -Presentation -
PPT
Projet BI - 1 - Analyse des besoins
PPT
Projet Bi - 3 - Alimentation des données
PPTX
Chp2 - Les Entrepôts de Données
PPTX
SGBDR vs NoSQL, Différences et Uses Cases. Focus sur ArangoDB
PDF
Business Intelligence : Transformer les données en information.
PDF
Cours datamining
PDF
Business Intelligence
PDF
Partie3BI-DW-OLAP2019
PDF
Chapitre 1 les entrepôts de données
PDF
TD4-UML-Correction
PPT
Cours data warehouse
PDF
Partie2BI-DW2019
PDF
Lsmw step by- step
EDM Creating Formulas for Formula Profile & RTP Interface
Projet BI - 2 - Conception base de données
Sap sd quest_answer_2009061511245119496
Tp1 - OpenERP (1)
Cours Big Data Chap5
Les Base de Données NOSQL -Presentation -
Projet BI - 1 - Analyse des besoins
Projet Bi - 3 - Alimentation des données
Chp2 - Les Entrepôts de Données
SGBDR vs NoSQL, Différences et Uses Cases. Focus sur ArangoDB
Business Intelligence : Transformer les données en information.
Cours datamining
Business Intelligence
Partie3BI-DW-OLAP2019
Chapitre 1 les entrepôts de données
TD4-UML-Correction
Cours data warehouse
Partie2BI-DW2019
Lsmw step by- step
Ad

Similar to ER 2016 Tutorial (20)

PDF
Meetup070416 Presentations
PPT
eScience: A Transformed Scientific Method
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Minimizing the Complexities of Machine Learning with Data Virtualization
PPT
Analysis technologies - day3 slides Lecture notesppt
PPTX
Flink Meetup Septmeber 2017 2018
PPTX
Big Data Session 1.pptx
PDF
Unlock Your Data for ML & AI using Data Virtualization
PPTX
Big Process for Big Data @ PNNL, May 2013
PPTX
Lecture1
PPT
Big data.ppt
PDF
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
PDF
INF2190_W1_2016_public
PDF
Data Infrastructure for a World of Music
PDF
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
PDF
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
PPTX
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
PDF
Introduction to Data streaming - 05/12/2014
PPTX
Big Data Analytics PPT - S1 working .pptx
Meetup070416 Presentations
eScience: A Transformed Scientific Method
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Data lake-itweekend-sharif university-vahid amiry
Minimizing the Complexities of Machine Learning with Data Virtualization
Analysis technologies - day3 slides Lecture notesppt
Flink Meetup Septmeber 2017 2018
Big Data Session 1.pptx
Unlock Your Data for ML & AI using Data Virtualization
Big Process for Big Data @ PNNL, May 2013
Lecture1
Big data.ppt
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
INF2190_W1_2016_public
Data Infrastructure for a World of Music
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
Introduction to Data streaming - 05/12/2014
Big Data Analytics PPT - S1 working .pptx
Ad

More from Rim Moussa (19)

PDF
data pipelines complexity human expertise and LLM era
PDF
customized eager lazy data cleansing for satisfactory big data veracity
PDF
doc oriented stores for mailing lists using elastic stack
PDF
scalable air quality analytics with apache spark and apache sedona
PDF
polystore_NYC_inrae_sysinfo2021-1.pdf
PDF
Big Data Projects
PDF
ISNCC 2017
PDF
EMR AWS Demo
PDF
BICOD-2017
PDF
Asd 2015
PDF
Ismis2014 dbaas expert
PDF
Parallel Sequence Generator
PDF
Hadoop ensma poitiers
PDF
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
PDF
Automation of MultiDimensional DB Design (poster)
PDF
TPC-H analytics' scenarios and performances on Hadoop data clouds
PDF
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
PDF
highly available distributed databases (poster)
PDF
parallel OLAP
data pipelines complexity human expertise and LLM era
customized eager lazy data cleansing for satisfactory big data veracity
doc oriented stores for mailing lists using elastic stack
scalable air quality analytics with apache spark and apache sedona
polystore_NYC_inrae_sysinfo2021-1.pdf
Big Data Projects
ISNCC 2017
EMR AWS Demo
BICOD-2017
Asd 2015
Ismis2014 dbaas expert
Parallel Sequence Generator
Hadoop ensma poitiers
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Automation of MultiDimensional DB Design (poster)
TPC-H analytics' scenarios and performances on Hadoop data clouds
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
highly available distributed databases (poster)
parallel OLAP

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Pre independence Education in Inndia.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Cell Structure & Organelles in detailed.
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Institutional Correction lecture only . . .
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
master seminar digital applications in india
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Microbial diseases, their pathogenesis and prophylaxis
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Week 4 Term 3 Study Techniques revisited.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Pre independence Education in Inndia.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Cell Structure & Organelles in detailed.
O7-L3 Supply Chain Operations - ICLT Program
FourierSeries-QuestionsWithAnswers(Part-A).pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Institutional Correction lecture only . . .
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
master seminar digital applications in india
PPH.pptx obstetrics and gynecology in nursing
01-Introduction-to-Information-Management.pdf
Pharma ospi slides which help in ospi learning
VCE English Exam - Section C Student Revision Booklet
Anesthesia in Laparoscopic Surgery in India
Microbial diseases, their pathogenesis and prophylaxis

ER 2016 Tutorial

  • 1. 17th Nov. 2016 ER'2016@Gifu.Japan 1 Multi-Dimensional Database Modeling and Querying: Methods, Experiences and Challenging Problems Alfredo Cuzzocrea University of Trieste & ICAR Rim Moussa LaTICE Lab. & University of Carthage The 35th Intl. Conference on Conceptual Modeling @ Gifu, JAPAN 17th of November, 2016
  • 2. 17th Nov. 2016 ER'2016@Gifu.Japan 2 Tutorial Outline Part I: Data Warehouse Systems Decision Support Systems DWS Architectures DW Schemas OLAP cube and OLAP operations DSS Benchmarks OLAP Mandate Part II: Multi-dimensional design Part III: Challenging Problems Conclusion
  • 3. 17th Nov. 2016 ER'2016@Gifu.Japan 3 Translating Data into Insights & Opportunities! Source:
  • 4. 17th Nov. 2016 ER'2016@Gifu.Japan 4 From DIKW Pyramid To Decision Support System SourceSource Source »Descriptions of things, events, activities and transactions »internal or external »Consolidated and organized data that has meaning, value and purpose »Know-thats »Processed data that conveys understanding, experience or learning applicable to a problem|activity »Ability to increase effectiveness »What to do, act … Wisdom Knowledge Information Data
  • 5. 17th Nov. 2016 ER'2016@Gifu.Japan 5 Decision Support System Decision Support System »A collection of integrated software applications and hardware that form the backbone of an organization’s decision making process. »Performs different types of analyses »What-if analysis »Simulation analysis »Goal-seeking analysis Business Intelligence »BI encompasses a variety of tools, applications and methodologies that enable organizations to collect data from internal systems and external sources, prepare it for analysis, develop and run queries against the data, and create reports, dashboards and data visualizations to make the analytical results available to corporate decision makers as well as operational workers. »Integrations (ETL) tools, data warehousing tools, OLAP technologies, OLAP clients, mining structures, reporting tools, dashboards’ design tools »Integration workflows design methods, Multi-dimensional design methods, …
  • 6. 17th Nov. 2016 ER'2016@Gifu.Japan 6 Decision Support System OLTP vs OLAP (1/3) OLTP: On-Line Transaction Processing »Users/ applications interacting with database in real-time »e.g.: on-line banking, on-line payment »Required properties of memory access ACID properties: Atomicity, Consistency, Isolation, Durability Coherence »Benchmarks: TPC-C, TPC-E OLAP: On-Line Analytical Processing »Experts doing offline data analysis »e.g.: data warehouses, decision support systems »Memory patterns: Indexed and Sequential access patterns »Benchmarks: TPC-H, TPC-DS
  • 7. 17th Nov. 2016 ER'2016@Gifu.Japan 7 Decision Support System OLTP vs OLAP (2/3) OLTP: On-Line Transaction Processing »Users/ applications interacting with database in real-time »e.g.: on-line banking, on-line payment »Required properties of memory access ACID properties: Atomicity, Consistency, Isolation, Durability Coherence »Benchmarks: TPC-C, TPC-E OLAP: On-Line Analytical Processing »Experts doing offline data analysis »e.g.: data warehouses, decision support systems »Memory patterns: Indexed and Sequential access patterns »Benchmarks: TPC-H, TPC-DS
  • 8. 17th Nov. 2016 ER'2016@Gifu.Japan 8 Decision Support System OLTP vs OLAP (3/3)  Users/apps interacting with DB in real-time  E.g., customers buying/selling books at Amazon  Many concurrent users, queries, connections  DB is 3rd NF  Simple queries, often predetermined  Relatively little data (~GB)  Each query touches little data  Little computation per query  Simple computation  Data is continually updated  Accuracy and recovery important, hence strict transactions  Throughput is most important  Business Experts doing offline data Analysis  E.g., bestseller at Amazon  Few concurrent users, queries, connections  DB is denormalized  Complex ad-hoc queries  Very large data sets (~TB)  Queries touch large data sets  Mining is compute intensive  Complex operations in mining  Data is mostly read-only  Strict transactional  semantics is not needed  Latency is more important OLTP OLAP
  • 9. 17th Nov. 2016 ER'2016@Gifu.Japan 9 Data Integration Data Integration is the process of integrating data from multiple sources and probably have a single view over all these sources Integration can be physical (copy the data to warehouse) or virtual (keep the data at the sources) Data Acquisition: This is the process of moving company data from the source systems into the warehouse. Data acquisition is an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time. Changed Data Capture: The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which records in the source have changed since the last update. Data Cleansing: This is typically performed in conjunction with data acquisition. It is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse
  • 10. 17th Nov. 2016 ER'2016@Gifu.Japan 10 Data Integration Software solutions Software Solutions »Known as ETL tools, Integration services, ... »Vendors (non-exhaustive list): IBM (InfoSphere Data Event Publisher), Informatica (Informatica Enterprise Data Integration), Information Builders (iWay DataMigrator), Microsoft (Microsoft's SQL Server Integration Services), Oracle (Oracle GoldenGate), Pentaho (Pentaho DI - prev. Ketlle), SAP (BusinessObjects Data Integrator), SAS (SAS DataFlux) and Talend (Talend Open Studio), Kettle (Pentaho) and Talend offer open-source versions, MicroSoft bundels integration services with database products
  • 11. 17th Nov. 2016 ER'2016@Gifu.Japan 11 Integrating Heterogeneous data sources: Lazy or Eager? Lazy Integration aka Query-driven approach »Accept a query, determine the sources which answer the query, devise the execution tree: generate appropriate sub-queries for each source, obtain resultsets, perform required post-processing: translation, merging and filtering and return answer to application Eager Integration aka Warehouse approach »Information of each source of interest is extracted in-advance, translated, filtered and processed as appropriate, then merged with information from other sources and stored »Accept a query, answer query from repository
  • 12. 17th Nov. 2016 ER'2016@Gifu.Japan 12 Data Integration Eager vs. Lazy ?  Integrate in-advance i.e. before queries  Copy data from sources  Query answer-set could be stale  Need to refresh the data warehouse  Operates when sources are unavailable  High query performance: through building data summaries and local processing at sources unaffected  OLAP queries are not visible outside the warehouse  Integrate on-demand i.e. at query time  Leave data at sources  Query answer-set is up-to-date  No copy so no refresh and storage cost  Out of service if sources unavailable  Sources are drained: interference with local processing Eager Lazy
  • 13. 17th Nov. 2016 ER'2016@Gifu.Japan 13 Lazy Data Integration  Query-driven Architecture Source WRAPPER WRAPPER MEDIATOR Source WRAPPER
  • 14. 17th Nov. 2016 ER'2016@Gifu.Japan 14 Eager Data Integration  Warehouse System Architecture
  • 15. 17th Nov. 2016 ER'2016@Gifu.Japan 15 Data Warehouse Definition By Bill Inmon "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process" »Subject-oriented: The data in the database is organized so that all the data elements relating to the same real-world event or object are linked together; e.g. finance, human resources, sales … »Time-variant: The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time. Hence, all addresses of a customer are in the DW, »Non-volatile: Data in the database is never over-written or deleted - once committed, the data is static, read-only, but retained for future reporting; »Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. There is only a single way to identify a product By Ralph Kimbal “A copy of transaction data specifically structured for query and analysis” Inmon top-down approach vs. Kimbal bottom-up approach
  • 16. 17th Nov. 2016 ER'2016@Gifu.Japan Constellation Schema »Galaxy schema »Multiple fact tables which share dimensions Snowflake Schema »Hierarchical relationships exist among Dimensions Tables »Normalized schema Star Schema »A single, large and central table Fact table surrounded by multiple Dimension tables »All Dimensions Tables have a hierarchical relationship with the Fact Table 16 Data Warehouse Schemas Fact Table Dim Table 1 Dim Table 3 Dim Table 4 Dim Table 2 Fact Table Dim Table 5 Dim Table 4 Dim Table 2 Dim Table 6 Dim Table 1 Dim Table 3 Dim Table 7 Dim Table 4 Fact Table 2 Fact Table 1 Dim Table 5 Dim Table 2 Dim Table 6 Dim Table 1 Dim Table 3
  • 17. 17th Nov. 2016 ER'2016@Gifu.Japan 17 Multi-dimensional Data Analysis Multi-dimensional Data Analysis »refers to the process of summarizing data across multiple levels (called dimensions) and then presenting the results in a multi-dimensional grid format. OLAP Cube »Multi-dimensional representation of data »aka Crosstab, Data Pivot
  • 18. 17th Nov. 2016 ER'2016@Gifu.Japan Facts: are the objects that represent the subject of the desired analyses. »Examples: sales records, weather records, cabs trips, … »The fact table contained 3 types of attributes: measured attributes, foreign keys to dimension tables, degenerate dimensions Dimension(s): »Levels are individual values that make up dimensions »Examples »Date dimension (Trimester, month, day) »Time dimension (hour, min, sec) »Geography dimension (Country, city, postal code) Measure(s): »Examples: revenue, lost revenue, sold quantities, expenses, forecasts, … »Use aggregate functions: min, max, count, distinct-count, sum, average, … 18 OLAP Cube
  • 19. 17th Nov. 2016 ER'2016@Gifu.Japan 19 OLAP Operations On-Line Analytical Processing
  • 20. 17th Nov. 2016 ER'2016@Gifu.Japan 20 OLAP Operations Drill-down stepping down to lower level data or introducing new dimensions Roll-up summarizes data by climbing up hierarchy or by dimension reduction Pivot rotate, reorienting the cube Slice selecting data on one dimension Dice selecting data on multiple dimensions
  • 21. 17th Nov. 2016 ER'2016@Gifu.Japan 21 OLAP Operations Roll-up & Drill-down
  • 22. 17th Nov. 2016 ER'2016@Gifu.Japan 22 OLAP Operations Slice & Dice
  • 23. 17th Nov. 2016 ER'2016@Gifu.Japan 23 Derived Data OLAP Indices »Inverted lists »Bitmap Indices »Join indices »Text indices Materialization of a data cube is a way to pre-compute and store multi-dimensional aggregates so that multi-dimensional analysis can be performed on-the-fly [Li et al., 2004] Calculated Attributes Aggregate Tables (aka materialized tables) Data synopsis: histograms and sketches Improved Query Performance Maintenance & Refresh Storage Requirements
  • 24. 17th Nov. 2016 ER'2016@Gifu.Japan 24 Cuboids Lattice What to materialize? d3,d4 All d2:Itemd1:Time d3:Customer d4:Supplier d1,d2 d1,d3 d1,d4 d2,d3 d2,d4 d1,d2,d3 d1,d2,d4 d1,d3,d4 d2,d3,d4 d1,d2,d3,d4 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid For n-dimensional cube »2n cuboids
  • 25. 17th Nov. 2016 ER'2016@Gifu.Japan ROLAP Server »Relational OLAP »Uses relational or extended-relational DBMS to store and manage warehouse data and OLAP middleware »Includes optimization of DBMS back-end, implementation of aggregation navigation logic, and additional tools and services »High scalability »E.g. Mondrian (Pentaho BI suite) MOLAP Server »Multidimensional OLAP »Sparse array-based multidimensional storage engine »Fast indexing to pre-computed summarized data »E.g. Palo HOLAP Server »Low-level: relational / high-level: array »High flexibility »E.g. MSAS (Microsoft) 25 OLAP Servers
  • 26. 17th Nov. 2016 ER'2016@Gifu.Japan Structured Query Language (SQL) »Relational and static schema »Data Definition, data Manipulation, and Data Control Language »Analytic Functions (window functions over partition by …) »Cube, roll-up and grouping sets operators MultiDimensional eXpressions (MDX) »Invented by Microsoft in 1997 »For querying and manipulating the multidimensional data stored OLAP cubes »Static schema Data Flow programming language »Google Sawzall, Apache Apache Pig Latin, IBM Infosphere Streams »Dynamic schema »After data is loaded, multiple operators are applied on that data before the final output is stored. 26 Query Languages Load Data Apply Schema Apply Filter Group Data Apply Aggregate Function Sort Data Store Output
  • 27. 17th Nov. 2016 ER'2016@Gifu.Japan 27 Query Languages SQL – Q16 of TPC-H benchmark
  • 28. 17th Nov. 2016 ER'2016@Gifu.Japan 28 Query Languages MDX –Q16 of TPC-H Benchmark WITH SET [Brands] AS 'Except({[Part Brand].Members}, {[Part Brand].[Brand#45 ]})' SET [Types] AS 'Filter({[Part Type].Members}, (NOT ([Part Type].CurrentMember.Name MATCHES "(?i)MEDIUM POLISHED.*")))' SET [Sizes] AS 'Filter({[Part Size].Members}, ([Part Size].CurrentMember IN {[Part Size].[3], [Part Size].[9], [Part Size].[14], [Part Size].[19], [Part Size].[23], [Part Size].[36], [Part Size].[45], [Part Size].[49]}))' SELECT [Measures].[Supplier Count] ON COLUMNS, nonemptyCrossjoin(nonemptyCrossjoin([Brands], [Types]), [Sizes]) ON ROWS FROM [Cube16]
  • 29. 17th Nov. 2016 ER'2016@Gifu.Japan 29 Query Languages Data Flow –Pig Latin script for Q16 of TPC-H benchmark
  • 30. 17th Nov. 2016 ER'2016@Gifu.Japan 30 Decision Support Systems Benchmarks Non-TPC Benchmarks Real datasets »Open data or proprietary data »fixed size »Devise a workload or trace the proprietary workload APB-1: no scale factor TPC Benchmarks The Transaction Processing Council founded in 1988 to define benchmarks In 2009, TPC-TC is set up as an International Technology Conference Series on Performance Evaluation and Benchmarking Examples of benchmarks relevant for benchmarking decision support systems: TPC-H, TPC-DS and TPC-DI Common characteristics of TPC benchmarks »Synthetic data »Scale factor allowing generation of different volumes 1GB to 1PB
  • 31. 17th Nov. 2016 ER'2016@Gifu.Japan 31 Decision Support Systems Benchmarks TPC-H Benchmark Schema (1/2) TPC-H Benchmark: complex schema 22 ad-hoc SQL statements (star queries, nested queries, …) + refresh functions
  • 32. 17th Nov. 2016 ER'2016@Gifu.Japan 32 Decision Support Systems Benchmarks TPC-H Benchmark (2/2) TPC-H Benchmark Metrics 2 Metrics »QphH@Size is the number of queries processed per hour, that the system under test can handle for a fixed load »$/QphH@Size represents the ratio of cost to performance, where the cost is the cost of ownership of the SUT (hardware,software, maintenance). Variants of TPC-H Benchmarks TPC-H*d Benchmark –detailed in Part II »Turning TPC-H benchmark into a Multi-dimensional benchmark »Few schema changes »No update to workload requirement »MDX workload for OLAP cubes and OLAP queries SSB: Star Schema Benchmark »Turning TPC-H benchmark into star-schema »Workload composed of 12 queries TPC-H translated into Pig Latin (Apache Hadoop Ecosystem) »22 pig latin scripts which load and process TPC-H raw data files (.tbl files)
  • 33. 17th Nov. 2016 ER'2016@Gifu.Japan 33 Decision Support Systems Benchmarks TPC-DS Benchmark (1/2) TPC-DS Benchmark: 7 data marts
  • 34. 17th Nov. 2016 ER'2016@Gifu.Japan 34 Decision Support Systems Benchmarks TPC-DS Benchmark (2/2) TPC-DS Benchmark Workload Hundred of queries (99 query templates) OLAP, windowing functions, mining, and reporting queries Concurrent data maintenance TPC-DS Benchmark Metrics 3 Metrics »QphDS@Size is the number of queries processed per hour, that the system under test can handle for a fixed load. »Data Maintenance and Load Time are calculated »$/QphDS@Size represents the ratio of cost to performance, where the cost is a 3 year cost of ownership of the SUT (hardware,software, maintenance) »System Availability date: the date when the system is available to customers. TPC-DS implementations TPC-DS v2.0 »Extension for non-relational systems such as Hadoop/Spark big data systems
  • 35. 17th Nov. 2016 ER'2016@Gifu.Japan 35 Decision Support Systems Benchmarks TPC-DI Benchmark (1/3) For benchmarking Data Integration technologies Synthetic Data of a Factious Retail Brokerage Firm »Internal Trading system data, Internal Human resources data, Internal CRM System and External data »Different data scales »Data extracted from different sources: »Structured (csv) »Semi-structured data (xml) »Multi record »Change Data Capture (CDS) Complex Data Integration Tasks Load large volumes of historical data Load incremental updates Execute complex transformations Check and ensure consistency of data
  • 36. 17th Nov. 2016 ER'2016@Gifu.Japan 36 TPC-DI Benchmark Complex Transformations (2/3) TPC-DI implements 18 Complex Transformations which feature the following characteristics, Transfer XML to relational data Detect changes in dimension data, and applying appropriate tracking mechanisms for history keeping dimensions Filter input data according to pre-defined conditions Identify new, deleted and updated records in input data Merge multiple input files of the same structure Join data of one input file to data from another input file with different structure Standardize entries of the input files Join data from input file to dimension table Join data from multiple input files with separate structures Consolidate multiple change records per day and identify most current Perform extensive arithmetic calculations Read data from files with variable type records Check data for errors or for adherence to business rules Detect changes in fact data, and journaling updates to reflect current state
  • 37. 17th Nov. 2016 ER'2016@Gifu.Japan 37 TPC-DI Benchmark Metrics (3/3) Metrics Performance Metric TPC_DI_RPS = Trunc(GeoMean(TH, Min(TI1 , TI2))) TH: Throughput of Historical Load TI1: Throughput of Incremental Update 1 TI2: Throughput of Incremental Update 2 Price/Performance Metric Price-per-TPC_DI_RPS = $ / TPC_DI_RPS $ is the total 3-year pricing
  • 38. 17th Nov. 2016 ER'2016@Gifu.Japan 38 OLAP Mandate 12 Rules for Evaluating OLAP products by E.F. Codd et al. Rule#1: Multidimensional Conceptual View »The multidimensional conceptual view facilitates OLAP model design and analysis, as well as inter and intra dimensional calculations. Rule#2: Transparency »Whether OLAP is or is not part of the user’s customary front-end product, that fact should be transparent to the user. If OLAP is provided within the context of a client server architecture, then this fact should be transparent to the user-analyst as well. Rule#3: Accessibility »The OLAP tool must map its own logical schema to heterogeneous physical data stores, access the data, and perform any conversions necessary to present a single, coherent and consistent user view. Moreover, the tool and not the end-user analyst must be concerned about where or from which type of systems the physical data is actually coming
  • 39. 17th Nov. 2016 ER'2016@Gifu.Japan 39 OLAP Mandate 12 Rules for Evaluating OLAP products by E.F. Codd et al. Rule#4: Consistent Reporting Performance »As the number of dimensions or the size of the database increases, the OLAP user-analyst should not perceive any significant degradation in reporting performance. Rule#5: Client-Server Architecture »Most data currently requiring on-line analytical processing is stored on mainframe systems and accessed via personal computers. It is imperative that the server component of OLAP tools be sufficiently intelligent such that various clients can be attached with minimum effort and integration programming. Rule#6: Generic Dimensionality »Every data dimension must be equivalent in both its structure and operational capabilities. Dimensions are symmetric, So the basic data structure, formulae, and reporting formats should not be biased toward any one data dimension.
  • 40. 17th Nov. 2016 ER'2016@Gifu.Japan 40 OLAP Mandate 12 OLAP Rules by E.F. Codd Rule#7: Dynamic Sparse Matrix Handling »The OLAP tools’ physical schema must adapt fully to the specific analytical model being created to provide optimal sparse matrix handling. »By adapting its physical data schema to the specific analytical model, OLAP tools can empower user analysts to easily perform types of analysis which previously have been avoided because of their perceived complexity. Rule#8: Multi-User Support »OLAP tools must provide concurrent access (retrieval and update), integrity, and security. Rule#9: Unrestricted Cross-dimensional Operations »The various roll-up levels within consolidation paths, due to their inherent hierarchical nature, represent in outline form, the majority of 1:1, 1:M, and dependent relationships in an OLAP model or application. Accordingly, the tool itself should infer the associated calculations and not require the user- analyst to explicitly define these inherent calculations.
  • 41. 17th Nov. 2016 ER'2016@Gifu.Japan 41 OLAP Mandate 12 OLAP Rules by E.F. Codd Rule#10: Intuitive Data Manipulation »Consolidation path re-orientation, drilling down across columns or rows, zooming out, and other manipulation inherent in the consolidation path outlines should be accomplished via direct action upon the cells of the analytical model, and should neither require the use of a menu nor multiple trips across the user interface. Rule#11: Flexible Reporting »Reporting must be capable of presenting data to be synthesized, or information resulting from animation of the data model according to any possible orientation. This means that the rows, columns, or page headings must each be capable of containing/displaying from 0 to N dimensions each, where N is the number of dimensions in the entire analytical model. Rule#12: Unlimited Dimensions and Aggregation Levels »An OLAP tool should be able to accommodate at least fifteen and preferably twenty data dimensions within a common analytical model.
  • 42. 17th Nov. 2016 ER'2016@Gifu.Japan 42 References  M. Fricke, The Knowledge Pyramid: A Critique of the DIKW Hierarchy. Journal of Information Science. 2009.  E.F. Codd, S.B. Codd and C.T. Salley, Providing OLAP to User Analysts: an IT mandate, 1993.  J. Widom, Integrating Heterogeneous databases: eager or lazy? ACM Computing Surveys (CSUR) Vol.4, 1996  Y.R. Cho, Data Warehouse and OLAP Operations www.ecs.baylor.edu/faculty/cho/4352  TPC homepage http://www.tpc.org/  M. Poess, T. Rabl and B. Caufield: TPC-DI: The First Industry Benchmark for Data Integration. PVLDB 7(13): 1367-1378 (2014) http://www.vldb.org/pvldb/vol7/p1367-poess.pdf  X. Li, J. Han, H. Gonzalez: High-Dimensional OLAP: A Minimal Cubing Approach. VLDB 2004.  C. Imhoff, N. Galemmo, J. G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional Techniques. 2003.  R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, B. Becker. The Data Warehouse Lifecycle Toolkit. 2nd Edition.  R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2nd Edition.  H. G. Molia. Data Warehousing Overview: Issues, Terminology, Products. www.cs.uh.edu/~ceick/6340/dw-olap.ppt (slides)
  • 43. 17th Nov. 2016 ER'2016@Gifu.Japan 43 Thank you for your Attention Q & A Multi-Dimensional Database Modeling and Querying: Methods, Experiences and Challenging Problems Alfredo Cuzzocrea and Rim Moussa 17th of November, 2016
  • 44. 17th Nov. 2016 ER'2016@Gifu.Japan 1 Multi-Dimensional Database Modeling and Querying: Methods, Experiences and Challenging Problems Part II Multi-dimensional Benchmark Design Alfredo Cuzzocrea University of Trieste & ICAR Rim Moussa University of Carthage & LaTICE The 35th Intl. Conference on Conceptual Modeling @ Gifu, JAPAN 17th of November, 2016
  • 45. 17th Nov. 2016 ER'2016@Gifu.Japan 2 Tutorial Outline Introduction Part I: State-of-the-Art Part II: Experiences TPC-H*d Experience AutoMDB TPC-DS*d Part III: Challenging Problems Conclusion
  • 46. 17th Nov. 2016 ER'2016@Gifu.Japan Given, A relational Warehouse schema A Workload -a set of SQL Statements, W = {Q1, Q2, …, Qn} where Qi is a parameterized query How to design the Multi-dimensional DB Schema? How to define cubes? Will there be a single cube or multiple cubes? Are there any rules for merging of cubes? Are there any rules for definition of virtual cubes? Which Optimizations are suitable for performance tuning ? Derived data calculus & refresh? Data partitioning & parallel cube building? # 3 Problem
  • 47. 17th Nov. 2016 ER'2016@Gifu.Japan # 4 Idea Map each business question to an OLAP cube >> Obtain a multi-dimensional DB schema Recommend & Test Optimizations >> Derived Data >> Data partitioning >> Cube Merging
  • 48. 17th Nov. 2016 ER'2016@Gifu.Japan SELECT t1.col_a, t1.col_b, …, tn.col_a, tn.col_z, aggregate_function(column) as measure_1, …, aggregate_function(expression) as measure_m FROM table_1 t1, table_2 t2, …, table_n tn WHERE ti.col_x operator $query_parameter$ AND ti.col_y = tj.col_z AND … GROUP BY t1.col_a, t1.col_b, …, tn.col_a, tn.col_z aggregate_function: min, max, sum, avg, count, count-distinct … Operator: =, < , <=, >=, != # 5 SQL Statement Template
  • 49. 17th Nov. 2016 ER'2016@Gifu.Japan Measures feature aggregate functions, e.g. min, max, count, count-distinct, sum, average, … Simple Measure Defined over a single attribute, e.g. SUM(l_extendedprice), Measure expressions Involve more than one attribute, e.g. SUM(l_extendedprice*(1 - l_discount)) Computed Members Involve already defined measures or measure expressions, e.g. M1=SUM(l_extendedprice), M2=COUNT(l_orderkey), CM = M1 / M2 # 6 OLAP Cube Design: Measures’ Definition
  • 50. 17th Nov. 2016 ER'2016@Gifu.Japan All attributes involved in measures and measure expressions belong to the fact table, Example: Q10 of TPC-H benchmark # 7 OLAP Cube Design Fact Table Definition (1/9)
  • 51. 17th Nov. 2016 ER'2016@Gifu.Japan Case measurable attributes belong to different tables then the fact table is a view defined as the join of relations, to which attributes belong! Example 1: Q9 of TPC-H benchmark, where l_extendedprice, l_discount and l_quantity belong to lineitem, and ps_supplycost belongs to partsupp. The fact table is the join of lineitem and partsupp tables. Only attributes needed for join with dimension tables (namely, l_partkey, l_orderkey, l_suppkey), and measurable attributes (namely l_extendedprice,l_discount,l_quantity, ps_supplycost) are selected. # 8 OLAP Cube Design Fact Table Definition over multiple tables (2/9)
  • 52. 17th Nov. 2016 ER'2016@Gifu.Japan # 9 Q9 SQL statement OLAP Cube Design Fact Table Definition over multiple tables (3/9)
  • 53. 17th Nov. 2016 ER'2016@Gifu.Japan # 10 OLAP Cube Design Fact Table Definition over multiple tables (4/9)
  • 54. 17th Nov. 2016 ER'2016@Gifu.Japan # 11 Q14 SQL statement OLAP Cube Design Fact Table Definition over multiple tables (5/9) Example 2: Q14 of TPC-H benchmark, where l_extendedprice, l_discount and belong to lineitem, and p_type belongs to part. The fact table is the join of lineitem and part tables
  • 55. 17th Nov. 2016 ER'2016@Gifu.Japan Filters Processing: the fact table is defined as a view of facts with filters »Extract all filters involving the fact table from the WHERE clause, such as (attr_i operator attr_j), where both attr_i and attr_j belong the fact table, (attr_k operator $value$), such that attr_k belongs to the fact table, [not] exists (select … from … where attr_k …), such that attr_k belongs to the fact table, attr_k [not] in (list of values), such that attr_k belongs to the fact table, Example 1: Q10 of TPC-H benchmark Example 2: Q16 of TPC-H benchmark Example 3: Q21 of TPC-H benchmark # 12 OLAP Cube Design Fact Table Definition and filters’ processing (6/9)
  • 56. 17th Nov. 2016 ER'2016@Gifu.Japan # 13 Q10 SQL statement OLAP Cube Design Fact Table Definition and filters’ processing (7/9)
  • 57. 17th Nov. 2016 ER'2016@Gifu.Japan # 14 Q16 SQL statement OLAP Cube Design Fact Table Definition and filters’ processing (8/9)
  • 58. 17th Nov. 2016 ER'2016@Gifu.Japan # 15 Q21 SQL statement OLAP Cube Design Fact Table Definition and filters’ processing (9/9)
  • 59. 17th Nov. 2016 ER'2016@Gifu.Japan First, consider all attributes in the SELECT, WHERE and GROUP BY clauses, »Discard measurable attributes, which figure out in measures, measure expressions, or computed members, »Discard attributes which figure out in the WHERE clause, and are used for joining tables or filtering the fact table with static values, »Compose time dimension along well known hierarchies, »Year, quarter, month »Compose geography dimension along well known hierarchies, »Region, nation, city # 16 OLAP Cube Design Dimension Definition (1/7)
  • 60. 17th Nov. 2016 ER'2016@Gifu.Japan Example: Q10 of TPC-H benchmark All highlighted attributes are considered for dimensions’ mount Time dimension o_orderdate requires order_year and order_quarter levels # 17 OLAP Cube Design Dimension Definition (2/7)
  • 61. 17th Nov. 2016 ER'2016@Gifu.Japan Second, find out hierarchical relations, i.e., one-to-many relationships, and re-organize attributes along hierarchies to form dimensions’ hierarchies, »Example: Q10 of TPC-H benchmark  each customer can be related to at most one nation, but a nation may be related to many customers, customer_dim: Customer nation  n_name Customer detailsc_custkey, c_name, c_acctbal, c_address, c_phone, c_comment order_dim order_year order_quarter # 18 OLAP Cube Design Dimension Definition (3/7)
  • 62. 17th Nov. 2016 ER'2016@Gifu.Japan Third, distinguish levels from properties. Properties are in functional dependency with levels, Example: Q10 of TPC-H benchmark  For customer_dim, c_custkey is the level, and all of c_name, c_acctbal, c_address, c_phone, c_comment attributes are properties of c_custkey level. # 19 OLAP Cube Design Dimension Definition (3/7)
  • 63. 17th Nov. 2016 ER'2016@Gifu.Japan Filters Processing: not all tuples in the dimension table should be considered, so we have to extract filters defined over dimension tables from the WHERE clause not useful for multi- dimensional design, Exple 1: Q12 of TPC-H benchmark For each line shipping mode, year, Count the number of high priority orders (high line count) and the number of not high priority orders (low line count) over orders’ facts and consider only lines such as l_commit_date < l_receipt_date and l_ship_date < l_commit_date. These are filters over dimension table. Exple 2: Q19 of TPC-H benchmark Calculate revenue for particular parts # 20 OLAP Cube Design Dimension Definition and Filters’ processing (4/7)
  • 64. 17th Nov. 2016 ER'2016@Gifu.Japan Example 1: Q12 of TPC-H Benchmark # 21 OLAP Cube Design Dimension Definition and Filters’ processing (5/7)
  • 65. 17th Nov. 2016 ER'2016@Gifu.Japan Example 2: Q19 of TPC-H Benchmark. # 22 OLAP Cube Design Dimension Definition and Filters’ processing (6/7)
  • 66. 17th Nov. 2016 ER'2016@Gifu.Japan # 23 OLAP Cube Design Dimension Definition and Filters’ processing (7/7)
  • 67. 17th Nov. 2016 ER'2016@Gifu.Japan # 24 TPC-H*d Truly OLAP variant of TPC-H benchmark TPC-H SQL workload translated into MDX (MultiDimensional eXpressions) The workload is composed of 23 MDX statements for OLAP cubes and 23 MDX statements for OLAP business queries. Each business question of TPC-H benchmark is mapped to an OLAP cube
  • 68. 17th Nov. 2016 ER'2016@Gifu.Japan # 25 TPC-H*d Q8: From SQL statement to OLAP cube
  • 69. 17th Nov. 2016 ER'2016@Gifu.Japan # 26 TPC-H*d TPC-H*d OLAP Cube C8 Market Share for each supplier nation within a region of customers, for each year and each part type
  • 70. 17th Nov. 2016 ER'2016@Gifu.Japan # 27 TPC-H*d TPC-H*d OLAP Query Q8 Market Share for each RUSSIAN Suppliers within AMERICA region, Over the years 1995 and 1996 and for part type ECO. ANODIZED STEEL
  • 71. 17th Nov. 2016 ER'2016@Gifu.Japan Open source software implemented in java Parses MDB schemas (.xml) files using SAX Library Performs comparisons of OLAP cubes' characteristics. »For each pair of OLAP cubes, »show whether they have same fact table or not »compute the nbr of shared | different | coalescable dimensions »Dimensions are coalescable if they are extracted from the dimension table and their hierarchies are coalescable »compute the number of shared | different measures »Run merge of OLAP cubes using different similarity functions »Simple distance function have or not same fact table »K-means clustering »Distance function is computed with weights according dimensions »Propose Virtual Cubes »Auto-generate a new MDB Schema (.xml) »Create MDB Schema from SQL Workload »On-going, tests include TPC-DS benchmark # 28 AutoMDB
  • 72. 17th Nov. 2016 ER'2016@Gifu.Japan # 29 AutoMDB Load OLAP Cubes defined in xml file
  • 73. 17th Nov. 2016 ER'2016@Gifu.Japan # 30 AutoMDB Compare OLAP Cubes –have or not same fact table
  • 74. 17th Nov. 2016 ER'2016@Gifu.Japan # 31 AutoMDB Compare Cubes –Group cubes which have same fact table
  • 75. 17th Nov. 2016 ER'2016@Gifu.Japan # 32 AutoMDB Compare Cubes –Auto-generate a new MDB schema
  • 76. 17th Nov. 2016 ER'2016@Gifu.Japan # 33 References Modeling Multidimensional Databases (non exhaustive list) M. Gyssens and L. V.S. Lakshmanan. A Foundation for Multi-Dimensional Databases. VLDB’1997. R. Agrawal, A. Gupta and S. Sarawagi. Modeling Multidimensional Databases. ICDE’1997. J. Gray, A. Bosworth, A. Layman and H. Priahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. ICDE’2008. P. Vassiliadis. Modeling Multidimensional Databases, Cubes and Cube Operations. SSDBM’1998. L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases. EDBT’1998. D. Cheung, B. Zhou, B. Kao, H. Lu, T. Lam and H. Ting. Requirement-based data cube schema design. CIKM’1999. T. Niemi, J. Nummenmaa and P. Thanisch. Constructing OLAP cubes based on Queries. DOLAP’2001. O. Teste. Towards Conceptual Multidimensional Design in Decision Support Systems. DEXA’2010. A. Cuzzocrea and R. Moussa. Multidimensional Database Design via Schema Transformation: Turning TPC-H into the TPC- H*d Multidimensional Benchmark. COMAD’2013.
  • 77. 17th Nov. 2016 ER'2016@Gifu.Japan 34 Thank you for your Attention Q & A Multi-Dimensional Database Modeling and Querying: Methods, Experiences and Challenging Problems Alfredo Cuzzocrea and Rim Moussa 17th of November, 2016
  • 78. 17th Nov. 2016 ER'2016@Gifu.Japan 1 Multi-Dimensional Database Modeling and Querying: Methods, Experiences and Challenging Problems Part III Challenging Problems Alfredo Cuzzocrea University of Trieste & ICAR Rim Moussa University of Carthage & LaTICE The 35th Intl. Conference on Conceptual Modeling @ Gifu, JAPAN 17th of November, 2016
  • 79. 17th Nov. 2016 ER'2016@Gifu.Japan 2 Tutorial Outline Introduction Part I: State-of-the-Art Part II: Experiences Part III: Challenges Big Data Integration Flexible Schema Model Curse of dimensionality Systems which scale-out Intelligent Recommenders Real-time OLAP Advanced Visualization Conclusion
  • 80. 17th Nov. 2016 ER'2016@Gifu.Japan 3 Big Data Volume »Volume refers to the amount of data, henceforth the challenge is integration at scale, Velocity »Velocity refers to the speed at which new data is generated, henceforth the challenge is to integrate and analyze data while it is being generated, Variety »Variety refers to different types of data; e.g. structured (relational data), semi-structured (XML, JSON, BSON), unstructured (text); henceforth the challenge is integration of different types of data, Veracity »Veracity refers to the messiness or trustworthiness of the data, henceforth the challenge is to integrate uncertain data quality in data sources, Value »Value refers to our ability to turn data into value.
  • 81. 17th Nov. 2016 ER'2016@Gifu.Japan 4 Advanced and highly performance Technologies »Allow to load and extract different data formats and allow to perform complex transformations »Solve heterogeneity »data type heterogeneity (phone type is number or string), »semantic heterogeneity (column title  column job title) , »value heterogeneity (Pr.  Prof.  Professor) »entity resolution through identification of same entities with values misspelled, values are synonyms or abbreviated, values originated from different systems (date formats) or different domains (true is 1) »Syncing Across Data Sources »data copies migrated from a wide range of sources on different rates and schedules can rapidly get out of the synchronization »Perform data integration at scale Low Solution Cost TPC-DI is a good start for benchmarking integration technologies TPC-DI implementation is on-going Challenge #1: Big Data Integration
  • 82. 17th Nov. 2016 ER'2016@Gifu.Japan 5 Challenge #2: Data Schema Model Schema-based data »E.g. relational model, multi-dimensional model »They define what columns appear, their names, and their datatypes. Schemaless Data and Dynamic Schema »Non-uniform data: custom fields and non-uniform data types »Parse data and query on-the-fly »NoSQL systems (e.g. Apache pig latin, SQL-on-hadoop) Data Lake »All data is loaded from source systems i.e., no data is turned away. »Data is stored at the leaf level in an untransformed or nearly untransformed state. »Data is transformed and schema is applied to fulfill the needs of analysis. »Load in the raw data, as-is. At processing time give data a structure. schema-on-read »High agility: configuration and reconfiguration of models, queries and applications as needed on-the-fly
  • 83. 17th Nov. 2016 ER'2016@Gifu.Japan 6 Challenge #3: Curse of Dimensionality Datasets in applications like bioinformatics and text processing are characterized by, Large column data set a.k.a. highly dimensional data »100 columns of data  100 dimension hyperspace »A data cube of 100 dimensions such each dimension cardinality=10  11100 aggregate cells Moderate size »One million of tuples Proposed Solutions Minimal Cubing Approach [Li et al., 2004 ] Dimension Reduction & Feature Selection »Principal Component Analysis (PCA) »Linear Discriminant Analysis (LDA) »Canonical Correlation Analysis (CCA) »Latent Semantic Indexing (LSI) for text data »Complex and not scalable
  • 84. 17th Nov. 2016 ER'2016@Gifu.Japan 7 Challenge #4: Scale-out Introduction Part I: Methods & State-of-the-Art Part II: Experiences Part III: Challenging Problems Conclusion Data Fragmentation Parallel IO Parallel Processing Technologies Parallel Cube Calculus Distributed Relational Data Warehouses + Mid-tier for parallel cube calculus, e.g. OLAP* framework SQL-on-Hadoop Systems e.g. Hive, Spark SQL, Drill, Impala, BigInsights Scalable and Distributed Data Structures K-RP*s, SkipTree, SkipWebs, PN-Tree,
  • 85. 17th Nov. 2016 ER'2016@Gifu.Japan Recommenders for performance tuning »Automated selection of materialized views and indexes for SQL workloads »AutoAdmin research project at Microsoft, which explores novel techniques to make databases self-tuning [Agrawal et al., 2000] »MS Database Tuning Advisor »Offline techniques »DBA is completely out of the picture 8 Challenge #5: Intelligent Recommenders Indexes and Materialized Views are physical structures that can significantly accelerate performance.
  • 86. 17th Nov. 2016 ER'2016@Gifu.Japan 9 Challenge #5: Intelligent Recommenders Alerter Approach [Hose et al., 2008] »support the aggregate configuration of an OLAP server by (1) continuously monitoring information about the workload and the benefit of aggregation tables and (2) alerting the DBA if changes to the current configuration would be beneficial Semi-Automatic Index Tuning: keeping DBAs in the loop [Schnaiter and Polyzotis, 2012] »Online workload analysis with decisions delegated to the DBA »Solution takes into index interactions »Index interactions: Two indices a and b interact if the benefit of a depends on the presence of b.
  • 87. 17th Nov. 2016 ER'2016@Gifu.Japan 10 Challenge #6: Real-time Processing Retrospective Analysis Traditional Architectures »DWS are deployed as part of an OLAP System separated out from the OLTP system, »Data propagates down to the OLAP system, but typically after some lag, »This is a sufficient system for retrospective analytics, but does not suffice in situations that require real-time analytics . Real-time analytics use cases »Intelligent road-traffic management »Remote health-care monitoring »Complex event processing systems New Data Processing Architectures Architecture for Big Data processing at Scale »Lambda Architecture by Nathan Marz »Batch processing system + Speed processing system »Kappa Architecture »No batch processing systems »New software for architecting DW systems!
  • 88. 17th Nov. 2016 ER'2016@Gifu.Japan 11 Challenge #6: Real-time Processing DSS Real-time Systems Characteristics Data Stream System »Streams pushed at system Long running (continuous) queries Stream characteristics often unknown and time-varying On-line profiling and adapting necessary Continuously arriving data streams Up to gigabits per second in network monitoring Would like to run continuous OLAP/mining queries Working set for typical systems might fit in memory Disk mostly for archiving purposes
  • 89. 17th Nov. 2016 ER'2016@Gifu.Japan 12 Challenge #6: Real-time Processing Lambda Architecture: Big Picture (1/4)
  • 90. 17th Nov. 2016 ER'2016@Gifu.Japan 13 Challenge #6: Real-time Processing Lambda Architecture: Batch Layer (2/4) What does ? »focus on the ingest and storage of large quantities of data and the calculation of views from that data; »Storage of an immutable, append-only, constantly expanding master copy of the system’s data. »Computation of views, derivative data for consumption by the serving layer. Technologies »Batch System: Apache Hadoop/MapReduce, Apache Pig latin, Apache Spark »Batch view databases: ElephantDB, SploutSQL
  • 91. 17th Nov. 2016 ER'2016@Gifu.Japan 14 Challenge #6: Real-time Processing Lambda Architecture: Speed Layer (3/4) What does? »processes stream raw data into views and deploys those views on the Serving Layer. »Stream processing »Continuous computation Technologies »Apache Storm, Apache Spark Streaming, »Speed Layer views »The views need to be stored in a random writable database and are Read & Write.
  • 92. 17th Nov. 2016 ER'2016@Gifu.Japan 15 Challenge #6: Real-time Processing Lambda Architecture: Service Layer (4/4) What does? »Focus on serving up views of the data as quickly as possible. »Queries the batch & real-time views and merges them, »Should meet requirements of scalability and fault-tolerance. Technologies »Read-only: ElephantDB, Druid »Read & write: Riak, Cassanda, Redis
  • 93. 17th Nov. 2016 ER'2016@Gifu.Japan 16 Challenge #7: Visualization Introduction Part I: Methods & State-of-the-Art Part II: Experiences Part III: Challenging Problems Conclusion Goal: Visualize multidimensional data sets by capturing the dimensionality of data Data cube Is a representation of a multidimensional data set OLAP clients technologies Pivot table or cross-tab OLAP ops: drill-down, roll-up, pivot, slice, dice bar-charts, pie-charts, and time series e.g. Apache jPivot, Saiku Sophisticated visualization techniques challenges Interactive Reflect current data: updateable Support of large datasets Ergonomic
  • 94. 17th Nov. 2016 ER'2016@Gifu.Japan 17 Challenge #7: Visualization Visualization Techniques Tree-map: display hierarchical (tree-structured) data as a set of nested rectangles Scatter plot: using Cartesian coordinates to display values for typically two variables for a set of data
  • 95. 17th Nov. 2016 ER'2016@Gifu.Japan 18 Challenge #7: Visualization Visualization Techniques Parallel coordinate plots: visualizing high- dimensional geometry and analyzing multivariate data. Choropleth Maps: a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map
  • 96. 17th Nov. 2016 ER'2016@Gifu.Japan 19 Challenge #7: Visualization Multidimensional Networks (1/2) Networks’ application domains: social networks (linkedIn, facebook, …) , the web, communication networks, … A multidimensional network Graph stucture and Multidimensional attributes Example: Graph Cube [Zhao et al. 2011] -1- and -2- are friends
  • 97. 17th Nov. 2016 ER'2016@Gifu.Japan 20 Challenge #7: Visualization Graph Cube: example queries (2/2) Single multidimensional space Q1: What’s the network structure between different genders? Q2: What’s the network structure between the various gender and location combinations? Multiple multidimensional spaces What is the network structure between the user with ID = 3 and various locations? What is the network structure between users grouped by gender v.s. users grouped by location? Cuboid Queries Crossboid Queries
  • 98. 17th Nov. 2016 ER'2016@Gifu.Japan 21 Challenge #7: Visualization Nanocubes by AT&T (1/3) Real-time exploratory visualization on large spatiotemporal and multidimensional datasets Real-time: extremely fast queries Spatial: spatial region that can be either a rectangle covering most of the world, or a heatmap of activity Multidimensional: besides latitude, longitude, and time, there are other attributes such as tweet device: android or iphone, tweet language: eng, fr, it, sp, ru… Nanocube [Lins et al., 2013 ] Algorithm outline for every object oi find the finest address of the schema S hit by this object, update the time series associated with this address update in a deepest first fashion, all coarser addresses hit by oi Cons: memory usage when indexing all six dimensions (latitude, longitude, time, language, device, application), the 210 million points from Twitter take around 45GB of memory
  • 99. 17th Nov. 2016 ER'2016@Gifu.Japan 22 Challenge #7: Visualization Nanocubes by AT&T (2/3) Which device is more popular for tweeting? Is one device more popular in certain areas than in others? How has this popularity changed over time? lspatial1 is coarser than lspatial2
  • 100. 17th Nov. 2016 ER'2016@Gifu.Japan 23 Challenge #7: Visualization Nanocubes by AT&T (3/3) Intermediate nanocubes generated by after each tweet is inserted
  • 102. 17th Nov. 2016 ER'2016@Gifu.Japan 25 References Introduction Part I: Methods & State-of-the-Art Part II: Experiences Part III: Challenging Problems Conclusion M. Fowler, Schemaless data structures. 2013 http://martinfowler.com/articles/schemaless/ N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data systems, 1st Edition S. Agrawal, S. Chaudhuri and V. Narasayya Automated Selection of Materialized Views and Indexes for SQL Databases. VLDB’2000 http://www.research.microsoft.com/dmx/AutoAdmin K. Hose, D. Klan, M. Marx and K. Sattler. When is it Time to Rethink the Aggregate Configuration of Your OLAP Server?. VLDB’2008 Karl Schnaitter and Neoklis Polyzotis. Semi-Automatic Index Tuning: Keeping DBAs in the Loop. VLDB’2012 P. Zhao, X. Li, D. Xin and J. Han. Graph cube: on warehousing and OLAP multidimensional networks. SIGMOD’2011 L. D. Lins, J. T. Klosowski and C. E. Scheidegger: Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput. Graph. 2013 https://github.com/laurolins/nanocube
  • 103. 17th Nov. 2016 ER'2016@Gifu.Japan 26 Thank you for your Attention Q & A Multi-Dimensional Database Modeling and Querying: Methods, Experiences and Challenging Problems Alfredo Cuzzocrea and Rim Moussa 17th of November, 2016