The convergence of reporting and interactive BI on Hadoop

The Convergence of Reporting
and Interactive BI on Hadoop
Gustavo Arocena
June 19, 2018
Db2 Big SQL

SQL-based Interactive BI on Hadoops
3
This Photo by Unknown Author is licensed under CC BY

SQL-based Interactive BI on Hadoop
4
Time2008 2011 2011 2012
The good old times
EDW
Analytic DB
Not there yet
SQL on Hadoop
EDW HDFS
Analytic DB
“Big Data”
It works, but …
SQL on Hadoop
EDW HDFS
Analytic DB
“Big Data”
Everyone Happy?
BI Accelerator
EDW HDFS
Analytic DB SQL on Hadoop
“Big Data”

• Enable offloading of Interactive BI
to Hadoop
• Interactive BI on Big Data
• Varying degree of autonomics
(auto-creation, auto-refresh)
• Fast response for analytic queries
SELECT p.category, max(s.amount)
FROM products p, sales s
WHERE p.id = s.pid
GROUP BY p.category
BI Accelerators – The Value Prop
5
This Photo by Unknown Author is licensed under CC BY-SA

BI Accelerators – The Small Print
6
• Duplication
• Licensing
• Skills
• Vendors for service/support
• Complexity
• Data architecture
• Data copying & refreshing
• Narrow scope
• Only repetitive, tool-generated queries
• Low integration with Hadoop platform
BI Accelerator
EDW HDFS
Analytic DB SQL on Hadoop
“Big Data”

BI Acceleration Techniques
7
CREATE HADOOP TABLE sales
(id integer,
city string,
amount double)
SELECT sum(amount)
FROM sales
WHERE amount < 500
AND city = ‘Toronto’
Columnar Storage
Cubing Indexing
Cache
1st use
2nd use
Caching
• Data
• Columnar stats
• Query results

Why Not in SQL on Hadoop ?
8
Interactive
BI
Concurrent
workloads
Enterprise features
Core SQL processing
SQL on Hadoop Maturity Levels
6-7 years

Reducing IO and Computation
9
2020? Cost-based optimization
Partitioning
Columnar storage
Cubing Caching Indexing
BI Accelerators
SQL on Hadoop
2018
Partitioning
Columnar storage
Cubing Caching Indexing
BI AcceleratorsSQL on Hadoop
Cost-based optimization
Partitioning
Columnar storage
Cubing
2014 Caching Indexing
SQL on Hadoop
BI Accelerators
Cost-based optimization

The Convergence
10
In memory
caching
Time
SQL on Hadoop
Partitioning
Cost-based
optimization
Columnar
storage
BI performance
BI Accelerators
Cubing
Indexing
~ 2012
On disk
caching
~ 2009 ~ 2020

11
Jethro AtScale Kylin
Engine Multiple instances of single node SMP engine Not an engine MOLAP engine, storing cube cells in HBase
Acceleration
techniques
Indexing
Cubing
Caching
Cubing
Caching
Approximate answers (e.g. count distinct)
Cubing
Cost-based Optimizer
Unique
features
• Computes cubes “bottom up” on demand
• Creates inverted indexes for all columns
• Re-ingests all the data into proprietary fmt
• Imposes star schema on all data
• Automatic and manual cubes
• Uses another engine to execute queries
• Brute force cube building
• Routes query to Hive when not in cube
• Uses Spark to speed up cube building
EDW HDFS
Analytic DB
EDW HDFS
Analytic DB Spark Hive HBase
EDW HDFS
Analytic DB Hive/SparkSQL/…

Using MPP Engines for Interactive BI
12
“MPP is a parallel architecture. Full scans are the
worst-case scenario, not the norm”
“You need scans for queries that can’t be answered
using cube/cache/index”
“MPP engines scale better than non-MPP ones”
“MPP is a full scan architecture”
“You can do BI on Hadoop without table scans”
“MPP SQL engines do not scale to many users”

IBM Db2 Big SQL
13
Top performance on complex workloads
No Lock-In
Reporting AND Interactive BI
Built-in federation to Oracle, Netezza, Db2
Workload management
Big SQL Head
Big SQL WorkerBig SQL WorkerBig SQL WorkerBig SQL WorkerBig SQL Worker
HDFS
Hive MS
Hadoop NN
Deep Platform Integration with no Lock-In

Big SQL Performance and Resource Utilization on Complex Workloads
14
Hadoop DS @ 100TB, 4 concurrent streams
13.7
43.2
BIG SQL SPARK SQL
Hours
Elapsed Time
76.4
88.2
BIG SQL SPARK SQL
%
CPU Utilization
107
388
BIG SQL SPARK SQL
MB/Sec
Disk Reads
25
237
BIG SQL SPARK SQL
MB/Sec
Disk Writes
- 15%
1/3
1/3 1/9
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

Accelerating BI with MQTs
15
CREATE HADOOP TABLE joinMQT AS (
SELECT p_type, p_color, lo_quantity
FROM lineorder, part, dwdate
WHERE p_partkey = lo_partkey
AND lo_orderdate = d_datekey
AND d_year BETWEEN 2000 AND 2010)
PARTITION BY d_year
SORT BY p_type
STORED AS ORC
DATA INITIALLY DEFERRED…;
SELECT AVG(lo_quantity), p_color
FROM lineorder, part, dwdate
WHERE year = 2007
AND p_type = ‘outdoor’
SELECT sum(lo_revenue)
FROM lineorder, customer
WHERE c_custokey = lo_custkey
AND c_city = ‘Toronto’
CREATE TABLE aggMQT AS (
SELECT sum(lo_revenue), c_city
FROM lineorder, customer
WHERE c_custokey = lo_custkey
GROUP BY c_city
DATA INITIALLY DEFERRED…;
Hadoop MQT for join (denormalization) Native MQT for aggregation (cubing)
• High Cardinality
• Stored on HDFS as ORC
• Partitioned
• Sorted for PPD
• Low Cardinality
• Stored on head node in “native” format
• Can be indexed (turns MQT into true cube)
CREATE UNIQUE INDEX aggMQTidx ON
aggMQT(c_city);
• Answered using joinMQT • Answered using aggMQT and aggMQTidx

Speeding Up Dashboards in Big SQL
16
1
• Tune Big SQL, partition data, use ORC format
2
• Copy a fraction/sample of the data to BI tool (e.g. using
Tableau extracts)
3
• Prototype Dashboard using sample data, to get interactive
response during design/prototyping
4
• Once Dashboard is stable, export SQL queries from BI tool
5
• Create necessary MQTs and indexes to speed up the
dashboard queries, to get interactive response in production
6
• Point BI tool to Big SQL (ODBC)
7
• Run Dashboard in Production
• Interactive during design ≠ interactive in production
• Speed up dashboard design by
using just a fraction of the
data
• Speed up production version
using MQTs and indexes
STEPS

SQL on Hadoop in 2018
17
Big SQL
EDW HDFS
Analytic DB
“Big Data”
2018
VS.
SQL on Hadoop
EDW HDFS
Analytic DB
“Big Data”
2012

Big SQL vs BI Accelerators
18
Big SQL BI Accelerators
Reporting queries
 
Predictable
interactive queries
 
Complex hand-
written queries
 
One-off queries
 
Heavy workloads
 
Integration with
Hadoop ecosystem
 
Auto-cubes
 
Full indexing
 

Options for Interactive BI on Hadoop in 2018
19
• “Upload” to Analytic DB
• Expensive
• Painful
• BI Accelerator
• Duplication
• Complexities
• Narrow scope
• Lack of platform integration
• SQL on Hadoop
• Autonomics
The picture above by Unknown Author is licensed under CC BY-NC

Big SQL Roadmap
20
• Caching
• Autonomics
• Security & Governance (Ranger/Atlas)
• Star schema joins
• Interoperability with Hive ACID
• Integration with HDP 3.0 (Ambari/YARN)

Conclusions
21
• Understand BI Acceleration techniques and trade-offs
• SQL on Hadoop 2018
• Cubing, indexing, caching and more
• Reporting AND Interactive BI in a single engine
• SQL on Hadoop still evolving fast!

The convergence of reporting and interactive BI on Hadoop

23
Information in these presentations (including information relating to products that have not yet been
announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include
unintentional technical or typographical errors. IBM shall have no responsibility to update this information.
This document is distributed “as is” without any warranty, either express or implied. In no event,
shall IBM be liable for any damage arising from the use of this information, including but not
limited to, loss of data, business interruption, loss of profit or loss of opportunity.
IBM products and services are warranted per the terms and conditions of the agreements under which
they are provided.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously installed. Regardless, our
warranty terms apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or
withdrawal without notice.
Performance data contained herein was generally obtained in a controlled,
isolated environments. Customer examples are presented as illustrations of how those
customers have used IBM products and the results they may have achieved. Actual performance, cost,
savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to
make such products, programs or services available in all countries in which IBM operates or does
business.
Workshops, sessions and associated materials may have been prepared by independent session
speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for
informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or
advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain
advice of competent legal counsel as to the identification and interpretation of any relevant laws and
regulatory requirements that may affect the customer’s business and any actions the customer may need
to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its
services or products will ensure that the customer follows any law.
Notices and Disclaimers
Information concerning non-IBM products was obtained from the suppliers of those products,
their published announcements or other publicly available sources. IBM has not tested
those products about this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of
non-IBM products should be addressed to the suppliers of those products. IBM does not
warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or
implied, including but not limited to, the implied warranties of merchantability and
fitness for a purpose.
The provision of the information contained herein is not intended to, and does not, grant any
right or license under any IBM patents, copyrights, trademarks or other intellectual
property right.
IBM, the IBM logo, ibm.com and Big SQL are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names
might be trademarks of IBM or other companies. A current list of IBM trademarks is available
on the Web at "Copyright and trademark information"
at: www.ibm.com/legal/copytrade.shtml

The convergence of reporting and interactive BI on Hadoop

More Related Content

What's hot (20)

Similar to The convergence of reporting and interactive BI on Hadoop (20)

More from DataWorks Summit (20)

Recently uploaded (20)

The convergence of reporting and interactive BI on Hadoop

Editor's Notes