SlideShare a Scribd company logo
The Convergence of Reporting
and Interactive BI on Hadoop
Gustavo Arocena
June 19, 2018
Db2 Big SQL
IBM Canada Lab
Toronto
SQL-based Interactive BI on Hadoops
3
This Photo by Unknown Author is licensed under CC BY
SQL-based Interactive BI on Hadoop
4
Time2008 2011 2011 2012
The good old times
EDW
Analytic DB
Not there yet
SQL on Hadoop
EDW HDFS
Analytic DB
“Big Data”
It works, but …
SQL on Hadoop
EDW HDFS
Analytic DB
“Big Data”
Everyone Happy?
BI Accelerator
EDW HDFS
Analytic DB SQL on Hadoop
“Big Data”
• Enable offloading of Interactive BI
to Hadoop
• Interactive BI on Big Data
• Varying degree of autonomics
(auto-creation, auto-refresh)
• Fast response for analytic queries
SELECT p.category, max(s.amount)
FROM products p, sales s
WHERE p.id = s.pid
GROUP BY p.category
BI Accelerators – The Value Prop
5
This Photo by Unknown Author is licensed under CC BY-SA
BI Accelerators – The Small Print
6
• Duplication
• Licensing
• Skills
• Vendors for service/support
• Complexity
• Data architecture
• Data copying & refreshing
• Narrow scope
• Only repetitive, tool-generated queries
• Low integration with Hadoop platform
BI Accelerator
EDW HDFS
Analytic DB SQL on Hadoop
“Big Data”
BI Acceleration Techniques
7
CREATE HADOOP TABLE sales
(id integer,
city string,
amount double)
SELECT sum(amount)
FROM sales
WHERE amount < 500
AND city = ‘Toronto’
Columnar Storage
Cubing Indexing
Cache
1st use
2nd use
Caching
• Data
• Columnar stats
• Query results
Why Not in SQL on Hadoop ?
8
Interactive
BI
Concurrent
workloads
Enterprise features
Core SQL processing
SQL on Hadoop Maturity Levels
6-7 years
Reducing IO and Computation
9
2020? Cost-based optimization
Partitioning
Columnar storage
Cubing Caching Indexing
BI Accelerators
SQL on Hadoop
2018
Partitioning
Columnar storage
Cubing Caching Indexing
BI AcceleratorsSQL on Hadoop
Cost-based optimization
Partitioning
Columnar storage
Cubing
2014 Caching Indexing
SQL on Hadoop
BI Accelerators
Cost-based optimization
The Convergence
10
In memory
caching
Time
SQL on Hadoop
Partitioning
Cost-based
optimization
Columnar
storage
BI performance
BI Accelerators
Cubing
Indexing
~ 2012
On disk
caching
~ 2009 ~ 2020
11
Jethro AtScale Kylin
Engine Multiple instances of single node SMP engine Not an engine MOLAP engine, storing cube cells in HBase
Acceleration
techniques
Indexing
Cubing
Caching
Cubing
Caching
Approximate answers (e.g. count distinct)
Cubing
Cost-based Optimizer
Unique
features
• Computes cubes “bottom up” on demand
• Creates inverted indexes for all columns
• Re-ingests all the data into proprietary fmt
• Imposes star schema on all data
• Automatic and manual cubes
• Uses another engine to execute queries
• Brute force cube building
• Routes query to Hive when not in cube
• Uses Spark to speed up cube building
EDW HDFS
Analytic DB
EDW HDFS
Analytic DB Spark Hive HBase
EDW HDFS
Analytic DB Hive/SparkSQL/…
Using MPP Engines for Interactive BI
12
“MPP is a parallel architecture. Full scans are the
worst-case scenario, not the norm”
“You need scans for queries that can’t be answered
using cube/cache/index”
“MPP engines scale better than non-MPP ones”
This Photo by Unknown Author is licensed under CC BY-SA
“MPP is a full scan architecture”
“You can do BI on Hadoop without table scans”
“MPP SQL engines do not scale to many users”
This Photo by Unknown Author is licensed under CC BY-SA
IBM Db2 Big SQL
13
Top performance on complex workloads
No Lock-In
Reporting AND Interactive BI
Built-in federation to Oracle, Netezza, Db2
Workload management
Big SQL Head
Big SQL WorkerBig SQL WorkerBig SQL WorkerBig SQL WorkerBig SQL Worker
HDFS
Hive MS
Hadoop NN
Deep Platform Integration with no Lock-In
Big SQL Performance and Resource Utilization on Complex Workloads
14
Hadoop DS @ 100TB, 4 concurrent streams
13.7
43.2
BIG SQL SPARK SQL
Hours
Elapsed Time
76.4
88.2
BIG SQL SPARK SQL
%
CPU Utilization
107
388
BIG SQL SPARK SQL
MB/Sec
Disk Reads
25
237
BIG SQL SPARK SQL
MB/Sec
Disk Writes
- 15%
1/3
1/3 1/9
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
Accelerating BI with MQTs
15
CREATE HADOOP TABLE joinMQT AS (
SELECT p_type, p_color, lo_quantity
FROM lineorder, part, dwdate
WHERE p_partkey = lo_partkey
AND lo_orderdate = d_datekey
AND d_year BETWEEN 2000 AND 2010)
PARTITION BY d_year
SORT BY p_type
STORED AS ORC
DATA INITIALLY DEFERRED…;
SELECT AVG(lo_quantity), p_color
FROM lineorder, part, dwdate
WHERE year = 2007
AND p_type = ‘outdoor’
SELECT sum(lo_revenue)
FROM lineorder, customer
WHERE c_custokey = lo_custkey
AND c_city = ‘Toronto’
CREATE TABLE aggMQT AS (
SELECT sum(lo_revenue), c_city
FROM lineorder, customer
WHERE c_custokey = lo_custkey
GROUP BY c_city
DATA INITIALLY DEFERRED…;
Hadoop MQT for join (denormalization) Native MQT for aggregation (cubing)
• High Cardinality
• Stored on HDFS as ORC
• Partitioned
• Sorted for PPD
• Low Cardinality
• Stored on head node in “native” format
• Can be indexed (turns MQT into true cube)
CREATE UNIQUE INDEX aggMQTidx ON
aggMQT(c_city);
• Answered using joinMQT • Answered using aggMQT and aggMQTidx
Speeding Up Dashboards in Big SQL
16
1
• Tune Big SQL, partition data, use ORC format
2
• Copy a fraction/sample of the data to BI tool (e.g. using
Tableau extracts)
3
• Prototype Dashboard using sample data, to get interactive
response during design/prototyping
4
• Once Dashboard is stable, export SQL queries from BI tool
5
• Create necessary MQTs and indexes to speed up the
dashboard queries, to get interactive response in production
6
• Point BI tool to Big SQL (ODBC)
7
• Run Dashboard in Production
• Interactive during design ≠ interactive in production
• Speed up dashboard design by
using just a fraction of the
data
• Speed up production version
using MQTs and indexes
STEPS
SQL on Hadoop in 2018
17
Big SQL
EDW HDFS
Analytic DB
“Big Data”
2018
VS.
SQL on Hadoop
EDW HDFS
Analytic DB
“Big Data”
2012
Big SQL vs BI Accelerators
18
Big SQL BI Accelerators
Reporting queries
 
Predictable
interactive queries
 
Complex hand-
written queries
 
One-off queries
 
Heavy workloads
 
Integration with
Hadoop ecosystem
 
Auto-cubes
 
Full indexing
 
Options for Interactive BI on Hadoop in 2018
19
• “Upload” to Analytic DB
• Expensive
• Painful
• BI Accelerator
• Duplication
• Complexities
• Narrow scope
• Lack of platform integration
• SQL on Hadoop
• Autonomics
The picture above by Unknown Author is licensed under CC BY-NC
Big SQL Roadmap
20
• Caching
• Autonomics
• Security & Governance (Ranger/Atlas)
• Star schema joins
• Interoperability with Hive ACID
• Integration with HDP 3.0 (Ambari/YARN)
Conclusions
21
• Understand BI Acceleration techniques and trade-offs
• SQL on Hadoop 2018
• Cubing, indexing, caching and more
• Reporting AND Interactive BI in a single engine
• SQL on Hadoop still evolving fast!
The convergence of reporting and interactive BI on Hadoop
23
Information in these presentations (including information relating to products that have not yet been
announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include
unintentional technical or typographical errors. IBM shall have no responsibility to update this information.
This document is distributed “as is” without any warranty, either express or implied. In no event,
shall IBM be liable for any damage arising from the use of this information, including but not
limited to, loss of data, business interruption, loss of profit or loss of opportunity.
IBM products and services are warranted per the terms and conditions of the agreements under which
they are provided.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously installed. Regardless, our
warranty terms apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or
withdrawal without notice.
Performance data contained herein was generally obtained in a controlled,
isolated environments. Customer examples are presented as illustrations of how those
customers have used IBM products and the results they may have achieved. Actual performance, cost,
savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to
make such products, programs or services available in all countries in which IBM operates or does
business.
Workshops, sessions and associated materials may have been prepared by independent session
speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for
informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or
advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain
advice of competent legal counsel as to the identification and interpretation of any relevant laws and
regulatory requirements that may affect the customer’s business and any actions the customer may need
to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its
services or products will ensure that the customer follows any law.
Notices and Disclaimers
Information concerning non-IBM products was obtained from the suppliers of those products,
their published announcements or other publicly available sources. IBM has not tested
those products about this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of
non-IBM products should be addressed to the suppliers of those products. IBM does not
warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or
implied, including but not limited to, the implied warranties of merchantability and
fitness for a purpose.
The provision of the information contained herein is not intended to, and does not, grant any
right or license under any IBM patents, copyrights, trademarks or other intellectual
property right.
IBM, the IBM logo, ibm.com and Big SQL are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names
might be trademarks of IBM or other companies. A current list of IBM trademarks is available
on the Web at "Copyright and trademark information"
at: www.ibm.com/legal/copytrade.shtml
24

More Related Content

PDF
Evolving Hadoop into an Operational Platform with Data Applications
PPTX
Containers and Big Data
PDF
Benefits of Hadoop as Platform as a Service
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
PPTX
The rise of big data governance: insight on this emerging trend from active o...
PPTX
Scaling Data Science on Big Data
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
PDF
50 Shades of SQL
Evolving Hadoop into an Operational Platform with Data Applications
Containers and Big Data
Benefits of Hadoop as Platform as a Service
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
The rise of big data governance: insight on this emerging trend from active o...
Scaling Data Science on Big Data
Adding structure to your streaming pipelines: moving from Spark streaming to ...
50 Shades of SQL

What's hot (20)

PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
PDF
On Demand HDP Clusters using Cloudbreak and Ambari
PPTX
Compute-based sizing and system dashboard
PDF
Empowering you with Democratized Data Access, Data Science and Machine Learning
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
PDF
Machine Learning for z/OS
PPTX
Breakout: Hadoop and the Operational Data Store
PPTX
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Enterprise large scale graph analytics and computing base on distribute graph...
PPTX
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
PPTX
Building intelligent applications, experimental ML with Uber’s Data Science W...
PPTX
Analyzing the World's Largest Security Data Lake!
PPTX
Hadoop Journey at Walgreens
PDF
Democratizing Data Science on Kubernetes
PPTX
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
PDF
Hadoop and the Data Warehouse: When to Use Which
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Why and how to leverage the simplicity and power of SQL on Flink
On Demand HDP Clusters using Cloudbreak and Ambari
Compute-based sizing and system dashboard
Empowering you with Democratized Data Access, Data Science and Machine Learning
The Future of Apache Hadoop an Enterprise Architecture View
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Machine Learning for z/OS
Breakout: Hadoop and the Operational Data Store
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Enterprise large scale graph analytics and computing base on distribute graph...
Driving Network and Marketing Investments at O2 by Focusing on Improving the ...
Building intelligent applications, experimental ML with Uber’s Data Science W...
Analyzing the World's Largest Security Data Lake!
Hadoop Journey at Walgreens
Democratizing Data Science on Kubernetes
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Hadoop and the Data Warehouse: When to Use Which
Ad

Similar to The convergence of reporting and interactive BI on Hadoop (20)

PPTX
The Convergence of Reporting and Interactive BI on Hadoop
PDF
Cloud Based Data Warehousing and Analytics
PDF
Benchmarking Hadoop - Which hadoop sql engine leads the herd
PDF
Ibm db2 big sql
PDF
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
PPT
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
PDF
Ibm db2update2019 icp4 data
PDF
Enabling a hardware accelerated deep learning data science experience for Apa...
PDF
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
PPTX
Still on IBM BigInsights? We have the right path for you
PDF
Snowflake: The most cost-effective agile and scalable data warehouse ever!
PDF
Get Started Quickly with IBM's Hadoop as a Service
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
PDF
Big Data Ready Enterprise
PDF
10/ EnterpriseDB @ OPEN'16
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
PDF
Informix REST API Tutorial
PDF
SQL Server 2019 Big Data Cluster
PDF
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
PDF
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
The Convergence of Reporting and Interactive BI on Hadoop
Cloud Based Data Warehousing and Analytics
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Ibm db2 big sql
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Ibm db2update2019 icp4 data
Enabling a hardware accelerated deep learning data science experience for Apa...
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Still on IBM BigInsights? We have the right path for you
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Get Started Quickly with IBM's Hadoop as a Service
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Big Data Ready Enterprise
10/ EnterpriseDB @ OPEN'16
Hadoop and SQL: Delivery Analytics Across the Organization
Informix REST API Tutorial
SQL Server 2019 Big Data Cluster
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced IT Governance
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced IT Governance
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced Soft Computing BINUS July 2025.pdf
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

The convergence of reporting and interactive BI on Hadoop

  • 1. The Convergence of Reporting and Interactive BI on Hadoop Gustavo Arocena June 19, 2018 Db2 Big SQL
  • 3. SQL-based Interactive BI on Hadoops 3 This Photo by Unknown Author is licensed under CC BY
  • 4. SQL-based Interactive BI on Hadoop 4 Time2008 2011 2011 2012 The good old times EDW Analytic DB Not there yet SQL on Hadoop EDW HDFS Analytic DB “Big Data” It works, but … SQL on Hadoop EDW HDFS Analytic DB “Big Data” Everyone Happy? BI Accelerator EDW HDFS Analytic DB SQL on Hadoop “Big Data”
  • 5. • Enable offloading of Interactive BI to Hadoop • Interactive BI on Big Data • Varying degree of autonomics (auto-creation, auto-refresh) • Fast response for analytic queries SELECT p.category, max(s.amount) FROM products p, sales s WHERE p.id = s.pid GROUP BY p.category BI Accelerators – The Value Prop 5 This Photo by Unknown Author is licensed under CC BY-SA
  • 6. BI Accelerators – The Small Print 6 • Duplication • Licensing • Skills • Vendors for service/support • Complexity • Data architecture • Data copying & refreshing • Narrow scope • Only repetitive, tool-generated queries • Low integration with Hadoop platform BI Accelerator EDW HDFS Analytic DB SQL on Hadoop “Big Data”
  • 7. BI Acceleration Techniques 7 CREATE HADOOP TABLE sales (id integer, city string, amount double) SELECT sum(amount) FROM sales WHERE amount < 500 AND city = ‘Toronto’ Columnar Storage Cubing Indexing Cache 1st use 2nd use Caching • Data • Columnar stats • Query results
  • 8. Why Not in SQL on Hadoop ? 8 Interactive BI Concurrent workloads Enterprise features Core SQL processing SQL on Hadoop Maturity Levels 6-7 years
  • 9. Reducing IO and Computation 9 2020? Cost-based optimization Partitioning Columnar storage Cubing Caching Indexing BI Accelerators SQL on Hadoop 2018 Partitioning Columnar storage Cubing Caching Indexing BI AcceleratorsSQL on Hadoop Cost-based optimization Partitioning Columnar storage Cubing 2014 Caching Indexing SQL on Hadoop BI Accelerators Cost-based optimization
  • 10. The Convergence 10 In memory caching Time SQL on Hadoop Partitioning Cost-based optimization Columnar storage BI performance BI Accelerators Cubing Indexing ~ 2012 On disk caching ~ 2009 ~ 2020
  • 11. 11 Jethro AtScale Kylin Engine Multiple instances of single node SMP engine Not an engine MOLAP engine, storing cube cells in HBase Acceleration techniques Indexing Cubing Caching Cubing Caching Approximate answers (e.g. count distinct) Cubing Cost-based Optimizer Unique features • Computes cubes “bottom up” on demand • Creates inverted indexes for all columns • Re-ingests all the data into proprietary fmt • Imposes star schema on all data • Automatic and manual cubes • Uses another engine to execute queries • Brute force cube building • Routes query to Hive when not in cube • Uses Spark to speed up cube building EDW HDFS Analytic DB EDW HDFS Analytic DB Spark Hive HBase EDW HDFS Analytic DB Hive/SparkSQL/…
  • 12. Using MPP Engines for Interactive BI 12 “MPP is a parallel architecture. Full scans are the worst-case scenario, not the norm” “You need scans for queries that can’t be answered using cube/cache/index” “MPP engines scale better than non-MPP ones” This Photo by Unknown Author is licensed under CC BY-SA “MPP is a full scan architecture” “You can do BI on Hadoop without table scans” “MPP SQL engines do not scale to many users” This Photo by Unknown Author is licensed under CC BY-SA
  • 13. IBM Db2 Big SQL 13 Top performance on complex workloads No Lock-In Reporting AND Interactive BI Built-in federation to Oracle, Netezza, Db2 Workload management Big SQL Head Big SQL WorkerBig SQL WorkerBig SQL WorkerBig SQL WorkerBig SQL Worker HDFS Hive MS Hadoop NN Deep Platform Integration with no Lock-In
  • 14. Big SQL Performance and Resource Utilization on Complex Workloads 14 Hadoop DS @ 100TB, 4 concurrent streams 13.7 43.2 BIG SQL SPARK SQL Hours Elapsed Time 76.4 88.2 BIG SQL SPARK SQL % CPU Utilization 107 388 BIG SQL SPARK SQL MB/Sec Disk Reads 25 237 BIG SQL SPARK SQL MB/Sec Disk Writes - 15% 1/3 1/3 1/9 https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
  • 15. Accelerating BI with MQTs 15 CREATE HADOOP TABLE joinMQT AS ( SELECT p_type, p_color, lo_quantity FROM lineorder, part, dwdate WHERE p_partkey = lo_partkey AND lo_orderdate = d_datekey AND d_year BETWEEN 2000 AND 2010) PARTITION BY d_year SORT BY p_type STORED AS ORC DATA INITIALLY DEFERRED…; SELECT AVG(lo_quantity), p_color FROM lineorder, part, dwdate WHERE year = 2007 AND p_type = ‘outdoor’ SELECT sum(lo_revenue) FROM lineorder, customer WHERE c_custokey = lo_custkey AND c_city = ‘Toronto’ CREATE TABLE aggMQT AS ( SELECT sum(lo_revenue), c_city FROM lineorder, customer WHERE c_custokey = lo_custkey GROUP BY c_city DATA INITIALLY DEFERRED…; Hadoop MQT for join (denormalization) Native MQT for aggregation (cubing) • High Cardinality • Stored on HDFS as ORC • Partitioned • Sorted for PPD • Low Cardinality • Stored on head node in “native” format • Can be indexed (turns MQT into true cube) CREATE UNIQUE INDEX aggMQTidx ON aggMQT(c_city); • Answered using joinMQT • Answered using aggMQT and aggMQTidx
  • 16. Speeding Up Dashboards in Big SQL 16 1 • Tune Big SQL, partition data, use ORC format 2 • Copy a fraction/sample of the data to BI tool (e.g. using Tableau extracts) 3 • Prototype Dashboard using sample data, to get interactive response during design/prototyping 4 • Once Dashboard is stable, export SQL queries from BI tool 5 • Create necessary MQTs and indexes to speed up the dashboard queries, to get interactive response in production 6 • Point BI tool to Big SQL (ODBC) 7 • Run Dashboard in Production • Interactive during design ≠ interactive in production • Speed up dashboard design by using just a fraction of the data • Speed up production version using MQTs and indexes STEPS
  • 17. SQL on Hadoop in 2018 17 Big SQL EDW HDFS Analytic DB “Big Data” 2018 VS. SQL on Hadoop EDW HDFS Analytic DB “Big Data” 2012
  • 18. Big SQL vs BI Accelerators 18 Big SQL BI Accelerators Reporting queries   Predictable interactive queries   Complex hand- written queries   One-off queries   Heavy workloads   Integration with Hadoop ecosystem   Auto-cubes   Full indexing  
  • 19. Options for Interactive BI on Hadoop in 2018 19 • “Upload” to Analytic DB • Expensive • Painful • BI Accelerator • Duplication • Complexities • Narrow scope • Lack of platform integration • SQL on Hadoop • Autonomics The picture above by Unknown Author is licensed under CC BY-NC
  • 20. Big SQL Roadmap 20 • Caching • Autonomics • Security & Governance (Ranger/Atlas) • Star schema joins • Interoperability with Hive ACID • Integration with HDP 3.0 (Ambari/YARN)
  • 21. Conclusions 21 • Understand BI Acceleration techniques and trade-offs • SQL on Hadoop 2018 • Cubing, indexing, caching and more • Reporting AND Interactive BI in a single engine • SQL on Hadoop still evolving fast!
  • 23. 23 Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.” Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law. Notices and Disclaimers Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com and Big SQL are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml
  • 24. 24

Editor's Notes