SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing with
materialized views in Apache Hive
Jesús Camacho Rodríguez
DataWorks Summit Berlin
April 18, 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Initial use case: batch processing
• Read-only data
• HiveQL (SQL-like query language)
• MapReduce
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Upcoming release: Apache Hive 3.0 (May 2018)
• Extensive renovation to improve three different axes
• Latency: allow interactive and sub-second queries
• Scalability: from TB to PB of data
• SQL support: move from HiveQL to SQL standard
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Multiple execution engines: Apache Tez and Apache Spark
• More efficient join execution algorithms
• Vectorized query execution
• Integration with columnar storage formats:
Apache ORC, Apache Parquet
• LLAP (Live Long and Process)
• Persistent deamons for low-latency queries
• Rule-based and cost-based optimizer
• Better statistics
• Tighter integration with other data processing systems: Druid
Important internals improvements
4 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns
5 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Establish relationship between original and new tables
• Has a similar table already been created?
• Rewrite your queries to use new tables
• What happens when access patterns change?
• Maintain your new tables when original tables change
• Do I have to fully rebuild new tables?
Optimization based on access patterns
Currently, Hive users
have to do it manually
6 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views
• A materialized view is an entity that contains the result of a evaluating a query
• Important property  Awareness of the materialized view definition semantics
• Optimizer can exploit them for automatic query rewriting
• System can handle maintenance of the materialized views
• Generally, materializations can be created in different forms depending on the scope
• DBA writes “CREATE MATERIALIZED VIEW” statement
• Daemon creates materialized view based on recent query activity
• Cached result of previous similar query
• Query factorization identifies common pieces within a single query
7 © Hortonworks Inc. 2011–2018. All rights reserved
Possible workflow
1. Create materialized view using Hive tables
• Stored by Hive or Druid
2. User or dashboard sends queries to Hive
• Hive rewrites queries using available materialized views
• Execute rewitten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
①
②
Data
Queries
8 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views in Apache Hive
• First implementation will be part of Apache Hive 3.0
• Multiple storage options: Hive, Druid
• Automatic rewriting of incoming queries to use materialized views
• Efficient view maintenance
• Incremental refresh
• Multiple options to control materialized views lifecycle
9 © Hortonworks Inc. 2011–2018. All rights reserved
Management of
materialized views in Hive
10 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[ENABLE REWRITE | DISABLE REWRITE]
[COMMENT materialized_view_comment]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
⇢ Supports custom table properties, storage format, etc.
11 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation (stored in Druid)
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive materialized view name
Hive storage handler classname
12 © Hortonworks Inc. 2011–2018. All rights reserved
Other operations for materialized view management
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
⇢ More operations to be added and extended
13 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based
query rewriting
14 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting algorithm
• Automatically rewrite incoming queries using materialized views
• Optimizer exploits materialized view definition semantics
• Built on the ideas presented in [GL01] using Apache Calcite
• Supports queries containing TableScan, Project, Filter, Join, Aggregate operators
• Includes some extensions
• Generation of additional rewritings without needing to do join permutation
• Partial rewritings using union operators
• More information about the rewriting coverage
• http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information
[GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical,
scalable solution. In Proc. ACM SIGMOD Conf., 2001.
15 © Hortonworks Inc. 2011–2018. All rights reserved
Enable materialized view-based rewriting
• Global property to enable materialized view rewriting for queries
SET hive.materializedview.rewriting=true;
• User can selectively use enable/disable materialized views for rewriting
• Materialized views are enabled by default for rewriting
• Behavior can be altered after materialized view has been created
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
16 © Hortonworks Inc. 2011–2018. All rights reserved
depts
Materialized view-based rewriting (example)
• Materialized view definition
Employees that were hired after 2016
CREATE MATERIALIZED VIEW mv
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
• Query
Employees that were hired last quarter
SELECT empid, deptname
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
• Materialized view-based rewriting
SELECT empid, deptname
FROM mv
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
deptsemps
empid depname hire_date
10001 IT 2016-03-01
10002 IT 2017-01-02
10003 HR 2017-07-01
10004 Finance 2018-01-15
10005 HR 2018-02-02
mv contents
empid depname
10004 Finance
10005 HR
Query results
17 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 2)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
Exploit SQL PK-FK and
NOT NULL constraints
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
… … ... …
mv contents
sum
42.0
…
Query results
18 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 3)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT floor(time to minute), page,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY floor(time to minute), page;
• Query
SELECT floor(time to month),
SUM(added) AS c_added
FROM wiki
GROUP BY floor(time to month);
• Materialized view-based rewriting
SELECT floor(time to month),
SUM(c_added) as c_added
FROM mv
GROUP BY floor(time to month);
wiki
__time page c_added c_rmv
2011-01-01 01:05:00 Justin 1800 25
2011-01-20 19:00:00 Justin 2912 42
2011-01-01 11:06:00 Ke$ha 1953 17
2011-02-02 13:15:00 Ke$ha 3194 170
2011-01-02 18:00:00 Miley 2232 34
mv contents
__time c_added
2011-01-01 00:00:00 8897
2011-02-01 00:00:00 3194
Query results
19 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view
maintenance
20 © Hortonworks Inc. 2011–2018. All rights reserved
Rebuilding materialized views
• Rebuild needs to be triggered manually by user
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
• Multiple benefits
• Decrease rebuild step execution time
• Preserves LLAP cache for existing data
• Materialized view should only use transactional tables (micromanaged or ACID)
• Current implementation only supports incremental rebuild for insert operations
• Update/delete operations force full rebuild
• Optimizer will attempt incremental rebuild
• Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
21 © Hortonworks Inc. 2011–2018. All rights reserved
Incremental view maintenance algorithm
• Relies on materialized view rewriting algorithm
• Materialized view stores write ID for its tables when it is created/refreshed
• Write ID associates rows with transactions
• When rebuild is triggered, introduce filter condition on write ID column in MV definition
• Read only new rows from source tables
• Execute materialized view rewriting
• Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan
• INSERT (table scan, filter, project, join)
• MERGE (table scan, filter, project, join, aggregate)
22 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
23 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
① Rebuild statement rewriting
INSERT OVERWRITE mv1
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM (
SELECT page, user, c_added, c_removed
FROM mv1
UNION ALL
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) subq
GROUP BY page, user;
Incremental view maintenance algorithm (example)
Rollup data
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
24 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
② Rewrite INSERT OVERWRITE into MERGE statement
MERGE INTO mv1
USING (
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) src
ON mv1.page = src.page AND mv1.user = src.user
WHEN MATCHED
THEN UPDATE SET c_added = mv1.c_added + src.c_added,
c_removed = mv1.c_removed + src.c_rmv
WHEN NOT MATCHED
THEN INSERT VALUES (page, user, c_added, c_rmv);
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
25 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view lifecycle
26 © Hortonworks Inc. 2011–2018. All rights reserved
Management of materialized view lifecycle
• Do not accept stale data (default)
• If content of the materialized view is not fresh, we do not use it for automatic query rewriting
• Still possible to trigger partial rewritings that read both the stale materialized view and new data
from source tables
• Accept stale data
• Freshness defined as a time parameter
• If MV was not rebuilt for a certain time period and there were changes in base tables, ignore
• SET hive.materializedview.rewriting.time.window=10min;
• Can also be overriden by a certain materialized view using table properties
• Periodically rebuild materialized view, e.g., every 5 minutes
t=0min t=10min t=20min
Create MV Rebuild Rebuild Rebuild Rebuild
t=5min t=15min
27 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
28 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
• Improvements to current materialized views implementation
• Rewriting performance and scalability
• Single/many MVs
• Control physical distribution of data
• DISTRIBUTE BY, SORT BY, CLUSTER BY
• Increase incremental view maintenance coverage
• Support update/delete in source tables
• Materialized view recommender
• Ease the identification of access patterns for a given workload
29 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
30 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
https://cwiki.apache.org/confluence/display/Hive/Materialized+views

More Related Content

PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
How Impala Works
PDF
Hive Anatomy
PPT
Introduction to mongodb
PDF
Advanced MySQL Query Tuning
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Dongwon Kim – A Comparative Performance Evaluation of Flink
How to understand and analyze Apache Hive query execution plan for performanc...
Apache Calcite (a tutorial given at BOSS '21)
How Impala Works
Hive Anatomy
Introduction to mongodb
Advanced MySQL Query Tuning

What's hot (20)

PPTX
Apache HBase Performance Tuning
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PPTX
PostgreSQL Database Slides
PPTX
Apache hive
PPTX
Session 14 - Hive
PPTX
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
PPTX
SHACL by example
PDF
Hadoop Distributed File System
PDF
Histogram-in-Parallel-universe-of-MySQL-and-MariaDB
PDF
Morel, a Functional Query Language
PDF
Using The Mysql Binary Log As A Change Stream
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PPTX
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
PDF
MySQL Performance Schema in Action
PDF
Spark SQL
PDF
Introduction to Apache Calcite
PDF
Inside Parquet Format
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache HBase Performance Tuning
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PostgreSQL Database Slides
Apache hive
Session 14 - Hive
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
SHACL by example
Hadoop Distributed File System
Histogram-in-Parallel-universe-of-MySQL-and-MariaDB
Morel, a Functional Query Language
Using The Mysql Binary Log As A Change Stream
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
MySQL Performance Schema in Action
Spark SQL
Introduction to Apache Calcite
Inside Parquet Format
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Ad

Similar to Accelerating query processing with materialized views in Apache Hive (20)

PPTX
Accelerating query processing
PDF
Accelerating query processing with materialized views in Apache Hive
PPTX
Discardable In-Memory Materialized Queries With Hadoop
PPTX
Discardable In-Memory Materialized Query for Hadoop
PPTX
Hive Performance Dataworks Summit Melbourne February 2019
PDF
Fast SQL on Hadoop, Really?
PDF
What's New in Apache Hive 3.0 - Tokyo
PDF
What's New in Apache Hive 3.0?
PPTX
What's new in apache hive
PPT
materialized view description presentation
PDF
Autonomous ETL with Materialized Views
PDF
Selection & Maintenance of Materialized View and It’s Application for Fast Qu...
PDF
Fast SQL on Hadoop, really?
PPTX
Improve data warehouse performance by preprocessing
PDF
PGConf.ASIA 2019 Bali - Toward Implementing Incremental View Maintenance on P...
PDF
Cassandra Materialized Views
PDF
Fg33950952
PDF
Fg33950952
PDF
Flexviews materialized views for my sql
PDF
Data Warehousing 101(and a video)
Accelerating query processing
Accelerating query processing with materialized views in Apache Hive
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Query for Hadoop
Hive Performance Dataworks Summit Melbourne February 2019
Fast SQL on Hadoop, Really?
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0?
What's new in apache hive
materialized view description presentation
Autonomous ETL with Materialized Views
Selection & Maintenance of Materialized View and It’s Application for Fast Qu...
Fast SQL on Hadoop, really?
Improve data warehouse performance by preprocessing
PGConf.ASIA 2019 Bali - Toward Implementing Incremental View Maintenance on P...
Cassandra Materialized Views
Fg33950952
Fg33950952
Flexviews materialized views for my sql
Data Warehousing 101(and a video)
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Accelerating query processing with materialized views in Apache Hive

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing with materialized views in Apache Hive Jesús Camacho Rodríguez DataWorks Summit Berlin April 18, 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive • Initial use case: batch processing • Read-only data • HiveQL (SQL-like query language) • MapReduce • Effort to take Hive beyond its batch processing roots • Started in Apache Hive 0.10.0 (January 2013) • Upcoming release: Apache Hive 3.0 (May 2018) • Extensive renovation to improve three different axes • Latency: allow interactive and sub-second queries • Scalability: from TB to PB of data • SQL support: move from HiveQL to SQL standard
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive • Multiple execution engines: Apache Tez and Apache Spark • More efficient join execution algorithms • Vectorized query execution • Integration with columnar storage formats: Apache ORC, Apache Parquet • LLAP (Live Long and Process) • Persistent deamons for low-latency queries • Rule-based and cost-based optimizer • Better statistics • Tighter integration with other data processing systems: Druid Important internals improvements
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing • Change data physical properties (distribute, sort) • Filter rows • Denormalize • Preaggregate Optimization based on access patterns
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing • Establish relationship between original and new tables • Has a similar table already been created? • Rewrite your queries to use new tables • What happens when access patterns change? • Maintain your new tables when original tables change • Do I have to fully rebuild new tables? Optimization based on access patterns Currently, Hive users have to do it manually
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Materialized views • A materialized view is an entity that contains the result of a evaluating a query • Important property  Awareness of the materialized view definition semantics • Optimizer can exploit them for automatic query rewriting • System can handle maintenance of the materialized views • Generally, materializations can be created in different forms depending on the scope • DBA writes “CREATE MATERIALIZED VIEW” statement • Daemon creates materialized view based on recent query activity • Cached result of previous similar query • Query factorization identifies common pieces within a single query
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Possible workflow 1. Create materialized view using Hive tables • Stored by Hive or Druid 2. User or dashboard sends queries to Hive • Hive rewrites queries using available materialized views • Execute rewitten query Dashboards, BI tools CREATE MATERIALIZED VIEW `ssb_mv` STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' ENABLE REWRITE AS <query>; DBA, recommendation system ① ② Data Queries
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Materialized views in Apache Hive • First implementation will be part of Apache Hive 3.0 • Multiple storage options: Hive, Druid • Automatic rewriting of incoming queries to use materialized views • Efficient view maintenance • Incremental refresh • Multiple options to control materialized views lifecycle
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Management of materialized views in Hive
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view creation • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name [ENABLE REWRITE | DISABLE REWRITE] [COMMENT materialized_view_comment] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] AS <query>; ⇢ Supports custom table properties, storage format, etc.
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view creation (stored in Druid) • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW druid_wiki_mv STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT __time, page, user, c_added, c_removed FROM src; Hive materialized view name Hive storage handler classname
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Other operations for materialized view management DROP MATERIALIZED VIEW [db_name.]materialized_view_name; SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’]; DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name; ⇢ More operations to be added and extended
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based query rewriting
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting algorithm • Automatically rewrite incoming queries using materialized views • Optimizer exploits materialized view definition semantics • Built on the ideas presented in [GL01] using Apache Calcite • Supports queries containing TableScan, Project, Filter, Join, Aggregate operators • Includes some extensions • Generation of additional rewritings without needing to do join permutation • Partial rewritings using union operators • More information about the rewriting coverage • http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information [GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical, scalable solution. In Proc. ACM SIGMOD Conf., 2001.
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Enable materialized view-based rewriting • Global property to enable materialized view rewriting for queries SET hive.materializedview.rewriting=true; • User can selectively use enable/disable materialized views for rewriting • Materialized views are enabled by default for rewriting • Behavior can be altered after materialized view has been created ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved depts Materialized view-based rewriting (example) • Materialized view definition Employees that were hired after 2016 CREATE MATERIALIZED VIEW mv AS SELECT empid, deptname, hire_date FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2016-01-01'; • Query Employees that were hired last quarter SELECT empid, deptname FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; • Materialized view-based rewriting SELECT empid, deptname FROM mv WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; deptsemps empid depname hire_date 10001 IT 2016-03-01 10002 IT 2017-01-02 10003 HR 2017-07-01 10004 Finance 2018-01-15 10005 HR 2018-02-02 mv contents empid depname 10004 Finance 10005 HR Query results
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting (example 2) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT <dims>, lo_revenue, lo_extprice * lo_disc AS d_price, lo_revenue - lo_supplycost, FROM customer, dates, lineorder, part, supplier WHERE lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and lo_custkey = c_custkey; • Query SELECT sum(lo_extendedprice*lo_discount) FROM lineorder, dates WHERE lo_orderdate = d_datekey and d_year = 2013 and lo_discount between 1 and 3; • Materialized view-based rewriting SELECT SUM(d_price) FROM mv WHERE d_year = 2013 and lo_discount between 1 and 3; supplier part dates customerlineorder Exploit SQL PK-FK and NOT NULL constraints d_year lo_discount <dims> d_price 2013 2 ... 7.55 2014 4 ... 432.60 2013 2 ... 34.45 2012 2 ... 2.05 … … ... … mv contents sum 42.0 … Query results
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting (example 3) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT floor(time to minute), page, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY floor(time to minute), page; • Query SELECT floor(time to month), SUM(added) AS c_added FROM wiki GROUP BY floor(time to month); • Materialized view-based rewriting SELECT floor(time to month), SUM(c_added) as c_added FROM mv GROUP BY floor(time to month); wiki __time page c_added c_rmv 2011-01-01 01:05:00 Justin 1800 25 2011-01-20 19:00:00 Justin 2912 42 2011-01-01 11:06:00 Ke$ha 1953 17 2011-02-02 13:15:00 Ke$ha 3194 170 2011-01-02 18:00:00 Miley 2232 34 mv contents __time c_added 2011-01-01 00:00:00 8897 2011-02-01 00:00:00 3194 Query results
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view maintenance
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Rebuilding materialized views • Rebuild needs to be triggered manually by user ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD; • Incremental materialized view maintenance • Only refresh data that has changed in source tables • Multiple benefits • Decrease rebuild step execution time • Preserves LLAP cache for existing data • Materialized view should only use transactional tables (micromanaged or ACID) • Current implementation only supports incremental rebuild for insert operations • Update/delete operations force full rebuild • Optimizer will attempt incremental rebuild • Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Incremental view maintenance algorithm • Relies on materialized view rewriting algorithm • Materialized view stores write ID for its tables when it is created/refreshed • Write ID associates rows with transactions • When rebuild is triggered, introduce filter condition on write ID column in MV definition • Read only new rows from source tables • Execute materialized view rewriting • Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan • INSERT (table scan, filter, project, join) • MERGE (table scan, filter, project, join, aggregate)
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records ⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ① Rebuild statement rewriting INSERT OVERWRITE mv1 SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM ( SELECT page, user, c_added, c_removed FROM mv1 UNION ALL SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) subq GROUP BY page, user; Incremental view maintenance algorithm (example) Rollup data mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ② Rewrite INSERT OVERWRITE into MERGE statement MERGE INTO mv1 USING ( SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) src ON mv1.page = src.page AND mv1.user = src.user WHEN MATCHED THEN UPDATE SET c_added = mv1.c_added + src.c_added, c_removed = mv1.c_removed + src.c_rmv WHEN NOT MATCHED THEN INSERT VALUES (page, user, c_added, c_rmv); Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view lifecycle
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Management of materialized view lifecycle • Do not accept stale data (default) • If content of the materialized view is not fresh, we do not use it for automatic query rewriting • Still possible to trigger partial rewritings that read both the stale materialized view and new data from source tables • Accept stale data • Freshness defined as a time parameter • If MV was not rebuilt for a certain time period and there were changes in base tables, ignore • SET hive.materializedview.rewriting.time.window=10min; • Can also be overriden by a certain materialized view using table properties • Periodically rebuild materialized view, e.g., every 5 minutes t=0min t=10min t=20min Create MV Rebuild Rebuild Rebuild Rebuild t=5min t=15min
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Road ahead
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Road ahead • Improvements to current materialized views implementation • Rewriting performance and scalability • Single/many MVs • Control physical distribution of data • DISTRIBUTE BY, SORT BY, CLUSTER BY • Increase incremental view maintenance coverage • Support update/delete in source tables • Materialized view recommender • Ease the identification of access patterns for a given workload
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Demo
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Thank you https://cwiki.apache.org/confluence/display/Hive/Materialized+views

Editor's Notes

  • #3: Intro-Hive evolution from batch to interactive (modify a bit original slide)
  • #4: Important improvements in Hive in general Integration with Druid Mention improvements to optimization too, to link it with next slide about accelerating query processing using materializations
  • #5: Access patterns
  • #6: Access patterns
  • #7: A traditional technique to accelerating query execution is precalculation of materialized views Awareness of semantics enables materialized view rewriting and automatic maintenance of materialized views
  • #8: Possible workflow Three important points. You can query the materialized view as with any other table. Druid integration goes beyond materialized views: you can just query Druid from Hive. Materialized views do not work exclusively with Druid; in fact, we expect them to play well with LLAP.
  • #9: Work that we have done in Hive, main goals
  • #15: Implemented in Calcite, based on paper
  • #16: How to enable? Enabled by default in Hive 3.0, can alter materialized view to enable-disable
  • #18: Example 2 (materialized views exploit constraints)
  • #19: Example 3 (rollup based on time, richer semantics)
  • #21: Manual rebuild: full vs incremental
  • #22: Manual rebuild: full vs incremental
  • #23: Manual rebuild: full vs incremental
  • #24: Manual rebuild: full vs incremental
  • #25: Manual rebuild: full vs incremental
  • #27: Data freshness different options: fresh data vs accept data staleness
  • #29: Control physical distribution of data (distributed by, sorted by, cluster by) MV recommender From other slides, e.g, scaling as number of materialized views grow