Accelerating query processing with materialized views in Apache Hive

1 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing with
materialized views in Apache Hive
Jesús Camacho Rodríguez
DataWorks Summit Berlin
April 18, 2018

Apache Hive
• Initial use case: batch processing
• Read-only data
• HiveQL (SQL-like query language)
• MapReduce
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Upcoming release: Apache Hive 3.0 (May 2018)
• Extensive renovation to improve three different axes
• Latency: allow interactive and sub-second queries
• Scalability: from TB to PB of data
• SQL support: move from HiveQL to SQL standard

Apache Hive
• Multiple execution engines: Apache Tez and Apache Spark
• More efficient join execution algorithms
• Vectorized query execution
• Integration with columnar storage formats:
Apache ORC, Apache Parquet
• LLAP (Live Long and Process)
• Persistent deamons for low-latency queries
• Rule-based and cost-based optimizer
• Better statistics
• Tighter integration with other data processing systems: Druid
Important internals improvements

Accelerating query processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns

Accelerating query processing
• Establish relationship between original and new tables
• Has a similar table already been created?
• Rewrite your queries to use new tables
• What happens when access patterns change?
• Maintain your new tables when original tables change
• Do I have to fully rebuild new tables?
Optimization based on access patterns
Currently, Hive users
have to do it manually

Materialized views
• A materialized view is an entity that contains the result of a evaluating a query
• Important property  Awareness of the materialized view definition semantics
• Optimizer can exploit them for automatic query rewriting
• System can handle maintenance of the materialized views
• Generally, materializations can be created in different forms depending on the scope
• DBA writes “CREATE MATERIALIZED VIEW” statement
• Daemon creates materialized view based on recent query activity
• Cached result of previous similar query
• Query factorization identifies common pieces within a single query

Possible workflow
1. Create materialized view using Hive tables
• Stored by Hive or Druid
2. User or dashboard sends queries to Hive
• Hive rewrites queries using available materialized views
• Execute rewitten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
①
②
Data
Queries

Materialized views in Apache Hive
• First implementation will be part of Apache Hive 3.0
• Multiple storage options: Hive, Druid
• Automatic rewriting of incoming queries to use materialized views
• Efficient view maintenance
• Incremental refresh
• Multiple options to control materialized views lifecycle

Management of
materialized views in Hive

Materialized view creation
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[ENABLE REWRITE | DISABLE REWRITE]
[COMMENT materialized_view_comment]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
⇢ Supports custom table properties, storage format, etc.

Materialized view creation (stored in Druid)
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive materialized view name
Hive storage handler classname

Other operations for materialized view management
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
⇢ More operations to be added and extended

Materialized view-based
query rewriting

Materialized view-based rewriting algorithm
• Automatically rewrite incoming queries using materialized views
• Optimizer exploits materialized view definition semantics
• Built on the ideas presented in [GL01] using Apache Calcite
• Supports queries containing TableScan, Project, Filter, Join, Aggregate operators
• Includes some extensions
• Generation of additional rewritings without needing to do join permutation
• Partial rewritings using union operators
• More information about the rewriting coverage
• http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information
[GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical,
scalable solution. In Proc. ACM SIGMOD Conf., 2001.

Enable materialized view-based rewriting
• Global property to enable materialized view rewriting for queries
SET hive.materializedview.rewriting=true;
• User can selectively use enable/disable materialized views for rewriting
• Materialized views are enabled by default for rewriting
• Behavior can be altered after materialized view has been created
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;

depts
Materialized view-based rewriting (example)
• Materialized view definition
Employees that were hired after 2016
CREATE MATERIALIZED VIEW mv
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
• Query
Employees that were hired last quarter
SELECT empid, deptname
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
• Materialized view-based rewriting
SELECT empid, deptname
FROM mv
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
deptsemps
empid depname hire_date
10001 IT 2016-03-01
10002 IT 2017-01-02
10003 HR 2017-07-01
10004 Finance 2018-01-15
10005 HR 2018-02-02
mv contents
empid depname
10004 Finance
10005 HR
Query results

Materialized view-based rewriting (example 2)
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
Exploit SQL PK-FK and
NOT NULL constraints
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
… … ... …
mv contents
sum
42.0
…
Query results

Materialized view-based rewriting (example 3)
CREATE MATERIALIZED VIEW mv AS
SELECT floor(time to minute), page,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY floor(time to minute), page;
• Query
SELECT floor(time to month),
SUM(added) AS c_added
FROM wiki
GROUP BY floor(time to month);
SELECT floor(time to month),
SUM(c_added) as c_added
FROM mv
GROUP BY floor(time to month);
wiki
__time page c_added c_rmv
2011-01-01 01:05:00 Justin 1800 25
2011-01-20 19:00:00 Justin 2912 42
2011-01-01 11:06:00 Ke$ha 1953 17
2011-02-02 13:15:00 Ke$ha 3194 170
2011-01-02 18:00:00 Miley 2232 34
mv contents
__time c_added
2011-01-01 00:00:00 8897
2011-02-01 00:00:00 3194
Query results

Materialized view
maintenance

Rebuilding materialized views
• Rebuild needs to be triggered manually by user
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
• Multiple benefits
• Decrease rebuild step execution time
• Preserves LLAP cache for existing data
• Materialized view should only use transactional tables (micromanaged or ACID)
• Current implementation only supports incremental rebuild for insert operations
• Update/delete operations force full rebuild
• Optimizer will attempt incremental rebuild
• Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)

Incremental view maintenance algorithm
• Relies on materialized view rewriting algorithm
• Materialized view stores write ID for its tables when it is created/refreshed
• Write ID associates rows with transactions
• When rebuild is triggered, introduce filter condition on write ID column in MV definition
• Read only new rows from source tables
• Execute materialized view rewriting
• Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan
• INSERT (table scan, filter, project, join)
• MERGE (table scan, filter, project, join, aggregate)

CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
FROM wiki
GROUP BY page, user;
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;

SELECT page, user,
FROM wiki
① Rebuild statement rewriting
INSERT OVERWRITE mv1
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM (
SELECT page, user, c_added, c_removed
FROM mv1
UNION ALL
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) subq
Rollup data
mv1 contents
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records

SELECT page, user,
FROM wiki
② Rewrite INSERT OVERWRITE into MERGE statement
MERGE INTO mv1
USING (
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) src
ON mv1.page = src.page AND mv1.user = src.user
WHEN MATCHED
THEN UPDATE SET c_added = mv1.c_added + src.c_added,
c_removed = mv1.c_removed + src.c_rmv
WHEN NOT MATCHED
THEN INSERT VALUES (page, user, c_added, c_rmv);
mv1 contents
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records

Materialized view lifecycle

Management of materialized view lifecycle
• Do not accept stale data (default)
• If content of the materialized view is not fresh, we do not use it for automatic query rewriting
• Still possible to trigger partial rewritings that read both the stale materialized view and new data
from source tables
• Accept stale data
• Freshness defined as a time parameter
• If MV was not rebuilt for a certain time period and there were changes in base tables, ignore
• SET hive.materializedview.rewriting.time.window=10min;
• Can also be overriden by a certain materialized view using table properties
• Periodically rebuild materialized view, e.g., every 5 minutes
t=0min t=10min t=20min
Create MV Rebuild Rebuild Rebuild Rebuild
t=5min t=15min

Road ahead

Road ahead
• Improvements to current materialized views implementation
• Rewriting performance and scalability
• Single/many MVs
• Control physical distribution of data
• DISTRIBUTE BY, SORT BY, CLUSTER BY
• Increase incremental view maintenance coverage
• Support update/delete in source tables
• Materialized view recommender
• Ease the identification of access patterns for a given workload

Demo

Thank you
https://cwiki.apache.org/confluence/display/Hive/Materialized+views

Accelerating query processing with materialized views in Apache Hive

More Related Content

What's hot (20)

Similar to Accelerating query processing with materialized views in Apache Hive (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Accelerating query processing with materialized views in Apache Hive

Editor's Notes