What's new in apache hive

1 © Hortonworks Inc. 2011–2018. All rights reserved.
What is new in Apache Hive?
Ashutosh Chauhan

Apache Hive – Distant Past – First Five Years
• Initial use case: batch processing
• Circa 2008
• Read-only data
• MapReduce
• HiveQL

Apache Hive – Past 5 Years
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Latest released version: Apache Hive 3.0 (May 2018)
• Extensive renovation along four different axes
• Runtime : Enable sub-second queries - LLAP
• Compiler : Cost Based Optimizer
• SQL support : Improved coverage of SQL syntax
• Transactional Support : ACID

Hive – Today
• Comprehensive ANSI SQL including all TPC-DS Queries.
• The only Hadoop SQL with ACID MERGE for easy updates.
• In-Memory caching for MPP performance at Hadoop scale.
• Enables Per-User dynamic row and column security.
• Enables Replication and DR for critical workloads.
• Compatible with every major BI Tool.
• Proven at 300+ PB Scale.

Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
Analytics Performance
100 Million rows/s Per Node
Largest Hive Warehouse
300+ PB Raw Storage
Largest Cluster
4,500+ Nodes

Hive: Serving ETL Workloads to BI Systems
BI
systems
Materialized
view
Improved
Stats
Constraints
Query
Result
Cache
Workload
manage
ment
ACID v2
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables

• SIGMOD Software Systems Award
• “For developing seminal software systems that served to bring relational-style
declarative programming to the Hadoop ecosystem.”
• Postgres, SQLLite and MonetDB

Hive – How Did We Get Here?
• LLAP Enhancements
• CBO Enhancements
• ACID Enhancements

Materialized Views in Hive

Accelerating Query Processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns

Materialized Views to Rescue
 Speed up aggregates and joins via MVs
 View navigation via CBO/Calcite
 Optionally allow rewrites against out-of-date
materializations

Materialized Views in Hive 3
• Multiple storage options: Hive, Druid
• Multiple options to control materialized views lifecycle

Materialized View-based Rewriting
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
mv contents
Query results

Rebuilding Materialized Views
• ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables

Accelerating Query Processing with
Materialized Views in Apache Hive
Jesus Camacho Rodriguez
Tuesday, June 19
2:50 PM - 3:30 PM
Executive Ballroom 210A/E

Workload Management

Overview
• Effectively share LLAP cluster resources
• Resource allocation per user policy; separate ETL and BI, etc.
• Resources based guardrails
• Protect against long running queries, high memory usage
• Improved, query-aware scheduling
• Scheduler is aware of query characteristics, types, etc.
• Fragments easy to pre-empt compared to containers
• Queries get guaranteed fractions of the cluster, but
can use empty space

Resource Plans
• Resource plan is a workload management configuration for a cluster
• Switching is allowed without stopping queries, e.g. based on time of day
• Cluster is divided into query pools (optionally nested)
• Each pool defines query parallelism, cluster resources percentage
• Queries are automatically routed to pools based on user name, app, etc.
• Rules (Triggers) to kill, move, or deprioritized queries based on DFS usage, runtime, etc.
• Example :
CREATE RESOURCE PLAN daytime;
CREATE POOL bi IN daytime (resource_percent=75, concurrent_queries=5);
CREATE POOL etl IN daytime TIME (resource_percent=25, concurrent_queries=10);
CREATE RULE downgrade IN daytime WHEN total_runtime > 120 THEN MOVE etl;
ADD RULE downgrade TO bi IN daytime ;
CREATE MAPPING tableau IN daytime (application='Tableau', pool=bi);
ALTER PLAN daytime SET default_pool='etl';
APPLY PLAN daytime;

Decentralized Guaranteed Resources
• A guaranteed task for each resource (executor slots)
• HS2 gives N guaranteed tasks to an AM based on configured resource plan
• AMs mark N of its most important tasks as guaranteed at any given time
• Guaranteed tasks pre-empt speculative tasks

Guaranteed Tasks – BI and ETL Example
BI (80% = 14 guaranteed) ETL (20% = 4 guaranteed)
Query 1 Query 2
LLAP Daemon 1 LLAP Daemon 2 LLAP Daemon 3
Wait Queue
Executors
10 active tasks (running):
10 guaranteed (running)
4 unused for now
19 active tasks (8 running):
4 guaranteed (4 running)
15 speculative (4 running)
HS2
18 executors total

Caching

Caching for BI Workloads
• Fine-grained (columnar), compact (dictionary, RLE encoded)
• Important due to projections over many wide EDW tables
• Prioritized – indexes are cached with higher priority
• Important to make use of predicate pushdown
• Off-heap (no extra GC), supports SSD
• LRFU replacement policy avoids the damage from large scans

Caching for BI Workloads – Formats, Zero-ETL
• ORC, Parquet
• Cached natively
• Zero-ETL analytics on CSV and JSON data with text caching
• Text is efficiently encoded in background; once cached, queries speed up

In-memory Processing – Native Columnar (ORC)
I/O threads
SSD
cache
Off-heap
cacheCompact encoded data
Distributed FS
Compressed data
Decoder: ORC
col1
col2
Compression
codec
Read planner
Execution thread
Fragment
Hive
operator
Hive
operator
Vectorized
processing
col1 col2
Native data
vectors
Replacement
policy

Running Hive queries fast in the cloud
Nita Dembla
Wednesday, June 20
4:00 PM - 4:40 PM
Grand Ballroom 220C

Druid + Apache Hive
Layer Data Access Pattern Features
Hive Layer Large Scale analytics
Joins
Subqueries
Windowing Functions
Transformations
Complex Aggregations
Advanced Sorting
UDFs
Druid Layer
Needles-in-a-haystack queries with
large numbers of dimensions
Dimensional Aggregates
Top N Queries
Min/Max Values
Timeseries Queries
Approximate Distinct Count
Approximate Histograms

Druid Integration
• Pushdown of aggregate queries
• Pushdown of complex expressions
• Improvements in Druid to support sql standard NULL semantics
• Store MV In Druid

Hive 3: Real-time Ingestion
Hive
Kafka-Druid-
Hive ingest
Druid
Real-time analytics
• Druid answers in near real-time

Druid and Hive Together: Interactive
Realtime Analytics at Scale
Nishant Bangarwa
Tuesday, June 19
4:50 PM - 5:30 PM
Grand Ballroom 220B

Acid V2
• New On disk storage format for Acid tables
• Run major compactions before you upgrade
• Update = Delete + Insert
• Performance at par with non-Acid tables
• Support for load statements
• New Streaming ingestion library

Insert-only Tables
• Transactional Semantics for non-ORC tables
• For insert into and Insert overwrite
• With near-zero overhead
• No rename() - Cloud friendly

Transactional Operations in Apache Hive
Eugene Koifman
Wednesday, June 20
11:50 AM - 12:30 PM
Executive Ballroom 210A/E

Disaster Recovery for Hive Data
A
A B
B
CentralizedSecurityandGovernance
On-Premise
Data Center (a)
On-Premise
Data Center (b)
Scheduled Policy (A)
(2am, 10am, 6pm daily)
Scheduled Policy (B)
(2am daily)
1 Data replication with scheduled policy
2 Disaster takes down Data Center (b)
3 Failover to Data Center (a); data set B made active
4 Active data set B changes to B’ in Data Center (a)

Hive-based Replication
• Replv2 introduces new REPL commands
• Incremental replication - only copy delta changes
• Point-in time replication.
• Hive maintains the replication state.
• Additional support for other database objects - for ex: functions, constraint etc.
• Reduce number of copies.

Seamless Replication and Disaster
Recovery for Apache Hive Warehouse
Sankar Hariappan
Thursday, June 21
9:30 AM - 10:10 AM
Meeting Room 211A/B/C/D

One Metastore to Rule Them All
HDFS/S3 Kafka
Hive
LLAP
Spark
HMS Atlas
RangerSR
Hive
on Tez

Between Us and the Grand Vision
• Make HMS separable from Hive
• Standalone Metastore
• Unify HMS and Schema Registry so batch and streaming can see each other’s data
• Also reduces the number of metadata systems admins have to install and maintain

Sharing Metadata Across the Data Lake
and Streams
Alan Gates
Wednesday, June 20
11:50 AM - 12:30 PM
Meeting Room 230A

External Access –
Spark Llap

External Access – Relational View for Everyone
• Hive-on-Tez and other DAG executors can use LLAP directly
• LLAP also provides a "relational datanode" view of the data
• Anyone (with access) can push the (approved) code in, from complex query fragments to
simple data reads
• E.g. a Spark DataFrame can be created with LlapInputFormat
• Gives the external services the access to
• Hive data: centralized, secure data access
• Ability to read all Hive table types, like ACID transactional tables
• Hive features: from column-level security, to LLAP columnar cache

Support Row/Column-level Security in Spark
spark-shell
pyspark

What Is Required?
• Apache Ranger
• Apache Hive with LLAP
• Spark-LLAP
• A library to integrate above tech with SparkSQL

HiveServer2 + LLAP + Ranger
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection:
mask(name)
SQL Query:
select name from users
1.Client sends query to HiveServer2.
2.Query plan generation by HiveServer2. Ranger
security policies applied. Plan modified based on
dynamic security policies.
3.Query plan sent to query coordinator
4.Query plan sent to LLAP daemons for execution.
Filtering/masking performed.
5.Results consolidated and sent to client
1 Ranger
Dynamic Policies
5 2
3 4
LLAP
LLAP
LLAP Daemons

LLAP
InputFor
mat
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection:
mask(name)
SQL Query:
select name from users
1.Client requests data locations known as “splits”
from HiveServer2.
2.Query plan generation by HiveServer2. Ranger
security policies applied. Plan modified based on
dynamic security policies.
3.Splits returned to client which include signed
query plan.
4.LLAP splits used by client to securely submit
query plan to LLAP. Filtering/masking performed.
Data returned to client.
1 Ranger
Dynamic Policies
3 2
LLAP
LLAP
LLAP Daemons
HiveServer2 + LLAP + Ranger
4

“Other” Improvements
• Query reoptimization
• Constraints
• Vectorization
• Query Cache
• Active Passive HS2 HA for llap
• HLL BitVectors
• CachedStore
• Numerous enhancements in Spark Integration

Future
• Standalone Metastore
• Materialized Views – Automatic Recommendations
• Better integration with cloud storage
• HS2 scalability

Thanks
to Open Source Community
for continued success for last
10 years.
Now,
Onwards to next 10 years

What's new in apache hive

More Related Content

What's hot (20)

Similar to What's new in apache hive (20)

More from DataWorks Summit (20)

Recently uploaded (20)

What's new in apache hive