Apache Hive 2.0: SQL, Speed, Scale

Apache Hive 2.0:
SQL, Speed, Scale
Alan Gates
Hive PMC Member
Co-founder Hortonworks
May 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgements
 The Apache Hive community for building all this awesome tech
 Content of some of these slides based on earlier presentations by Sergey Shelukhin
and Siddarth Seth
 alias Hive=‘Apache Hive’
alias Hadoop=‘Apache Hadoop’
alias Spark=‘Apache Spark’
alias Tez=‘Apache Tez’
alias Parquet=‘Apache Parquet’
alias ORC=‘Apache ORC’
alias Omid=‘Apache Omid (incubating)’
alias Calcite=‘Apache Calcite’

Apache Hive History
 Initially Hive provided SQL on Hadoop
– Provided a table view instead of file view of data
– Translated SQL to MapReduce
– Mostly used for ETL (Extract Transform Load)
– Big, batch, high start up time
 Around 2012 it became clear users wanted to do all data warehousing on Hadoop,
not just batch ETL
 Hive has shifted over time to focus on traditional data warehousing problems
– Still does large ETL well
– Now also can be used for analytics, reporting
– Work being done to better support BI (Business Intelligence) tools
 Not OLTP, very focused on backend analytics

Hive 1.x and 2.x
 New feature development in Hive moving at a fast pace
– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)
– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes
 Compromise: follow Hadoop’s example, split into stable and new feature lines
 1.x
– Stable
– Backwards compatible
– Ongoing bug fixes
 2.x
– Major new features
– Backwards compatible where possible, but some things will be broken
– Hive 2.0 released February 15, 2016 – Not considered production ready
– Hive 2.1 released June 20, 2016 – Getting closer, but still beta

Hive 2.0 New Features Overview
 1039 JIRAs resolved with 2.0 as fix version
– 666 bugs
– 140 improvements or new features
– 625 more issues resolved in 2.1, mostly all bug fixes
 HPLSQL
 LLAP
 HBase Metastore
 Hive-On-Spark Improvements
 Cost Based Optimizer Improvements
 Many, many new features and bug fixes I will not have time to cover

Adding Procedural SQL: HPLSQL
 Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures
– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)
 Aims to be compatible with all major dialects of procedural SQL to maximize re-use of
existing scripts
 Currently external to Hive, communicates with Hive via JDBC.
– User runs command using hplsql binary
– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,
etc.

Sub-second Queries in Hive: LLAP (Live Long and Process)
 Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
 Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
 Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
 Working on interface to allow other data engines to read securely in parallel
 Beta in 2.0

Hive With LLAP Execution Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only

LLAP Performance
0
5
10
15
20
25
30
35
40
45
50
query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98
TIME(SECONDS)
LLAP vs Hive 1.x 10TB Scale
LLAP Hive 1.x

LLAP Performance Continued
0
100
200
300
400
500
Time(seconds)
LLAP Hive 1.2.1
Hive / LLAP, Hive 1.2.1 Query Times
38 out of 61 queries ran 50% faster
1 query ran 90% faster

LLAP Limitations
 Currently in Beta
 Read only, no write path yet
 Does not work with ACID yet (see previous bullet)
 User must decide whether query runs fully in LLAP, mixed mode, or not at all
– Should be handled by CBO
 Currently only reads ORC files
 Currently only integrates with Tez as an engine

Speeding up Query Planning: HBase Metastore
 Add option to use HBase to store Hive’s metadata
 Why?
– Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata
acquisition
– ORM layer produces complex, slow schema (40+ tables)
– The need to work across 5 different databases limits performance optimizations and maximizes test
matrix for developers
– Limits caching opportunities as we cannot store too much data in a single node RDBMS
– The need to limit number of concurrent connections forces all metadata operations to be done during
query planning
– HBase addresses each of these
 Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds
– Not there yet, currently at 1-1.5 seconds
 Challenges
– HBase lacks transactions, addressing via Apache Omid (incubating)
 Alpha in Hive 2.0

Improvements to Hive on Spark
 Dynamic partition pruning
 Make use of spark persistence for self-join, self-union, and CTEs
 Vectorized map-join and other map-join improvements
 Parallel order by
 Pre-warming of containers
 Support for Spark 1.5
 Many bug fixes

Cost Base Optimizer (CBO) Improvements
 Hive’s CBO uses Calcite
– Not all optimization rules migrated yet, but 2.0 continues work towards that
 CBO on by default in 2.0 (wasn’t in in 1.x)
 Main focus of CBO work has been BI queries (using TPC-DS as guide)
– Some work on machine generated queries, since tools generate some funky queries
 Focus on improving stats collection and estimating stats more accurately between
operators in the plan

And Many, Many More
• SQL Standard Auth is the default authorization (actually works)
• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)
• Codahale-based metrics (also in 1.3)
• HS2 Web UI
• Stability Improvements and bugfixes for ACID (almost production ready now)
• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.
• Improvements to Parquet performance (PPD, memory manager, etc.)
• ORC schema evolution (beta)
• Improvement to windowing functions, refactoring ORC before split, SIMD
optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez
session management, many more

Hive 2.0 Incompabilities
 Java 7 & 8 supported, 6 no longer supported
 Requires Hadoop 2.x, Hadoop 1.x no longer supported
 MapReduce deprecated, Tez or Spark recommended instead
– At some future date MR will be removed
 Some configuration defaults changed, e.g.
– bucketing enforced by default
– metadata schema no longer created if it is missing
– SQL Standard authorization used by default
 We plan to remove Hive CLI in the future and replace with beeline CLI
– Why?
• Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC
• It is cleaner to maintain one code path
– Does not require HiveServer2, can run HS2 embedded in beeline

Thank You

Apache Hive 2.0: SQL, Speed, Scale

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Hive 2.0: SQL, Speed, Scale (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Apache Hive 2.0: SQL, Speed, Scale

Editor's Notes