SlideShare a Scribd company logo
Apache Hive 2.0:
SQL, Speed, Scale
Alan Gates
Hive PMC Member
Co-founder Hortonworks
May 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgements
 The Apache Hive community for building all this awesome tech
 Content of some of these slides based on earlier presentations by Sergey Shelukhin
and Siddarth Seth
 alias Hive=‘Apache Hive’
alias Hadoop=‘Apache Hadoop’
alias Spark=‘Apache Spark’
alias Tez=‘Apache Tez’
alias Parquet=‘Apache Parquet’
alias ORC=‘Apache ORC’
alias Omid=‘Apache Omid (incubating)’
alias Calcite=‘Apache Calcite’
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive History
 Initially Hive provided SQL on Hadoop
– Provided a table view instead of file view of data
– Translated SQL to MapReduce
– Mostly used for ETL (Extract Transform Load)
– Big, batch, high start up time
 Around 2012 it became clear users wanted to do all data warehousing on Hadoop,
not just batch ETL
 Hive has shifted over time to focus on traditional data warehousing problems
– Still does large ETL well
– Now also can be used for analytics, reporting
– Work being done to better support BI (Business Intelligence) tools
 Not OLTP, very focused on backend analytics
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 1.x and 2.x
 New feature development in Hive moving at a fast pace
– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)
– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes
 Compromise: follow Hadoop’s example, split into stable and new feature lines
 1.x
– Stable
– Backwards compatible
– Ongoing bug fixes
 2.x
– Major new features
– Backwards compatible where possible, but some things will be broken
– Hive 2.0 released February 15, 2016 – Not considered production ready
– Hive 2.1 released June 20, 2016 – Getting closer, but still beta
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 New Features Overview
 1039 JIRAs resolved with 2.0 as fix version
– 666 bugs
– 140 improvements or new features
– 625 more issues resolved in 2.1, mostly all bug fixes
 HPLSQL
 LLAP
 HBase Metastore
 Hive-On-Spark Improvements
 Cost Based Optimizer Improvements
 Many, many new features and bug fixes I will not have time to cover
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Adding Procedural SQL: HPLSQL
 Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures
– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)
 Aims to be compatible with all major dialects of procedural SQL to maximize re-use of
existing scripts
 Currently external to Hive, communicates with Hive via JDBC.
– User runs command using hplsql binary
– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,
etc.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sub-second Queries in Hive: LLAP (Live Long and Process)
 Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
 Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
 Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
 Working on interface to allow other data engines to read securely in parallel
 Beta in 2.0
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive With LLAP Execution Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Performance
0
5
10
15
20
25
30
35
40
45
50
query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98
TIME(SECONDS)
LLAP vs Hive 1.x 10TB Scale
LLAP Hive 1.x
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Performance Continued
0
100
200
300
400
500
Time(seconds)
LLAP Hive 1.2.1
Hive / LLAP, Hive 1.2.1 Query Times
38 out of 61 queries ran 50% faster
25 out of 61 queries ran 70% faster
12 out of 61 queries ran 80% faster
1 query ran 90% faster
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Limitations
 Currently in Beta
 Read only, no write path yet
 Does not work with ACID yet (see previous bullet)
 User must decide whether query runs fully in LLAP, mixed mode, or not at all
– Should be handled by CBO
 Currently only reads ORC files
 Currently only integrates with Tez as an engine
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speeding up Query Planning: HBase Metastore
 Add option to use HBase to store Hive’s metadata
 Why?
– Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata
acquisition
– ORM layer produces complex, slow schema (40+ tables)
– The need to work across 5 different databases limits performance optimizations and maximizes test
matrix for developers
– Limits caching opportunities as we cannot store too much data in a single node RDBMS
– The need to limit number of concurrent connections forces all metadata operations to be done during
query planning
– HBase addresses each of these
 Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds
– Not there yet, currently at 1-1.5 seconds
 Challenges
– HBase lacks transactions, addressing via Apache Omid (incubating)
 Alpha in Hive 2.0
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Improvements to Hive on Spark
 Dynamic partition pruning
 Make use of spark persistence for self-join, self-union, and CTEs
 Vectorized map-join and other map-join improvements
 Parallel order by
 Pre-warming of containers
 Support for Spark 1.5
 Many bug fixes
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cost Base Optimizer (CBO) Improvements
 Hive’s CBO uses Calcite
– Not all optimization rules migrated yet, but 2.0 continues work towards that
 CBO on by default in 2.0 (wasn’t in in 1.x)
 Main focus of CBO work has been BI queries (using TPC-DS as guide)
– Some work on machine generated queries, since tools generate some funky queries
 Focus on improving stats collection and estimating stats more accurately between
operators in the plan
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And Many, Many More
• SQL Standard Auth is the default authorization (actually works)
• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)
• Codahale-based metrics (also in 1.3)
• HS2 Web UI
• Stability Improvements and bugfixes for ACID (almost production ready now)
• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.
• Improvements to Parquet performance (PPD, memory manager, etc.)
• ORC schema evolution (beta)
• Improvement to windowing functions, refactoring ORC before split, SIMD
optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez
session management, many more
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 Incompabilities
 Java 7 & 8 supported, 6 no longer supported
 Requires Hadoop 2.x, Hadoop 1.x no longer supported
 MapReduce deprecated, Tez or Spark recommended instead
– At some future date MR will be removed
 Some configuration defaults changed, e.g.
– bucketing enforced by default
– metadata schema no longer created if it is missing
– SQL Standard authorization used by default
 We plan to remove Hive CLI in the future and replace with beeline CLI
– Why?
• Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC
• It is cleaner to maintain one code path
– Does not require HiveServer2, can run HS2 embedded in beeline
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

PDF
The state of SQL-on-Hadoop in the Cloud
PPTX
Evolving HDFS to a Generalized Storage Subsystem
PPTX
Curb your insecurity with HDP
PPTX
What's new in apache hive
PPTX
Apache Hadoop YARN: Present and Future
PDF
From Device to Data Center to Insights
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PPTX
Real-time Hadoop: The Ideal Messaging System for Hadoop
The state of SQL-on-Hadoop in the Cloud
Evolving HDFS to a Generalized Storage Subsystem
Curb your insecurity with HDP
What's new in apache hive
Apache Hadoop YARN: Present and Future
From Device to Data Center to Insights
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Real-time Hadoop: The Ideal Messaging System for Hadoop

What's hot (20)

PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PPTX
What's new in Ambari
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PPTX
An Apache Hive Based Data Warehouse
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
PDF
Spark Uber Development Kit
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
PDF
The Heterogeneous Data lake
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
PPTX
PPTX
Hdfs 2016-hadoop-summit-san-jose-v4
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PPTX
YARN Ready: Apache Spark
PPTX
Embeddable data transformation for real time streams
PDF
Multitenancy At Bloomberg - HBase and Oozie
PPTX
Building and managing complex dependencies pipeline using Apache Oozie
PPTX
Apache Phoenix + Apache HBase
PPTX
Operating and Supporting Apache HBase Best Practices and Improvements
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Apache Hive 2.0: SQL, Speed, Scale
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
What's new in Ambari
Real time fraud detection at 1+M scale on hadoop stack
An Apache Hive Based Data Warehouse
End-to-End Security and Auditing in a Big Data as a Service Deployment
Spark Uber Development Kit
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
The Heterogeneous Data lake
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hdfs 2016-hadoop-summit-san-jose-v4
Large-Scale Stream Processing in the Hadoop Ecosystem
YARN Ready: Apache Spark
Embeddable data transformation for real time streams
Multitenancy At Bloomberg - HBase and Oozie
Building and managing complex dependencies pipeline using Apache Oozie
Apache Phoenix + Apache HBase
Operating and Supporting Apache HBase Best Practices and Improvements
Ad

Viewers also liked (20)

PDF
High-Scale Entity Resolution in Hadoop
PDF
What's new in SQL on Hadoop and Beyond
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
LLAP: long-lived execution in Hive
PPTX
Producing Spark on YARN for ETL
PPTX
How to build a successful Data Lake
PPTX
Apache Hive
PPTX
Loan Decisioning Transformation
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PDF
Hadoop Crash Course Hadoop Summit SJ
PPT
Apache Hive - Introduction
PPTX
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
PDF
Why you care about
 relational algebra (even though you didn’t know it)
PPTX
Solving Performance Problems on Hadoop
PPTX
Calcite meetup-2016-04-20
PDF
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
PPS
Introduction to Apache Hive
PPT
Data Discovery, Visualization, and Apache Hadoop
PDF
H20: A platform for big math
High-Scale Entity Resolution in Hadoop
What's new in SQL on Hadoop and Beyond
LLAP: Sub-Second Analytical Queries in Hive
LLAP: long-lived execution in Hive
Producing Spark on YARN for ETL
How to build a successful Data Lake
Apache Hive
Loan Decisioning Transformation
Introduction to Apache Hive(Big Data, Final Seminar)
Hadoop Crash Course Hadoop Summit SJ
Apache Hive - Introduction
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
Why you care about
 relational algebra (even though you didn’t know it)
Solving Performance Problems on Hadoop
Calcite meetup-2016-04-20
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Introduction to Apache Hive
Data Discovery, Visualization, and Apache Hadoop
H20: A platform for big math
Ad

Similar to Apache Hive 2.0: SQL, Speed, Scale (20)

PPTX
Apache Hive 2.0; SQL, Speed, Scale
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Hive2.0 big dataspain-nov-2016
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
PPTX
Hive acid and_2.x new_features
PPTX
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
PDF
What is New in Apache Hive 3.0?
PDF
What is new in Apache Hive 3.0?
PPTX
SoCal BigData Day
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
Hadoop & cloud storage object store integration in production (final)
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PDF
What's New in Apache Hive 3.0?
PDF
What's New in Apache Hive 3.0 - Tokyo
PDF
Big SQL Competitive Summary - Vendor Landscape
PPTX
LLAP: Building Cloud First BI
PPTX
Moving towards enterprise ready Hadoop clusters on the cloud
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
Hive2.0 big dataspain-nov-2016
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Hive acid and_2.x new_features
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
What is New in Apache Hive 3.0?
What is new in Apache Hive 3.0?
SoCal BigData Day
Hive edw-dataworks summit-eu-april-2017
Hadoop & cloud storage object store integration in production (final)
Hadoop & Cloud Storage: Object Store Integration in Production
LLAP: Sub-Second Analytical Queries in Hive
Hadoop & Cloud Storage: Object Store Integration in Production
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0 - Tokyo
Big SQL Competitive Summary - Vendor Landscape
LLAP: Building Cloud First BI
Moving towards enterprise ready Hadoop clusters on the cloud

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Programs and apps: productivity, graphics, security and other tools
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
sap open course for s4hana steps from ECC to s4
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology

Apache Hive 2.0: SQL, Speed, Scale

  • 1. Apache Hive 2.0: SQL, Speed, Scale Alan Gates Hive PMC Member Co-founder Hortonworks May 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgements  The Apache Hive community for building all this awesome tech  Content of some of these slides based on earlier presentations by Sergey Shelukhin and Siddarth Seth  alias Hive=‘Apache Hive’ alias Hadoop=‘Apache Hadoop’ alias Spark=‘Apache Spark’ alias Tez=‘Apache Tez’ alias Parquet=‘Apache Parquet’ alias ORC=‘Apache ORC’ alias Omid=‘Apache Omid (incubating)’ alias Calcite=‘Apache Calcite’
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive History  Initially Hive provided SQL on Hadoop – Provided a table view instead of file view of data – Translated SQL to MapReduce – Mostly used for ETL (Extract Transform Load) – Big, batch, high start up time  Around 2012 it became clear users wanted to do all data warehousing on Hadoop, not just batch ETL  Hive has shifted over time to focus on traditional data warehousing problems – Still does large ETL well – Now also can be used for analytics, reporting – Work being done to better support BI (Business Intelligence) tools  Not OLTP, very focused on backend analytics
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 1.x and 2.x  New feature development in Hive moving at a fast pace – Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce) – Realizing the full potential of Hive as data warehouse on Hadoop requires more changes  Compromise: follow Hadoop’s example, split into stable and new feature lines  1.x – Stable – Backwards compatible – Ongoing bug fixes  2.x – Major new features – Backwards compatible where possible, but some things will be broken – Hive 2.0 released February 15, 2016 – Not considered production ready – Hive 2.1 released June 20, 2016 – Getting closer, but still beta
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 New Features Overview  1039 JIRAs resolved with 2.0 as fix version – 666 bugs – 140 improvements or new features – 625 more issues resolved in 2.1, mostly all bug fixes  HPLSQL  LLAP  HBase Metastore  Hive-On-Spark Improvements  Cost Based Optimizer Improvements  Many, many new features and bug fixes I will not have time to cover
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Adding Procedural SQL: HPLSQL  Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures – Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)  Aims to be compatible with all major dialects of procedural SQL to maximize re-use of existing scripts  Currently external to Hive, communicates with Hive via JDBC. – User runs command using hplsql binary – Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures, etc.
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sub-second Queries in Hive: LLAP (Live Long and Process)  Persistent daemons – Saves time on process start up (eliminates container allocation and JVM start up time) – All code JITed within a query or two  Data caching with an async I/O elevator – Hot data cached in memory (columnar aware, so only hot columns cached) – When possible work scheduled on node with data cached, if not work will be run in other node  Operators can be executed inside LLAP when it makes sense – Large, ETL style queries usually don’t make sense – User code not run in LLAP for security  Working on interface to allow other data engines to read securely in parallel  Beta in 2.0
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive With LLAP Execution Options AM AM T T T R R R T T T R M M M R R R M M R R Tez Only LLAP + Tez T T T R R R T T T R LLAP only
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Performance 0 5 10 15 20 25 30 35 40 45 50 query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98 TIME(SECONDS) LLAP vs Hive 1.x 10TB Scale LLAP Hive 1.x
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Performance Continued 0 100 200 300 400 500 Time(seconds) LLAP Hive 1.2.1 Hive / LLAP, Hive 1.2.1 Query Times 38 out of 61 queries ran 50% faster 25 out of 61 queries ran 70% faster 12 out of 61 queries ran 80% faster 1 query ran 90% faster
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Limitations  Currently in Beta  Read only, no write path yet  Does not work with ACID yet (see previous bullet)  User must decide whether query runs fully in LLAP, mixed mode, or not at all – Should be handled by CBO  Currently only reads ORC files  Currently only integrates with Tez as an engine
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Speeding up Query Planning: HBase Metastore  Add option to use HBase to store Hive’s metadata  Why? – Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata acquisition – ORM layer produces complex, slow schema (40+ tables) – The need to work across 5 different databases limits performance optimizations and maximizes test matrix for developers – Limits caching opportunities as we cannot store too much data in a single node RDBMS – The need to limit number of concurrent connections forces all metadata operations to be done during query planning – HBase addresses each of these  Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds – Not there yet, currently at 1-1.5 seconds  Challenges – HBase lacks transactions, addressing via Apache Omid (incubating)  Alpha in Hive 2.0
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Improvements to Hive on Spark  Dynamic partition pruning  Make use of spark persistence for self-join, self-union, and CTEs  Vectorized map-join and other map-join improvements  Parallel order by  Pre-warming of containers  Support for Spark 1.5  Many bug fixes
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cost Base Optimizer (CBO) Improvements  Hive’s CBO uses Calcite – Not all optimization rules migrated yet, but 2.0 continues work towards that  CBO on by default in 2.0 (wasn’t in in 1.x)  Main focus of CBO work has been BI queries (using TPC-DS as guide) – Some work on machine generated queries, since tools generate some funky queries  Focus on improving stats collection and estimating stats more accurately between operators in the plan
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved And Many, Many More • SQL Standard Auth is the default authorization (actually works) • CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*) • Codahale-based metrics (also in 1.3) • HS2 Web UI • Stability Improvements and bugfixes for ACID (almost production ready now) • Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc. • Improvements to Parquet performance (PPD, memory manager, etc.) • ORC schema evolution (beta) • Improvement to windowing functions, refactoring ORC before split, SIMD optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez session management, many more
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 Incompabilities  Java 7 & 8 supported, 6 no longer supported  Requires Hadoop 2.x, Hadoop 1.x no longer supported  MapReduce deprecated, Tez or Spark recommended instead – At some future date MR will be removed  Some configuration defaults changed, e.g. – bucketing enforced by default – metadata schema no longer created if it is missing – SQL Standard authorization used by default  We plan to remove Hive CLI in the future and replace with beeline CLI – Why? • Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC • It is cleaner to maintain one code path – Does not require HiveServer2, can run HS2 embedded in beeline
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You

Editor's Notes

  • #10: 10 compute nodes, with 512GB RAM per node, running HDP 2.3 Scale 10K (10TB), interactive queries Single query runs – via Hive CLI Concurrency runs – via HiveServer2 and jmeter Hive1: Hive 1.2 + Tez 0.7 Pre-warm and container reuse enabled LLAP: Close to the 2.0 Hive branch, Tez close to the current master branch Caching Enabled (as of November 2015)
  • #14: 1. DPP: Implemented in two sequential jobs. The first one processes the pruning part, saving the dynamic values on HDFS. The second job uses these values to filter out unwanted partitoins. Not fully tested yet. 2. Spark RDD persistence is used to store the temporary results from repeated subqueires to avoid re-computation. This is similar to materialized view and happens automatically. This is especially useful for cases of self-join, self-union, and CTE. 3. Vectorized map-join, optimized hashtable for mapjoin. These are very similar to tez. 4. Use parallel order by provided by Spark to do global sorting without limiting to one reducer. Internally, however, spark does the sampling.  5. Wait for a few seconds after SparkContext is created before submitting the job to make sure that enough number of executors are launched. SparkContext allows a job to be submitted right way, even if the executors are still starting up. Parallelism at reducer is partially determined by the number of available executors at the time when the job is submitted. This is useful for short-lived sessions, such as those launched by Oozie.