SlideShare a Scribd company logo
Putting the Sting in
Hive
Page 1
Alan F. Gates
@alanfgates
Stinger Overview
Page 2
•An initiative, not a project or product
•Includes changes to Hive and a new project Tez
•Two main goals
–Improve Hive performance 100x over Hive 0.10
–Extend Hive SQL to include features needed for
analytics
•Hive will support:
–BI tools connecting to Hadoop
–Analysts performing ad-hoc, interactive queries
–Still excellent at the large batch jobs it is used for today
© 2013 Hortonworks
Stinger Mileposts
Page 3
© 2013 Hortonworks
Stinger Phase 3
•Buffer Cache
•Cost Based
Optimizer
Stinger Phase 2
•YARN Resource Mgmnt
•Hive on Apache Tez
•Query Service
•Vectorized Operators
Stinger Phase 1
•Base Optimizations
•SQL Analytics
•ORCFile Format
1 2 Improve existing tools & preserve
investments
Enable Hive to support interactive
workloads
Released in
Hive 0.11
Current
Work
Roadmap
Hive Performance Gains in 0.11
Page 4
© 2013 Hortonworks
• Enable star joins by improving Hive’s map join (aka
broadcast join)
–Where possible do in single map only task
–When not possible push larger tables to separate tasks
• Collapse adjacent jobs where possible
–Hive has lots of M->MR type plans, collapse these to MR
–Collapse adjacent jobs on sufficiently similar keys when
feasible
–join followed by group
–join followed by order
–group followed by order
• Improvements in sort merge bucket (SMB) joins
Page ‹#›
© Hortonworks Inc. 2013
Before
Page 6
© Hortonworks Inc. 2013
After
Page 7
© Hortonworks Inc. 2013
Improvements in SMB Joins
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)
3257.692
2862.669
255.641
71.114
0
500
1000
1500
2000
2500
3000
3500
Query 82
Text
RCFile
Partitioned RCFile
Partitioned RCFile + Optimizations
Page 8
New Technologies in Hive
Page 9
© 2013 Hortonworks
• All covered in depth in other talks
– See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today
• Tez – A new execution engine for relational tools such as Hive
– No need to use MapReduce, instead provides general DAG execution
– Data moved between tasks via socket, disk, or HDFS based on performance / re-
startability trade off
– Provides standing service to greatly reduce query start time
• ORCFile – A rewrite of RCFile
– Columnar
– Tightly integrated with Hive’s type model, including support for nested types
– Much better compression
– Supports projection and filter push down
• Vectorization – Rewriting operators to take advantage of modern
processors
– Based on work done in MonetDB
– Rewrite operators to radically reduce number of function calls, branch prediction
misses, and cache misses
© Hortonworks Inc. 2013
Standard Queries
Page 10
260
165
38
77
142
296
38 42
67
80
0
50
100
150
200
250
300
Query 27
Scale 200
Query 82
Scale 200
Query 27
Scale 1000
Query 82
Scale 1000
Query 27 Star Join
Query 82 Fact Table Join
Hive 0.10, RC File
Hive 0.11 CP, RC File
Hive 0.11 CP, ORC File
© Hortonworks Inc. 2013
Performance Trajectory
Page 11
1X
2X
12X
11X
21X
0X
5X
10X
15X
20X
25X
Hive 10
Text
Hive 10
RC
Hive 11
RC
Hive 11
ORC
Hive 11 CP
ORC, Tez…
Query 27 Speedup
1X
14X
44X
57X
78X
0X
10X
20X
30X
40X
50X
60X
70X
80X
90X
Hive 10
Text
Hive 10
RC
Hive 11
RC
Hive 11
ORC
Hive 11 CP
ORC, Tez
Query 82 Speedup
© Hortonworks Inc. 2013
Query 12 – Demonstrating MRR
Page 12
55 54
75
65
35 34
55
46
0
10
20
30
40
50
60
70
80
RC File
Scale 200
ORC File
Scale 200
RC File
Scale 1000
ORC File
Scale 1000
ElapsedTime(seconds)
Query 12 - MRR Optimization
Traditional
Map-Reduce
Tez Map
Reduce Reduce
Hive Performance Up Next
Page 13
© 2013 Hortonworks
• Push down start up time - even for queries that spend less than a
second running on the cluster, there is ~15 seconds of start up time
– Tez service will remove Hadoop startup issues
– Need to reduce time for the metadata access
– Need intelligent file caching so that hot tables can be kept in memory
• Keep working on the optimizer
– Y Smart work from Ohio State University
– Start using statistics to make intelligent decisions about how many mappers and
reducers to spawn – maybe in Hive, maybe in Tez
– Start using statistics to choose between competing plan options
• Buffer Cache
– Coordinate with HDFS team to determine caching strategy
Extending Hive SQL in 0.11
Page 14
© 2013 Hortonworks
• DECIMAL data type – for fixed precision calculation (e.g. currency)
• OVER clause
– PARTITION BY, ORDER BY, ROWS
BETWEEN/FOLLOWING/PRECEDING
– Works with existing aggregate functions
– New analytic and window functions added
– ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE
, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK
SELECT salesperson, AVG(salesprice) OVER
(PARTITION BY region ORDER BY date
ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING)
FROM sales;
Extending Hive SQL Post 0.11
Page 15
© 2013 Hortonworks
• Subqueries in WHERE
– Non-correlated first
– [NOT] IN first, then extend to (in)equalities and EXISTS
• Datatype conformance – Hive has Java type model, add support for
SQL types:
– DATE
– CHAR() and VARCHAR()
– add precision and scale to decimal and float
– aliases for standard SQL types (BLOB = binary, CLOB = string, integer =
int, real/number = decimal)
• Security
– Add security checks to views, indices, functions, etc.
– Secure GRANT and REVOKE
Questions
Page 16
© 2013 Hortonworks

More Related Content

PPTX
Strata feb2013
PPTX
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
PDF
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
PPTX
Hive & HBase For Transaction Processing
PPTX
Hive Does ACID
PPTX
LLAP: long-lived execution in Hive
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Strata feb2013
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
Hive & HBase For Transaction Processing
Hive Does ACID
LLAP: long-lived execution in Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive

What's hot (20)

PPTX
Apache Hive on ACID
PPTX
Hive ACID Apache BigData 2016
PDF
A TPC Benchmark of Hive LLAP and Comparison with Presto
PPTX
Hive acid-updates-strata-sjc-feb-2015
PPTX
Hive acid-updates-summit-sjc-2014
PPTX
Hive: Loading Data
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PPTX
Llap: Locality is Dead
PDF
From Device to Data Center to Insights
PPTX
Evolving HDFS to Generalized Storage Subsystem
PPTX
Hive acid and_2.x new_features
PPTX
HiveACIDPublic
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
PDF
Apache Ratis - In Search of a Usable Raft Library
PPTX
Apache Hive ACID Project
PPTX
Hive - 1455: Cloud Storage
PPTX
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Apache Hive on ACID
Hive ACID Apache BigData 2016
A TPC Benchmark of Hive LLAP and Comparison with Presto
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-summit-sjc-2014
Hive: Loading Data
Large-Scale Stream Processing in the Hadoop Ecosystem
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Llap: Locality is Dead
From Device to Data Center to Insights
Evolving HDFS to Generalized Storage Subsystem
Hive acid and_2.x new_features
HiveACIDPublic
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
HBaseConEast2016: HBase and Spark, State of the Art
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Apache Ratis - In Search of a Usable Raft Library
Apache Hive ACID Project
Hive - 1455: Cloud Storage
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Ad

Viewers also liked (17)

PDF
Strata Stinger Talk October 2013
PPTX
Simply the best college best work
PPTX
Bowling event
PDF
Outline providing effectivefeedbacktoemployees (1)
PPTX
Hive2.0 big dataspain-nov-2016
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
PPTX
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
PPTX
Big data spain keynote nov 2016
PDF
Rpp reproduksi - copy (1)
PPTX
Keynote apache bd-eu-nov-2016
PPTX
Hortonworks apache training
PPTX
Hive analytic workloads hadoop summit san jose 2014
PDF
Indexed Hive
PPTX
Brownian motion
PPT
Types dbms
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Strata Stinger Talk October 2013
Simply the best college best work
Bowling event
Outline providing effectivefeedbacktoemployees (1)
Hive2.0 big dataspain-nov-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Big data spain keynote nov 2016
Rpp reproduksi - copy (1)
Keynote apache bd-eu-nov-2016
Hortonworks apache training
Hive analytic workloads hadoop summit san jose 2014
Indexed Hive
Brownian motion
Types dbms
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Ad

Similar to Stinger hadoop summit june 2013 (20)

PPTX
La big datacamp2014_vikram_dixit
PDF
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
PPTX
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
PPTX
Stinger Initiative - Deep Dive
PPTX
February 2014 HUG : Hive On Tez
PDF
Gunther hagleitner:apache hive & stinger
PPTX
Hive for Analytic Workloads
PDF
Overview of stinger interactive query for hive
PPTX
Hive present-and-feature-shanghai
PDF
Overview of the Hive Stinger Initiative
PPTX
Performance Hive+Tez 2
PPTX
Hive+Tez: A performance deep dive
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Stinger.Next by Alan Gates of Hortonworks
PPTX
Cost-based query optimization in Apache Hive 0.14
PPTX
Alan Gates, Hortonworks_Hadoop&SQL
PPTX
Using Apache Hive with High Performance
PPTX
Hadoop Now, Next and Beyond
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Apache Hive 2.0: SQL, Speed, Scale
La big datacamp2014_vikram_dixit
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative - Deep Dive
February 2014 HUG : Hive On Tez
Gunther hagleitner:apache hive & stinger
Hive for Analytic Workloads
Overview of stinger interactive query for hive
Hive present-and-feature-shanghai
Overview of the Hive Stinger Initiative
Performance Hive+Tez 2
Hive+Tez: A performance deep dive
Hive + Tez: A Performance Deep Dive
Stinger.Next by Alan Gates of Hortonworks
Cost-based query optimization in Apache Hive 0.14
Alan Gates, Hortonworks_Hadoop&SQL
Using Apache Hive with High Performance
Hadoop Now, Next and Beyond
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence

Stinger hadoop summit june 2013

  • 1. Putting the Sting in Hive Page 1 Alan F. Gates @alanfgates
  • 2. Stinger Overview Page 2 •An initiative, not a project or product •Includes changes to Hive and a new project Tez •Two main goals –Improve Hive performance 100x over Hive 0.10 –Extend Hive SQL to include features needed for analytics •Hive will support: –BI tools connecting to Hadoop –Analysts performing ad-hoc, interactive queries –Still excellent at the large batch jobs it is used for today © 2013 Hortonworks
  • 3. Stinger Mileposts Page 3 © 2013 Hortonworks Stinger Phase 3 •Buffer Cache •Cost Based Optimizer Stinger Phase 2 •YARN Resource Mgmnt •Hive on Apache Tez •Query Service •Vectorized Operators Stinger Phase 1 •Base Optimizations •SQL Analytics •ORCFile Format 1 2 Improve existing tools & preserve investments Enable Hive to support interactive workloads Released in Hive 0.11 Current Work Roadmap
  • 4. Hive Performance Gains in 0.11 Page 4 © 2013 Hortonworks • Enable star joins by improving Hive’s map join (aka broadcast join) –Where possible do in single map only task –When not possible push larger tables to separate tasks • Collapse adjacent jobs where possible –Hive has lots of M->MR type plans, collapse these to MR –Collapse adjacent jobs on sufficiently similar keys when feasible –join followed by group –join followed by order –group followed by order • Improvements in sort merge bucket (SMB) joins
  • 6. © Hortonworks Inc. 2013 Before Page 6
  • 7. © Hortonworks Inc. 2013 After Page 7
  • 8. © Hortonworks Inc. 2013 Improvements in SMB Joins • TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks) 3257.692 2862.669 255.641 71.114 0 500 1000 1500 2000 2500 3000 3500 Query 82 Text RCFile Partitioned RCFile Partitioned RCFile + Optimizations Page 8
  • 9. New Technologies in Hive Page 9 © 2013 Hortonworks • All covered in depth in other talks – See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today • Tez – A new execution engine for relational tools such as Hive – No need to use MapReduce, instead provides general DAG execution – Data moved between tasks via socket, disk, or HDFS based on performance / re- startability trade off – Provides standing service to greatly reduce query start time • ORCFile – A rewrite of RCFile – Columnar – Tightly integrated with Hive’s type model, including support for nested types – Much better compression – Supports projection and filter push down • Vectorization – Rewriting operators to take advantage of modern processors – Based on work done in MonetDB – Rewrite operators to radically reduce number of function calls, branch prediction misses, and cache misses
  • 10. © Hortonworks Inc. 2013 Standard Queries Page 10 260 165 38 77 142 296 38 42 67 80 0 50 100 150 200 250 300 Query 27 Scale 200 Query 82 Scale 200 Query 27 Scale 1000 Query 82 Scale 1000 Query 27 Star Join Query 82 Fact Table Join Hive 0.10, RC File Hive 0.11 CP, RC File Hive 0.11 CP, ORC File
  • 11. © Hortonworks Inc. 2013 Performance Trajectory Page 11 1X 2X 12X 11X 21X 0X 5X 10X 15X 20X 25X Hive 10 Text Hive 10 RC Hive 11 RC Hive 11 ORC Hive 11 CP ORC, Tez… Query 27 Speedup 1X 14X 44X 57X 78X 0X 10X 20X 30X 40X 50X 60X 70X 80X 90X Hive 10 Text Hive 10 RC Hive 11 RC Hive 11 ORC Hive 11 CP ORC, Tez Query 82 Speedup
  • 12. © Hortonworks Inc. 2013 Query 12 – Demonstrating MRR Page 12 55 54 75 65 35 34 55 46 0 10 20 30 40 50 60 70 80 RC File Scale 200 ORC File Scale 200 RC File Scale 1000 ORC File Scale 1000 ElapsedTime(seconds) Query 12 - MRR Optimization Traditional Map-Reduce Tez Map Reduce Reduce
  • 13. Hive Performance Up Next Page 13 © 2013 Hortonworks • Push down start up time - even for queries that spend less than a second running on the cluster, there is ~15 seconds of start up time – Tez service will remove Hadoop startup issues – Need to reduce time for the metadata access – Need intelligent file caching so that hot tables can be kept in memory • Keep working on the optimizer – Y Smart work from Ohio State University – Start using statistics to make intelligent decisions about how many mappers and reducers to spawn – maybe in Hive, maybe in Tez – Start using statistics to choose between competing plan options • Buffer Cache – Coordinate with HDFS team to determine caching strategy
  • 14. Extending Hive SQL in 0.11 Page 14 © 2013 Hortonworks • DECIMAL data type – for fixed precision calculation (e.g. currency) • OVER clause – PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING – Works with existing aggregate functions – New analytic and window functions added – ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE , LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK SELECT salesperson, AVG(salesprice) OVER (PARTITION BY region ORDER BY date ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING) FROM sales;
  • 15. Extending Hive SQL Post 0.11 Page 15 © 2013 Hortonworks • Subqueries in WHERE – Non-correlated first – [NOT] IN first, then extend to (in)equalities and EXISTS • Datatype conformance – Hive has Java type model, add support for SQL types: – DATE – CHAR() and VARCHAR() – add precision and scale to decimal and float – aliases for standard SQL types (BLOB = binary, CLOB = string, integer = int, real/number = decimal) • Security – Add security checks to views, indices, functions, etc. – Secure GRANT and REVOKE

Editor's Notes

  • #12: Speedup (y-axis) as a ratio to Hive 10 Text. Bigger is better.
  • #13: Time (y-axis) in seconds. Smaller is better.