Stinger hadoop summit june 2013

Putting the Sting in
Hive
Page 1
Alan F. Gates
@alanfgates

Stinger Overview
Page 2
•An initiative, not a project or product
•Includes changes to Hive and a new project Tez
•Two main goals
–Improve Hive performance 100x over Hive 0.10
–Extend Hive SQL to include features needed for
analytics
•Hive will support:
–BI tools connecting to Hadoop
–Analysts performing ad-hoc, interactive queries
–Still excellent at the large batch jobs it is used for today
© 2013 Hortonworks

Stinger Mileposts
Page 3
© 2013 Hortonworks
Stinger Phase 3
•Buffer Cache
•Cost Based
Optimizer
Stinger Phase 2
•YARN Resource Mgmnt
•Hive on Apache Tez
•Query Service
•Vectorized Operators
Stinger Phase 1
•Base Optimizations
•SQL Analytics
•ORCFile Format
1 2 Improve existing tools & preserve
investments
Enable Hive to support interactive
workloads
Released in
Hive 0.11
Current
Work
Roadmap

Hive Performance Gains in 0.11
Page 4
© 2013 Hortonworks
• Enable star joins by improving Hive’s map join (aka
broadcast join)
–Where possible do in single map only task
–When not possible push larger tables to separate tasks
• Collapse adjacent jobs where possible
–Hive has lots of M->MR type plans, collapse these to MR
–Collapse adjacent jobs on sufficiently similar keys when
feasible
–join followed by group
–join followed by order
–group followed by order
• Improvements in sort merge bucket (SMB) joins

© Hortonworks Inc. 2013
Before
Page 6

After
Page 7

Improvements in SMB Joins
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)
3257.692
2862.669
255.641
71.114
0
500
1000
1500
2000
2500
3000
3500
Query 82
Text
RCFile
Partitioned RCFile
Partitioned RCFile + Optimizations
Page 8

New Technologies in Hive
Page 9
© 2013 Hortonworks
• All covered in depth in other talks
– See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today
• Tez – A new execution engine for relational tools such as Hive
– No need to use MapReduce, instead provides general DAG execution
– Data moved between tasks via socket, disk, or HDFS based on performance / re-
startability trade off
– Provides standing service to greatly reduce query start time
• ORCFile – A rewrite of RCFile
– Columnar
– Tightly integrated with Hive’s type model, including support for nested types
– Much better compression
– Supports projection and filter push down
• Vectorization – Rewriting operators to take advantage of modern
processors
– Based on work done in MonetDB
– Rewrite operators to radically reduce number of function calls, branch prediction
misses, and cache misses

Standard Queries
Page 10
260
165
38
77
142
296
38 42
67
80
0
50
100
150
200
250
300
Query 27
Scale 200
Query 82
Scale 200
Query 27
Scale 1000
Query 82
Scale 1000
Query 27 Star Join
Query 82 Fact Table Join
Hive 0.10, RC File
Hive 0.11 CP, RC File
Hive 0.11 CP, ORC File

Performance Trajectory
Page 11
1X
2X
12X
11X
21X
0X
5X
10X
15X
20X
25X
Hive 10
Text
Hive 10
RC
Hive 11
RC
Hive 11
ORC
Hive 11 CP
ORC, Tez…
Query 27 Speedup
1X
14X
44X
57X
78X
0X
10X
20X
30X
40X
50X
60X
70X
80X
90X
Hive 10
Text
Hive 10
RC
Hive 11
RC
Hive 11
ORC
Hive 11 CP
ORC, Tez
Query 82 Speedup

Query 12 – Demonstrating MRR
Page 12
55 54
75
65
35 34
55
46
0
10
20
30
40
50
60
70
80
RC File
Scale 200
ORC File
Scale 200
RC File
Scale 1000
ORC File
Scale 1000
ElapsedTime(seconds)
Query 12 - MRR Optimization
Traditional
Map-Reduce
Tez Map
Reduce Reduce

Hive Performance Up Next
Page 13
© 2013 Hortonworks
• Push down start up time - even for queries that spend less than a
second running on the cluster, there is ~15 seconds of start up time
– Tez service will remove Hadoop startup issues
– Need to reduce time for the metadata access
– Need intelligent file caching so that hot tables can be kept in memory
• Keep working on the optimizer
– Y Smart work from Ohio State University
– Start using statistics to make intelligent decisions about how many mappers and
reducers to spawn – maybe in Hive, maybe in Tez
– Start using statistics to choose between competing plan options
• Buffer Cache
– Coordinate with HDFS team to determine caching strategy

Extending Hive SQL in 0.11
Page 14
© 2013 Hortonworks
• DECIMAL data type – for fixed precision calculation (e.g. currency)
• OVER clause
– PARTITION BY, ORDER BY, ROWS
BETWEEN/FOLLOWING/PRECEDING
– Works with existing aggregate functions
– New analytic and window functions added
– ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE
, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK
SELECT salesperson, AVG(salesprice) OVER
(PARTITION BY region ORDER BY date
ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING)
FROM sales;

Extending Hive SQL Post 0.11
Page 15
© 2013 Hortonworks
• Subqueries in WHERE
– Non-correlated first
– [NOT] IN first, then extend to (in)equalities and EXISTS
• Datatype conformance – Hive has Java type model, add support for
SQL types:
– DATE
– CHAR() and VARCHAR()
– add precision and scale to decimal and float
– aliases for standard SQL types (BLOB = binary, CLOB = string, integer =
int, real/number = decimal)
• Security
– Add security checks to views, indices, functions, etc.
– Secure GRANT and REVOKE

Stinger hadoop summit june 2013

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Stinger hadoop summit june 2013 (20)

Recently uploaded (20)

Stinger hadoop summit june 2013

Editor's Notes