Strata Stinger Talk October 2013

Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen O’Malley (@owen_omalley)
@hortonworks

© Hortonworks Inc. 2013.

YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service
Applica'ons
Run
Na'vely
IN
Hadoop

BATCH

INTERACTIVE

(MapReduce)

(Tez)

ONLINE

(HBase)

STREAMING

(Storm,
S4,…)

GRAPH

(Giraph)

IN-‐MEMORY

(Spark)

HPC
MPI

(OpenMPI)

OTHER

(Search)

(Weave…)

YARN
(Cluster
Resource
Management)

HDFS2
(Redundant,
Reliable
Storage)


Page 2

Hadoop Beyond Batch with YARN
A shift from the old to the new…
Single Use System

Multi Use Data Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1

HADOOP 2
MapReduce

(batch)

MapReduce

Tez

(interac:ve)

YARN

Others

(varied)

(cluster
resource
management

&
data
processing)

(opera:ng
system:
cluster
resource
management)

HDFS

HDFS2

(redundant,
reliable
storage)


(redundant,
reliable
storage)

Apache Tez (“Speed”)
•  Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
Task with pluggable Input, Processor and Output

Input

Processor

Output

Task

Tez Task - <Input, Processor, Output>

YARN ApplicationMaster to run DAG of Tez Tasks

Tez: Building blocks for scalable data processing
Classical ‘Map’

HDFS

Input

Map

Processor

Classical ‘Reduce’

Sorted

Output

Shuﬄe

Input

Shuﬄe

Input

Reduce

Processor

Sorted

Output

Intermediate ‘Reduce’ for
Map-Reduce-Reduce

Reduce

Processor

HDFS

Output

Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;

Hive – MR
M

M

Hive – Tez

M

SELECT a.state

SELECT b.id
R

R

M

SELECT a.state,
c.itemId

M

M

M
R

M

SELECT b.id

R

M

HDFS

JOIN (a, c)
SELECT c.price

M

R

M
R

HDFS

R

JOIN (a, c)

R

HDFS

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)


M

M

R

M

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

R

Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions
– Hot containers ready for immediate use
– Removes task and job launch overhead (~5s – 30s)

• Hive
– Session launch/shutdown in background (seamless, user not
aware)
– Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc

Tez Delivers Interactive Query - Out of the Box!
Feature

Descrip'on

Benefit

Tez
Session

Overcomes
Map-‐Reduce
job-‐launch
latency
by
pre-‐
launching
Tez
AppMaster

Latency

Tez
Container
Pre-‐
Launch

Overcomes
Map-‐Reduce
latency
by
pre-‐launching

hot
containers
ready
to
serve
queries.

Latency

Finished
maps
and
reduces
pick
up
more
work

Tez
Container
Re-‐Use
rather
than
exi:ng.
Reduces
latency
and
eliminates

difficult
split-‐size
tuning.
Out
of
box
performance!

Run:me
re-‐
Run:me
query
tuning
by
picking
aggrega:on

configura:on
of
DAG
parallelism
using
online
query
sta:s:cs

Tez
In-‐Memory

Cache

Hot
data
kept
in
RAM
for
fast
access.

Complex
DAGs

Tez
Broadcast
Edge
and
Map-‐Reduce-‐Reduce

paXern
improve
query
scale
and
throughput.


Latency

Throughput

Latency

Throughput

Page 8

Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative

A broad, community-based effort to
drive the next generation of HIVE

S'nger
Project

(announced
February
2013)

Hive
0.11,
May
2013:

•  Base
Op:miza:ons

•  SQL
Analy:c
Func:ons

•  ORCFile,
Modern
File
Format

Goals:

Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)

Hive
0.12,
October
2013:

• 
• 
• 
• 

The only SQL interface to Hadoop designed
for queries that scale from TB to PB

SQL


Hive
on
Apache
Tez

Query
Service

Buﬀer
Cache

Cost
Based
Op:mizer
(Op:q)

Vectorized
Processing

Coming
Soon:

Support broadest range of SQL semantics for
analytic applications running against Hadoop

…all IN Hadoop

VARCHAR,
DATE
Types

ORCFile
predicate
pushdown

Advanced
Op:miza:ons

Performance
Boosts
via
YARN

• 
• 
• 
• 
• 

Scale

Hive 0.12
Hive 0.12
Release Theme

Speed, Scale and SQL

Specific Features

•  10x faster query launch when using large number
(500+) of partitions
•  ORCFile predicate pushdown speeds queries
•  Evaluate LIMIT on the map side
•  Parallel ORDER BY
•  New query optimizer
•  Introduces VARCHAR and DATE datatypes
•  GROUP BY on structs or unions

Included
Components

Apache Hive 0.12


SPEED: Increasing Hive Performance
Interactive Query Times across ALL use cases
•  Simple and advanced queries in seconds
•  Integrates seamlessly with existing tools
•  Currently a >100x improvement in just nine months
Performance Improvements
included in Hive 12
–  Base & advanced query optimization
–  Startup time improvement
–  Join optimizations


Stinger Phase 3: Interactive Query In Hadoop
Query
27:
Pricing
Analy'cs
using
Star
Schema
Join

Query
82:
Inventory
Analy'cs
Joining
2
Large
Fact
Tables

1400s

190x

Improvement

3200s

200x

Improvement

65s
39s
14.9s

7.2s
TPC-‐DS
Query
27

Hive 10

Hive 0.11 (Phase 1)

TPC-‐DS
Query
82

Trunk (Phase 3)

All
Results
at
Scale
Factor
200
(Approximately
200GB
Data)


Page 12

Speed: Delivering Interactive Query
Query
Time
in
Seconds

Query
52:
Star
Schema
Join

Query
5:
Star
Schema
Join

41.1s

39.8s

4.2s
TPC-‐DS
Query
52

Hive 0.12

Trunk (Phase 3)

4.1s
TPC-‐DS
Query
55

Test
Cluster:

•  200
GB
Data
(Impala:
Parquet

Hive:
ORCFile)

•  20
Nodes,
24GB
RAM
each,
6x
disk
each

Speed: Delivering Interactive Query
Query
Time
in
Seconds

Query
28:
Vectoriza'on

Query
12:
Complex
join
(M-‐R-‐R
pabern)

31s
22s

9.8s
TPC-‐DS
Query
28

Hive 0.12

Trunk (Phase 3)

6.7s
TPC-‐DS
Query
12

Test
Cluster:

•  200
GB
Data
(Impala:
Parquet

Hive:
ORCFile)

•  20
Nodes,
24GB
RAM
each,
6x
disk
each

AMPLab Big Data Benchmark
AMPLab
Query
1:
Simple
Filter
Query

63s

63s

45s

1.6s

2.3s

AMPLab
Query
1a

AMPLab
Query
1b

9.4s
AMPLab
Query
1c

Query
Time
in
Seconds

(lower
is
beXer)

Hive 0.10 (5 node EC2)
Trunk (Phase 3)


S:nger
Phase
3
Cluster
Conﬁgura:on:

•  AMPLab
Data
Set
(~135
GB
Data)

•  20
Nodes,
24GB
RAM
each,
6x
Disk
each

Page 15

AMPLab
Query
2:
Group
By
IP
Block
and
Aggregate

552s
466s

104.3s
AMPLab
Query
2a

490s

118.3s
AMPLab
Query
2b

172.7s
AMPLab
Query
2c

Query
Time
in
Seconds

(lower
is
beXer)

Trunk (Phase 3)


S:nger
Phase
3
Cluster
Conﬁgura:on:

•  AMPLab
Data
Set
(~135
GB
Data)

•  20
Nodes,
24GB
RAM
each,
6x
Disk
each

Page 16

AMPLab
Query
3:
Correlate
Page
Rankings
and
Revenues
Across
Time

490s

466s

40s
AMPLab
Query
3a

145s
AMPLab
Query
3b

Query
Time
in
Seconds

(lower
is
beXer)

Trunk (Phase 3)


S:nger
Phase
3
Cluster
Conﬁgura:on:

•  AMPLab
Data
Set
(~135
GB
Data)

•  20
Nodes,
24GB
RAM
each,
6x
Disk
each

Page 17

How Stinger Phase 3 Delivers Interactive Query

Feature

Tez
Integra:on

Descrip'on

Tez
is
signiﬁcantly
beXer
engine
than
MapReduce

Beneﬁt

Latency

Vectorized
Query

Take
advantage
of
modern
hardware
by
processing

thousand-‐row
blocks
rather
than
row-‐at-‐a-‐:me.

Throughput

Query
Planner

Using
extensive
sta:s:cs
now
available
in
Metastore

to
beXer
plan
and
op:mize
query,
including

predicate
pushdown
during
compila:on
to
eliminate

por:ons
of
input
(beyond
par::on
pruning)

Latency

Cost
Based
Op:mizer
Join
re-‐ordering
and
other
op:miza:ons
based
on

(Op:q)

column
sta:s:cs
including
histograms
etc.


Latency

Page 18

SQL: Enhancing SQL Semantics
Hive
SQL
Datatypes

Hive
SQL
Seman'cs

SQL Compliance

INT

SELECT,
INSERT

TINYINT/SMALLINT/BIGINT

GROUP
BY,
ORDER
BY,
SORT
BY

BOOLEAN

JOIN
on
explicit
join
key

FLOAT

Inner,
outer,
cross
and
semi
joins

DOUBLE

Sub-‐queries
in
FROM
clause

Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop

STRING

ROLLUP
and
CUBE

TIMESTAMP

UNION

BINARY

Windowing
Func:ons
(OVER,
RANK,
etc)

DECIMAL

Custom
Java
UDFs

ARRAY,
MAP,
STRUCT,
UNION

Standard
Aggrega:on
(SUM,
AVG,
etc.)

DATE

Advanced
UDFs
(ngram,
Xpath,
URL)

VARCHAR

Sub-‐queries
in
WHERE,
HAVING

CHAR

Expanded
JOIN
Syntax

SQL
Compliant
Security
(GRANT,
etc.)

INSERT/UPDATE/DELETE
(ACID)


Available

Hive
0.12

Roadmap

ORC File Format
• Columnar format for complex data types
• Built into Hive from 0.11
• Support for Pig and MapReduce via HCat
• Two levels of compression
– Lightweight type-specific and generic

• Built in indexes
– Every 10,000 rows with position information
– Min, Max, Sum, Count of each column
– Supports seek to row number


Page 20

SCALE: Interactive Query at Petabyte Scale
Sustained Query Times

Smaller Footprint

Apache Hive 0.12 provides
sustained acceptable query
times even at petabyte scale

Better encoding with ORC in
Apache Hive 0.12 reduces resource
requirements for your cluster

File
Size
Comparison
Across
Encoding
Methods

Dataset:
TPC-‐DS
Scale
500
Dataset

585
GB

(Original
Size)

505
GB

(14%
Smaller)

Impala

221
GB

(62%
Smaller)

Hive
12

131
GB

(78%
Smaller)

Encoded
with

Text


Encoded
with

RCFile

Encoded
with

Parquet

Encoded
with

ORCFile

•  Larger Block Sizes
•  Columnar format
arranges columns
adjacent within the
file for compression
& fast access

ORC File Format
• Hive 0.12
– Predicate Push Down
– Improved run length encoding
– Adaptive string dictionaries
– Padding stripes to HDFS block boundaries

• Trunk
– Stripe-based Input Splits
– Input Split elimination
– Vectorized Reader
– Customized Pig Load and Store functions


Page 22

Vectorized Query Execution
• Designed for Modern Processor Architectures
– Avoid branching in the inner loop.
– Make the most use of L1 and L2 cache.

• How It Works
– Process records in batches of 1,000 rows
– Generate code from templates to minimize branching.

• What It Gives
– 30x improvement in rows processed per second.
– Initial prototype: 100M rows/sec on laptop


Page 23

HDFS Buffer Cache
• Use memory mapped buffers for zero copy
– Avoid overhead of going through DataNode
– Can mlock the block files into RAM

• ORC Reader enhanced for zero-copy reads
– New compression interfaces in Hadoop

• Vectorization specific reader
– Read 1000 rows at a time
– Read into Hive’s internal representation


Next Steps
• Blog
http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/

• Stinger Initiative
http://hortonworks.com/labs/stinger/

• Stinger Beta: HDP-2.1 Beta, December, 2013


Strata Stinger Talk October 2013

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Strata Stinger Talk October 2013 (20)

Recently uploaded (20)

Strata Stinger Talk October 2013

Editor's Notes