Indexed Hive

Indexed Hive
A quick demonstration of Hive performance acceleration
using indexes
By:
Prafulla Tekawade
Nikhil Deshpande

www.persistentsys.com

Summary

• This presentation describes the performance
experiment based on Hive using indexes to accelerate
query execution.
• The slides include information on
• Indexes
• A specific set of Group By queries
• Rewrite technique
• Performance experiment and results

© 2010 Persistent Systems Ltd www.persistentsys.com 2

Hive usage

• HDFS spreads and scatters the data to different
locations (data nodes).
• Data dumped & loaded into HDFS ‘as it is’.
• Only one view to the data, original data structure &
layout
• Typically data is append-only
• Processing times dominated by full data scan times
Can the data access times be better?


Hive usage

What can be done to speed-up queries?
Cut down the data I/O. Lesser data means faster
processing.

Different ways to get performance
• Columnar storage
• Data partitioning
• Indexing (different view of same data)
• …


Hive Indexing

• Provides key-based data view
• Keys data duplicated
• Storage layout favors search & lookup performance
• Provided better data access for certain operations
• A cheaper alternative to full data scans!
How cheap?
An order of magnitude better in certain cases!


How does the index look like?

An index is a table with 3 columns
hive> describe
default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx
__;
OK
l_shipdate string Key
_bucketname string References to
_offsets array<string> values

Data in index looks like
hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2;
OK
1992-01-08 hdfs://hadoop1:54310/user/…/lineitem.tbl ["662368"]
1992-01-16 hdfs://hadoop1:54310/user/…/lineitem.tbl ["143623","390763","637910"]


Hive index in HQL

• SELECT (mapping, projection, association, given key,
fetch value)
• WHERE (filters on keys)
• GROUP BY (grouping on keys)
• JOIN (join key as index key)

Indexes have high potential for accelerating wide range
of queries.


Hive Index
• Index as Reference
• Index as Data

This demonstration uses Index as Data technique to show order
of magnitude performance gain!
• Uses Query Rewrite technique to transform queries on base
table to index table.
• Limited applicability currently (e.g. demo based on GB) but
technique itself has wide potential.
• Also a very quick way to demonstrate importance of index for
performance (no deep optimizer/execution engine
modifications).


Indexes and Query Rewrites

Demo targeting:
• GROUP BY, aggregation
• Index as Data
• Group By Key = Index Key
• Query rewritten to use indexes, but still a valid query
(nothing special in it!)


Query Rewrites: simple gb

SELECT DISTINCT l_shipdate
FROM lineitem;

SELECT l_shipdate
FROM __lineitem_shipdate_idx__;


Query Rewrites: simple agg

SELECT l_shipdate, COUNT(1)
FROM lineitem
GROUP BY l_shipdate;

SELECT l_shipdate, size(`_offsets`)
FROM __lineitem_shipdate_idx__;


Query Rewrites: gb + where

SELECT l_shipdate, COUNT(1)
FROM lineitem
WHERE YEAR(l_shipdate) >= 1992
AND YEAR(l_shipdate) <= 1996
GROUP BY l_shipdate;

SELECT l_shipdate, size(` _offsets `)
FROM __lineitem_shipdate_idx__
WHERE YEAR(l_shipdate) >= 1992
AND YEAR(l_shipdate) <= 1996;


Query Rewrites: gb on func(key)

SELECT YEAR(l_shipdate) AS Year,
COUNT(1) AS Total
FROM lineitem
GROUP BY YEAR(l_shipdate);

SELECT Year, SUM(cnt) AS Total
FROM (SELECT YEAR(l_shipdate) AS Year,
size(`_offsets`) AS cnt
FROM __lineitem_shipdate_idx__) AS t
GROUP BY Year;


Histogram Query

MONTH(l_shipdate) AS Month,
COUNT(1) AS Monthly_shipments
FROM lineitem
GROUP BY YEAR(l_shipdate), MONTH(l_shipdate);

MONTH(l_shipdate) AS Month,
SUM(sz) AS Monthly_shipments
FROM (SELECT l_shipdate, SIZE(`_offsets`) AS sz
GROUP BY YEAR(l_shipdate), MONTH(l_shipdate);

Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments,

(y2_shipments-y1_shipments)/y1_shipments AS Delta
FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

COUNT(1) AS Shipments

FROM lineitem

WHERE YEAR(l_shipdate) = 1997

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1

JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

COUNT(1) AS Shipments

FROM lineitem



ON y1.Month = y2.Month;


Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS y1_shipments,
y2.shipments AS y2_shipments,
( y2_shipments - y1_shipments ) / y1_shipments AS delta
FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,
SUM(sz) AS shipments
FROM (SELECT l_shipdate, size(` _offsets `) AS sz
FROM __lineitem_shipdate_idx__) AS t1

JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,
SUM(sz) AS shipments
FROM (SELECT l_shipdate, size(` _offsets `) AS sz
ON y1.Month = y2.Month;


Performance tests

Hardware and software configuration:
• 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in
RAID5, 16GB RAM)
• 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized,
data not partitioned and clustered, Hive tables stored in row-
store format, HDFS replication factor: 2
• Hive development branch (~0.5)
• Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)
• Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g.
TPC-H 30GB data: 21GB lineitem, ~180Million tuples)


Perf gain for Histogram Query

Graphs
not to
scale

(sec) 1M 1G 10G 30G
q1_noidx 24.161 76.79 506.005 1551.555
q1_idx 21.268 27.292 35.502 86.133


Perf gain for Year on Year Query

Graphs
not to
scale

(sec) 1M 1G 10G 30G
q1_noidx 73.66 130.587 764.619 2146.423
q1_idx 69.393 75.493 92.867 190.619


Why index performs better?

Reducing data increases I/O efficiency Exploiting storage layout optimization

 If you need only X, separate X from  “Right tool for the job”, e.g. two ways
the rest to do GROUP BY
 Lesser data to process, better  sort + agg or
memory footprint, better locality of  hash & agg
reference…  Sort step already done in index!

Parallelization

• Process the index data in same
manner as base table, distribute the
processing across nodes
• Scalable!


Near-by future

More rewrites
Partitioning Index data per key.
Run-time operators for index usage (lookup, join, filter
etc., since rewrites only a partial solution).
Optimizer support for index operators.
Cost based optimizer to choose index and non-index
plans.
…


Index Design

Hive Hive
Query
DDL Index Query
Rewrite
Compiler Builder Compiler
Engine

Hive Hive
DDL Query
Engine Engine

Hadoop MR

HDFS


Hive Compiler

Parser / AST
Generator

Semantic
Analyzer Optimizer /
Operator
Query Plan
Rewrite Generator Execution
Engine Plan
Generator

To
Hadoop
MR


Query Rewrite Engine

Rule Engine

Rewritten
Query Tree
Query
Tree
Rewrite Rules Repository
Rewrite Rule

Rewrite
Rewrite Rule
Rewrite Rule
Rewrite
Trigger Rewrite Rule
Rewrite Action
Condition Rewrite Rewrite
Trigger
Rewrite
Rewrite Rule
Action Rewrite
Condition
Trigger
Rewrite
Rewrite Rule Trigger
Rewrite
Condition
Action
Rewrite Action
Trigger
Rewrite Action Condition
Condition Rewrite
Trigger
Action
Condition


Learning Hive

• Hive compiler is not ‘Syntax Directed Translation’ driven
• Tree visitor based, separation of data structs and compiler logic
• Tree is immutable (harder to change, harder to rewrite)
• Query semantic information is separately maintained from the query lexical/parse tree, in
different data structures, which are loosely bound in a Query Block data structure, which itself
is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which
everything is hung. This makes it very difficult to rewrite queries.
• Optimizer is not yet mature
• Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table
scans)
• Optimizer is rule-based, not cost-based, no stats collected
• Query tuning is harder job (requires special knowledge of the optimizer guts, what works and
what doesn’t)
• Setting up development environment is tedious (build system heavily relies on internet
connection, troublesome behind restrictive firewalls).
• Folks in the community are very active, dependent JIRAs are fast moving target and
development-wise, we need to keep up with them actively (e.g. if branching, need to
frequently refresh from trunk).


How to get it?
• Needs a working Hadoop cluster (tested with 0.20.2)
• For the Hive with Indexing support:
• Hive Index DDL patch (JIRA 417) now part of hive trunk
https://issues.apache.org/jira/browse/HIVE-417
• Get the Hive branch with Index Query Rewrite patch applied from
Github (a fork/branch of Hive development tree, a snapshot of Hive +
Index DDL source tree, not latest, but single place to get all)
http://github.com/prafullat/hive
Refer Hive documentation for building
http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an
d_building
See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test.


Thank You!
prafulla_tekawade at persistent dot co dot in
nikhil_deshpande at persistent dot co dot in


Indexed Hive

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Indexed Hive (20)

Recently uploaded (20)

Indexed Hive