Fast SQL on Hadoop, Really?

1 © Hortonworks Inc. 2011–2018. All rights reserved
Fast SQL for Big Data
Apache Hive and Apache Druid
Alan Gates
Hortonworks Co-founder, Apache Hive PMC member
@alanfgates

7000 analysts, 80ms average latency, 1PB data.
250k BI queries per hour
On demand deep reporting in the cloud over
100TB in minutes.

• Ran all 99 TPCDS queries
• Total query runtime have improved multifold in each release!
Benchmark journey
TPCDS 10TB scale on 10 node cluster
HDP 2.5
Hive1
HDP 2.5
LLAP
HDP 2.6
LLAP
25x 3x 2x
HDP 3.0
LLAP
2016 20182017
ACID
tables

Hive LLAP v Presto v Spark SQL
• TPC-DS, scale 3TB, all 99 queries, not run by Hortonworks (nor at our request)
• Total time to run (seconds):
LLAP: 5,517 Presto: 12,948 Spark SQL: 26,247
• LLAP faster than Presto on 83/97 queries, Spark SQL on 92/96 queries
• More details:
• Hive 3.1 LLAP, Presto 0.208e, Spark 2.3.1
• 19 worker nodes, 84G each
• Source mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/

Hive LLAP - MPP Performance at Hadoop Scale
Deep
Storage
Hadoop Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon

Aggressive Caching

Caching in LLAP
• Fine grained (by row group and column) and compact (dictionary encoding, RLE)
• Important in environment with PBs of data but common queries only touch 100s of GB
• Prioritized – indexes cached with higher priority
• Off heap to avoid GC
• Supports spill to SSD – important in the cloud
• Uses LRFU replacement algorithm to avoid large scans purging the cache

Query result cache
Returns results directly from storage (e.g.
HDFS) without actually executing the query
If the same query has run before and the
underlying data has not changed
Important for dashboards, reports etc.
where repetitive queries are common
Uses transactions to determine when
underlying data has changed
Without
cache
With
cache

Metastore Cache
• With query execution time being < 1 sec, compilation time starts to dominate
• Metadata retrieval is often significant part of compilation time. Most of it is in RDBMS
queries.
• Cloud RDBMS As a Service is often slower, and frequent queries leads to throttling.
• Metadata cache speeds compilation time by around 50% with on prem MySQL.
Significantly more improvement with cloud RDBMS.
• Cache is consistent in single metastore setup, eventually consistent with HA setup.
Consistent HA setup support is in the works.

New in Hive 3:
Materialized Views

Possible workflow
1. Create materialized view using Hive tables
• Stored by Hive or Druid
2. User or dashboard sends queries to Hive
• Hive rewrites queries using available materialized views
• Execute rewritten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
①
②
Data
Queries

Materialized view-based rewriting example
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extendedprice * lo_discount AS d_price,
lo_revenue - lo_supplycost
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice * lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
… … ... …
mv contents
sum
42.0
…
Query results

Materialized view - Maintenance
• Partial table rewrites are supported
• Typical: Denormalize last month of data only
• Rewrite engine will produce union of latest and historical data
• Updates to base tables
• Invalidates views, but
• Can choose to allow stale views (max staleness) for performance
• Can partial match views and compute delta after updates
• Incremental updates
• Common classes of views allow for incremental updates
• Others need full refresh

Optimizer Improvements

SELECT * FROM
( SELECT AVG(ss_list_price) B1_LP,
COUNT(ss_list_price) B1_CNT ,COUNT(DISTINCT
ss_list_price) B1_CNTD
FROM store_sales
WHERE ss_quantity BETWEEN 0 AND 5 AND
(ss_list_price BETWEEN 11 and 11+10 OR
ss_coupon_amt BETWEEN 460 and 460+1000 OR
ss_wholesale_cost BETWEEN 14 and 14+20)) B1,
( SELECT AVG(ss_list_price) B2_LP,
COUNT(ss_list_price) B2_CNT ,COUNT(DISTINCT
ss_list_price) B2_CNTD
FROM store_sales
WHERE ss_quantity BETWEEN 6 AND 10 AND
(ss_list_price BETWEEN 91 and 91+10 OR
ss_coupon_amt BETWEEN 1430 and 1430+1000 OR
ss_wholesale_cost BETWEEN 32 and 32+20)) B2,
. . .
LIMIT 100;
TPCDS SQL query 28 joins 6 instances of store_sales table
Shared scan - 4x improvement!
RS RS RS RS RS
Scan
store_sales
Combined OR’ed B1-B6 Filters
B1 Filter B2 Filter B3 Filter B4 Filter B5 Filter
Join

• Dramatically improves performance of very selective joins
• Builds a bloom filter from one side of join and filters rows from other side
• Skips scan and further evaluation of rows that would not qualify the join
Dynamic Semijoin Reduction - 7x improvement for q72
SELECT …
FROM sales JOIN time ON
sales.time_id = time.time_id
WHERE time.year = 2014 AND
time.quarter IN ('Q1', 'Q2’)
Reduced scan on sales

Statistics (not new)
• Statistics collection can be set to automatic or manual
• Used extensively in join selection
• Without statistics much of the optimizer will not be used

⬢ Solution
● Query fails because of stats estimation error
● Runtime sends observed statistics back to
coordinator
● Statistics overrides are created at session, server
or global level
● Query is replanned and resubmitted
Optimizer is learning from planning mistakes
⬢ Symptoms
● Memory exhaustion due to under
provisioning
● Excessive runtime (future)
● Excessive spilling (future)

Apache Druid

Druid capabilities
• Streaming ingestion capability
• Data Freshness – analyze events as they occur
• Fast response time (ideally < 1sec query time)
• Arbitrary slicing and dicing
• Multi-tenancy – 1000s of concurrent users
• Scalability and Availability
• Rich real-time visualization with Superset
Apache Druid is a distributed, real-time, column-oriented
datastore designed to quickly ingest and index large amounts
of data and make it available for real-time query.

Druid: Fast Facts
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)

Hive and Druid, Better Together
Technology Strengths Issues
Hive SQL 2011, JDBC/ODBC
Fast scans
ACID
Not optimized for slice and dice and drill down (OLAP
cubing) operations
Druid Dimensional aggregates support OLAP cubes
Timeseries queries
Realtime ingestion of streaming data
Lacks SQL interface
No joins
Problem: You don't want two systems to manage and load data into
Solution: For data that fits best in Druid, load it in Druid and access it with Hive
• Hive supports push down of queries to Druid, optimizer knows what to push and what to run in Hive
• Enables SQL and JDBC/ODBC access to data in Druid
• Enables join of historical and realtime data
• Enables Hive support of slice & dice, drill down for OLAP cubing
• Can also create materialized views in Hive and store them in Druid

Druid Connector
Realtime Node
Realtime Node
Realtime Node
Broker HiveServer2
Instantly analyze kafka data with milliseconds latency

Druid Connector - Joins between Hive and realtime data in Druid
Bloom filter pushdown greatly reduces data transfer
Send promotional email to all customers from CA who purchased more than $1000 worth of merchandise today.
create external table sales(`__time` timestamp, quantity int, sales_price double,customer_id bigint, item_id int, store_id int)
stored by 'org.apache.hadoop.hive.druid.DruidStorageHandler'
tblproperties ( "kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "sales-topic",
"druid.kafka.ingestion.maxRowsInMemory" = "5");
create table customers (customer_id bigint, first_name string, last_name string, email string, state string);
select email from customers join sales using customer_id where to_date(sales.__time) = date ‘2018-09-06’
and quantity * sales_price > 1000 and customers.state = ‘CA’;

Tips for Optimizing Hive

Making Your Queries Blaze in Hive 3
• Use a columnar format
• We recommend ORC; ORC or Parquet much better for DW queries than row oriented formats
• Use the right tool for the right job, all in Hive
• LLAP for BI queries
• Tez for ETL/batch
• Druid for ROLAP and realtime ingestion
• Do not use MapReduce as your Hive engine, it is very slow
• Keep statistics current on your data
• Define materialized views for common joins and aggregations
• Turn on ACID – it enables query cache and materialized view partial rewrites

SOLUTIONS: Heuristic recommendation engine
Fully self-serviced query and storage optimization

Questions?

Fast SQL on Hadoop, Really?

More Related Content

What's hot (20)

Similar to Fast SQL on Hadoop, Really? (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Fast SQL on Hadoop, Really?