New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

1© Cloudera, Inc. All rights reserved.
Greg Rahn | @gregrahn | Director of Product Management
Apache Impala:
Recent advancements, performance
benchmarks, and pro-tips

Agenda
• Improvements in recent Impala versions
• Latest performance comparisons
• Performance recommendations and considerations

Meet Impala

Apache Impala: Open Source & Open Standard
1 Millions of downloads since GA in May 2013
2 Majority adoption across Cloudera customers
3 Certification across key application partners:
4 De facto standard with multi-vendor support:
and others

The Leader in Interactive SQL Analytics for Hadoop
Multi-User Performance
& Usability
✔ • 10x vs. alternatives with latest benchmarks
• Cost-based optimization allows for more users and tools to
run a broader range of queries
Compatibility
✔ • Provides both ANSI SQL and vendor-specific extensions
• Compatibility with the leading BI partners
Impala delivers the best of both worlds
Flexibility
✔ • Supports the common native Hadoop file formats
• Parquet provides best-of-breed columnar performance across
Hadoop frameworks
Native Integration
✔ • Unified with Hadoop resource management, metadata,
security, and management

Key Benefits
An analytic database designed for Hadoop
High-Performance BI and SQL Analytics
Flexibility for Data and Use Case Variety
Cost-effective Scale for Today and Tomorrow
Go Beyond SQL with an Open Architecture

Advancements in recent Impala versions

• Enables faster and more
complex queries
• 5.4x speedup in standard
benchmarks over the last
14 months
• Track record of
constantly improving
performance
Impala keeps getting faster
100% 100% 100%
214%
226%
544%
0%
100%
200%
300%
400%
500%
600%
TPC-H Nested TPC-H TPC-DS
Impala Relative Performance
Impala 2.3 Impala 2.5 Impala 2.6 Impala 2.7 Impala 2.8

Major themes across releases
Performance Real-time Update Support
Cloud-Native Capabilities
v2.8
•Intra-node parallelism + codegen for
TIMESTAMP = 8x faster compute stats
•Constant folding predicate expressions = 3x
speedup
•Optimizations for long IN list predicates = 7.5x
speedup
•Optimizations for metadata loading for tables
w/ large # of blocks = 9x speedup
v2.6
•3x faster performance on secure clusters
(beneficial for large data extractions)
v2.8
•GA of Impala/Kudu integration
•“Fast analytics on fast data”
•INSERT, UPDATE, DELETE
v2.6
•Support reads/writes on S3
•Enable fast, flexible ETL and BI analytics
across all data in any environment
•Performance Metrics: Analytics and BI on
Amazon S3 with Apache Impala

Latest Impala performance comparisons
April 2017
vs. analytic database (Greenplum) & other SQL-on-Hadoop
(Hive LLAP, Spark SQL, Presto)

Testing approach
• Start with TPC-DS 10TB scale factor
• Use unmodified TPC-DS v2.4 query templates
• Casting syntax
• CONCAT() vs. ||
• Aliases for in-line views or columns
• Determine common set of queries
• Eliminate unsupported language and excessively long executions
• Single-user test
• Multi-user tests
• Leverage best practices for all systems

Software:
• Impala 2.8 from CDH 5.10
• Greenplum Database 4.3.9.1
• Spark 2.1
• Presto 0.160
• Hive 2.1 with LLAP from HDP v2.5
Hardware:
• 7 worker nodes each with
• CPU: 2 x E5-2698 v4 @ 2.20GHz
• Storage: 8 x 2TB HDDs
• Memory: 256GB RAM
Workload:
• Data: TPC-DS 10TB & 1TB stored in
• Parquet for Impala & Spark SQL
• ORCFile for LLAP & Presto
• Columnar storage for Greenplum
• Partitioned fact tables
Queries:
• 77 of the 99 TPC-DS queries using only
permitted minor query modifications*
• 22 queries not used that required non-
minor modifications (valid alternates
not used either)
Environment details

• Impala outperforms on
both single and multi-
user tests at 10TB
• Impala lead expands
with concurrency
• Other SQL on Hadoop
engines failed at 10TB
Streams Impala QpH Greenplum QpH Impala Times Better
2 41 20 2.05x
4 75 20 3.75x
8 133 16 8.31x
Impala outperforms analytical database
Metric Impala Greenplum Impala Times Better
Total seconds 11,898 21,093 1.77x
Geometric mean 33 92 2.79x
Single-user Test – 10TB
Multi-user Test – 10TB

• The analytical db cohort
(Impala / GP) leads SQL
on Hadoop
• Impala outperforms
• Presto by 8.3x
• Hive w/ LLAP by 4x
• Spark SQL by 2.8x
TPC-DS 1TB: Single-user
Impala Greenplum Spark SQL Hive with LLAP Presto
Total 100% 97% 283% 405% 826%
Geomean 100% 250% 367% 450% 1317%
0%
200%
400%
600%
800%
1000%
1200%
1400%
RelativePerformance
TPC-DS 1TB Single-user (Lower is Better)

• At 1TB the analytic db
cohort (Impala / GP)
expands lead with
concurrency
• Presto failed to complete
• With 16 streams Impala
outperforms
• Spark by ~22x
• Hive by ~20x
• Greenplum by ~3x
TPC-DS 1TB: Multi-user
4 streams 8 streams 16 streams
Impala 495 865 1,315
Greenplum 381 417 462
Spark SQL 76 70 61
Hive with LLAP 58 62 66
0
200
400
600
800
1000
1200
1400
QueriesperHour
Multi-user TPC-DS 1TB Queries/Hour (Higher is Better)
20x

Impala leads analytic database performance
• Impala leads in performance against the traditional analytic database, including
over 8x better performance for high concurrency workloads
• Even greater difference compared to other SQL-on-Hadoop engines with Impala
nearly 22x faster for multi-user workloads
• Other SQL-on-Hadoop engines also required a simplified, smaller scale
benchmark (with Hive even still requiring modifications and Presto unable to
complete multi-user tests)
• Impala delivers analytic database performance as well as:
• Flexibility
• Cost-effective scale
• Open architecture

Impala performance optimizations and tips
Data types, partitioning, and runtime filters

Data type selection
Data type selection affects performance:
• Computation: numerical types allow direct computation, string types require
conversion
• On-disk storage size: numerical types are more compact
• More compact types also require less network traffic
General guidelines:
• Choose numerical types over character types for numerical data
• Use smallest data type that will accommodate the largest possible value

Data type selection
Picking the incorrect data type can result in:
• Increase in on-disk storage by 40%
• 80% slower scans
• 80% slower aggregations
• 150% slower joins
• Increase in runtime memory utilization
Data set: L_ORDERKEY column from TPC-H 3TB
Values domain: 1 - 18,000,000,000
Number of distinct values: 4,500,000,000

Data type selection
100% 100% 100% 100%
107%
100%
106%
164%
100%
124%
104%
159%
144%
178% 179%
242%
144%
180% 177%
252%
Size Scan Group by Join
Numeric Data Type Performance
Bigint Double Decimal String Varchar

Partitioning design considerations
• Date based partitioning is the most common
• WHERE event_dt BETWEEN 20000101 AND 20000201
• WHERE event_year = 2000 AND event_month = 1
• Partition keys can be column groups / multi-level. Eg.
• By date, by hour
• By date, by region

CREATE TABLE sales (...)
PARTITIONED BY (INT year, INT month)
SELECT ... FROM sales
WHERE year >= 2012
AND month IN (1, 2, 3)
CREATE TABLE sales (...)
PARTITIONED BY (INT date_key)
SELECT ...
FROM sales s
JOIN date_dim d USING (date_key)
WHERE d.year >= 2012
AND d.month IN (1, 2, 3)
Partitioning examples

Partition wisely
Too few partitions (granularity is too large):
• Data elimination is not effective
• Increases minimum unit of work
Too many partitions (granularity is too small):
• Many small data files hurt large queries; scans less efficient; limits parallelism
• Large number of files can cause metadata bloat and create bottlenecks on HDFS
NameNode, Hive Metastore, Impala catalog service
General guidelines:
• Regularly compact tables to keep the number of files per partition under control and
improve scan and compression efficiency
• Keep number of partitions under 20K (not a hard limit, mileage will vary)

Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTITIONED BY (INT ss_sold_date_sk);
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate

Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTITIONED BY (INT ss_sold_date_sk);
SELECT
d_year,
FROM date_dim d
JOIN store_sales ON
WHERE d_moy = 6
GROUP BY d_year;
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
AggregateThe query planner does not know what
values for d_date_sk will be returned and
what fact table partitions need to be
scanned (or eliminated).
But there’s clearly an opportunity to save
some work - why bother sending 28 billion of
those rows to the joins?
Runtime filters construct the partition
pruning predicate at runtime.

SELECT
d_year,
FROM date_dim d
JOIN store_sales ON
WHERE d_moy = 6
GROUP BY d_year;
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 1: Planner tells Join #1
to produce Bloom filter for
qualifying distinct values of
d_date_sk
Bloom filter: compact, probabilistic
representation of a data set
Essentially a sophisticated bitmap

SELECT
d_year,
FROM date_dim d
JOIN store_sales ON
WHERE d_moy = 6
GROUP BY d_year;
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 2: Join reads all rows
from build side (right
input), and populates
Bloom filter containing all
distinct values of
d_date_sk

SELECT
d_year,
FROM date_dim d
JOIN store_sales ON
WHERE d_moy = 6
GROUP BY d_year;
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 3: Query coordinator
sends filter to store_sales
scan before the scan starts.

SELECT
d_year,
FROM date_dim d
JOIN store_sales ON
WHERE d_moy = 6
GROUP BY d_year;
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 4: Scan eliminates all
partitions that don’t have a
match in the Bloom filter.
Only 150 out of the 1824
partitions are read from
store_sales.

SELECT
d_year,
FROM date_dim d
JOIN store_sales ON
WHERE d_moy = 6
GROUP BY d_year;
Step 5: Rows coming out of
the scan is reduced from
28 Billion to 1.3 Billion.
STORE_SALES
1.3 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate

In summary
• Impala offers
• Flexibility
• Cost-effective scale
• Open architecture
• Leading performance
• Recent improvements/enhancements:
• Performance
• Real-time and cloud capabilities
• Be sure to check out the Impala Cookbook (SlideShare) for more performance
protips

Thank you
Downloads:
https://cloudera.com/downloads
Interested in contributing?
http://impala.io

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

More Related Content

What's hot (20)

Similar to New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database