SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Greg Rahn | @gregrahn | Director of Product Management
Apache Impala:
Recent advancements, performance
benchmarks, and pro-tips
2© Cloudera, Inc. All rights reserved.
Agenda
‱ Improvements in recent Impala versions
‱ Latest performance comparisons
‱ Performance recommendations and considerations
3© Cloudera, Inc. All rights reserved.
Meet Impala
4© Cloudera, Inc. All rights reserved.
Apache Impala: Open Source & Open Standard
1 Millions of downloads since GA in May 2013
2 Majority adoption across Cloudera customers
3 Certification across key application partners:
4 De facto standard with multi-vendor support:
and others
5© Cloudera, Inc. All rights reserved.
The Leader in Interactive SQL Analytics for Hadoop
Multi-User Performance
& Usability
✔ ‱ 10x vs. alternatives with latest benchmarks
‱ Cost-based optimization allows for more users and tools to
run a broader range of queries
Compatibility
✔ ‱ Provides both ANSI SQL and vendor-specific extensions
‱ Compatibility with the leading BI partners
Impala delivers the best of both worlds
Flexibility
✔ ‱ Supports the common native Hadoop file formats
‱ Parquet provides best-of-breed columnar performance across
Hadoop frameworks
Native Integration
✔ ‱ Unified with Hadoop resource management, metadata,
security, and management
6© Cloudera, Inc. All rights reserved.
Key Benefits
An analytic database designed for Hadoop
High-Performance BI and SQL Analytics
Flexibility for Data and Use Case Variety
Cost-effective Scale for Today and Tomorrow
Go Beyond SQL with an Open Architecture
7© Cloudera, Inc. All rights reserved.
Advancements in recent Impala versions
8© Cloudera, Inc. All rights reserved.
‱ Enables faster and more
complex queries
‱ 5.4x speedup in standard
benchmarks over the last
14 months
‱ Track record of
constantly improving
performance
Impala keeps getting faster
100% 100% 100%
214%
226%
544%
0%
100%
200%
300%
400%
500%
600%
TPC-H Nested TPC-H TPC-DS
Impala Relative Performance
Impala 2.3 Impala 2.5 Impala 2.6 Impala 2.7 Impala 2.8
9© Cloudera, Inc. All rights reserved.
Major themes across releases
Performance Real-time Update Support
Cloud-Native Capabilities
v2.8
‱Intra-node parallelism + codegen for
TIMESTAMP = 8x faster compute stats
‱Constant folding predicate expressions = 3x
speedup
‱Optimizations for long IN list predicates = 7.5x
speedup
‱Optimizations for metadata loading for tables
w/ large # of blocks = 9x speedup
v2.6
‱3x faster performance on secure clusters
(beneficial for large data extractions)
v2.8
‱GA of Impala/Kudu integration
‱“Fast analytics on fast data”
‱INSERT, UPDATE, DELETE
v2.6
‱Support reads/writes on S3
‱Enable fast, flexible ETL and BI analytics
across all data in any environment
‱Performance Metrics: Analytics and BI on
Amazon S3 with Apache Impala
10© Cloudera, Inc. All rights reserved.
Latest Impala performance comparisons
April 2017
vs. analytic database (Greenplum) & other SQL-on-Hadoop
(Hive LLAP, Spark SQL, Presto)
11© Cloudera, Inc. All rights reserved.
Testing approach
‱ Start with TPC-DS 10TB scale factor
‱ Use unmodified TPC-DS v2.4 query templates
‱ Casting syntax
‱ CONCAT() vs. ||
‱ Aliases for in-line views or columns
‱ Determine common set of queries
‱ Eliminate unsupported language and excessively long executions
‱ Single-user test
‱ Multi-user tests
‱ Leverage best practices for all systems
12© Cloudera, Inc. All rights reserved.
Software:
‱ Impala 2.8 from CDH 5.10
‱ Greenplum Database 4.3.9.1
‱ Spark 2.1
‱ Presto 0.160
‱ Hive 2.1 with LLAP from HDP v2.5
Hardware:
‱ 7 worker nodes each with
‱ CPU: 2 x E5-2698 v4 @ 2.20GHz
‱ Storage: 8 x 2TB HDDs
‱ Memory: 256GB RAM
Workload:
‱ Data: TPC-DS 10TB & 1TB stored in
‱ Parquet for Impala & Spark SQL
‱ ORCFile for LLAP & Presto
‱ Columnar storage for Greenplum
‱ Partitioned fact tables
Queries:
‱ 77 of the 99 TPC-DS queries using only
permitted minor query modifications*
‱ 22 queries not used that required non-
minor modifications (valid alternates
not used either)
Environment details
13© Cloudera, Inc. All rights reserved.
‱ Impala outperforms on
both single and multi-
user tests at 10TB
‱ Impala lead expands
with concurrency
‱ Other SQL on Hadoop
engines failed at 10TB
Streams Impala QpH Greenplum QpH Impala Times Better
2 41 20 2.05x
4 75 20 3.75x
8 133 16 8.31x
Impala outperforms analytical database
Metric Impala Greenplum Impala Times Better
Total seconds 11,898 21,093 1.77x
Geometric mean 33 92 2.79x
Single-user Test – 10TB
Multi-user Test – 10TB
14© Cloudera, Inc. All rights reserved.
‱ The analytical db cohort
(Impala / GP) leads SQL
on Hadoop
‱ Impala outperforms
‱ Presto by 8.3x
‱ Hive w/ LLAP by 4x
‱ Spark SQL by 2.8x
TPC-DS 1TB: Single-user
Impala Greenplum Spark SQL Hive with LLAP Presto
Total 100% 97% 283% 405% 826%
Geomean 100% 250% 367% 450% 1317%
0%
200%
400%
600%
800%
1000%
1200%
1400%
RelativePerformance
TPC-DS 1TB Single-user (Lower is Better)
15© Cloudera, Inc. All rights reserved.
‱ At 1TB the analytic db
cohort (Impala / GP)
expands lead with
concurrency
‱ Presto failed to complete
‱ With 16 streams Impala
outperforms
‱ Spark by ~22x
‱ Hive by ~20x
‱ Greenplum by ~3x
TPC-DS 1TB: Multi-user
4 streams 8 streams 16 streams
Impala 495 865 1,315
Greenplum 381 417 462
Spark SQL 76 70 61
Hive with LLAP 58 62 66
0
200
400
600
800
1000
1200
1400
QueriesperHour
Multi-user TPC-DS 1TB Queries/Hour (Higher is Better)
20x
16© Cloudera, Inc. All rights reserved.
Impala leads analytic database performance
‱ Impala leads in performance against the traditional analytic database, including
over 8x better performance for high concurrency workloads
‱ Even greater difference compared to other SQL-on-Hadoop engines with Impala
nearly 22x faster for multi-user workloads
‱ Other SQL-on-Hadoop engines also required a simplified, smaller scale
benchmark (with Hive even still requiring modifications and Presto unable to
complete multi-user tests)
‱ Impala delivers analytic database performance as well as:
‱ Flexibility
‱ Cost-effective scale
‱ Open architecture
17© Cloudera, Inc. All rights reserved.
Impala performance optimizations and tips
Data types, partitioning, and runtime filters
18© Cloudera, Inc. All rights reserved.
Data type selection
Data type selection affects performance:
‱ Computation: numerical types allow direct computation, string types require
conversion
‱ On-disk storage size: numerical types are more compact
‱ More compact types also require less network traffic
General guidelines:
‱ Choose numerical types over character types for numerical data
‱ Use smallest data type that will accommodate the largest possible value
19© Cloudera, Inc. All rights reserved.
Data type selection
Picking the incorrect data type can result in:
‱ Increase in on-disk storage by 40%
‱ 80% slower scans
‱ 80% slower aggregations
‱ 150% slower joins
‱ Increase in runtime memory utilization
Data set: L_ORDERKEY column from TPC-H 3TB
Values domain: 1 - 18,000,000,000
Number of distinct values: 4,500,000,000
20© Cloudera, Inc. All rights reserved.
Data type selection
100% 100% 100% 100%
107%
100%
106%
164%
100%
124%
104%
159%
144%
178% 179%
242%
144%
180% 177%
252%
Size Scan Group by Join
Numeric Data Type Performance
Bigint Double Decimal String Varchar
21© Cloudera, Inc. All rights reserved.
Partitioning design considerations
‱ Date based partitioning is the most common
‱ WHERE event_dt BETWEEN 20000101 AND 20000201
‱ WHERE event_year = 2000 AND event_month = 1
‱ Partition keys can be column groups / multi-level. Eg.
‱ By date, by hour
‱ By date, by region
22© Cloudera, Inc. All rights reserved.
CREATE TABLE sales (...)
PARTITIONED BY (INT year, INT month)
SELECT ... FROM sales
WHERE year >= 2012
AND month IN (1, 2, 3)
CREATE TABLE sales (...)
PARTITIONED BY (INT date_key)
SELECT ...
FROM sales s
JOIN date_dim d USING (date_key)
WHERE d.year >= 2012
AND d.month IN (1, 2, 3)
Partitioning examples
23© Cloudera, Inc. All rights reserved.
Partition wisely
Too few partitions (granularity is too large):
‱ Data elimination is not effective
‱ Increases minimum unit of work
Too many partitions (granularity is too small):
‱ Many small data files hurt large queries; scans less efficient; limits parallelism
‱ Large number of files can cause metadata bloat and create bottlenecks on HDFS
NameNode, Hive Metastore, Impala catalog service
General guidelines:
‱ Regularly compact tables to keep the number of files per partition under control and
improve scan and compression efficiency
‱ Keep number of partitions under 20K (not a hard limit, mileage will vary)
24© Cloudera, Inc. All rights reserved.
Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTITIONED BY (INT ss_sold_date_sk);
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
25© Cloudera, Inc. All rights reserved.
Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTITIONED BY (INT ss_sold_date_sk);
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
AggregateThe query planner does not know what
values for d_date_sk will be returned and
what fact table partitions need to be
scanned (or eliminated).
But there’s clearly an opportunity to save
some work - why bother sending 28 billion of
those rows to the joins?
Runtime filters construct the partition
pruning predicate at runtime.
26© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 1: Planner tells Join #1
to produce Bloom filter for
qualifying distinct values of
d_date_sk
Bloom filter: compact, probabilistic
representation of a data set
Essentially a sophisticated bitmap
27© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 2: Join reads all rows
from build side (right
input), and populates
Bloom filter containing all
distinct values of
d_date_sk
28© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 3: Query coordinator
sends filter to store_sales
scan before the scan starts.
29© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 4: Scan eliminates all
partitions that don’t have a
match in the Bloom filter.
Only 150 out of the 1824
partitions are read from
store_sales.
30© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
Step 5: Rows coming out of
the scan is reduced from
28 Billion to 1.3 Billion.
STORE_SALES
1.3 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
31© Cloudera, Inc. All rights reserved.
In summary
‱ Impala offers
‱ Flexibility
‱ Cost-effective scale
‱ Open architecture
‱ Leading performance
‱ Recent improvements/enhancements:
‱ Performance
‱ Real-time and cloud capabilities
‱ Be sure to check out the Impala Cookbook (SlideShare) for more performance
protips
32© Cloudera, Inc. All rights reserved.
Thank you
Downloads:
https://cloudera.com/downloads
Interested in contributing?
http://impala.io

More Related Content

PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
PPTX
Analyzing Hadoop Data Using Sparklyr‹
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PPTX
Cloudera Altus: Big Data in the Cloud Made Easy
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
PPTX
How Data Drives Business at Choice Hotels
PPTX
Part 3: Models in Production: A Look From Beginning to End
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Analyzing Hadoop Data Using Sparklyr‹
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera Altus: Big Data in the Cloud Made Easy
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
How Data Drives Business at Choice Hotels
Part 3: Models in Production: A Look From Beginning to End

What's hot (20)

PPTX
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PPTX
Apache Impala (incubating) 2.5 Performance Update
PPTX
Kudu Forrester Webinar
PPTX
Driving Better Products with Customer Intelligence‹
PPTX
Supercharge Splunk with Cloudera‹
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PPTX
The Big Picture: Learned Behaviors in Churn
PPTX
Consolidate your data marts for fast, flexible analytics 5.24.18
PPTX
Big data journey to the cloud rohit pujari 5.30.18
PPT
A Community Approach to Fighting Cyber Threats
PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
PPTX
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
PPTX
Apache Kudu: Technical Deep Dive‹‹
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PPTX
End to End Streaming Architectures
PPTX
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
PDF
Hadoop on Cloud: Why and How?
PPTX
RecordService for Unified Access Control
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Apache Impala (incubating) 2.5 Performance Update
Kudu Forrester Webinar
Driving Better Products with Customer Intelligence‹
Supercharge Splunk with Cloudera‹
Simplifying Real-Time Architectures for IoT with Apache Kudu
The Big Picture: Learned Behaviors in Churn
Consolidate your data marts for fast, flexible analytics 5.24.18
Big data journey to the cloud rohit pujari 5.30.18
A Community Approach to Fighting Cyber Threats
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Apache Kudu: Technical Deep Dive‹‹
Part 1: Lambda Architectures: Simplified by Apache Kudu
End to End Streaming Architectures
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Hadoop on Cloud: Why and How?
RecordService for Unified Access Control
Moving Beyond Lambda Architectures with Apache Kudu
Ad

Similar to New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database (20)

PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PDF
Cloudera Impala technical deep dive
 
PDF
Impala tech-talk by Dimitris Tsirogiannis
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PDF
Impala use case @ edge
PPTX
The Impala Cookbook
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Hug meetup impala 2.5 performance overview
PDF
Building a Hadoop Data Warehouse with Impala
 
PDF
Impala presentation ahad rana
PPTX
Performance Optimizations in Apache Impala
PDF
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PDF
Impala Performance Update
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
PDF
Impala: Real-time Queries in Hadoop
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Impala 2.0 - The Best Analytic Database for Hadoop
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera Impala technical deep dive
 
Impala tech-talk by Dimitris Tsirogiannis
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Impala use case @ edge
The Impala Cookbook
Bay Area Impala User Group Meetup (Sept 16 2014)
Building a Hadoop Data Warehouse with Impala
Hug meetup impala 2.5 performance overview
Building a Hadoop Data Warehouse with Impala
 
Impala presentation ahad rana
Performance Optimizations in Apache Impala
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Impala Performance Update
BDM8 - Near-realtime Big Data Analytics using Impala
Impala: Real-time Queries in Hadoop
SQL Engines for Hadoop - The case for Impala
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Introduction to Artificial Intelligence
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
AI in Product Development-omnex systems
top salesforce developer skills in 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
Introduction to Artificial Intelligence
Upgrade and Innovation Strategies for SAP ERP Customers
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Understanding Forklifts - TECH EHS Solution
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms I-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
2025 Textile ERP Trends: SAP, Odoo & Oracle
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Design an Analysis of Algorithms II-SECS-1021-03
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How to Choose the Right IT Partner for Your Business in Malaysia
AI in Product Development-omnex systems

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

  • 1. 1© Cloudera, Inc. All rights reserved. Greg Rahn | @gregrahn | Director of Product Management Apache Impala: Recent advancements, performance benchmarks, and pro-tips
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda ‱ Improvements in recent Impala versions ‱ Latest performance comparisons ‱ Performance recommendations and considerations
  • 3. 3© Cloudera, Inc. All rights reserved. Meet Impala
  • 4. 4© Cloudera, Inc. All rights reserved. Apache Impala: Open Source & Open Standard 1 Millions of downloads since GA in May 2013 2 Majority adoption across Cloudera customers 3 Certification across key application partners: 4 De facto standard with multi-vendor support: and others
  • 5. 5© Cloudera, Inc. All rights reserved. The Leader in Interactive SQL Analytics for Hadoop Multi-User Performance & Usability ✔ ‱ 10x vs. alternatives with latest benchmarks ‱ Cost-based optimization allows for more users and tools to run a broader range of queries Compatibility ✔ ‱ Provides both ANSI SQL and vendor-specific extensions ‱ Compatibility with the leading BI partners Impala delivers the best of both worlds Flexibility ✔ ‱ Supports the common native Hadoop file formats ‱ Parquet provides best-of-breed columnar performance across Hadoop frameworks Native Integration ✔ ‱ Unified with Hadoop resource management, metadata, security, and management
  • 6. 6© Cloudera, Inc. All rights reserved. Key Benefits An analytic database designed for Hadoop High-Performance BI and SQL Analytics Flexibility for Data and Use Case Variety Cost-effective Scale for Today and Tomorrow Go Beyond SQL with an Open Architecture
  • 7. 7© Cloudera, Inc. All rights reserved. Advancements in recent Impala versions
  • 8. 8© Cloudera, Inc. All rights reserved. ‱ Enables faster and more complex queries ‱ 5.4x speedup in standard benchmarks over the last 14 months ‱ Track record of constantly improving performance Impala keeps getting faster 100% 100% 100% 214% 226% 544% 0% 100% 200% 300% 400% 500% 600% TPC-H Nested TPC-H TPC-DS Impala Relative Performance Impala 2.3 Impala 2.5 Impala 2.6 Impala 2.7 Impala 2.8
  • 9. 9© Cloudera, Inc. All rights reserved. Major themes across releases Performance Real-time Update Support Cloud-Native Capabilities v2.8 ‱Intra-node parallelism + codegen for TIMESTAMP = 8x faster compute stats ‱Constant folding predicate expressions = 3x speedup ‱Optimizations for long IN list predicates = 7.5x speedup ‱Optimizations for metadata loading for tables w/ large # of blocks = 9x speedup v2.6 ‱3x faster performance on secure clusters (beneficial for large data extractions) v2.8 ‱GA of Impala/Kudu integration ‱“Fast analytics on fast data” ‱INSERT, UPDATE, DELETE v2.6 ‱Support reads/writes on S3 ‱Enable fast, flexible ETL and BI analytics across all data in any environment ‱Performance Metrics: Analytics and BI on Amazon S3 with Apache Impala
  • 10. 10© Cloudera, Inc. All rights reserved. Latest Impala performance comparisons April 2017 vs. analytic database (Greenplum) & other SQL-on-Hadoop (Hive LLAP, Spark SQL, Presto)
  • 11. 11© Cloudera, Inc. All rights reserved. Testing approach ‱ Start with TPC-DS 10TB scale factor ‱ Use unmodified TPC-DS v2.4 query templates ‱ Casting syntax ‱ CONCAT() vs. || ‱ Aliases for in-line views or columns ‱ Determine common set of queries ‱ Eliminate unsupported language and excessively long executions ‱ Single-user test ‱ Multi-user tests ‱ Leverage best practices for all systems
  • 12. 12© Cloudera, Inc. All rights reserved. Software: ‱ Impala 2.8 from CDH 5.10 ‱ Greenplum Database 4.3.9.1 ‱ Spark 2.1 ‱ Presto 0.160 ‱ Hive 2.1 with LLAP from HDP v2.5 Hardware: ‱ 7 worker nodes each with ‱ CPU: 2 x E5-2698 v4 @ 2.20GHz ‱ Storage: 8 x 2TB HDDs ‱ Memory: 256GB RAM Workload: ‱ Data: TPC-DS 10TB & 1TB stored in ‱ Parquet for Impala & Spark SQL ‱ ORCFile for LLAP & Presto ‱ Columnar storage for Greenplum ‱ Partitioned fact tables Queries: ‱ 77 of the 99 TPC-DS queries using only permitted minor query modifications* ‱ 22 queries not used that required non- minor modifications (valid alternates not used either) Environment details
  • 13. 13© Cloudera, Inc. All rights reserved. ‱ Impala outperforms on both single and multi- user tests at 10TB ‱ Impala lead expands with concurrency ‱ Other SQL on Hadoop engines failed at 10TB Streams Impala QpH Greenplum QpH Impala Times Better 2 41 20 2.05x 4 75 20 3.75x 8 133 16 8.31x Impala outperforms analytical database Metric Impala Greenplum Impala Times Better Total seconds 11,898 21,093 1.77x Geometric mean 33 92 2.79x Single-user Test – 10TB Multi-user Test – 10TB
  • 14. 14© Cloudera, Inc. All rights reserved. ‱ The analytical db cohort (Impala / GP) leads SQL on Hadoop ‱ Impala outperforms ‱ Presto by 8.3x ‱ Hive w/ LLAP by 4x ‱ Spark SQL by 2.8x TPC-DS 1TB: Single-user Impala Greenplum Spark SQL Hive with LLAP Presto Total 100% 97% 283% 405% 826% Geomean 100% 250% 367% 450% 1317% 0% 200% 400% 600% 800% 1000% 1200% 1400% RelativePerformance TPC-DS 1TB Single-user (Lower is Better)
  • 15. 15© Cloudera, Inc. All rights reserved. ‱ At 1TB the analytic db cohort (Impala / GP) expands lead with concurrency ‱ Presto failed to complete ‱ With 16 streams Impala outperforms ‱ Spark by ~22x ‱ Hive by ~20x ‱ Greenplum by ~3x TPC-DS 1TB: Multi-user 4 streams 8 streams 16 streams Impala 495 865 1,315 Greenplum 381 417 462 Spark SQL 76 70 61 Hive with LLAP 58 62 66 0 200 400 600 800 1000 1200 1400 QueriesperHour Multi-user TPC-DS 1TB Queries/Hour (Higher is Better) 20x
  • 16. 16© Cloudera, Inc. All rights reserved. Impala leads analytic database performance ‱ Impala leads in performance against the traditional analytic database, including over 8x better performance for high concurrency workloads ‱ Even greater difference compared to other SQL-on-Hadoop engines with Impala nearly 22x faster for multi-user workloads ‱ Other SQL-on-Hadoop engines also required a simplified, smaller scale benchmark (with Hive even still requiring modifications and Presto unable to complete multi-user tests) ‱ Impala delivers analytic database performance as well as: ‱ Flexibility ‱ Cost-effective scale ‱ Open architecture
  • 17. 17© Cloudera, Inc. All rights reserved. Impala performance optimizations and tips Data types, partitioning, and runtime filters
  • 18. 18© Cloudera, Inc. All rights reserved. Data type selection Data type selection affects performance: ‱ Computation: numerical types allow direct computation, string types require conversion ‱ On-disk storage size: numerical types are more compact ‱ More compact types also require less network traffic General guidelines: ‱ Choose numerical types over character types for numerical data ‱ Use smallest data type that will accommodate the largest possible value
  • 19. 19© Cloudera, Inc. All rights reserved. Data type selection Picking the incorrect data type can result in: ‱ Increase in on-disk storage by 40% ‱ 80% slower scans ‱ 80% slower aggregations ‱ 150% slower joins ‱ Increase in runtime memory utilization Data set: L_ORDERKEY column from TPC-H 3TB Values domain: 1 - 18,000,000,000 Number of distinct values: 4,500,000,000
  • 20. 20© Cloudera, Inc. All rights reserved. Data type selection 100% 100% 100% 100% 107% 100% 106% 164% 100% 124% 104% 159% 144% 178% 179% 242% 144% 180% 177% 252% Size Scan Group by Join Numeric Data Type Performance Bigint Double Decimal String Varchar
  • 21. 21© Cloudera, Inc. All rights reserved. Partitioning design considerations ‱ Date based partitioning is the most common ‱ WHERE event_dt BETWEEN 20000101 AND 20000201 ‱ WHERE event_year = 2000 AND event_month = 1 ‱ Partition keys can be column groups / multi-level. Eg. ‱ By date, by hour ‱ By date, by region
  • 22. 22© Cloudera, Inc. All rights reserved. CREATE TABLE sales (...) PARTITIONED BY (INT year, INT month) SELECT ... FROM sales WHERE year >= 2012 AND month IN (1, 2, 3) CREATE TABLE sales (...) PARTITIONED BY (INT date_key) SELECT ... FROM sales s JOIN date_dim d USING (date_key) WHERE d.year >= 2012 AND d.month IN (1, 2, 3) Partitioning examples
  • 23. 23© Cloudera, Inc. All rights reserved. Partition wisely Too few partitions (granularity is too large): ‱ Data elimination is not effective ‱ Increases minimum unit of work Too many partitions (granularity is too small): ‱ Many small data files hurt large queries; scans less efficient; limits parallelism ‱ Large number of files can cause metadata bloat and create bottlenecks on HDFS NameNode, Hive Metastore, Impala catalog service General guidelines: ‱ Regularly compact tables to keep the number of files per partition under control and improve scan and compression efficiency ‱ Keep number of partitions under 20K (not a hard limit, mileage will vary)
  • 24. 24© Cloudera, Inc. All rights reserved. Business question: How much was sold in June? CREATE TABLE store_sales (...) PARTITIONED BY (INT ss_sold_date_sk); SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate
  • 25. 25© Cloudera, Inc. All rights reserved. Business question: How much was sold in June? CREATE TABLE store_sales (...) PARTITIONED BY (INT ss_sold_date_sk); SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows AggregateThe query planner does not know what values for d_date_sk will be returned and what fact table partitions need to be scanned (or eliminated). But there’s clearly an opportunity to save some work - why bother sending 28 billion of those rows to the joins? Runtime filters construct the partition pruning predicate at runtime.
  • 26. 26© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 1: Planner tells Join #1 to produce Bloom filter for qualifying distinct values of d_date_sk Bloom filter: compact, probabilistic representation of a data set Essentially a sophisticated bitmap
  • 27. 27© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 2: Join reads all rows from build side (right input), and populates Bloom filter containing all distinct values of d_date_sk
  • 28. 28© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 3: Query coordinator sends filter to store_sales scan before the scan starts.
  • 29. 29© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 4: Scan eliminates all partitions that don’t have a match in the Bloom filter. Only 150 out of the 1824 partitions are read from store_sales.
  • 30. 30© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters Step 5: Rows coming out of the scan is reduced from 28 Billion to 1.3 Billion. STORE_SALES 1.3 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate
  • 31. 31© Cloudera, Inc. All rights reserved. In summary ‱ Impala offers ‱ Flexibility ‱ Cost-effective scale ‱ Open architecture ‱ Leading performance ‱ Recent improvements/enhancements: ‱ Performance ‱ Real-time and cloud capabilities ‱ Be sure to check out the Impala Cookbook (SlideShare) for more performance protips
  • 32. 32© Cloudera, Inc. All rights reserved. Thank you Downloads: https://cloudera.com/downloads Interested in contributing? http://impala.io