SlideShare a Scribd company logo
Scott
Leberknight
Cloudera's
History
lesson...
Google Map/Reduce
paper (2004)
Cutting & Cafarella
create Hadoop (2005)
Google Dremel paper (2010)
Facebook creates Hive (2007)*
Cloudera announces Impala
(October 2012)
HortonWorks' Stinger
(February 2013)
Apache Drill proposal
(August 2012)
* Hive => "SQL on Hadoop"
Write SQL queries
Translate into Map/Reduce job(s)
Convenient & easy
High-latency (batch processing)
What is Impala?
In-memory, distributed SQL
query engine (no Map/Reduce)
Native code (C++)
Distributed
(on HDFS data nodes)
Why Impala?
Interactive data analysis
Low-latency response
(roughly, 4-100x Hive)
Deploy on existing Hadoop clusters
Why Impala? (cont'd)
Data stored in HDFS avoids...
...duplicate storage
...data transformation
...moving data
Why Impala? (cont'd)
SPEED!
Overview
impalad daemon runs on HDFS nodes
Queries run on "relevant" nodes
Supports common HDFS file formats
statestored, uses Hive metastore
(for database metadata)
Overview (cont'd)
Does not use Map/Reduce
Not fault tolerant !
(query fails if any query on any node fails)
Submit queries via Hue/Beeswax
Thrift API, CLI, ODBC, JDBC (future)
SQL Support
SELECT
Projection
UNION
INSERT OVERWRITE
INSERT INTO
ORDER BY
(w/ LIMIT)
Aggregation
Subqueries
(uncorrelated)
JOIN (equi-join only,
subject to memory
limitations)
(subset of Hive QL)
HBase Queries
Maps HBase tables via Hive
metastore mapping
Row key predicates => start/stop row
Non-row key predicates => SingleColumnValueFilter
HBase scan translations:
(Very) Unscientific Benchmarks
9 queries, run in Impala Demo VM
Macbook Pro Retina, mid 2012
16GB RAM,
4GB for VM (VMWare 5),
Intel i7 2.6GHz quad-core processor
Hardware
No other load on system during queries
Pseudo-cluster + Impala daemons
Benchmarks (cont'd)
(from simple projection queries to
multiple joins, aggregation, multiple
predicates, and order by)
Impala vs. Hive performance
"TPC-DS" sample dataset
(http://www.tpc.org/tpcds/)
Query "A"
select
c.c_first_name,
c.c_last_name
from customer c
limit 50;
Query "B"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state
from customer c
   join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
limit 50;
Query "C"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state
from customer c
   join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
where lower(c.c_last_name) like 'smi%'
limit 50;
Query "D"
select distinct cd_credit_rating
from customer_demographics;
Query "E"
select
   cd_credit_rating,
   count(*)
from customer_demographics
group by cd_credit_rating;
Query "F"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state,
   cd.cd_marital_status,
   cd.cd_education_status
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
where
   lower(c.c_last_name) like 'smi%' and
   cd.cd_credit_rating in ('Unknown', 'High Risk')
limit 50;
Query "G"
select
   count(c.c_customer_sk)
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
where
   ca.ca_zip in ('20191', '20194') and
   cd.cd_credit_rating in ('Unknown', 'High Risk');
Query "H"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state,
   cd.cd_marital_status,
   cd.cd_education_status
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
where
   ca.ca_zip in ('20191', '20194') and
   cd.cd_credit_rating in ('Unknown', 'High Risk')
limit 100;
select  
  i_item_id,
  s_state,
  avg(ss_quantity) agg1,
  avg(ss_list_price) agg2,
  avg(ss_coupon_amt) agg3,
  avg(ss_sales_price) agg4
from store_sales
join date_dim
   on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
join item
   on (store_sales.ss_item_sk = item.i_item_sk)
join customer_demographics
   on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
join store
   on (store_sales.ss_store_sk = store.s_store_sk)
where
  cd_gender = 'M' and
  cd_marital_status = 'S' and
  cd_education_status = 'College' and
  d_year = 2002 and
  s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')
group by
  i_item_id,
  s_state
order by
  i_item_id,
  s_state
limit 100;
Query "TPC-DS"
Query Hive (sec) # M/R jobs Impala (sec) x Hive perf.
A 12.4 1 0.21 59
B 30.9 1 0.37 84
C 29.6 1 0.33 91
D 22.8 1 0.60 38
E 22.5 1 0.52 44
F 66.4 2 1.56 43
G 83.0 3 1.33 62
H 66.1 2 1.50 44
TPC-DS 248.3 6 3.05 82
(remember, unscientific...)
Cloudera Impala
Cloudera Impala
A
rchitecture
Two daemons
impalad
statestored
impalad on each HDFS data node
statestored - metadata
Thrift APIs
impalad
Query execution
Query coordination
Query planning
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Queries performed in-memory
Intermediate data never hits disk!
Data streamed to clients
C++
runtime code generation
intrinsics for optimization
Execution engine:
statestored
Cluster membership
Metadata handling
(scheduled for GA release)
Not a SPOF
(single point of failure)
Metadata
Shares Hive metastore
Daemons cache metadata
Push to cluster via statestored
(scheduled for GA release)
Create tables in Hive
(then REFRESH impalad)
Next up - how queries work...
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Client Statestore Hive Metastore
table
metadata
table
metadata
(cached)
SQL
query
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Read directly from disk
Short-circuit reads
Bypass HDFS DataNode
(avoids overhead of HDFS API)
impalad
Query Coordinator
Query Planner
Query Executor
HBase
Region
Server
HDFS
DataNode
Local Filesystem
Read
directly
from disk
Current Limitations
(as of beta version 0.6)
No join order optimization
No custom file formats or SerDes or UDFs
Limit required when using ORDER BY
Joins limited by memory of single node
(at GA, aggregate memory of cluster)
Current Limitations
(as of beta version 0.6)
No advanced data structures
(arrays, maps, json, etc.)
No DDL (do in Hive)
Limited file formats (text, sequence
w/ snappy/gzip compression)
Future - GA & beyond...
Structure types (structs,
arrays, maps, json, etc.)
DDL support
Additional file formats &
compression support
Columnar format
(Parquet?)
"Performance"
Metadata
(via statestore)
JDBC
Join optimization
(e.g. cost-based)
UDFs
Comparing...
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout, it
is capable of running aggregation queries
over trillion-row tables in seconds. The
system scales to thousands of CPUs and
petabytes of data, and has thousands of
users at Google.
Comparing Impala to Dremel
- http://research.google.com/pubs/pub36632.html
Comparing Impala to Dremel
Impala = Dremel features circa 2010 + join
support, assuming columnar data format
(but, Google doesn't stand still...)
Dremel is production, mature
Basis for Google's BigQuery
Comparing Impala to Hive
Hive uses Map/Reduce -> high latency
Impala is in-memory, low-
latency query engine
Sacrifices fault tolerance for
performance
Comparing Impala to Others
Stinger
Apache Drill
Improve Hive performance (e.g. optimize execution plan)
Based on Dremel
In very early stages...
Support for analytics (e.g. OVER clause, window functions)
TEZ framework to optimize execution
Columnar file format
Review
In-memory, distributed
SQL query engine
Integrates into
existing HDFS
Not Map/Reduce
Focus is on
performance
(native code)
References
Google Dremel - http://research.google.com/pubs/pub36632.html
Apache Drill - http://incubator.apache.org/drill/
TPC-DS dataset - http://www.tpc.org/tpcds/
Stinger Initiative - http://hortonworks.com/blog/100x-faster-hive/
Cloudera Impala resources
http://university.cloudera.com/onlineresources/introductionimpala.html
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-
queries-in-apache-hadoop-for-real/
Photo Attributions
Impala - http://www.flickr.com/photos/gerardstolk/5897570970/
Measuring tape - http://www.morguefile.com/archive/display/24850
Bridge frame - http://www.morguefile.com/archive/display/9699
Balance - http://www.morguefile.com/archive/display/93433
* All others are iStockPhoto (paid)
My Info
scott dot leberknight at nearinfinity dot com
twitter.com/sleberknight www.sleberknight.com/blog
www.nearinfinity.com/blogs/scott_leberknight/all/
scott dot leberknight at gmail dot com

More Related Content

PDF
Impala: Real-time Queries in Hadoop
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Impala Architecture presentation
PDF
Introduction to Impala
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Cloudera impala
PPTX
Node labels in YARN
Impala: Real-time Queries in Hadoop
Building a Hadoop Data Warehouse with Impala
Impala Architecture presentation
Introduction to Impala
Impala 2.0 - The Best Analytic Database for Hadoop
SQL Engines for Hadoop - The case for Impala
Cloudera impala
Node labels in YARN

What's hot (20)

PPTX
Architecting Applications with Hadoop
PDF
Cloudera Impala, updated for v1.0
PPTX
Incredible Impala
PDF
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
PPTX
Cloudera Impala: A Modern SQL Engine for Hadoop
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
PPTX
SQL on Hadoop
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PDF
SQL on Hadoop
PDF
Presentations from the Cloudera Impala meetup on Aug 20 2013
PDF
SQL on Hadoop in Taiwan
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
October 2014 HUG : Hive On Spark
PPTX
Introduction to Hadoop
Architecting Applications with Hadoop
Cloudera Impala, updated for v1.0
Incredible Impala
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Building a Hadoop Data Warehouse with Impala
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
SQL on Hadoop
Intro to Apache Kudu (short) - Big Data Application Meetup
SQL on Hadoop
Presentations from the Cloudera Impala meetup on Aug 20 2013
SQL on Hadoop in Taiwan
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
A brave new world in mutable big data relational storage (Strata NYC 2017)
Scaling HDFS to Manage Billions of Files with Key-Value Stores
The Future of Hadoop: A deeper look at Apache Spark
Apache Tez: Accelerating Hadoop Query Processing
October 2014 HUG : Hive On Spark
Introduction to Hadoop
Ad

Viewers also liked (20)

PPTX
The Impala Cookbook
PDF
CoffeeScript
PDF
Polyglot Persistence
PPTX
Streaming Python on Hadoop
PDF
jps & jvmtop
PDF
Recommendation and graph algorithms in Hadoop and SQL
PPTX
Cloudera Impala + PostgreSQL
PDF
wtf is in Java/JDK/wtf7?
PPTX
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
PDF
Apache Hadoop Crash Course
PDF
RESTful Web Services with Jersey
PPTX
Protecting Your IP with Perforce Helix and Interset
PDF
Java 8 Lambda Expressions
PDF
Dropwizard
PDF
Nested Types in Impala
PDF
HBase Lightning Talk
The Impala Cookbook
CoffeeScript
Polyglot Persistence
Streaming Python on Hadoop
jps & jvmtop
Recommendation and graph algorithms in Hadoop and SQL
Cloudera Impala + PostgreSQL
wtf is in Java/JDK/wtf7?
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Apache Hadoop Crash Course
RESTful Web Services with Jersey
Protecting Your IP with Perforce Helix and Interset
Java 8 Lambda Expressions
Dropwizard
Nested Types in Impala
HBase Lightning Talk
Ad

Similar to Cloudera Impala (20)

PDF
Cloudera Impala Overview (via Scott Leberknight)
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
PDF
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
PPTX
Impala for PhillyDB Meetup
PDF
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
PDF
Impala presentation ahad rana
PDF
Getting Started With Impala Interactive Sql For Apache Hadoop 1st Edition Joh...
PDF
Impala 2.0 Update #impalajp
PPTX
Unifying your data management with Hadoop
PPTX
Impala presentation
PDF
Cloudera Impala presentation
PDF
Cloudera Impala
PDF
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
PDF
An Introduction to Impala – Low Latency Queries for Apache Hadoop
PDF
SQL on Hadoop
PDF
Cloudera Impala - HUG Karlsruhe, July 04, 2013
PDF
Fast SQL on Hadoop, really?
PPTX
Query Compilation in Impala
PDF
Impala tech-talk by Dimitris Tsirogiannis
Cloudera Impala Overview (via Scott Leberknight)
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
BDM8 - Near-realtime Big Data Analytics using Impala
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
Impala for PhillyDB Meetup
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Impala presentation ahad rana
Getting Started With Impala Interactive Sql For Apache Hadoop 1st Edition Joh...
Impala 2.0 Update #impalajp
Unifying your data management with Hadoop
Impala presentation
Cloudera Impala presentation
Cloudera Impala
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
An Introduction to Impala – Low Latency Queries for Apache Hadoop
SQL on Hadoop
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Fast SQL on Hadoop, really?
Query Compilation in Impala
Impala tech-talk by Dimitris Tsirogiannis

More from Scott Leberknight (9)

PDF
JShell & ki
PDF
JUnit Pioneer
PDF
JDKs 10 to 14 (and beyond)
PDF
Unit Testing
PDF
PDF
PDF
AWS Lambda
PDF
Google Guava
PDF
Apache ZooKeeper
JShell & ki
JUnit Pioneer
JDKs 10 to 14 (and beyond)
Unit Testing
AWS Lambda
Google Guava
Apache ZooKeeper

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation theory and applications.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Modernizing your data center with Dell and AMD
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation theory and applications.pdf
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
cuic standard and advanced reporting.pdf

Cloudera Impala