SlideShare a Scribd company logo
1©	Cloudera,	Inc.	All	rights	reserved.
Choosing	the	Right	Tool	for	the	
Right	Job
Overview	of	Cloudera’s	SQL-on-Hadoop	Technologies
2©	Cloudera,	Inc.	All	rights	reserved.
§ The	information	in	this	document	is	proprietary	to	Cloudera.		No	part	of	this	document	may	be	reproduced,	copied	or	transmitted	in	any	form	for	
any	purpose	without	the	express	prior	written	permission	of	Cloudera.
§ This	document	is	a	preliminary	version	and	not	subject	to	your	license	agreement	or	any	other	agreement	with	Cloudera.		This	document	contains	
only	intended	strategies,	developments	and	functionalities	of	Cloudera	products	and	is	not	intended	to	be	binding	upon	Cloudera	to	any	particular	
course	of	business,	product	strategy	and/or	development.		Please	note	that	this	document	is	subject	to	change	and	may	be	changed by	Cloudera	at	
any	time	without	notice.
§ Cloudera	assumes	no	responsibility	for	errors	or	omissions	in	this	document.		Cloudera	does	not	warrant	the	accuracy	or	completeness	of	the	
information,	text,	graphics,	links	or	other	items	contained	within	this	material.		This	document	is	provided	without	a	warranty	of	any	kind,	either	
express	or	implied,	including	but	not	limited	to	the	implied	warranties	of	merchantability,	fitness	for	a	particular	purpose	or	non-infringement.
§ Cloudera	shall	have	no	liability	for	damages	of	any	kind	including	without	limitation	direct,	special,	indirect	or	consequentialdamages	that	may	
result	from	the	use	of	these	materials.		The	limitation	shall	not	apply	in	cases	of	gross	negligence.
3©	Cloudera,	Inc.	All	rights	reserved.
Cloudera	is	Built	for	Production	Success
Hadoop	delivers:
• One	place	for	unlimited	data
• Unified,	multi-framework	data	access
Cloudera	delivers:
• Leading	Performance
• Enterprise	Security
• Data	Management
• Simple	Administration
Security	and	Administration
Unlimited	Storage
Process Discover Model Serve
Deployment
Flexibility
On-Premises
Appliances
Engineered	Systems
Public	Cloud
Private	Cloud
Hybrid	Cloud
A	modern	data	platform	plus	what	the	enterprise	requires.
4©	Cloudera,	Inc.	All	rights	reserved.
One	Platform,	Many	Workloads
Batch,	Interactive,	
and Real-Time.
Leading	performance	and	
usability	in	one	platform.
• End-to-end	analytic	workflows
• Access	more	data
• Work	with	data	in	new	ways
• Enable	new	users
Security	and	Administration
Process
Ingest
Sqoop,	 Flume,	
Kafka
Transform
MapReduce,	
Hive,	Pig,	Spark
Discover
Analytic	Database
Impala
Search
Solr
Model
Machine	Learning
SAS,	R,	Spark,	
Mahout
Serve
NoSQL	Database
HBase
Streaming
Spark	Streaming
Unlimited	Storage	HDFS,	HBase
YARN,	Cloudera	Manager,
Cloudera	Navigator
5©	Cloudera,	Inc.	All	rights	reserved.
Choosing	the	Right	SQL	Engine
Know	Your	Audience,	Know	Your	Use	Case
Batch	
Processing
BI	and	
SQL	Analytics
Procedural	
Development
SQLOR
Impala
6©	Cloudera,	Inc.	All	rights	reserved.
Hive
Batch	Processing
• User:	
• SQL-based	ETL	developers
• Designed	for:
• Handful	of	concurrent,	very	long-
running	batch	jobs
• Strengths:
• Custom	file	formats
• Very	long-running	ETL,	data	preparation,	
or	batch	processing
• Massive	ETL	sorts	with	joins
• Existing	Hive	jobs
7©	Cloudera,	Inc.	All	rights	reserved.
Impala
BI	and	Analytics
User:	
• Data	Analysts
• BI	Users
Designed	for:
• Interactive	SQL	for	large	number	of	BI	
users	and	analysts
Strengths:
• Multi-user	scale
• Interactive	latency
• Compatibility	(BI	tools,	ANSI	SQL,	and	
vendor-specific	SQL)
• Usability
8©	Cloudera,	Inc.	All	rights	reserved.
SparkSQL
Machine	Learning	Applications
User:	
• Data	Engineers
• Data	Scientists
Designed	for:
• Ease	of	development	for	Spark	
developers
• Handful	of	concurrent	Spark	jobs
Strengths:
• Ease	of	embedding	SQL	into	Java	or	Scala
applications
• SQL	for	common	functionality	in	
developer	flow	(eg.	aggregations,	filters,	
samples)
9©	Cloudera,	Inc.	All	rights	reserved.
SQL-on-Hadoop	Benchmark
Impala,	Presto,	Stinger,	SparkSQL
Versions:
• Impala	1.4.0
• Presto	0.74
• Stinger	Phase	3	(Final)	=>	Hive	0.13.0
• SparkSQL 1.1
• Benchmark	Details
• Based	on	industry	standards	(TPC)
• Repeatable	(https://github.com/cloudera/impala-tpcds-kit)
• Methodical	testing	with	multiple	runs	on	
same	hardware
• Help	competing	software	do	well
• SQL-92	join	style	for	engines	without	CBO
• JVM	tuning	for	Presto
• Run	on	optimal	file	formats	for	each
Full	Details:	http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4-widens-the-
performance-gap/
10©	Cloudera,	Inc.	All	rights	reserved.
Impala	Multi-User	Performance	Over	10x	Faster
with	Just	10	Users
0
50
100
150
200
250
300
350
Impala Spark	SQL Presto Hive-on-Tez
Time	(in	seconds)
Single	User	vs 10	User	Response	Time/Impala	
Times	Faster
(Lower	bars	=	better)	
Single	User,	5
10	Users,	11
Single	User,	25
10	Users,	120
10	Users,	302
10	Users,	202
Single	User,	37
Single	User,	77
5.0x
10.6x
7.4x
27.4x
15.4x
18.3x
11©	Cloudera,	Inc.	All	rights	reserved.
Impala	Enables	Over	8.7x	Throughput
More	Work	Done	in	Less	Time
2333
266
106
175
0
500
1000
1500
2000
2500
Impala Spark	SQL Presto Hive-on-Tez
Queries	per	Hour
Query	Throughput/Impala	Throughput	Times	More	Than
(Higher	bars	=	better)	
8.7x 22.0x 13.3x
12©	Cloudera,	Inc.	All	rights	reserved.
Performance	Benchmark	Takeaways
• Impala	unlocks	BI usage	directly	on	Hadoop
• Meets	BI	low-latency	and	multi-user	requirements	
• Advantage	expands	from	5x for	single-user	to	>10x	with	just	10	users
• Hive	is	designed	(and	still	great)	for	batch	processing
• Most	Impala	customers	use	Hive	for	data	preparation
• Hive	is	the	most	commonly	used	ETL	framework
• Spark	SQL	enables	easier	Spark	application	development
• Enables	mixed	procedural	Spark	(Java/Scala)	and	SQL	job	development
• Mid-term	trends	will	further	favor	Impala’s	design	approach	for	latency	and	concurrency
• More	data	sets	move	to	memory	(HDFS	caching,	in-memory	joins,	Intel	joint	roadmap)
• CPU	efficiency	will	increase	in	importance
• Native	code	enables	easy	optimizations	for	CPU	instruction	sets
• Intel	joint	roadmap	support	these	opportunities
13©	Cloudera,	Inc.	All	rights	reserved.
IBM	Research	Validation
• VLDB	academic	paper	compares	Impala	and	Hive	(both	MR	and	Tez)	for	SQL-on-Hadoop
• http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf
• Impala’s	significantly	more	efficient	than	Hive/Tez or	Hive/MR
• Impala’s	lead	due	to	CPU	efficiency,	I/O	manager,	and	overall	
architecture	that	resembles	a	shared-nothing	parallel	database
• Parquet	more	efficient	than	ORC
• Additional	Notes:
• Impala	1.4	and	higher	is	significantly	faster	on	selective	joins	than	Impala	1.2.2	used	in	the	paper
• Impala	2.0	has	disk-based	joins	and	aggregations	
• Paper	compares	single-user	only.	Multi-user	would	perform	even	better
“Impala’s	database-like	architecture	
provides	significant	performance	gains,	
compared	to	Hive’s	MapReduce	or	Tez-
based	runtime”
“The	Parquet	format	skips	data	more	efficiently	
than	ORC,	which	tends	to	pre-fetch	
unnecessary	data,	especially	when	a	table	
contains	a	large	number	 of	columns”
14©	Cloudera,	Inc.	All	rights	reserved.
Major	new	SQL	features	in	Cloudera 5.5
• Impala
• Reliability	(particularly	with	concurrency	and	scale)
• Nested	types
• Column-level	security
• Additional	functions
• Hive
• Quality
• S3	support
• CM	monitoring
• Navigator	lineage
• SparkSQL (with	DataFrames)
• Now	supported	in	CDH	5.5	(recommend	HiveContext)
• Thriftserver and	JDBC	not	ready	for	support
• Navigator	Optimizer	(beta)
• Helps	assess	and	offload	workloads	onto	Hadoop
15©	Cloudera,	Inc.	All	rights	reserved. 15
Kudu	Fills	a	Critical	Gap:	Fast	Analytics	on	Fast	Changing	Data
Fast	Scans,	Analytics
and	Processing	of	
Stored	Data
Fast	On-Line	
Updates	&
Data	Serving
Unchanging
Fast	Changing
Frequent	Updates
HDFS
HBase
Arbitrary	Storage
(Active	Archive)
Append-Only
Fast	Analytics
(on	fast-changing	 	or	
frequently-updated	 data)
Real-Time
Kudu
Kudu	fills	the	Gap
Modern	analytic	
applications	often		
require	complex	data	
flow	&	difficult		
integration	work	to	
move	data	between	
HBase	&	HDFS
Analytic	
Gap
Pace	of	Analysis
Pace	of	Data
16©	Cloudera,	Inc.	All	rights	reserved.
Current	Security	Architecture:	Inconsistency	=	Limited	
Access
Policy	B
Impala
(column-level)
Policy	A
Impala
...than	 others.
Some	engines	 support
more	granular	restrictions...
Unified,	Granular
Policy	Enforcement
Challenge:	Hadoop	access	engines	respect	policies	differently,	forcing	reliance	on	lowest	common	denominator	 file- or	
table-based	policies,	or	restricted	access.	Policy	management	only	solves	part	of	the	problem.
Solution:	RecordService	is	a	new	high-performance	security	layer	that	centrally	enforces	access	control	policy.	
Complementing	 Apache	Sentry,	which	provides	unified	policy	definition,	 it	delivers	unified	row- and	column-based	
security,	and	dynamic	data	masking,	to	every	Hadoop	access	path.
Benefits:
● Security:	Fine-grained	permissions	and	enforcement	across	Hadoop,	 building	 on	Sentry.
● Interoperability:	Developers	don’t	need	to	be	aware	of	on-disk	formats;	transparently	swap	components.
RecordService: Unified	Access	Control	Enforcement
Spark
(table-level)
RecordService
(policy	enforcement)
Spark
Sentry
(policy	definition)
Sentry
(policy	definition)
...
17©	Cloudera,	Inc.	All	rights	reserved.
Thank	You

More Related Content

PPTX
Spark One Platform Webinar
PDF
Hadoop on Cloud: Why and How?
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
PPTX
Road to Cloudera certification
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
PPTX
A deep dive into running data analytic workloads in the cloud
PDF
Train, predict, serve: How to go into production your machine learning model
Spark One Platform Webinar
Hadoop on Cloud: Why and How?
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Road to Cloudera certification
Unlock Hadoop Success with Cloudera Navigator Optimizer
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
A deep dive into running data analytic workloads in the cloud
Train, predict, serve: How to go into production your machine learning model

What's hot (20)

PPTX
Five Tips for Running Cloudera on AWS
PPTX
Intel and Cloudera: Accelerating Enterprise Big Data Success
PDF
Cloudera 5.3 Update
PPTX
Configuring a Secure, Multitenant Cluster for the Enterprise
PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
PPTX
Multi-Tenant Operations with Cloudera 5.7 & BT
PDF
大数据数据治理及数据安全
PDF
Kudu Cloudera Meetup Paris
PDF
Cloudera のサポートエンジニアリング #supennight
PDF
PPTX
What's new in Hadoop Yarn- Dec 2014
PDF
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
PPTX
Apache Spark: Usage and Roadmap in Hadoop
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
PDF
Apache Hadoop 3
PDF
How to use Impala query plan and profile to fix performance issues
PPTX
Risk Management for Data: Secured and Governed
PDF
sql on hadoop
PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
PDF
Edge to ai analytics from edge to cloud with efficient movement of machine data
Five Tips for Running Cloudera on AWS
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera 5.3 Update
Configuring a Secure, Multitenant Cluster for the Enterprise
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Multi-Tenant Operations with Cloudera 5.7 & BT
大数据数据治理及数据安全
Kudu Cloudera Meetup Paris
Cloudera のサポートエンジニアリング #supennight
What's new in Hadoop Yarn- Dec 2014
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
Apache Spark: Usage and Roadmap in Hadoop
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Apache Hadoop 3
How to use Impala query plan and profile to fix performance issues
Risk Management for Data: Secured and Governed
sql on hadoop
The Edge to AI Deep Dive Barcelona Meetup March 2019
Edge to ai analytics from edge to cloud with efficient movement of machine data
Ad

Viewers also liked (20)

PPTX
Hive vs. Impala
PDF
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
PPTX
Spark: The State of the Art Engine for Big Data Processing
PDF
Bi on Big Data - Strata 2016 in London
PPT
Daniel Abadi HadoopWorld 2010
PPT
BCBS 239 - Risk Data Adequacy
PPTX
Deploying Enterprise-grade Security for Hadoop
PPTX
Overcoming cassandra query limitation spark
PDF
20140908 spark sql & catalyst
PPTX
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
PDF
Spark SQL - 10 Things You Need to Know
PPTX
Spark sql meetup
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
PDF
Apache Spark RDDs
PDF
2016 Spark Summit East Keynote: Matei Zaharia
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Hug meetup impala 2.5 performance overview
Hive vs. Impala
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Spark: The State of the Art Engine for Big Data Processing
Bi on Big Data - Strata 2016 in London
Daniel Abadi HadoopWorld 2010
BCBS 239 - Risk Data Adequacy
Deploying Enterprise-grade Security for Hadoop
Overcoming cassandra query limitation spark
20140908 spark sql & catalyst
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Spark SQL - 10 Things You Need to Know
Spark sql meetup
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Spark SQL Deep Dive @ Melbourne Spark Meetup
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Spark RDDs
2016 Spark Summit East Keynote: Matei Zaharia
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hug meetup impala 2.5 performance overview
Ad

Similar to Cloudera Showcase: SQL-on-Hadoop (20)

PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
PDF
Fusion hcm roles information
PDF
SuiteFlowUserGuide.pdf
PDF
12.2 l2 implement-and_use_order management_ame integration
PDF
Advanced Administration: Kaseya Virtual Administrator
PPTX
MuleSoft Summer Meetup - Germany - 09 Jun 2021
PPTX
Big Data Fundamentals
PPTX
Big Data Fundamentals 6.6.18
PDF
Security and Backup I: OEM Architecture
PDF
365 Command: Managing Exchange in Office 365
PDF
Oracle Succession Planning Setup
PDF
oracle guradian instalacion
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
PDF
e13406_WSHIM.pdf
PDF
Building beacon-enabled apps with Oracle MCS
PDF
Kaseya Asset Discovery Overview
PDF
Oracle hrms approvals management implementation guide
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
Fusion hcm roles information
SuiteFlowUserGuide.pdf
12.2 l2 implement-and_use_order management_ame integration
Advanced Administration: Kaseya Virtual Administrator
MuleSoft Summer Meetup - Germany - 09 Jun 2021
Big Data Fundamentals
Big Data Fundamentals 6.6.18
Security and Backup I: OEM Architecture
365 Command: Managing Exchange in Office 365
Oracle Succession Planning Setup
oracle guradian instalacion
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
e13406_WSHIM.pdf
Building beacon-enabled apps with Oracle MCS
Kaseya Asset Discovery Overview
Oracle hrms approvals management implementation guide

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
PPTX
Cloudera SDX
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Cloudera SDX

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation

Cloudera Showcase: SQL-on-Hadoop