Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Data Science on Hadoop:
How Cloudera Impala Unlocks New
Productivity and Insights
Justin Erickson | Product Manager
Marcel Kornacker | Software Engineer
Ravikumar Visweswara | Software Engineer
October 2012

Why Data Scientists Love Hadoop

• Massive volumes of data

• Data preparation & analytics in 1 environment
• Highly flexible environment for creating & testing machine learning models

• 10% the cost/TB under management

Hadoop Use Cases Moving to Real-Time

Already query Already load data into Already use HBase for
Hadoop using Hive CDH every 90 mins or less real-time data access

Source: Cloudera customer survey August 2012

But Hadoop Isn’t Fast Enough

Need faster Move data from See value today in
queries on Hadoop to RDBMS for consolidating to a
Hadoop data interactive SQL single platform

Source: Cloudera customer survey August 2012

Beyond Batch – The Next Stage for Hadoop
HADOOP TODAY IS TOO SLOW
MapReduce is batch
Simple queries can take minutes / tens of minutes

CURRENT DATA MANAGEMENT IS TOO COMPLEX
Optimized for rigid schemas &
special purpose applications
Redundant data storage & processes
Very expensive systems: $20K-150K / TB

Cloudera Enterprise RTQ
Real-Time Query for Data Stored in Hadoop
Powered by Cloudera Impala.
Supports Hive SQL
4-30X faster than Hive over MapReduce
Supports multiple storage engines &
file formats
Uses existing drivers, integrates with existing
metastore, works with leading BI tools
Flexible, cost-effective, no lock-in

Deploy & operate with Cloudera Manager

Cloudera Now Powered by Impala
BEFORE IMPALA WITH IMPALA
USER INTERFACE

BATCH PROCESSING REAL-TIME ACCESS

• Unified Storage: • With Impala:
Supports HDFS and HBase Real-time SQL queries
Flexible file formats Native distributed query engine
• Unified Metastore Optimized for low-latency
• Unified Security • Provides:
• Unified Client Interfaces: Answers as fast as you can ask
ODBC, SQL syntax, Hue Beeswax Everyone to ask questions for all data
Big data storage and analytics together

Cloudera Impala Details
Common Hive SQL and interface Unified metadata and scheduler
SQL App Hive State
Metastore YARN HDFS NN Store
ODBC

Query Planner Query Planner Fully MPP Query Planner
Query Coordinator Query Coordinator Distributed Query Coordinator
Query Exec Engine Query Exec Engine Query Exec Engine
HDFS DN HBase HDFS DN HBase HDFS DN HBase
Local Direct Reads

Common Hive SQL and interface
SQL App Hive State
ODBC

SQL Request

Query Planner Query Planner Query Planner
Query Coordinator Query Coordinator Query Coordinator

Unified metadata and scheduler
SQL App Hive State
ODBC


SQL App Hive State
ODBC

Query Planner Query Planner Fully MPP Query Planner
Query Coordinator Query Coordinator Distributed Query Coordinator

SQL App Hive State
ODBC

Local Direct Reads

SQL App Hive State
ODBC

SQL Results

Query Planner Query Planner In Memory Query Planner
Query Coordinator Query Coordinator Transfers Query Coordinator

Advantages of Our Approach
• No high-latency MapReduce batch processing
• Local processing avoids network bottlenecks
• No costly data format conversion overhead
• All data immediately query-able
• Single machine pool to scale
• All machines available to both Impala and MapReduce
• Single, open, and unified metadata and scheduler

MapReduce Remote Query Side Storage
Query Query Query Query
Node Node Node Node Query MR
Hive Engine
MR OR MR DN
NN
DN HDFS
DN DN DN

Benefits of Cloudera Impala
Real-Time Query for Data Stored in Hadoop
• Get answers as fast as you can ask questions
• Interactive analytics directly on source data
• No jumping between data silos
• Reduce duplicate storage with EDW
• Reduce data movement for interactive analysis
• Leverage existing tools and employee skills
• Ask questions of all your data
• No information loss from aggregation or
conforming to relational schemas for analysis

• Single metadata store from origination through analysis
• No need to hunt through multiple data silos

Cloudera powers real-time data hub
The Challenge:
• Needs to understand 2 years clickstream data for greater insight
• Legacy system cannot scale for data processing and analytics
So Expedia can optimize end user
data-driven search results and
maximize Google AdWord spend.

The Solution:
• Cloudera Enterprise – 4 Petabyes
• One single scalable platform for Big data for
archive, ETL & analytics with real-time BI
• Running Impala

18 CONFIDENTIAL - RESTRICTED

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights (20)

More from Cloudera, Inc. (20)

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Editor's Notes