SlideShare a Scribd company logo
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights
Data Science on Hadoop:
How Cloudera Impala Unlocks New
Productivity and Insights
Justin Erickson | Product Manager
Marcel Kornacker | Software Engineer
Ravikumar Visweswara | Software Engineer
October 2012
Why Data Scientists Love Hadoop

  •   Massive volumes of data




  •   Data preparation & analytics in 1 environment
  •   Highly flexible environment for creating & testing machine learning models




  •   10% the cost/TB under management
Hadoop Use Cases Moving to Real-Time




      Already query      Already load data into      Already use HBase for
    Hadoop using Hive   CDH every 90 mins or less    real-time data access




                                      Source: Cloudera customer survey August 2012
But Hadoop Isn’t Fast Enough




      Need faster     Move data from            See value today in
       queries on   Hadoop to RDBMS for         consolidating to a
      Hadoop data      interactive SQL           single platform




                               Source: Cloudera customer survey August 2012
Beyond Batch – The Next Stage for Hadoop
             HADOOP TODAY IS TOO SLOW
                     MapReduce is batch
       Simple queries can take minutes / tens of minutes


    CURRENT DATA MANAGEMENT IS TOO COMPLEX
                Optimized for rigid schemas &
                 special purpose applications
            Redundant data storage & processes
           Very expensive systems: $20K-150K / TB
Cloudera Enterprise RTQ
Real-Time Query for Data Stored in Hadoop
Powered by Cloudera Impala.
                           Supports Hive SQL
                           4-30X faster than Hive over MapReduce
                           Supports multiple storage engines &
                           file formats
                           Uses existing drivers, integrates with existing
                           metastore, works with leading BI tools
                           Flexible, cost-effective, no lock-in

                           Deploy & operate with Cloudera Manager
Cloudera Now Powered by Impala
          BEFORE IMPALA                                  WITH IMPALA
                                      USER INTERFACE



                                      BATCH PROCESSING       REAL-TIME ACCESS




  • Unified Storage:                 • With Impala:
     Supports HDFS and HBase              Real-time SQL queries
     Flexible file formats                Native distributed query engine
  • Unified Metastore                     Optimized for low-latency
  • Unified Security                 • Provides:
  • Unified Client Interfaces:            Answers as fast as you can ask
     ODBC, SQL syntax, Hue Beeswax        Everyone to ask questions for all data
                                          Big data storage and analytics together
Cloudera Impala Details
Common Hive SQL and interface                      Unified metadata and scheduler
           SQL App                          Hive                                    State
                                          Metastore      YARN       HDFS NN         Store
            ODBC




    Query Planner                 Query Planner       Fully MPP        Query Planner
 Query Coordinator              Query Coordinator     Distributed    Query Coordinator
 Query Exec Engine              Query Exec Engine                    Query Exec Engine
 HDFS DN     HBase              HDFS DN    HBase                    HDFS DN         HBase
                                                             Local Direct Reads
Cloudera Impala Details
Common Hive SQL and interface
           SQL App                             Hive                        State
                                             Metastore   YARN   HDFS NN    Store
            ODBC

                     SQL Request

    Query Planner                    Query Planner                Query Planner
 Query Coordinator                 Query Coordinator            Query Coordinator
 Query Exec Engine                 Query Exec Engine            Query Exec Engine
 HDFS DN     HBase                 HDFS DN    HBase             HDFS DN   HBase
Cloudera Impala Details
                                       Unified metadata and scheduler
          SQL App               Hive                                    State
                              Metastore      YARN       HDFS NN         Store
           ODBC




  Query Planner       Query Planner                        Query Planner
Query Coordinator   Query Coordinator                   Query Coordinator
Query Exec Engine   Query Exec Engine                    Query Exec Engine
HDFS DN     HBase   HDFS DN    HBase                    HDFS DN         HBase
Cloudera Impala Details
          SQL App               Hive                               State
                              Metastore     YARN        HDFS NN    Store
           ODBC




  Query Planner       Query Planner       Fully MPP       Query Planner
Query Coordinator   Query Coordinator     Distributed   Query Coordinator
Query Exec Engine   Query Exec Engine                   Query Exec Engine
HDFS DN     HBase   HDFS DN    HBase                    HDFS DN   HBase
Cloudera Impala Details
          SQL App               Hive                              State
                              Metastore   YARN      HDFS NN       Store
           ODBC




  Query Planner       Query Planner                    Query Planner
Query Coordinator   Query Coordinator                Query Coordinator
Query Exec Engine   Query Exec Engine                Query Exec Engine
HDFS DN     HBase   HDFS DN    HBase                HDFS DN       HBase
                                             Local Direct Reads
Cloudera Impala Details
          SQL App                             Hive                              State
                                            Metastore     YARN       HDFS NN    Store
           ODBC

                    SQL Results

  Query Planner                     Query Planner       In Memory      Query Planner
Query Coordinator                 Query Coordinator      Transfers   Query Coordinator
Query Exec Engine                 Query Exec Engine                  Query Exec Engine
HDFS DN     HBase                 HDFS DN    HBase                   HDFS DN   HBase
Advantages of Our Approach
•   No high-latency MapReduce batch processing
•   Local processing avoids network bottlenecks
•   No costly data format conversion overhead
•   All data immediately query-able
•   Single machine pool to scale
•   All machines available to both Impala and MapReduce
•   Single, open, and unified metadata and scheduler

       MapReduce                      Remote Query               Side Storage
    Query                        Query        Query    Query
    Node                         Node         Node     Node     Query     MR
                 Hive                                           Engine
     MR     OR    MR                                                       DN
                                 NN
     DN          HDFS
                                         DN       DN       DN
Cloudera Impala Demo
Benefits of Cloudera Impala
Real-Time Query for Data Stored in Hadoop
                       • Get answers as fast as you can ask questions
                       • Interactive analytics directly on source data
                       • No jumping between data silos
                       • Reduce duplicate storage with EDW
                       • Reduce data movement for interactive analysis
                       • Leverage existing tools and employee skills
                       • Ask questions of all your data
                       • No information loss from aggregation or
                         conforming to relational schemas for analysis

                       • Single metadata store from origination through analysis
                       • No need to hunt through multiple data silos
Cloudera powers real-time data hub
     The Challenge:
     • Needs to understand 2 years clickstream data for greater insight
     • Legacy system cannot scale for data processing and analytics
                                                      So Expedia can optimize end user
                                                      data-driven search results and
                                                      maximize Google AdWord spend.

                                                   The Solution:
                                                   • Cloudera Enterprise – 4 Petabyes
                                                   • One single scalable platform for Big data for
                                                     archive, ETL & analytics with real-time BI
                                                   • Running Impala

18                                  CONFIDENTIAL - RESTRICTED
Validated Beta Partners
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

More Related Content

PPTX
Etu L2 Training - Hadoop 企業應用實作
PDF
Using hadoop to expand data warehousing
PDF
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PPTX
Facing enterprise specific challenges – utility programming in hadoop
PDF
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
PPTX
Big data Hadoop
PPTX
Azure_Business_Opportunity
Etu L2 Training - Hadoop 企業應用實作
Using hadoop to expand data warehousing
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Facing enterprise specific challenges – utility programming in hadoop
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Big data Hadoop
Azure_Business_Opportunity

What's hot (20)

PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
PDF
Is your cloud ready for Big Data? Strata NY 2013
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
PDF
Hadoop 101
 
PDF
Hadoop on Azure, Blue elephants
PDF
Integration of HIve and HBase
PPT
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
KEY
Processing Big Data
PPTX
Drill njhug -19 feb2013
PPTX
Hadoop introduction
PDF
Searching conversations with hadoop
PPTX
Introduction to Hadoop
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Hadoop and Hive in Enterprises
PPTX
Impala Unlocks Interactive BI on Hadoop
PPTX
Hadoop configuration & performance tuning
PDF
Liquidity Risk Management powered by SAP HANA
PDF
Realtime Apache Hadoop at Facebook
PDF
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Supporting Financial Services with a More Flexible Approach to Big Data
Is your cloud ready for Big Data? Strata NY 2013
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Hadoop 101
 
Hadoop on Azure, Blue elephants
Integration of HIve and HBase
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Processing Big Data
Drill njhug -19 feb2013
Hadoop introduction
Searching conversations with hadoop
Introduction to Hadoop
Building a Hadoop Data Warehouse with Impala
Hadoop and Hive in Enterprises
Impala Unlocks Interactive BI on Hadoop
Hadoop configuration & performance tuning
Liquidity Risk Management powered by SAP HANA
Realtime Apache Hadoop at Facebook
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
Ad

Viewers also liked (6)

PDF
Facebook - Jonthan Gray - Hadoop World 2010
PPTX
Strata + Hadoop World 2012: Given Enough Monkeys - Some Thoughts On Randomness
PPTX
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
PPT
Data Science Day New York: Data Science: A Personal History
PDF
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Facebook - Jonthan Gray - Hadoop World 2010
Strata + Hadoop World 2012: Given Enough Monkeys - Some Thoughts On Randomness
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Data Science Day New York: Data Science: A Personal History
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
EclipseCon Keynote: Apache Hadoop - An Introduction
Ad

Similar to Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights (20)

PDF
Impala: Real-time Queries in Hadoop
PDF
Cloudera Impala: A modern SQL Query Engine for Hadoop
PPTX
Technical Overview on Cloudera Impala
PPTX
Cloudera Impala: A Modern SQL Engine for Hadoop
PPTX
Impala for PhillyDB Meetup
PPTX
Impala presentation
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PDF
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
PPTX
Apache Drill
PPTX
Hortonworks.bdb
PPTX
Big data hadoop ecosystem and nosql
PDF
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
PDF
Discover hdp 2.2 hdfs - final
PDF
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
PDF
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
PPTX
Big data processing engines, Atlanta Meetup 4/30
PDF
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
PPTX
Hadoop in the Cloud - The what, why and how from the experts
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PPTX
Microsoft's Hadoop Story
Impala: Real-time Queries in Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
Technical Overview on Cloudera Impala
Cloudera Impala: A Modern SQL Engine for Hadoop
Impala for PhillyDB Meetup
Impala presentation
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
Apache Drill
Hortonworks.bdb
Big data hadoop ecosystem and nosql
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2 hdfs - final
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Big data processing engines, Atlanta Meetup 4/30
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hadoop in the Cloud - The what, why and how from the experts
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Microsoft's Hadoop Story

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

  • 2. Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights Justin Erickson | Product Manager Marcel Kornacker | Software Engineer Ravikumar Visweswara | Software Engineer October 2012
  • 3. Why Data Scientists Love Hadoop • Massive volumes of data • Data preparation & analytics in 1 environment • Highly flexible environment for creating & testing machine learning models • 10% the cost/TB under management
  • 4. Hadoop Use Cases Moving to Real-Time Already query Already load data into Already use HBase for Hadoop using Hive CDH every 90 mins or less real-time data access Source: Cloudera customer survey August 2012
  • 5. But Hadoop Isn’t Fast Enough Need faster Move data from See value today in queries on Hadoop to RDBMS for consolidating to a Hadoop data interactive SQL single platform Source: Cloudera customer survey August 2012
  • 6. Beyond Batch – The Next Stage for Hadoop HADOOP TODAY IS TOO SLOW MapReduce is batch Simple queries can take minutes / tens of minutes CURRENT DATA MANAGEMENT IS TOO COMPLEX Optimized for rigid schemas & special purpose applications Redundant data storage & processes Very expensive systems: $20K-150K / TB
  • 7. Cloudera Enterprise RTQ Real-Time Query for Data Stored in Hadoop Powered by Cloudera Impala. Supports Hive SQL 4-30X faster than Hive over MapReduce Supports multiple storage engines & file formats Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Manager
  • 8. Cloudera Now Powered by Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL-TIME ACCESS • Unified Storage: • With Impala: Supports HDFS and HBase Real-time SQL queries Flexible file formats Native distributed query engine • Unified Metastore Optimized for low-latency • Unified Security • Provides: • Unified Client Interfaces: Answers as fast as you can ask ODBC, SQL syntax, Hue Beeswax Everyone to ask questions for all data Big data storage and analytics together
  • 9. Cloudera Impala Details Common Hive SQL and interface Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads
  • 10. Cloudera Impala Details Common Hive SQL and interface SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 11. Cloudera Impala Details Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 12. Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads
  • 14. Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Results Query Planner Query Planner In Memory Query Planner Query Coordinator Query Coordinator Transfers Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 15. Advantages of Our Approach • No high-latency MapReduce batch processing • Local processing avoids network bottlenecks • No costly data format conversion overhead • All data immediately query-able • Single machine pool to scale • All machines available to both Impala and MapReduce • Single, open, and unified metadata and scheduler MapReduce Remote Query Side Storage Query Query Query Query Node Node Node Node Query MR Hive Engine MR OR MR DN NN DN HDFS DN DN DN
  • 17. Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop • Get answers as fast as you can ask questions • Interactive analytics directly on source data • No jumping between data silos • Reduce duplicate storage with EDW • Reduce data movement for interactive analysis • Leverage existing tools and employee skills • Ask questions of all your data • No information loss from aggregation or conforming to relational schemas for analysis • Single metadata store from origination through analysis • No need to hunt through multiple data silos
  • 18. Cloudera powers real-time data hub The Challenge: • Needs to understand 2 years clickstream data for greater insight • Legacy system cannot scale for data processing and analytics So Expedia can optimize end user data-driven search results and maximize Google AdWord spend. The Solution: • Cloudera Enterprise – 4 Petabyes • One single scalable platform for Big data for archive, ETL & analytics with real-time BI • Running Impala 18 CONFIDENTIAL - RESTRICTED

Editor's Notes

  • #19: Expedia’s use case for Impala:As theworld’s leading online travel provider, Expedia’s business requires a fine-tuned website that understands what its visitors want and can deliver results to partner hotels, airlines and other travel vendors. Expedia has historically used traditional relational data warehouses to capture and analyze the clickstream data generated to, from and within its website, but saw the value in being able to capture greater volumes of historical, detailed data leveraging Hadoop. The goal: to better understand keyword conversions driving traffic to the site in order to optimize Google AdWord spend. Today, Expedia uses Hadoop to empower its full data lifecycle – data is collected from online activity, loaded into Hadoop, scored and analyzed, and that data generates scoring engines which impact the recommendations, search results and sort orders on Expedia.com. Most recently, Expedia has kicked off a project using HBase and Impala for real-time BI that will power their Market Manager, an interactive application used by merchants such as hotels so they can see how Expedia is performing vs. competitors. For example, if one hotel notices they aren’t getting many bookings through Expedia around Christmastime, they can drill into the application to find out why: is it because their prices are too high? Or are they running low on inventory for certain dates? With this solution, Expedia can glean these insights and proactively reach out to merchants with recommendations on how they might drive greater bookings. Impala will allow Expedia’s business users to access Hadoop in a more interactive, ad hoc, speed-of-thought manner. Latency will be cut in half, and Impala provides an extensible solution that will scale with the growth of the business.