SlideShare a Scribd company logo
November 2011
How Apache Hadoop is Revolutionizing
Business Intelligence and Data Analytics
Dr. Amr Awadallah | Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
Business Intelligence Before Adopting Apache Hadoop


    BI Reports + Interactive Apps                       Can’t Explore Original
                                                        High Fidelity Raw Data
      RDBMS (processed data)

         ETL Compute Grid

                    Moving Data To
                    Compute Doesn’t Scale

            Storage Only Grid (original raw data)
                                                                             Archiving =
              Mostly Append
                                                                             Premature
                         Collection                                          Data Death

                      Instrumentation


2
                            ©2011 Cloudera, Inc. All Rights Reserved.
Business Intelligence After Adopting Apache Hadoop

                                                                  Data Exploration &
    BI Reports + Interactive Apps                                 Advanced Analytics

              RDBMS




     ETL and Aggregations                               Complex Data Processing
                  Hadoop: Storage + Compute Grid

                                                    Keep Data Alive For Ever

                               Collection

                          Instrumentation


3
                            ©2011 Cloudera, Inc. All Rights Reserved.
So What is Apache                                                      Hadoop ?
• A scalable fault-tolerant distributed system for data storage
  and processing (open source under the Apache license).

• Core Hadoop has two main components:
     – Hadoop Distributed File System: self-healing high-bandwidth
       clustered storage.
     – MapReduce: fault-tolerant distributed processing.

• Key business values:
     –   Flexible – Store any data, Run any analysis (Mine First, Govern Later).
     –   Scalable – Start at 1TB/3-nodes then grow to petabytes/1000s of nodes.
     –   Affordable – Cost per TB at a fraction of traditional options.
     –   Open Source – No Lock-In, Rich Ecosystem, Large developer community.
     –   Broadly adopted – A large and active ecosystem, Proven to run at scale.



 4
                                ©2011 Cloudera, Inc. All Rights Reserved.
The Main Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):                                   Schema-on-Read (Hadoop):
•   Schema must be created before                         •   Data is simply copied to the file
    data is loaded                                            store, no transformation is needed
•   Explicit load operation has to                        •   A SerDe (Serializer/Deserlizer) is
    take place which transforms                               applied during read time to extract
    data to DB internal structure                             the required columns
•   New columns must be added                             •   New data can start flowing anytime
    explicitly before data for such                           and will appear retroactively once
    columns can be loaded into the                            the SerDe is updated to parse it
    database

•   Read is Fast                                          •   Load is Fast
                                     Benefits
•   Standards/Governance                                  •   Flexibility/Agility


    5
                                ©2011 Cloudera, Inc. All Rights Reserved.
What is Complex Data Processing?
1. Java MapReduce: Most flexibility and performance, but tedious
   development cycle (the “assembly language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
   any programming language of your choice, but slightly lower
   performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java.
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
   data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
   store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
   workflow of jobs composed of any of the above.


6
                          ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Agility

Up Front Design                                               Just in Time




7
                  ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Innovation

    Data Committee                                          Data Scientist




8
                ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Consolidation

      Silos                                               Sharing




 9
              ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Extract Value from Latent Data


     Archive to Tape                                      Keep Data Alive




10
                     ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Ability to Grow Fluidly




11
                  ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Data Beats Algorithm


     Smarter Algos                                               More Data




12
                     ©2011 Cloudera, Inc. All Rights Reserved.
Where Does Hadoop Fit in the Enterprise Data Stack?


                                     Data Scientists        Analysts              Business Users
                                                                                      Enterprise
                                          IDEs            BI, Analytics
                                                                                      Reporting

                                    Development Tools                  Business Intelligence Tools
                     System
                    Operators
                     Cloudera
                    Mgmt Suite                                                                Enterprise
        ETL Tools




                                                                                                Data
                                                                                              Warehouse

  Data
Architects                                                                                                 Customers
                                                                                             Low-Latency     Web
                                                                                               Serving     Application
                                                                      Relational               Systems
                Logs             Files       Web Data
                                                                      Databases


   13
                                                 ©2011 Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job
    Relational Databases:                             Hadoop:




Use when:                                               Use when:
•   Interactive OLAP Analytics (<1sec)                  •   Structured or Not (Flexibility)
•   Multistep ACID Transactions                         •   Scalability of Storage/Compute
•   100% SQL Compliance                                 •   Complex Data Processing


14
                               ©2011 Cloudera, Inc. All Rights Reserved.
Two Core Use Cases Common Across Many Industries


Use Case                    Application                     Industry                               Application             Use Case
                                                                Web
   ADVANCED ANALYTICS




                        Social Network Analysis                                               Clickstream Sessionization




                                                                                                                              DATA PROCESSING
                         Content Optimization                 Media                           Clickstream Sessionization

                          Network Analytics                    Telco                                  Mediation

                         Loyalty & Promotions                  Retail                               Data Factory

                            Fraud Analysis                 Financial                             Trade Reconciliation

                            Entity Analysis                  Federal                                   SIGINT

                         Sequencing Analysis        Bioinformatics                                Genome Mapping

                           Product Quality           Manufacturing                              Mfg Process Tracking



  15
                                                  ©2011 Cloudera, Inc. All Rights Reserved.
CDH: Cloudera’s Distribution Including Apache Hadoop

The #1 commercial and non-commercial Apache Hadoop distribution.
                File System Mount        UI Framework/SDK                             Data Mining
                            FUSE-DFS                                  HUE               APACHE MAHOUT


                     Workflow                  Scheduling                              Metadata
                        APACHE OOZIE                  APACHE OOZIE                         APACHE HIVE


                                       Languages / Compilers
                                                APACHE PIG, APACHE HIVE                Fast Read/Write
               Data Integration
                                                                                           Access
                APACHE FLUME,
                                                                                       APACHE HBASE
                APACHE SQOOP


                                             Coordination                          APACHE ZOOKEEPER

•     Open Source – 100% Apache licensed, 100% Open Source, 100% Free, No Forks.
•     Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA.
•     Proven at Scale – Deployed at hundreds of enterprises across many industries.
•     Integrated – All required component versions & dependencies are managed for you.
•     Industry Standard – Existing RDBMS, ETL and BI systems work best with it.
•     Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc.


 16
                                       ©2011 Cloudera, Inc. All Rights Reserved.
CDH Integrates with Existing IT Infrastructure

      BI/Analytics   ETL           Databases                           Cloud/OS   Hardware




       Cloudera’s Distribution including Apache Hadoop



 17
                           ©2011 Cloudera, Inc. All Rights Reserved.
What is Cloudera Enterprise?
Cloudera Enterprise makes open                             CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy

 Simplify and Accelerate Hadoop Deployment                     Cloudera                     Production-
                                                               Management                   Level Support
 Reduce Adoption Costs and Risks
                                                                  Suite
 Lower the Cost of Administration
                                                               Comprehensive                Our Team of Experts
 Increase the Transparency & Control of Hadoop                                             On-Call to Help You
                                                              Toolset for Hadoop
 Leverage the Experience of Our Experts                        Administration               Meet Your SLAs




       3 of the top 5 telecommunications, mobile services, defense &
intelligence, banking, media and retail organizations depend on Cloudera

           EFFECTIVENESS                                                           EFFICIENCY
           Ensuring Repeatable Value from                                          Enabling Apache Hadoop to be
           Apache Hadoop Deployments                                               Affordably Run in Production



18
                                     ©2011 Cloudera, Inc. All Rights Reserved.
SCM Express: Simplifies Installation and Configuration

   Service & Configuration Manager
   (SCM) Express takes the complexity out
   of deploying and configuring CDH.

    Provision a complete Hadoop stack in minutes
    Centrally manage system services through a user-
     friendly interface
    Manages services for up to 50 nodes
    FREE to download


KEY FEATURES
 Automated, wizard-     Central, real-time      Ability to configure the                     Incorporates           Automates the
based installation of    dashboard for             cluster while it’s                      comprehensive         expansion of services
the complete Hadoop       configuration                   running                        validation and error   to new nodes when they
       stack              management                                                           checking               come online


        1                     2                               3                                 4                       5
   19
                                             ©2011 Cloudera, Inc. All Rights Reserved.
What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:
   – Agility/Flexibility (Enables Exploration/Innovation).
   – Complex Data Processing (Any Language, Any Problem).
   – Scalability of Storage/Compute (Freedom to Grow).
   – Economical Active Archive (Keep All Your Data Alive).

• Cloudera Enterprise enables:
   – Lower the Cost of Management and Administration.
   – Simplify and Accelerate Hadoop Deployment.
   – Increase the Transparency & Control of Hadoop.
   – Firm SLAs on Issue Resolution.

20
                      ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera

More Related Content

PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
ODP
The power of hadoop in cloud computing
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
PDF
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
PPTX
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
PDF
Machine Learning for z/OS
PPTX
YARN Ready: Integrating to YARN with Tez
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
The power of hadoop in cloud computing
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Machine Learning for z/OS
YARN Ready: Integrating to YARN with Tez

What's hot (20)

PDF
Introduction to Hortonworks Data Platform
PDF
2014.07.11 biginsights data2014
PDF
Data Lake for the Cloud: Extending your Hadoop Implementation
PDF
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
PPTX
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
PDF
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
PDF
YARN: Future of Data Processing with Apache Hadoop
PDF
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
PPTX
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
PDF
Discover.hdp2.2.storm and kafka.final
PDF
50 Shades of SQL
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PDF
A Reference Architecture for ETL 2.0
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
PPTX
Stinger.Next by Alan Gates of Hortonworks
PPT
Data Science Day New York: The Platform for Big Data
PDF
Delivering Apache Hadoop for the Modern Data Architecture
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
PDF
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
PPTX
Intel and Cloudera: Accelerating Enterprise Big Data Success
Introduction to Hortonworks Data Platform
2014.07.11 biginsights data2014
Data Lake for the Cloud: Extending your Hadoop Implementation
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
YARN: Future of Data Processing with Apache Hadoop
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Discover.hdp2.2.storm and kafka.final
50 Shades of SQL
Modern Data Warehousing with the Microsoft Analytics Platform System
A Reference Architecture for ETL 2.0
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Stinger.Next by Alan Gates of Hortonworks
Data Science Day New York: The Platform for Big Data
Delivering Apache Hadoop for the Modern Data Architecture
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Intel and Cloudera: Accelerating Enterprise Big Data Success
Ad

Viewers also liked (20)

PDF
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
PPTX
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PDF
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
PDF
MapR-DB Elasticsearch Integration
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
PPTX
Hadoop: An Industry Perspective
PPTX
Introduction to Apache Hadoop
PDF
Baptist Health: Solving Healthcare Problems with Big Data
PPTX
Schema-on-Read vs Schema-on-Write
PDF
Apache Drill - Why, What, How
PPTX
The Future of Data Management: The Enterprise Data Hub
PDF
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
PDF
Hadoop application architectures - using Customer 360 as an example
PPTX
Introduction to Apache HBase, MapR Tables and Security
PDF
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
PPTX
Thriving and surviving the Big Data revolution
PPTX
Zeta Architecture: The Next Generation Big Data Architecture
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
MapR-DB Elasticsearch Integration
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Hadoop: An Industry Perspective
Introduction to Apache Hadoop
Baptist Health: Solving Healthcare Problems with Big Data
Schema-on-Read vs Schema-on-Write
Apache Drill - Why, What, How
The Future of Data Management: The Enterprise Data Hub
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Hadoop application architectures - using Customer 360 as an example
Introduction to Apache HBase, MapR Tables and Security
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Thriving and surviving the Big Data revolution
Zeta Architecture: The Next Generation Big Data Architecture
Ad

Similar to Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera (20)

PPTX
Integrating hadoop - Big Data TechCon 2013
PPTX
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
PPTX
Integrating Hadoop Into the Enterprise
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PDF
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
PPTX
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
PPTX
Amr Awadallah, unSEXY Presentation
PPTX
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
PPT
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
PDF
Common and unique use cases for Apache Hadoop
PDF
Commonanduniqueusecases 110831113310-phpapp01
PDF
Introduction to Hadoop
PDF
Hadoop - Now, Next and Beyond
PPT
Data Science Day New York: Data Science: A Personal History
PPTX
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
PDF
Treasure Data and Heroku
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PDF
Hadoop summit cloudera keynote_v5
PPTX
hadoop 101 aug 21 2012 tohug
PDF
Integrating Big Data Technologies
Integrating hadoop - Big Data TechCon 2013
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Amr Awadallah, unSEXY Presentation
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
Common and unique use cases for Apache Hadoop
Commonanduniqueusecases 110831113310-phpapp01
Introduction to Hadoop
Hadoop - Now, Next and Beyond
Data Science Day New York: Data Science: A Personal History
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Treasure Data and Heroku
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Hadoop summit cloudera keynote_v5
hadoop 101 aug 21 2012 tohug
Integrating Big Data Technologies

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
NewMind AI Monthly Chronicles - July 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera

  • 1. November 2011 How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics Dr. Amr Awadallah | Founder, CTO, VP of Engineering aaa@cloudera.com, twitter: @awadallah
  • 2. Business Intelligence Before Adopting Apache Hadoop BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (processed data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. Business Intelligence After Adopting Apache Hadoop Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS ETL and Aggregations Complex Data Processing Hadoop: Storage + Compute Grid Keep Data Alive For Ever Collection Instrumentation 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main components: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: fault-tolerant distributed processing. • Key business values: – Flexible – Store any data, Run any analysis (Mine First, Govern Later). – Scalable – Start at 1TB/3-nodes then grow to petabytes/1000s of nodes. – Affordable – Cost per TB at a fraction of traditional options. – Open Source – No Lock-In, Rich Ecosystem, Large developer community. – Broadly adopted – A large and active ecosystem, Proven to run at scale. 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. The Main Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file data is loaded store, no transformation is needed • Explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms applied during read time to extract data to DB internal structure the required columns • New columns must be added • New data can start flowing anytime explicitly before data for such and will appear retroactively once columns can be loaded into the the SerDe is updated to parse it database • Read is Fast • Load is Fast Benefits • Standards/Governance • Flexibility/Agility 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. What is Complex Data Processing? 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java. 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. What This Means For You: Agility Up Front Design Just in Time 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. What This Means For You: Innovation Data Committee Data Scientist 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. What This Means For You: Consolidation Silos Sharing 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. What This Means For You: Extract Value from Latent Data Archive to Tape Keep Data Alive 10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. What This Means For You: Ability to Grow Fluidly 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. What This Means For You: Data Beats Algorithm Smarter Algos More Data 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting Development Tools Business Intelligence Tools System Operators Cloudera Mgmt Suite Enterprise ETL Tools Data Warehouse Data Architects Customers Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100% SQL Compliance • Complex Data Processing 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Two Core Use Cases Common Across Many Industries Use Case Application Industry Application Use Case Web ADVANCED ANALYTICS Social Network Analysis Clickstream Sessionization DATA PROCESSING Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. CDH: Cloudera’s Distribution Including Apache Hadoop The #1 commercial and non-commercial Apache Hadoop distribution. File System Mount UI Framework/SDK Data Mining FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME, APACHE HBASE APACHE SQOOP Coordination APACHE ZOOKEEPER • Open Source – 100% Apache licensed, 100% Open Source, 100% Free, No Forks. • Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA. • Proven at Scale – Deployed at hundreds of enterprises across many industries. • Integrated – All required component versions & dependencies are managed for you. • Industry Standard – Existing RDBMS, ETL and BI systems work best with it. • Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc. 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. CDH Integrates with Existing IT Infrastructure BI/Analytics ETL Databases Cloud/OS Hardware Cloudera’s Distribution including Apache Hadoop 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. What is Cloudera Enterprise? Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS source Apache Hadoop enterprise-easy  Simplify and Accelerate Hadoop Deployment Cloudera Production- Management Level Support  Reduce Adoption Costs and Risks Suite  Lower the Cost of Administration Comprehensive Our Team of Experts  Increase the Transparency & Control of Hadoop On-Call to Help You Toolset for Hadoop  Leverage the Experience of Our Experts Administration Meet Your SLAs 3 of the top 5 telecommunications, mobile services, defense & intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. SCM Express: Simplifies Installation and Configuration Service & Configuration Manager (SCM) Express takes the complexity out of deploying and configuring CDH.  Provision a complete Hadoop stack in minutes  Centrally manage system services through a user- friendly interface  Manages services for up to 50 nodes  FREE to download KEY FEATURES Automated, wizard- Central, real-time Ability to configure the Incorporates Automates the based installation of dashboard for cluster while it’s comprehensive expansion of services the complete Hadoop configuration running validation and error to new nodes when they stack management checking come online 1 2 3 4 5 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. What I Would Like You To Remember: • The Key Benefits of the Apache Hadoop Data Platform: – Agility/Flexibility (Enables Exploration/Innovation). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Active Archive (Keep All Your Data Alive). • Cloudera Enterprise enables: – Lower the Cost of Management and Administration. – Simplify and Accelerate Hadoop Deployment. – Increase the Transparency & Control of Hadoop. – Firm SLAs on Issue Resolution. 20 ©2011 Cloudera, Inc. All Rights Reserved.