SlideShare a Scribd company logo
How Apache Hadoop is Revolutionizing
Business Intelligence and Data Analytics

Strata Conference, Sept 22nd 2011, New York, NY

Dr. Amr Awadallah, Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
Business Intelligence Before Adopting Apache Hadoop

  BI Reports + Interactive Apps                        Can’t Explore Original
                                                       High Fidelity Raw Data
    RDBMS (processed data)
       ETL Compute Grid
                   Moving Data To
                   Compute Doesn’t Scale
           Storage Only Grid (original raw data)
                                                                            Archiving =
            Mostly Append
                                                                            Premature
                           Collection                                       Data Death
                     Instrumentation

                    Copyright © 2011, Cloudera, Inc. All Rights Reserved.             2
Business Intelligence After Adopting Apache Hadoop
                                                               Data Exploration &
  BI Reports + Interactive Apps                                Advanced Analytics

            RDBMS




    ETL and Aggregations                               Complex Data Processing
                 Hadoop: Storage + Compute Grid
                 Mostly Append                       Keep Data Alive For Ever
                                  Collection
                            Instrumentation

                    Copyright © 2011, Cloudera, Inc. All Rights Reserved.           3
So What is Apache Hadoop?
• A scalable fault-tolerant distributed system for data storage and
  processing (open source under the Apache license)

• Core Hadoop has two main components:
    • Hadoop Distributed File System: self-healing high-bandwidth clustered storage
    • MapReduce: fault-tolerant distributed processing


• Key business values:
    •   Flexible – Store any data, Run any analysis (Mine First, Govern Later)
    •   Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes
    •   Affordable – Cost per TB at a fraction of traditional options
    •   Open Source – No Lock-In, Rich Ecosystem, Large developer community
    •   Broadly adopted – A large and active ecosystem, Proven to run at scale

                          Copyright © 2011, Cloudera, Inc. All Rights Reserved.       4
The Main Benefit: Agility/Flexibility

Schema-on-Write (RDBMS):                                  Schema-on-Read (Hadoop):
•   Schema must be created before                        •   Data is simply copied to the file
    data is loaded                                           store, no special transformation is
                                                             needed
•   Explicit load operation has to
    take place which transforms data                     •   A SerDe (Serializer/Deserlizer) is
    to database internal structure                           applied during read time to extract
                                                             the required columns
•   New columns must be added
    explicitly before data for such                      •   New data can start flowing
    columns can be loaded into the                           anytime and will appear
    database                                                 retroactively once the SerDe is
                                                             updated to parse them
•   Read is Fast                                         •   Load is Fast
                                        Benefits
•   Standards/Governance                                 •   Flexibility/Agility

                         Copyright © 2011, Cloudera, Inc. All Rights Reserved.                 5
What is Complex Data Processing?
1. Java MapReduce: Gives the most flexibility and performance,
   but potentially long development cycle (the “assembly
   language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
   any programming language of your choice, but slightly lower
   performance and less flexibility.
3. Pig: A high-level language out of Yahoo, suitable for batch data
   flow workloads.
4. Hive: A SQL interpreter out of Facebook, also includes a meta-
   store mapping files to their schemas and associated SerDe.
5. Oozie: A PDL XML workflow server engine that enables creating
   a workflow of jobs composed of any of the above.

                    Copyright © 2011, Cloudera, Inc. All Rights Reserved.   6
What This Means For You: Agility

Up Front Design                                                Just in Time




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.         7
What This Means For You: Innovation

   Data Committee                                              Data Scientist




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.           8
What This Means For You: Consolidation

        Silos                                                           Sharing




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.             9
What This Means For You: Extract Value from Latent Data

  Archive to Tape                                         Keep Data Alive




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.       10
What This Means For You: Ability to Grow Fluidly
Benefit #2: Scalability




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.   11
What This Means For You: Data Beats Algorithm

  Smarter Algos                                            More Data




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.   12
Where Does Hadoop Fit in the Enterprise Data Stack?
                                          Data Scientists          Analysts         Business Users



                                                                                       Enterprise
                                                 IDEs            BI, Analytics
                           System                                                      Reporting
                          Operators
                                          Development Tools                 Business Intelligence Tools


                          Cloudera
                         Mgmt Suite                                                               Enterprise
                                                                                                    Data
  Data
             ETL Tools




Architects                                                                                        Warehouse     Customers



                                                                                                  Low-Latency     Web
                                                                                                    Serving     Application

                                                                           Relational               Systems
                     Logs             Files           Web Data
                                                                           Databases

                                          Copyright © 2011, Cloudera, Inc. All Rights Reserved.                         13
Use The Right Tool For The Right Job

    Relational Databases:                             Hadoop:




Use when:                                              Use when:
•   Interactive OLAP Analytics (<1sec)                 •   Structured or Not (Agility)
•   Multistep ACID Transactions                        •   Scalability of Storage/Compute
•   100% SQL Compliance                                •   Complex Data Processing
                         Copyright © 2011, Cloudera, Inc. All Rights Reserved.              14
Two Core Use Cases Common Across Many Industries

Use Case                   Application                       Industry                            Application      Use Case
                      Social Network Analysis                  Web                   Clickstream Sessionization
 ADVANCED ANALYTICS




                                                             Media




                                                                                                                   DATA PROCESSING
                       Content Optimization                                          Clickstream Sessionization

                        Network Analytics                      Telco                              Mediation

                       Loyalty & Promotions                   Retail                             Data Factory

                          Fraud Analysis                    Financial                    Trade Reconciliation

                          Entity Analysis                    Federal                               SIGINT

                       Sequencing Analysis             Bioinformatics                      Genome Mapping

                         Product Quality              Manufacturing                     Mfg Process Tracking



                                         Copyright © 2011, Cloudera, Inc. All Rights Reserved.                               15
CDH: Cloudera’s Distribution Including Apache Hadoop
                     UI Framework                HUE                               SDK              HUE SDK


               Workflow       OOZIE             Scheduling         OOZIE                 Metadata      HIVE


                                        Languages / Compilers
                                                                       PIG, HIVE     Fast Read/Write
         Data Integration
                                                                                          Access
         FLUME, SQOOP, ODBC                                                                  HBASE


                                               Coordination                                ZOOKEEPER




•   Open Source – 100% Apache licensed, 100% Open Source, 100% Free.
•   Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA
•   Integrated – All required component versions & dependencies are managed for you
•   Industry Standard – Existing RDBMS, ETL and BI systems work best with it
•   Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc

                                 Copyright © 2011, Cloudera, Inc. All Rights Reserved.                        16
SCM Express: Simplifies Installation and Configuration

    Service & Configuration Manager
    (SCM) Express takes the complexity out of
    deploying and configuring CDH.

     Provision a complete Hadoop stack in minutes
     Centrally manage system services through a user-
      friendly interface
     Manages services for up to 50 nodes
     FREE to download


KEY FEATURES
Automated, wizard-based    Central, real-time        Ability to configure the         Incorporates          Automates the expansion
   installation of the      dashboard for           cluster while it’s running   comprehensive validation   of services to new nodes
 complete Hadoop stack       configuration                                          and error checking       when they come online
                             management


         1                       2                            3                           4                          5
                                            ©2011 Cloudera, Inc. All Rights Reserved.                                         17
What is Cloudera Enterprise?

Cloudera Enterprise makes open source                            CLOUDERA ENTERPRISE COMPONENTS
Apache Hadoop enterprise-easy
                                                               Cloudera                       Production-Level
 Simplify and Accelerate Hadoop Deployment
                                                            Management Suite                      Support
 Reduce Adoption Costs and Risks
 Lower the Cost of Administration                             Comprehensive                Our Team of Experts
                                                             Toolset for Hadoop             On-Call to Help You
 Increase the Transparency & Control of Hadoop
                                                               Administration                 Meet Your SLAs
 Leverage the Experience of Our Experts



   3 of the top 5 telecommunications, mobile services, defense & intelligence,
     banking, media and retail organizations depend on Cloudera Enterprise

            EFFECTIVENESS                                                         EFFICIENCY
            Ensuring Repeatable Value from                                        Enabling Apache Hadoop to be
            Apache Hadoop Deployments                                             Affordably Run in Production



                                     ©2011 Cloudera, Inc. All Rights Reserved.                                    18
Hadoop World 2011

    The largest gathering of Hadoop practitioners, developers,
    business executives, industry luminaries and innovative
    companies in the Hadoop ecosystem.

•    1400 attendees, 25+ sponsors
                                                                        November 8-9
•    60 sessions across 5 tracks for:
                                                                   Sheraton New York Hotel
      – Business Decision Makers                                        & Towers, NYC
      – Enterprise Architects
      – IT Operators                                                   Learn more and register at
      – Data Scientists                                            www.hadoopworld.com
      – Developers
•    Cloudera Training and Certification                                  $50 discount for
     (November 7, 10, 11)                                                 Strata attendees



                           ©2011 Cloudera, Inc. All Rights Reserved.                                19
What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:
   • Agility/Flexibility (Enables Innovation/Exploration).
   • Complex Data Processing (Any Language, Any Problem).
   • Scalability of Storage/Compute (Freedom to Grow).
   • Economical Active Archive (Keep All Your Data Alive).

• Cloudera Enterprise enables:
   •   Lower the Cost of Management and Administration.
   •   Simplify and Accelerate Hadoop Deployment.
   •   Increase the Transparency & Control of Hadoop.
   •   Firm SLAs on Issue Resolution.
                   Copyright © 2011, Cloudera, Inc. All Rights Reserved.   20
Contact Information:



          Amr Awadallah
        aaa@cloudera.com
           650-644-3921
   http://twitter.com/awadallah




                  Copyright © 2011, Cloudera, Inc. All Rights Reserved.   21
Copyright © 2011, Cloudera, Inc. All Rights Reserved.   22
Appendix



      Copyright © 2011, Cloudera, Inc. All Rights Reserved.   23
Hadoop Timeline

                                                                              Fastest sort of a TB, 3.5mins
                                                                              over 910 nodes
                         Doug Cutting adds DFS &
                        MapReduce support to Nutch                                              • Fastest sort of a TB, 62secs
                                                                                                over 1,460 nodes
                                                            NY Times converts 4TB of            • Sorted a PB in 16.25hours
Doug Cutting & Mike Cafarella                                                                   over 3,658 nodes
                                                          image archives over 100 EC2s
  started working on Nutch


     2002        2003           2004         2005            2006            2007         2008           2009

             Google publishes GFS &
                                                   Yahoo! hires Cutting,                      Cloudera         Doug Cutting
               MapReduce papers
                                                 Hadoop spins out of Nutch                    Founded         joins Cloudera

                                                                     Facebooks launches Hive:
                                                                      SQL Support for Hadoop
                                                                                                  Hadoop Summit 2009,
                                                                                                     750 attendees


                                  Copyright © 2011, Cloudera, Inc. All Rights Reserved.                                  24
Cloudera’s Track Record
• Customers: Multiple customers with >1,000 Hadoop nodes under management
• Supporting dozens of diverse production use cases including ones that are revenue critical
  with tight SLA’s

• Community: years of demonstrated leadership in the Apache Hadoop ecosystem.
  Cloudera employees are:
    • The largest contributor to the Hadoop ecosystem in patches
    • Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache
      Hadoop itself
    • The first to build & integrate what is now the reference Hadoop stack

• Industry: Multiple years of experience providing Hadoop solutions across industries:
    • 2 of the top 5 payments companies run Cloudera
    • 3 of the top 5 commerical banks run Cloudera
    • 2 of the top 4 online travel companies run Cloudera


                            Copyright © 2011, Cloudera, Inc. All Rights Reserved.        25
Cloudera Enterprise Management Suite

Utility                   It Helps You…                       So You Can…                        It’s Like…
Activity Monitor          • Consolidate all user activities
                            into a real-time view
                                                              • Improve performance              • MySQL Enterprise Monitor
                                                              • Improve conformance to           • Quest Foglight for Oracle /
                          • Diagnose user performance           SLAs                               SQL Server
                          • Track activity metrics            • Improve QOS



Service &                 • Manage system services            • Lower cost of administration     • Red Hat Satellite Server
                          • Automate changes                  • Improve uptime                   • Microsoft System Center
Configuration             • Validate settings                                                    • Oracle Enterprise Manager
Manager                   • 1-click security


Resource                  • Report on the usage of
                            scarce resources
                                                              • Improve quality of service       • VMware vCenter
                                                              • Extend the life of the cluster
Manager                   • Plan for capacity expansion




Authorization             • Centralize management of all
                            users, groups and privileges
                                                              • Lower the costs of
                                                                administration
                                                                                                 • Teradata security
                                                                                                   administration
Manager                   • Manage permissions via            • Improve compliance
                            delegated administration




                   ©2011 Cloudera, Inc. All Rights Reserved.                                                             26
CDH Integrates with Existing IT Infrastructure

   BI/Analytics   ETL                   Databases                 Cloud/OS      Hardware




                        Copyright © 2011, Cloudera, Inc. All Rights Reserved.              27
Copyright © 2011, Cloudera, Inc. All Rights Reserved.   28

More Related Content

PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
PDF
Machine Learning for z/OS
PPTX
Microsoft cloud big data strategy
PPTX
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PPTX
Hadoop and Enterprise Data Warehouse
PDF
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
PPTX
SQL Server on Linux - march 2017
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Machine Learning for z/OS
Microsoft cloud big data strategy
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Boost Performance with Scala – Learn From Those Who’ve Done It!
Hadoop and Enterprise Data Warehouse
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SQL Server on Linux - march 2017

What's hot (20)

PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PPTX
Accelerating Business Intelligence Solutions with Microsoft Azure pass
PPTX
Integrating hadoop - Big Data TechCon 2013
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
Overview of Microsoft Appliances: Scaling SQL Server to Hundreds of Terabytes
PPTX
What's new in SQL Server 2016
PDF
A Reference Architecture for ETL 2.0
PPTX
Db2 analytics accelerator on ibm integrated analytics system technical over...
PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
PPTX
Apache Kudu: Technical Deep Dive


PDF
Presentation big dataappliance-overview_oow_v3
PPTX
Sharing metadata across the data lake and streams
PPTX
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
PDF
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
PDF
Dipping Your Toes: Azure Data Lake for DBAs
PPTX
Analyzing the World's Largest Security Data Lake!
PPTX
Insights into Real-world Data Management Challenges
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
PPTX
Format Wars: from VHS and Beta to Avro and Parquet
PPTX
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Modern Data Warehousing with the Microsoft Analytics Platform System
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Integrating hadoop - Big Data TechCon 2013
Choosing technologies for a big data solution in the cloud
Overview of Microsoft Appliances: Scaling SQL Server to Hundreds of Terabytes
What's new in SQL Server 2016
A Reference Architecture for ETL 2.0
Db2 analytics accelerator on ibm integrated analytics system technical over...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Apache Kudu: Technical Deep Dive


Presentation big dataappliance-overview_oow_v3
Sharing metadata across the data lake and streams
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Dipping Your Toes: Azure Data Lake for DBAs
Analyzing the World's Largest Security Data Lake!
Insights into Real-world Data Management Challenges
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Format Wars: from VHS and Beta to Avro and Parquet
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Ad

Similar to Business Intelligence and Data Analytics Revolutionized with Apache Hadoop (20)

PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
ODP
The power of hadoop in cloud computing
PPT
Data Science Day New York: The Platform for Big Data
PPTX
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
PDF
Impala: Real-time Queries in Hadoop
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PDF
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
PPTX
Integrating Hadoop Into the Enterprise
PPTX
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
PDF
Building a Modern Data Architecture with Enterprise Hadoop
PPT
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PPTX
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
PDF
Hadoop & Data Warehouse
PPTX
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
PPTX
The Transformation of your Data in modern IT (Presented by DellEMC)
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
The power of hadoop in cloud computing
Data Science Day New York: The Platform for Big Data
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Impala: Real-time Queries in Hadoop
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Building a Modern Data Architecture with Enterprise Hadoop
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Hadoop & Data Warehouse
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
The Transformation of your Data in modern IT (Presented by DellEMC)
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Modernizing your data center with Dell and AMD
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
A Presentation on Artificial Intelligence
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

  • 1. How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics Strata Conference, Sept 22nd 2011, New York, NY Dr. Amr Awadallah, Founder, CTO, VP of Engineering aaa@cloudera.com, twitter: @awadallah
  • 2. Business Intelligence Before Adopting Apache Hadoop BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (processed data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2
  • 3. Business Intelligence After Adopting Apache Hadoop Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS ETL and Aggregations Complex Data Processing Hadoop: Storage + Compute Grid Mostly Append Keep Data Alive For Ever Collection Instrumentation Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
  • 4. So What is Apache Hadoop? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Core Hadoop has two main components: • Hadoop Distributed File System: self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing • Key business values: • Flexible – Store any data, Run any analysis (Mine First, Govern Later) • Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes • Affordable – Cost per TB at a fraction of traditional options • Open Source – No Lock-In, Rich Ecosystem, Large developer community • Broadly adopted – A large and active ecosystem, Proven to run at scale Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
  • 5. The Main Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file data is loaded store, no special transformation is needed • Explicit load operation has to take place which transforms data • A SerDe (Serializer/Deserlizer) is to database internal structure applied during read time to extract the required columns • New columns must be added explicitly before data for such • New data can start flowing columns can be loaded into the anytime and will appear database retroactively once the SerDe is updated to parse them • Read is Fast • Load is Fast Benefits • Standards/Governance • Flexibility/Agility Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
  • 6. What is Complex Data Processing? 1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility. 3. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads. 4. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe. 5. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
  • 7. What This Means For You: Agility Up Front Design Just in Time Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
  • 8. What This Means For You: Innovation Data Committee Data Scientist Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
  • 9. What This Means For You: Consolidation Silos Sharing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
  • 10. What This Means For You: Extract Value from Latent Data Archive to Tape Keep Data Alive Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
  • 11. What This Means For You: Ability to Grow Fluidly Benefit #2: Scalability Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
  • 12. What This Means For You: Data Beats Algorithm Smarter Algos More Data Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
  • 13. Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics System Reporting Operators Development Tools Business Intelligence Tools Cloudera Mgmt Suite Enterprise Data Data ETL Tools Architects Warehouse Customers Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases Copyright © 2011, Cloudera, Inc. All Rights Reserved. 13
  • 14. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (<1sec) • Structured or Not (Agility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100% SQL Compliance • Complex Data Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
  • 15. Two Core Use Cases Common Across Many Industries Use Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization ADVANCED ANALYTICS Media DATA PROCESSING Content Optimization Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
  • 16. CDH: Cloudera’s Distribution Including Apache Hadoop UI Framework HUE SDK HUE SDK Workflow OOZIE Scheduling OOZIE Metadata HIVE Languages / Compilers PIG, HIVE Fast Read/Write Data Integration Access FLUME, SQOOP, ODBC HBASE Coordination ZOOKEEPER • Open Source – 100% Apache licensed, 100% Open Source, 100% Free. • Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA • Integrated – All required component versions & dependencies are managed for you • Industry Standard – Existing RDBMS, ETL and BI systems work best with it • Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
  • 17. SCM Express: Simplifies Installation and Configuration Service & Configuration Manager (SCM) Express takes the complexity out of deploying and configuring CDH.  Provision a complete Hadoop stack in minutes  Centrally manage system services through a user- friendly interface  Manages services for up to 50 nodes  FREE to download KEY FEATURES Automated, wizard-based Central, real-time Ability to configure the Incorporates Automates the expansion installation of the dashboard for cluster while it’s running comprehensive validation of services to new nodes complete Hadoop stack configuration and error checking when they come online management 1 2 3 4 5 ©2011 Cloudera, Inc. All Rights Reserved. 17
  • 18. What is Cloudera Enterprise? Cloudera Enterprise makes open source CLOUDERA ENTERPRISE COMPONENTS Apache Hadoop enterprise-easy Cloudera Production-Level  Simplify and Accelerate Hadoop Deployment Management Suite Support  Reduce Adoption Costs and Risks  Lower the Cost of Administration Comprehensive Our Team of Experts Toolset for Hadoop On-Call to Help You  Increase the Transparency & Control of Hadoop Administration Meet Your SLAs  Leverage the Experience of Our Experts 3 of the top 5 telecommunications, mobile services, defense & intelligence, banking, media and retail organizations depend on Cloudera Enterprise EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production ©2011 Cloudera, Inc. All Rights Reserved. 18
  • 19. Hadoop World 2011 The largest gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem. • 1400 attendees, 25+ sponsors November 8-9 • 60 sessions across 5 tracks for: Sheraton New York Hotel – Business Decision Makers & Towers, NYC – Enterprise Architects – IT Operators Learn more and register at – Data Scientists www.hadoopworld.com – Developers • Cloudera Training and Certification $50 discount for (November 7, 10, 11) Strata attendees ©2011 Cloudera, Inc. All Rights Reserved. 19
  • 20. What I Would Like You To Remember: • The Key Benefits of the Apache Hadoop Data Platform: • Agility/Flexibility (Enables Innovation/Exploration). • Complex Data Processing (Any Language, Any Problem). • Scalability of Storage/Compute (Freedom to Grow). • Economical Active Archive (Keep All Your Data Alive). • Cloudera Enterprise enables: • Lower the Cost of Management and Administration. • Simplify and Accelerate Hadoop Deployment. • Increase the Transparency & Control of Hadoop. • Firm SLAs on Issue Resolution. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 20
  • 21. Contact Information: Amr Awadallah aaa@cloudera.com 650-644-3921 http://twitter.com/awadallah Copyright © 2011, Cloudera, Inc. All Rights Reserved. 21
  • 22. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22
  • 23. Appendix Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
  • 24. Hadoop Timeline Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch • Fastest sort of a TB, 62secs over 1,460 nodes NY Times converts 4TB of • Sorted a PB in 16.25hours Doug Cutting & Mike Cafarella over 3,658 nodes image archives over 100 EC2s started working on Nutch 2002 2003 2004 2005 2006 2007 2008 2009 Google publishes GFS & Yahoo! hires Cutting, Cloudera Doug Cutting MapReduce papers Hadoop spins out of Nutch Founded joins Cloudera Facebooks launches Hive: SQL Support for Hadoop Hadoop Summit 2009, 750 attendees Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
  • 25. Cloudera’s Track Record • Customers: Multiple customers with >1,000 Hadoop nodes under management • Supporting dozens of diverse production use cases including ones that are revenue critical with tight SLA’s • Community: years of demonstrated leadership in the Apache Hadoop ecosystem. Cloudera employees are: • The largest contributor to the Hadoop ecosystem in patches • Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache Hadoop itself • The first to build & integrate what is now the reference Hadoop stack • Industry: Multiple years of experience providing Hadoop solutions across industries: • 2 of the top 5 payments companies run Cloudera • 3 of the top 5 commerical banks run Cloudera • 2 of the top 4 online travel companies run Cloudera Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
  • 26. Cloudera Enterprise Management Suite Utility It Helps You… So You Can… It’s Like… Activity Monitor • Consolidate all user activities into a real-time view • Improve performance • MySQL Enterprise Monitor • Improve conformance to • Quest Foglight for Oracle / • Diagnose user performance SLAs SQL Server • Track activity metrics • Improve QOS Service & • Manage system services • Lower cost of administration • Red Hat Satellite Server • Automate changes • Improve uptime • Microsoft System Center Configuration • Validate settings • Oracle Enterprise Manager Manager • 1-click security Resource • Report on the usage of scarce resources • Improve quality of service • VMware vCenter • Extend the life of the cluster Manager • Plan for capacity expansion Authorization • Centralize management of all users, groups and privileges • Lower the costs of administration • Teradata security administration Manager • Manage permissions via • Improve compliance delegated administration ©2011 Cloudera, Inc. All Rights Reserved. 26
  • 27. CDH Integrates with Existing IT Infrastructure BI/Analytics ETL Databases Cloud/OS Hardware Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27
  • 28. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28