SlideShare a Scribd company logo
Apache Hadoop & the Cloud


Jim Walker
Dir. Product Marketing, Hortonworks
Twitter @jaymce

July 10, 2012




© Hortonworks Inc. 2012
1941


               2012
                                 Page 2
© Hortonworks Inc. 2012
Big data market segments

                          Software
  Hardware                                ETL & Mgmnt              Analytics       Applications           Services
                        Distributions
•  Storage             •  OSS Apache      •  Distributed file   •  Analytic       •  Data            •  Consulting
•  Servers                Hadoop             stores                application       visualization   •  Training
•  Networking          •  Enterprise      •  NoSQL                 development       tools           •  Tech support
                          Distributions      databases             platforms      •  Business        •  Software
                       •  Non-Hadoop      •  Data               •  Advanced          intelligence       maintenance
                          big data           integration           analytics         applications    •  Hardware
                          frameworks      •  Data quality &        applications                         maintenance
                                             governance                                              •  hosting




         Next Generation Data Warehouse

•  MPP columnar data warehouse appliances
•  In-memory analytics engines
•  Fast data loading




                © Hortonworks Inc. 2012
Big data market segments

                          Software
  Hardware                                ETL & Mgmnt              Analytics       Applications           Services
                        Distributions
•  Storage             •  OSS Apache      •  Distributed file   •  Analytic       •  Data            •  Consulting
•  Servers                Hadoop             stores                application       visualization   •  Training
•  Networking          •  Enterprise      •  NoSQL                 development       tools           •  Tech support
                          Distributions      databases             platforms      •  Business        •  Software
                       •  Non-Hadoop      •  Data               •  Advanced          intelligence       maintenance
                          big data           integration           analytics         applications    •  Hardware
                          frameworks      •  Data quality &        applications                         maintenance
                                             governance                                              •  hosting

                               cloud          cloud                 cloud              cloud


         Next Generation Data Warehouse

•  MPP columnar data warehouse appliances
•  In-memory analytics engines
•  Fast data loading




                © Hortonworks Inc. 2012
Analytics started with basic purchase history…




 Megabytes
                ERP
                 Purchase detail
                 Purchase record
                 Payment record



                                       Increasing Data Variety and Complexity

                                                                  Source: Crated in conjunction with Teradata, Inc.


             © Hortonworks Inc. 2012
then we added customer information…




Gigabytes       CRM
                                   Segmentation

                                       Customer Touches
 Megabytes
                ERP
                 Purchase detail            Support Contacts
                 Purchase record
                 Payment record                   Offer details



                                              Increasing Data Variety and Complexity

                                                                         Source: Crated in conjunction with Teradata, Inc.


             © Hortonworks Inc. 2012
and the web started to impact…




Terabytes       WEB                Web logs

                                       A/B testing

                                                Behavioral Targeting
 Gigabytes      CRM                                        Dynamic Pricing
                                   Segmentation
                                                                  Search Marketing
                                       Customer Touches
 Megabytes
                ERP                                                  Affiliate Networks
                 Purchase detail              Support Contacts
                                                                       Dynamic Funnels
                 Purchase record
                 Payment record                   Offer details          Offer history



                                                Increasing Data Variety and Complexity

                                                                                          Source: Crated in conjunction with Teradata, Inc.


             © Hortonworks Inc. 2012
Big data changes the game

                                                                    Transactions + Interactions
Petabytes
                 BIG DATA                       Mobile Web                  + Observations
                                                Sentiment

                                                 User Click Stream
                                                                   SMS/MMS
                                                                                  = BIG DATA
                                                                        Speech to Text

                                                               Social Interactions & Feeds
 Terabytes       WEB                Web logs
                                                                        Spatial & GPS Coordinates
                                        A/B testing
                                                                               Sensors / RFID / Devices
                                                 Behavioral Targeting
  Gigabytes      CRM                                        Dynamic Pricing
                                                                                       Business Data Feeds
                                    Segmentation                                             External Demographics
                                                                   Search Marketing
                                        Customer Touches                                       User Generated Content
  Megabytes
                 ERP                                                  Affiliate Networks
                  Purchase detail              Support Contacts                                     HD Video, Audio, Images
                                                                        Dynamic Funnels
                  Purchase record
                                                   Offer details          Offer history               Product/Service Logs
                  Payment record



                                                 Increasing Data Variety and Complexity

                                                                                             Source: Crated in conjunction with Teradata, Inc.


              © Hortonworks Inc. 2012
Next-gen data architecture drivers


Business                •     Enable new business models & drive faster growth (20%+)
 Drivers                •     Find insights for competitive advantage & optimal returns




Technical               •     Data continues to grow exponentially

  Drivers               •     Data is increasingly everywhere and in many formats
                        •     Legacy solutions unfit for new requirements growth
 cloud

Financial               •     Cost of data systems, as % of IT spend, continues to grow
  Drivers               •     Cost advantages of commodity hardware & open source




         © Hortonworks Inc. 2012
Apache Hadoop
                          Open Source Data Management Software



                          One of the best examples of open source
                          driving innovation and creating a market
                           •  Foundation for big data solutions
                           •  Enables a rational economics model
                           •  Powers data-driven business
                           •  Commodity hardware
                           •  Loosely coupled, ship early/ship often
                           •  Consists of many specialized sub-projects

© Hortonworks Inc. 2012
Apache Hadoop & Cloud Makes Sense

                             •  Broader access of Hadoop to end users, IT
                                professionals, and developers
   cloud
                             •  Easy installation and configuration and
                                simplified programming
                             •  Enterprise-ready distribution with greater
                                security, performance, ease of management
                                and options for Hybrid IT usage.
                             •  Integrate with everything via RESTful API
                             •  Spin up a cluster on demand
                             •  Ease management




                                                                          Page 11
   © Hortonworks Inc. 2012
5 Reasons for Hadoop in the Cloud


                                              People say "should
                                              you run Hadoop in
                                              the cloud?”


                                              I say "it depends".




 http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html

                                                                                  Page 12
      © Hortonworks Inc. 2012
5 Reasons for Hadoop in the Cloud


                             1        If your data is stored in a cloud, local analysis
                                      may make more sense… "work near the data"


                             2        For periodic processing (nightly, etc…)
                                      it might make sense to just rent.


                             3        No upfront capital expense,
                                      fund from success


                             4        Easier to expand a cluster;
                                      no need to buy just find


                             5        Eliminate networking concerns

                             http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html

                                                                                                              Page 13
   © Hortonworks Inc. 2012
What is Apache Hadoop?

1 PROCESSING – Map/Reduce
                              •    Splits a task across processors “near”
                                   the data & assembles results
                              •    2004 white paper
                                   MapReduce: Simplified Data Processing on Large Clusters

                              •    Base of much new tech




2 STORAGE – Hadoop Distributed File System
                              •    Distributed across “nodes”
                              •    Natively redundant
                              •    Name node tracks locations



    © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
                               Apache Hive is a data
5    HCatalog                  warehouse infrastructure built
                               on top of Hadoop (originally by
6    Pig                       Facebook) for providing data
                               summarization, ad-hoc query,
7    Oozie                     and analysis of large datasets.
                               It provides a mechanism to
                               project structure onto this data
8    Ambari                    and query the data using a
                               SQL-like language called
9    Sqoop                     HiveQL (HQL).

10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
5    HCatalog                  HBase is a non-relational
                               database. It is columnar and
                               provides fault-tolerant storage
6    Pig                       and quick access to large
                               quantities of sparse data. It
7    Oozie                     also adds transactional
                               capabilities to Hadoop,
8    Ambari                    allowing users to conduct
                               updates, inserts and deletes.
9    Sqoop
10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive                      HCatalog
4    HBase                     HCatalog is a metadata
                               management service for
5    HCatalog                  Apache Hadoop. It opens up
                               the platform and allows
6    Pig                       interoperability across data
                               processing tools such as Pig,
                               Map Reduce and Hive. It also
7    Oozie                     provides a table abstraction so
                               that users need not be
8    Ambari                    concerned with where or how
                               their data is stored.
9    Sqoop
                               Aster SQL-H interfaces
                               with HCatalog
10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
                               Apache Pig allows you to write
                               complex map reduce
5    HCatalog                  transformations using a simple
                               scripting language. Pig latin
6    Pig                       (the language) defines a set of
                               transformations on a data set
7    Oozie                     such as aggregate, join and
                               sort among others. Pig Latin is
                               sometimes extended using
8    Ambari                    UDF (User Defined
                               Functions), which the user can
9    Sqoop                     write in Java and then call
                               directly from the language.
10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
5    HCatalog                  Oozie coordinates jobs written
                               in multiple languages such as
6    Pig                       Map Reduce, Pig and Hive. It
                               is a workflow system that links
7    Oozie                     these jobs and allows
                               specification of order and
                               dependencies between them.
8    Ambari
9    Sqoop
10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
5    HCatalog                  Apache Ambari
                               operationalizes Hadoop. It
                               provides a mechanism to
6    Pig                       monitor and manage a cluster.
                               It also provisions nodes.
7    Oozie
                               Ambari is a monitoring,
8    Ambari                    administration and lifecycle
                               management project for
                               Apache Hadoop clusters
9    Sqoop
10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
5    HCatalog
                               Sqoop is a set of tools that
                               allow non-Hadoop data stores
6    Pig                       to interact with traditional
                               relational databases and data
7    Oozie                     warehouses.

8    Ambari
9    Sqoop
10   Zookeeper

     © Hortonworks Inc. 2012
Apache Hadoop related projects

3    Hive
4    HBase
5    HCatalog                  ZooKeeper is a centralized
                               service for maintaining
6    Pig                       configuration information,
                               naming, providing distributed
7    Oozie                     synchronization, and providing
                               group services.
8    Ambari
9    Sqoop
10   Zookeeper

     © Hortonworks Inc. 2012
Hadoop in Action
                                                       Interfaces with HCatalog to
  1     Web Log files via WebHDFS APIs             4
                                                       analyze website visits by the
                                                       type of end results
  Website    Web
Interactions Logs

                                        Big Data
      Order                             Refinery
                    DB
      Data


Customer
                    DB
  Data


        Customer & Order data via Talend               Pre-processes, refines, and
 2                                                 3
        & HCatalog for schema                          joins data via Talend, Pig, &
                                                       HCatalog


              © Hortonworks Inc. 2012
Hortonworks Vision & Role

                                We believe that by the end of 2015,
                                more than half the world's data will be
                                processed by Apache Hadoop.



  1       Be diligent stewards of the open source core

  2       Be tireless innovators beyond the core

  3       Provide robust data platform services & open APIs

  4       Enable the ecosystem at each layer of the stack

  5       Make the platform enterprise-ready & easy to use


      © Hortonworks Inc. 2012
Balancing Innovation & Stability
customers
 relative %




                                              The CHASM
          Innovators,              Early                     Early
                                                                           Late majority,            Laggards,
          technology             adopters,                  majority,
                                                                           conservatives              Skeptics
          enthusiasts           visionaries               pragmatists




                                                                                                                          time
                  Customers want                                            Customers want
              technology & performance                                  solutions & convenience

                                                                                             Source: Geoffrey Moore - Crossing the Chasm



                                                                                                                                 Page 25
                 © Hortonworks Inc. 2012
Enabling Hadoop as Enterprise Big Data Platform



  Applications,                                                              Installation & Configuration,
  Business Tools,                                                            Administration,
  Development Tools,                                                         Monitoring,
  Open APIs and access                                                       High Availability,
  Data Movement & Integration,                                               Replication,
  Data Management Systems,                                                   Multi-tenancy, ..
  Systems Management
                                             Hortonworks
                                             Data Platform

                                         DEVELOPER
                                  Data Platform Services & Open APIs

                                     Metadata, Indexing, Search, Security,
                                    Management, Data Extract & Load, APIs




        © Hortonworks Inc. 2012
Hortonworks Data Platform


                             The ONLY 100% open source data
                             platform for Hadoop

                    •  Tightly aligned with core Apache code line
                    •  All code committed back to open source
                    •  Most complete Apache Hadoop platform
                    •  Comprehensive management and monitoring
                    •  Intuitive graphical data integration tools
                    •  Centralized metadata services for easy data sharing



                                                                        Page 27
   © Hortonworks Inc. 2012
Hortonworks Data Platform

                                                           •  Simplify deployment to get
                                                              started quickly and easily

                                                           •  Monitor, manage any size cluster
                                                              with familiar console and tools

                                                           •  Only platform to include data
                                                              integration services to interact
                                1                             with any data source

                                                           •  Metadata services opens the
                                                              platform for integration with
           Hortonworks Data Platform                          existing applications
    Delivers enterprise grade functionality on a proven
    Apache Hadoop distribution to ease management,         •  Dependable high availability
   simplify use and ease integration into the enterprise      architecture




The only 100% open source data platform for Apache Hadoop

      © Hortonworks Inc. 2012
Apache Distribution Stack

Built on Hadoop 1.0
(a.k.a. 0.20.205)
 •  Proven at large scale enterprise
    implementations                                                         0.92.1+                                           5.1.1
 •  Most stable and reliable version   1.0.3
                                                           0.9.2                                        3.3.4
    of Hadoop to date
 •  First Apache line supporting               0.4.0
    security, HBase, WebHDFS
 •  Driven by core committers and                                  0.9.0+                      3.1.3
    architects at Hortonworks
                                                                                      0.9.0+
                                                                                                                     beta




                                                                                                         Zookeeper
Includes necessary components



                                                HCatalog




                                                                                                                     Ambari
                                                                              HBase




                                                                                                                                 Talend
                                                                                       Sqoop
already integrated and tested




                                                                                                Oozie
                                        Core




                                                                    Hive
                                                            Pig
together
                                       1.0.3   0.4.0       0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3            3.3.4        beta     5.1.1
Most stable versions of all
                                                              Hortonworks Distribution
components are chosen
                                               Tested, Hardened & Proven
                                                Distribution Reduces Risk
                                                                                                                       Page 29
         © Hortonworks Inc. 2012
Management & Monitoring Svcs

Hortonworks Management Center
   – View the health of cluster operations,
     server utilization and performance levels
   – Customizable dashboards
   – APIs for integration into 3rd party
     monitoring tools
   – 100% open source management &
     monitoring, powered by Apache Ambari,
     Puppet, Nagios and Gaglia
   – Simple wizard-based installation,
     configuration & provisioning of any size
     Hadoop cluster

Optimize performance for your Hadoop cluster
Simplify Installation and provisioning

                                                 Page 30
       © Hortonworks Inc. 2012
Data Integration Services

•  Intuitive graphical data
   integration tools for HDFS,
   Hive, HBase, HCatalog and Pig

•  Oozie scheduling allows you to
   manage and stage jobs

•  Connectors for any database,
   business application or system

•  Integrated HCatalog storage

 Bridge the gap between
 legacy data & Hadoop

 Simplify and speed development

                                    Page 31
      © Hortonworks Inc. 2012
Which is best for the cloud?



                              vs.




                                    Page 32
    © Hortonworks Inc. 2012
Metadata Services
Apache HCatalog provides flexible metadata
services across tools and external access
 •  Consistency of metadata and data models across tools
    (MapReduce, Pig, HBase and Hive)
 •  Accessibility: share data as tables in and out of HDFS
 •  Availability: enables flexible, thin-client access via REST API




                                  HCatalog                        Shared table
                                                                  and schema
                                                                  management
   •  Raw Hadoop data                        Table access         opens the
   •  Inconsistent, unknown                  Aligned metadata     platform
   •  Tool specific access                   REST API



        © Hortonworks Inc. 2012
Services Integration

Provides RESTful API as
“front door” for Hadoop             Existing & New Applications




•    Opens the door to              WebHDFS            HCatalog RESTful Web Services
     languages other than Java

•    Thin clients via web                      MapReduce           Pig   Hive
     services vs. fat-clients in                             HCatalog
     gateway

•    Insulation from interface                                           External
                                        HDFS               HBase
     changes release to release                                           Store




     Opens Hadoop to integration with existing and new applications


          © Hortonworks Inc. 2012
Use cases: optimize outcomes at scale
                      Media     optimize                 Content
        Intelligence            optimize                 Detection
         Investment             optimize                 Algorithms
        Advertising             optimize                 Performance
                      Fraud     optimize                 Prevention
          Regulation            optimize                 Compliance
 Retail / Wholesale             optimize                 Inventory turns
    Manufacturing               optimize                 Supply chains
          Healthcare            optimize                 Patient outcomes
            Education           optimize                 Learning outcomes
      Government                optimize                 Citizen services
                                      Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.

      © Hortonworks Inc. 2012
Connecting Transactions + Interactions + Observations
 Audio,                 Retain runtime models and
 Video,
Images
                         historical data for ongoing           5     Business         Web, Mobile, CRM,
                              refinement & analysis                                   ERP, SCM, …
                                                                   Transactions
 Docs,                                                             & Interactions
 Text,
 XML


  Web
 Logs,
 Clicks
                           Big Data                      4                    Data
Social,                    Refinery                                       Discovery &                           Classic
Graph,                                                                                                       1     ETL
Feeds                                                                     Investigative                      processing
                                                                            Analytics
Sensors,     3                                    Share refined
Devices,
  RFID
                                                  data & runtime                         2
           Store, aggregate, and                  models                                  Interactive
           transform multi-structured                                                     data
Spatial,   data to unlock value                                         Business          exploration
 GPS
                                                                       Intelligence
                                                                       & Analytics
                                       Retain historical data to
Events,
 Other
                                       unlock additional value     6
                                                                                      Dashboards, Reports,
                                                                                      Visualization, …


             © Hortonworks Inc. 2012
5 Reasons for Hadoop in the Cloud


                             1        If your data is stored in a cloud, local analysis
                                      may make more sense… "work near the data"


                             2        For periodic processing (nightly, etc…)
                                      it might make sense to just rent.


                             3        No upfront capital expense,
                                      fund from success


                             4        Easier to expand a cluster;
                                      no need to buy just find


                             5        Eliminate networking concerns

                             http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html

                                                                                                              Page 37
   © Hortonworks Inc. 2012
THANK YOU

                                          Jim Walker
                                          jim@hortonworks.com
                                          @jaymce




1                                 Get Hortonworks Data Platform
                                  hortonworks.com/download




2   Use the getting started guide
    hortonworks.com/get-started



3   Learn more… get support
     hortonworks.com/training           hortonworks.com/support



                                                                  Page 38
        © Hortonworks Inc. 2012

More Related Content

PDF
Bertrand Delsart Java R T S
PDF
Towards a Federated Cloud Ecosystem
PDF
Accel Partners New Data Workshop 7-14-10
PDF
Ari Zilka Cluster Architecture Patterns
PDF
Innovations in Grid Computing with Oracle Coherence
PPTX
Track 1, session 4, hcl by adarsh singh, practice head, cloud computing
PDF
Cloud architecture and deployment: The Kognitio checklist, Nigel Sanctuary, K...
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
Bertrand Delsart Java R T S
Towards a Federated Cloud Ecosystem
Accel Partners New Data Workshop 7-14-10
Ari Zilka Cluster Architecture Patterns
Innovations in Grid Computing with Oracle Coherence
Track 1, session 4, hcl by adarsh singh, practice head, cloud computing
Cloud architecture and deployment: The Kognitio checklist, Nigel Sanctuary, K...
Boost Performance with Scala – Learn From Those Who’ve Done It!

What's hot (17)

PDF
Be Prepared for Tomorrow's IT Forecast Great Chance of Hybrid Clouds
PPTX
RightScale overview and why I find it elegant
PPTX
Big Data Analytics - Is Your Elephant Enterprise Ready?
PDF
Open stack in action hp cloud openstack
PPTX
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
PPTX
NIC 2013 - Configure and Deploy Private Cloud
PDF
Leveraging The Clouds For Reliable Web Applications Presentation
PDF
Resume_KapilDeshpande
PDF
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
PDF
OSCON 2012 OpenStack Automation and DevOps Best Practices
PPTX
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
PDF
Java Web Programming Using Cloud Platform: Module 10
PDF
Presentation introduction to cloud computing and technical issues
PPTX
Deploying hp cloud
PDF
Kaavo Introduction 08012011
PPTX
How to Make Hadoop Easy, Dependable and Fast
PPT
CCitDG Presenation
Be Prepared for Tomorrow's IT Forecast Great Chance of Hybrid Clouds
RightScale overview and why I find it elegant
Big Data Analytics - Is Your Elephant Enterprise Ready?
Open stack in action hp cloud openstack
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
NIC 2013 - Configure and Deploy Private Cloud
Leveraging The Clouds For Reliable Web Applications Presentation
Resume_KapilDeshpande
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
OSCON 2012 OpenStack Automation and DevOps Best Practices
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Java Web Programming Using Cloud Platform: Module 10
Presentation introduction to cloud computing and technical issues
Deploying hp cloud
Kaavo Introduction 08012011
How to Make Hadoop Easy, Dependable and Fast
CCitDG Presenation
Ad

Viewers also liked (20)

PPTX
PDF
CloudStack Hyderabad Meetup: How the Apache community works
PPTX
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
PDF
vBACD July 2012 - Scaling Storage with Ceph
PPTX
vBACD July 2012 - Xen Cloud Platform
PDF
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
PPTX
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
PDF
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
PPTX
Apache CloudStack from API to UI
PPTX
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
PPTX
Big Data Lessons from the Cloud
PDF
The Next Generation of Big Data Analytics
PDF
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
PPTX
Wdrozenie Chmury W Oparciu O VMware vCloud Suite W Polsce Nie Jest Trudne
PPTX
vSphere Data Protection czyli jak utracic dane dzieki oprogramowaniu do backupu
PDF
Advanced automation and provisioning in Red Hat Satellite 6 - Red Hat Archite...
PDF
Hadoop's Opportunity to Power Next-Generation Architectures
PPTX
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
PDF
Xen Orchestra: XAPI and XenServer from the web-XPUS13 Lambert
CloudStack Hyderabad Meetup: How the Apache community works
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Xen Cloud Platform
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
Apache CloudStack from API to UI
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
Big Data Lessons from the Cloud
The Next Generation of Big Data Analytics
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Wdrozenie Chmury W Oparciu O VMware vCloud Suite W Polsce Nie Jest Trudne
vSphere Data Protection czyli jak utracic dane dzieki oprogramowaniu do backupu
Advanced automation and provisioning in Red Hat Satellite 6 - Red Hat Archite...
Hadoop's Opportunity to Power Next-Generation Architectures
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Xen Orchestra: XAPI and XenServer from the web-XPUS13 Lambert
Ad

Similar to vBACD July 2012 - Apache Hadoop, Now and Beyond (20)

PDF
Hortonworks roadshow
PPTX
2012 06 hortonworks paris hug
PDF
Hadoop's Role in the Big Data Architecture, OW2con'12, Paris
 
PPTX
Introduction to Hortonworks Data Platform for Windows
PDF
Powering Next Generation Data Architecture With Apache Hadoop
PDF
Hadoop: What It Is and What It's Not
PDF
The Comprehensive Approach: A Unified Information Architecture
PPTX
Introducing Splunk – The Big Data Engine
PDF
Unified big data architecture
PDF
Cutting Big Data Down to Size with AMD and Dell
 
PPTX
Break Through the Traditional Advertisement Services with Big Data and Apache...
PDF
Talend Open Studio and Hortonworks Data Platform
PDF
Scaling MySQL: Catch 22 of Read Write Splitting
PDF
Tackling big data with hadoop and open source integration
PDF
Globant and Big Data on AWS
PDF
16h00 globant - aws globant-big-data_summit2012
PDF
Hadoop - Now, Next and Beyond
PDF
IBM Big Data Platform Nov 2012
PDF
Scaling MySQL: Benefits of Automatic Data Distribution
PDF
Intel Cloud Summit: Big Data
Hortonworks roadshow
2012 06 hortonworks paris hug
Hadoop's Role in the Big Data Architecture, OW2con'12, Paris
 
Introduction to Hortonworks Data Platform for Windows
Powering Next Generation Data Architecture With Apache Hadoop
Hadoop: What It Is and What It's Not
The Comprehensive Approach: A Unified Information Architecture
Introducing Splunk – The Big Data Engine
Unified big data architecture
Cutting Big Data Down to Size with AMD and Dell
 
Break Through the Traditional Advertisement Services with Big Data and Apache...
Talend Open Studio and Hortonworks Data Platform
Scaling MySQL: Catch 22 of Read Write Splitting
Tackling big data with hadoop and open source integration
Globant and Big Data on AWS
16h00 globant - aws globant-big-data_summit2012
Hadoop - Now, Next and Beyond
IBM Big Data Platform Nov 2012
Scaling MySQL: Benefits of Automatic Data Distribution
Intel Cloud Summit: Big Data

More from CloudStack - Open Source Cloud Computing Project (12)

PPTX
PDF
Build a Cloud Day San Francisco - Ubuntu Cloud
PPTX
PPTX
PPTX
PPT
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
PDF
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
PPTX
vBACD - Crash Course in Open Source Cloud Computing - 2/28
PPT
vBACD - Introduction to Opscode Chef - 2/29
PPTX
vBACD - Deploying Infrastructure-as-a-Service with CloudStack - 2/28
Build a Cloud Day San Francisco - Ubuntu Cloud
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Crash Course in Open Source Cloud Computing - 2/28
vBACD - Introduction to Opscode Chef - 2/29
vBACD - Deploying Infrastructure-as-a-Service with CloudStack - 2/28

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars

vBACD July 2012 - Apache Hadoop, Now and Beyond

  • 1. Apache Hadoop & the Cloud Jim Walker Dir. Product Marketing, Hortonworks Twitter @jaymce July 10, 2012 © Hortonworks Inc. 2012
  • 2. 1941 2012 Page 2 © Hortonworks Inc. 2012
  • 3. Big data market segments Software Hardware ETL & Mgmnt Analytics Applications Services Distributions •  Storage •  OSS Apache •  Distributed file •  Analytic •  Data •  Consulting •  Servers Hadoop stores application visualization •  Training •  Networking •  Enterprise •  NoSQL development tools •  Tech support Distributions databases platforms •  Business •  Software •  Non-Hadoop •  Data •  Advanced intelligence maintenance big data integration analytics applications •  Hardware frameworks •  Data quality & applications maintenance governance •  hosting Next Generation Data Warehouse •  MPP columnar data warehouse appliances •  In-memory analytics engines •  Fast data loading © Hortonworks Inc. 2012
  • 4. Big data market segments Software Hardware ETL & Mgmnt Analytics Applications Services Distributions •  Storage •  OSS Apache •  Distributed file •  Analytic •  Data •  Consulting •  Servers Hadoop stores application visualization •  Training •  Networking •  Enterprise •  NoSQL development tools •  Tech support Distributions databases platforms •  Business •  Software •  Non-Hadoop •  Data •  Advanced intelligence maintenance big data integration analytics applications •  Hardware frameworks •  Data quality & applications maintenance governance •  hosting cloud cloud cloud cloud Next Generation Data Warehouse •  MPP columnar data warehouse appliances •  In-memory analytics engines •  Fast data loading © Hortonworks Inc. 2012
  • 5. Analytics started with basic purchase history… Megabytes ERP Purchase detail Purchase record Payment record Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
  • 6. then we added customer information… Gigabytes CRM Segmentation Customer Touches Megabytes ERP Purchase detail Support Contacts Purchase record Payment record Offer details Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
  • 7. and the web started to impact… Terabytes WEB Web logs A/B testing Behavioral Targeting Gigabytes CRM Dynamic Pricing Segmentation Search Marketing Customer Touches Megabytes ERP Affiliate Networks Purchase detail Support Contacts Dynamic Funnels Purchase record Payment record Offer details Offer history Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
  • 8. Big data changes the game Transactions + Interactions Petabytes BIG DATA Mobile Web + Observations Sentiment User Click Stream SMS/MMS = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Dynamic Pricing Business Data Feeds Segmentation External Demographics Search Marketing Customer Touches User Generated Content Megabytes ERP Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
  • 9. Next-gen data architecture drivers Business •  Enable new business models & drive faster growth (20%+) Drivers •  Find insights for competitive advantage & optimal returns Technical •  Data continues to grow exponentially Drivers •  Data is increasingly everywhere and in many formats •  Legacy solutions unfit for new requirements growth cloud Financial •  Cost of data systems, as % of IT spend, continues to grow Drivers •  Cost advantages of commodity hardware & open source © Hortonworks Inc. 2012
  • 10. Apache Hadoop Open Source Data Management Software One of the best examples of open source driving innovation and creating a market •  Foundation for big data solutions •  Enables a rational economics model •  Powers data-driven business •  Commodity hardware •  Loosely coupled, ship early/ship often •  Consists of many specialized sub-projects © Hortonworks Inc. 2012
  • 11. Apache Hadoop & Cloud Makes Sense •  Broader access of Hadoop to end users, IT professionals, and developers cloud •  Easy installation and configuration and simplified programming •  Enterprise-ready distribution with greater security, performance, ease of management and options for Hybrid IT usage. •  Integrate with everything via RESTful API •  Spin up a cluster on demand •  Ease management Page 11 © Hortonworks Inc. 2012
  • 12. 5 Reasons for Hadoop in the Cloud People say "should you run Hadoop in the cloud?” I say "it depends". http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html Page 12 © Hortonworks Inc. 2012
  • 13. 5 Reasons for Hadoop in the Cloud 1 If your data is stored in a cloud, local analysis may make more sense… "work near the data" 2 For periodic processing (nightly, etc…) it might make sense to just rent. 3 No upfront capital expense, fund from success 4 Easier to expand a cluster; no need to buy just find 5 Eliminate networking concerns http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html Page 13 © Hortonworks Inc. 2012
  • 14. What is Apache Hadoop? 1 PROCESSING – Map/Reduce •  Splits a task across processors “near” the data & assembles results •  2004 white paper MapReduce: Simplified Data Processing on Large Clusters •  Base of much new tech 2 STORAGE – Hadoop Distributed File System •  Distributed across “nodes” •  Natively redundant •  Name node tracks locations © Hortonworks Inc. 2012
  • 15. Apache Hadoop related projects 3 Hive 4 HBase Apache Hive is a data 5 HCatalog warehouse infrastructure built on top of Hadoop (originally by 6 Pig Facebook) for providing data summarization, ad-hoc query, 7 Oozie and analysis of large datasets. It provides a mechanism to project structure onto this data 8 Ambari and query the data using a SQL-like language called 9 Sqoop HiveQL (HQL). 10 Zookeeper © Hortonworks Inc. 2012
  • 16. Apache Hadoop related projects 3 Hive 4 HBase 5 HCatalog HBase is a non-relational database. It is columnar and provides fault-tolerant storage 6 Pig and quick access to large quantities of sparse data. It 7 Oozie also adds transactional capabilities to Hadoop, 8 Ambari allowing users to conduct updates, inserts and deletes. 9 Sqoop 10 Zookeeper © Hortonworks Inc. 2012
  • 17. Apache Hadoop related projects 3 Hive HCatalog 4 HBase HCatalog is a metadata management service for 5 HCatalog Apache Hadoop. It opens up the platform and allows 6 Pig interoperability across data processing tools such as Pig, Map Reduce and Hive. It also 7 Oozie provides a table abstraction so that users need not be 8 Ambari concerned with where or how their data is stored. 9 Sqoop Aster SQL-H interfaces with HCatalog 10 Zookeeper © Hortonworks Inc. 2012
  • 18. Apache Hadoop related projects 3 Hive 4 HBase Apache Pig allows you to write complex map reduce 5 HCatalog transformations using a simple scripting language. Pig latin 6 Pig (the language) defines a set of transformations on a data set 7 Oozie such as aggregate, join and sort among others. Pig Latin is sometimes extended using 8 Ambari UDF (User Defined Functions), which the user can 9 Sqoop write in Java and then call directly from the language. 10 Zookeeper © Hortonworks Inc. 2012
  • 19. Apache Hadoop related projects 3 Hive 4 HBase 5 HCatalog Oozie coordinates jobs written in multiple languages such as 6 Pig Map Reduce, Pig and Hive. It is a workflow system that links 7 Oozie these jobs and allows specification of order and dependencies between them. 8 Ambari 9 Sqoop 10 Zookeeper © Hortonworks Inc. 2012
  • 20. Apache Hadoop related projects 3 Hive 4 HBase 5 HCatalog Apache Ambari operationalizes Hadoop. It provides a mechanism to 6 Pig monitor and manage a cluster. It also provisions nodes. 7 Oozie Ambari is a monitoring, 8 Ambari administration and lifecycle management project for Apache Hadoop clusters 9 Sqoop 10 Zookeeper © Hortonworks Inc. 2012
  • 21. Apache Hadoop related projects 3 Hive 4 HBase 5 HCatalog Sqoop is a set of tools that allow non-Hadoop data stores 6 Pig to interact with traditional relational databases and data 7 Oozie warehouses. 8 Ambari 9 Sqoop 10 Zookeeper © Hortonworks Inc. 2012
  • 22. Apache Hadoop related projects 3 Hive 4 HBase 5 HCatalog ZooKeeper is a centralized service for maintaining 6 Pig configuration information, naming, providing distributed 7 Oozie synchronization, and providing group services. 8 Ambari 9 Sqoop 10 Zookeeper © Hortonworks Inc. 2012
  • 23. Hadoop in Action Interfaces with HCatalog to 1 Web Log files via WebHDFS APIs 4 analyze website visits by the type of end results Website Web Interactions Logs Big Data Order Refinery DB Data Customer DB Data Customer & Order data via Talend Pre-processes, refines, and 2 3 & HCatalog for schema joins data via Talend, Pig, & HCatalog © Hortonworks Inc. 2012
  • 24. Hortonworks Vision & Role We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop. 1 Be diligent stewards of the open source core 2 Be tireless innovators beyond the core 3 Provide robust data platform services & open APIs 4 Enable the ecosystem at each layer of the stack 5 Make the platform enterprise-ready & easy to use © Hortonworks Inc. 2012
  • 25. Balancing Innovation & Stability customers relative % The CHASM Innovators, Early Early Late majority, Laggards, technology adopters, majority, conservatives Skeptics enthusiasts visionaries pragmatists time Customers want Customers want technology & performance solutions & convenience Source: Geoffrey Moore - Crossing the Chasm Page 25 © Hortonworks Inc. 2012
  • 26. Enabling Hadoop as Enterprise Big Data Platform Applications, Installation & Configuration, Business Tools, Administration, Development Tools, Monitoring, Open APIs and access High Availability, Data Movement & Integration, Replication, Data Management Systems, Multi-tenancy, .. Systems Management Hortonworks Data Platform DEVELOPER Data Platform Services & Open APIs Metadata, Indexing, Search, Security, Management, Data Extract & Load, APIs © Hortonworks Inc. 2012
  • 27. Hortonworks Data Platform The ONLY 100% open source data platform for Hadoop •  Tightly aligned with core Apache code line •  All code committed back to open source •  Most complete Apache Hadoop platform •  Comprehensive management and monitoring •  Intuitive graphical data integration tools •  Centralized metadata services for easy data sharing Page 27 © Hortonworks Inc. 2012
  • 28. Hortonworks Data Platform •  Simplify deployment to get started quickly and easily •  Monitor, manage any size cluster with familiar console and tools •  Only platform to include data integration services to interact 1 with any data source •  Metadata services opens the platform for integration with Hortonworks Data Platform existing applications Delivers enterprise grade functionality on a proven Apache Hadoop distribution to ease management, •  Dependable high availability simplify use and ease integration into the enterprise architecture The only 100% open source data platform for Apache Hadoop © Hortonworks Inc. 2012
  • 29. Apache Distribution Stack Built on Hadoop 1.0 (a.k.a. 0.20.205) •  Proven at large scale enterprise implementations 0.92.1+ 5.1.1 •  Most stable and reliable version 1.0.3 0.9.2 3.3.4 of Hadoop to date •  First Apache line supporting 0.4.0 security, HBase, WebHDFS •  Driven by core committers and 0.9.0+ 3.1.3 architects at Hortonworks 0.9.0+ beta Zookeeper Includes necessary components HCatalog Ambari HBase Talend Sqoop already integrated and tested Oozie Core Hive Pig together 1.0.3 0.4.0 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3 3.3.4 beta 5.1.1 Most stable versions of all Hortonworks Distribution components are chosen Tested, Hardened & Proven Distribution Reduces Risk Page 29 © Hortonworks Inc. 2012
  • 30. Management & Monitoring Svcs Hortonworks Management Center – View the health of cluster operations, server utilization and performance levels – Customizable dashboards – APIs for integration into 3rd party monitoring tools – 100% open source management & monitoring, powered by Apache Ambari, Puppet, Nagios and Gaglia – Simple wizard-based installation, configuration & provisioning of any size Hadoop cluster Optimize performance for your Hadoop cluster Simplify Installation and provisioning Page 30 © Hortonworks Inc. 2012
  • 31. Data Integration Services •  Intuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and Pig •  Oozie scheduling allows you to manage and stage jobs •  Connectors for any database, business application or system •  Integrated HCatalog storage Bridge the gap between legacy data & Hadoop Simplify and speed development Page 31 © Hortonworks Inc. 2012
  • 32. Which is best for the cloud? vs. Page 32 © Hortonworks Inc. 2012
  • 33. Metadata Services Apache HCatalog provides flexible metadata services across tools and external access •  Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API HCatalog Shared table and schema management •  Raw Hadoop data Table access opens the •  Inconsistent, unknown Aligned metadata platform •  Tool specific access REST API © Hortonworks Inc. 2012
  • 34. Services Integration Provides RESTful API as “front door” for Hadoop Existing & New Applications •  Opens the door to WebHDFS HCatalog RESTful Web Services languages other than Java •  Thin clients via web MapReduce Pig Hive services vs. fat-clients in HCatalog gateway •  Insulation from interface External HDFS HBase changes release to release Store Opens Hadoop to integration with existing and new applications © Hortonworks Inc. 2012
  • 35. Use cases: optimize outcomes at scale Media optimize Content Intelligence optimize Detection Investment optimize Algorithms Advertising optimize Performance Fraud optimize Prevention Regulation optimize Compliance Retail / Wholesale optimize Inventory turns Manufacturing optimize Supply chains Healthcare optimize Patient outcomes Education optimize Learning outcomes Government optimize Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. © Hortonworks Inc. 2012
  • 36. Connecting Transactions + Interactions + Observations Audio, Retain runtime models and Video, Images historical data for ongoing 5 Business Web, Mobile, CRM, refinement & analysis ERP, SCM, … Transactions Docs, & Interactions Text, XML Web Logs, Clicks Big Data 4 Data Social, Refinery Discovery & Classic Graph, 1 ETL Feeds Investigative processing Analytics Sensors, 3 Share refined Devices, RFID data & runtime 2 Store, aggregate, and models Interactive transform multi-structured data Spatial, data to unlock value Business exploration GPS Intelligence & Analytics Retain historical data to Events, Other unlock additional value 6 Dashboards, Reports, Visualization, … © Hortonworks Inc. 2012
  • 37. 5 Reasons for Hadoop in the Cloud 1 If your data is stored in a cloud, local analysis may make more sense… "work near the data" 2 For periodic processing (nightly, etc…) it might make sense to just rent. 3 No upfront capital expense, fund from success 4 Easier to expand a cluster; no need to buy just find 5 Eliminate networking concerns http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html Page 37 © Hortonworks Inc. 2012
  • 38. THANK YOU Jim Walker jim@hortonworks.com @jaymce 1 Get Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support hortonworks.com/training hortonworks.com/support Page 38 © Hortonworks Inc. 2012