SlideShare a Scribd company logo
Enabling Highly Available, Elastic, Multi-tenancy
Hadoop on Demand

Richard McDougall,
VMware, Inc
@richardmcdougll




                                             © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization


1. Reduce the Complexity      2. Dramatically Lower         3. Enable Flexible, Agile
                                     Costs                     IT Service Delivery
     to simplify operations   to redirect investment into   to meet and anticipate the
        and maintenance        value-add opportunities        needs of the business




 2
A Holistic View of a Big Data System:


                 Real Time
                  Streams


                      Real-Time
                      Processing
                       (s4, storm)
                                                     Analytics

    ETL                      Real Time
                             Structured      Big SQL
                             Database       (Greenplum,       Batch
                              (hBase,        AsterData,     Processing
                              Gemfire,         Etc…)
                             Cassandra)




                             Unstructured Data (HDFS)



3
Common Infrastructure for Big Data


                                                           MPP DB    HBase       Hadoop
     Virtualization Platform
                                                       Virtualization Platform


     Hadoop


                      HBase



                                        Cluster Consolidation
       MPP DB

                                        §  Simplify
                                          •  Single Hardware Infrastructure
Cluster Sprawling
                                          •  Unified operations
Single purpose clusters for various
business applications lead to cluster   §  Optimize
sprawl.                                   •  Shared Resources = higher utilization
                                          •  Elastic resources = faster on-demand access
 4
Enterprise Challenges with Using Hadoop

§  Deployment
     •  Slow to provision
     •  Complex to keep running/tune
§  Single Points of Failure
     •  Single point of failure with Name Node and Job tracker
     •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
§  Low Utilization
     •  Dedicated clusters to run Hadoop with low CPU utilization
     •  No easy way to share resource between Hadoop and non-Hadoop workloads
     •  Noisy neighbor, lack resource containment
§  Need Multi-tenant Isolation, Resource Management, etc,…
     •  Noisy Neighbor - no performance or security isolation between different tenants/users
     •  Lack of configuration isolation - Can’t run multiple versions on the cluster




 5
I.     Market Overview & Insights
II.    Virtualization + Hadoop
III.  Distribution & OSS Contribution




6
Hadoop Runs Well on Virtualization

                                     Comparable performance to physical
                      1.2


                       1


                      0.8
    Ratio to Native




                      0.6


                      0.4                                                                             1 VM
                                                                                                      2 VMs

                      0.2


                       0




                            Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf
7
Use Local Disk where it’s Needed




     SAN Storage          NAS Filers       Local Storage

    $2 - $10/Gigabyte   $1 - $5/Gigabyte   $0.05/Gigabyte

        $1M gets:          $1M gets:          $1M gets:
      0.5Petabytes        1 Petabyte         20 Petabytes
      200,000 IOPS       400,000 IOPS      10,000,000 IOPS
       1Gbyte/sec         2Gbyte/sec        800 Gbytes/sec

8
Extend Virtual Storage Architecture to Include Local Disk

 §  Shared Storage: SAN or NAS                                                          §  Hybrid Storage
         •  Easy to provision                                                              •  SAN for boot images, VMs, other
         •  Automated cluster rebalancing                                                      workloads
                                                                                           •  Local disk for Hadoop & HDFS
                                                                                           •  Scalable Bandwidth, Lower Cost/GB
         Other VM

                    Other VM




                                                 Other VM




                                                                              Other VM




                                                                                                    Other VM

                                                                                                               Other VM




                                                                                                                                            Other VM




                                                                                                                                                                         Other VM
Hadoop




                               Hadoop

                                        Hadoop




                                                            Hadoop

                                                                     Hadoop




                                                                                           Hadoop




                                                                                                                          Hadoop

                                                                                                                                   Hadoop




                                                                                                                                                       Hadoop

                                                                                                                                                                Hadoop
         Host                           Host                         Host                           Host                           Host                         Host




     9
Why Virtualize Hadoop?



     Simple to Operate            Highly Available               Elastic Scaling


§  Rapid deployment        §  No more single point of    §  Shrink and expand
                                failure                        cluster on demand
§  Unified operations
    across enterprise       §  One click to setup         §  Resource Guarantee

§  Easy Clone of Cluster   §  High availability for MR   §  Independent scaling of
                                Jobs                           Compute and data




10
Deploy a Hadoop Cluster in under 30 Minutes

Step 1: Deploy Serengeti virtual appliance on vSphere.


                                                               Deploy vHelperOVF to
                                                                     vSphere




Step 2: A few simple commands to stand up Hadoop Cluster.
                                                             Select Compute, memory,
                                                               storage and network




                                                            Select configuration template




                                                               Automate deployment




                         Done


  11
A Tour Through Serengeti


$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>




12
A Tour Through Serengeti


serengeti> cluster create --name myElephant

serengeti> cluster list -–name myElephant

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hadoop_NameNode, hadoop_jobtracker] 1            2   7500     LOCAL   50

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hive, hadoop_client, pig]             1          1   3700     LOCAL   50

     NAME                HOST                              IP
     -----------------------------------------------------------------
     myElephant-client0 rmc-elephant-009.eng.vmware.com    10.0.20.184




13
A Tour Through Serengeti


$ ssh rmc@rmc-elephant-009.eng.vmware.com

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

…




14
Serengeti Spec File
[
        "distro":"apache",               Choice of Distro
          {
             "name": "master",
             "roles": [
                "hadoop_NameNode",
                "hadoop_jobtracker"
             ],
             "instanceNum": 1,
             "instanceType": "MEDIUM",
             “ha”:true,                  HA Option
          },
          {
             "name": "worker",
             "roles": [
                "hadoop_datanode", "hadoop_tasktracker"
             ],
             "instanceNum": 5,
             "instanceType": "SMALL",
             "storage": {                Choice of Shared Storage or Local Disk
                "type": "LOCAL",
                "sizeGB": 10
             }
          },
    ]

15
Configuring Distro’s


{
         "name" : "cdh",
         "version" : "3u3",
         "packages" : [
           {
              "roles" : ["hadoop_NameNode", "hadoop_jobtracker",
                         "hadoop_tasktracker", "hadoop_datanode",
                         "hadoop_client"],
              "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
           },
           {
              "roles" : ["hive"],
              "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
           },
           {
              "roles" : ["pig"],
              "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
           }
         ]
    },




16
Serengeti Demo


                         Deploy Serengeti vApp on vSphere


                         Deploy a Hadoop cluster in 10 Minutes


                         Run MapReduce
     Serengeti Demo	

                         Scale out the Hadoop cluster


                         Create a Customized Hadoop cluster


                         Use Your Favorite Hadoop Distribution


17
Why Virtualize Hadoop?



     Simple to Operate            Highly Available               Elastic Scaling


§  Rapid deployment        §  No more single point of    §  Shrink and expand
                                failure                        cluster on demand
§  Unified operations
    across enterprise       §  One click to setup         §  Resource Guarantee

§  Easy Clone of Cluster   §  High availability for MR   §  Independent scaling of
                                Jobs                           Compute and data




18
High Availability for the Hadoop Stack



                                  ETL Tools        BI Reporting              RDBMS


                               Pig (Data   Flow)   Hive (SQL)               HCatalog
     Zookeepr (Coordination)




                                                             Hive           Hcatalog MDB
                                                            MetaDB




                                                                                           Management Server
                               MapReduce (Job Scheduling/Execution System)
                               HBase (Key-Value store)               Jobtracker



                                                                            Namenode
                                                     HDFS
                                       (Hadoop Distributed File System)
                                                                                            Server




19
Live Machine Migration Reduces Planned Downtime


Description:
Enables the live migration of virtual
machines from one host to another
with continuous service availability.

Benefits:
•     Revolutionary technology that is the
      basis for automated virtual machine
      movement
•     Meets service level and performance
      goals




 20
vSphere High Availability (HA) - protection against unplanned downtime




     Overview
      •  Protection against host and VM failures
      •  Automatic failure detection (host, guest OS)
      •  Automatic virtual machine restart in minutes, on any available host in cluster
      •  OS and application-independent, does not require complex configuration
       changes

21
vSphere Fault Tolerance provides continuous protection

                                                      Overview


                                                       •  Single identical VMs running in
                                                        lockstep on separate hosts
                                                       •  Zero downtime, zero data loss
     XX                                                 failover for all virtual machines in
     App   App   App        App     App   App   App

     HA HA             FT
      OS OS      OS         OS      OS    OS    OS
                                                        case of hardware failures
     VMware ESX                   VMware ESX
                                                       •  Integrated with VMware HA/DRS
                                                       •  No complex clustering or
                                                        specialized hardware required
                                                       •  Single common mechanism for all
           X                                            applications and operating
                                                        systems


       Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

22
One click to HA

§  Easy to setup, one click is all you need




23
Example HA Failover for Hadoop



       Serengeti
                       Namenode
                                      vSphere HA       Namenode
        Server




      TaskTracker     TaskTracker     TaskTracker     TaskTracker
     HDFS Datanode   HDFS Datanode   HDFS Datanode   HDFS Datanode
         Hive            Hive            Hive            Hive


        hBase           hBase           hBase           hBase




24
vSphere HA and Optionally FT

§  vSphere HA
 •  Is application-aware: will auto-restart NN if heartbeat goes away
 •  Is easy to configure
 •  Has no performance overhead
§  vSphere FT
 •  Has the added bonus of no pause-time when there is hardware failure
 •  Has a one vcpu max
 •  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated
     measurement shows this is good for ~300 host cluster.
§  HDFS 2 HA
 •  Only covers Namenode – what about the other 5+ master services?
 •  Not available in Apache Hadoop 0.20
 •  Not as battle-tested as vSphere HA
 •  Is more complex to install, manage

25
High Availability for the Hadoop Stack



                                  ETL Tools        BI Reporting              RDBMS


                               Pig (Data   Flow)   Hive (SQL)               HCatalog
     Zookeepr (Coordination)




                                                             Hive           Hcatalog MDB
                                                            MetaDB




                                                                                           Management Server
                               MapReduce (Job Scheduling/Execution System)
                               HBase (Key-Value store)               Jobtracker



                                                                            Namenode
                                                     HDFS
                                       (Hadoop Distributed File System)
                                                                                            Server




26
Why Virtualize Hadoop?



     Simple to Operate            Highly Available               Elastic Scaling


§  Rapid deployment        §  No more single point of    §  Shrink and expand
                                failure                        cluster on demand
§  Unified operations
    across enterprise       §  One click to setup         §  Resource Guarantee

§  Easy Clone of Cluster   §  High availability for MR   §  Independent scaling of
                                Jobs                           Compute and data




27
Elastic Scaling and Multi-tenancy of Hadoop on vSphere



       VM                               VM                               VM                  VM

            Current	
  
            Hadoop:	
                        Compute                          T1                  T2
            	
  
            Combined	
                  VM                               VM
            Storage/                         Storage                          Storage
            Compute




1.	
  Hadoop	
  in	
  VM	
       2.	
  Separate	
  Compute	
  and	
  Data	
   3.	
  Mul8.	
  Clusters	
  
-­‐     Single	
  Tenant	
       -­‐     Single	
  Tenant	
                    -­‐    Mul6ple	
  Tenants	
  
-­‐     Fixed	
  Resources	
     -­‐     Elas6c	
  Compute	
                   -­‐    Elas6c	
  Compute	
  
                                 	
  

       28
“Time Share”

     Other VM

                Other VM

                            Other VM

                                       Other VM

                                                  Other VM



                                                             Other VM

                                                                        Other VM

                                                                                    Other VM

                                                                                               Other VM

                                                                                                          Other VM




                                                                                                                     Other VM

                                                                                                                                Other VM

                                                                                                                                            Other VM

                                                                                                                                                       Other VM

                                                                                                                                                                  Other VM
     Hadoop

                 Hadoop




                                                              Hadoop

                                                                         Hadoop




                                                                                                                     Hadoop

                                                                                                                                Hadoop
                                                                                       vHelper

                                                                        VMware vSphere

                           Host                                                    Host                                                    Host
                           HDFS                                                    HDFS                                                    HDFS




            While existing apps run during the day to support business
            operations, Hadoop batch jobs kicks off at night to conduct
            deep analysis of data.
29
Virtualization delivers VM level Multi-tenancy

                                                                                                                       §  Performance isolation
                                              Coke	
                                    Pepsi	
                             •  No more noisy neighbors –
                                                                                                                              Resource container to
                                                                                                                              achieve guaranteed SLA
                                                                                                                              for different tenants/users/
                                                                                                                              jobs

Run6me	
  	
                                                                                                           §  Configuration isolation
                      	
  Hadoop	
  




                                                                	
  Hadoop	
  




                                                                                                    	
  Hadoop	
  
                          Virtual	
  




                                                                    Virtual	
  




                                                                                                        Virtual	
  
Layer	
                                                                                                                     •  Support multiple Hadoop

                                                           	
  Hadoop	
  
                                                           	
  Queue	
  
                                                           Virtual	
                                                          environments on the same
                                                                                                                              physical clusters
                                                                                                                               •  Multiple Linux versions
                                                                                                                               •  Multiple Hadoop
                              Data	
                             Data	
                     Data	
                                versions
                            Container	
                        Container	
                Container	
  
                                                                                                                       §  Security isolation
Data	
  
                                                                    HDFS	
                                                  •  Higher level of security
Layer	
  
                                                                                                                                •  Compute VM can only
                                                                                                                                 access data VM
                 Host	
                 Host	
           Host	
              Host	
      Host	
             Host	
  
                                                                                                                                 through Access Control
                                                                                                                                 List


   30
I.     Market Overview & Insights
II.    Virtualization + Hadoop
III.  Distribution & OSS Contribution




31
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions


         Commercial Vendors             Community Projects




•  Support major distribution and multiple projects
•  Contribute Hadoop Virtualization Extension (HVE) to Open Source
   Community



32
Hadoop Virtualization Extensions: Topology Awareness




33
Virtual Topologies




34
Proposed Topology Changes

                            HADOOP-8468 (Umbrella JIRA)
                            HADOOP-8469
                            HDFS-3495
                            MAPREDUCE-4310
                            HDFS-3498
                            MAPREDUCE-4309
                            HADOOP-8470
                            HADOOP-8472




35
Spring for Apache Hadoop

§  Announced initial formation of Spring
 Data OSS project in 2010
 •  Enables Spring-powered applications to use
     new data access technologies
 •  Data project technologies around MongoDB,
     Neo4J, Riak, Redis, JDBC Extensions, JPA,
     REST, and Blob

§  Announcing additional contributions on GitHub:
 •  Integration with Cascading library
 •  Hbase support
 •  Hadoop security support
 •  More examples
 •  Administrative application, RESTful API to upload Hadoop jobs to schedule for
     batch execution, query status, etc.
 •  Web HDFS support
36
Big Data on Virtualized Infrastructure
Enabling Highly Available, Elastic, Multi-tenancy Hadoop on Demand

Richard McDougall,
VMware, Inc
@richardmcdougll




                                                            © 2009 VMware Inc. All rights reserved

More Related Content

PPTX
Best Practices for Virtualizing Hadoop
PDF
Apache Hadoop on Virtual Machines
ODP
Farming hadoop in_the_cloud
PPTX
Hello OpenStack, Meet Hadoop
PDF
Hadoop Operations for Production Systems (Strata NYC)
PDF
Introduction to GlusterFS Webinar - September 2011
PDF
Improving Hadoop Cluster Performance via Linux Configuration
PPTX
Oracle big data appliance and solutions
Best Practices for Virtualizing Hadoop
Apache Hadoop on Virtual Machines
Farming hadoop in_the_cloud
Hello OpenStack, Meet Hadoop
Hadoop Operations for Production Systems (Strata NYC)
Introduction to GlusterFS Webinar - September 2011
Improving Hadoop Cluster Performance via Linux Configuration
Oracle big data appliance and solutions

What's hot (20)

PPTX
Hadoop and WANdisco: The Future of Big Data
PDF
Hadoop Operations at LinkedIn
PPTX
Hadoop Operations
PPTX
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
PPT
Hadoop Performance at LinkedIn
PPTX
Geo-based content processing using hbase
PDF
Gluster Webinar: Introduction to GlusterFS
PDF
Future of cloud storage
PDF
Power BI with Essbase in the Oracle Cloud
PDF
Realtime Analytics with Hadoop and HBase
PDF
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
PDF
How to Increase Performance of Your Hadoop Cluster
PPTX
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
PDF
Hadoop and OpenStack
PDF
Intro to GlusterFS Webinar - August 2011
PPTX
Gluster Blog 11.15.2010
PPTX
Moving from C#/.NET to Hadoop/MongoDB
PPTX
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
PPTX
IaaS for DBAs in Azure
PPTX
Optimizing your Infrastrucure and Operating System for Hadoop
Hadoop and WANdisco: The Future of Big Data
Hadoop Operations at LinkedIn
Hadoop Operations
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Hadoop Performance at LinkedIn
Geo-based content processing using hbase
Gluster Webinar: Introduction to GlusterFS
Future of cloud storage
Power BI with Essbase in the Oracle Cloud
Realtime Analytics with Hadoop and HBase
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
How to Increase Performance of Your Hadoop Cluster
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Hadoop and OpenStack
Intro to GlusterFS Webinar - August 2011
Gluster Blog 11.15.2010
Moving from C#/.NET to Hadoop/MongoDB
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
IaaS for DBAs in Azure
Optimizing your Infrastrucure and Operating System for Hadoop
Ad

Viewers also liked (20)

PDF
Best Practices for Virtualizing Apache Hadoop
PPTX
1. beyond mission critical virtualizing big data and hadoop
PDF
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
PPTX
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
PDF
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
PDF
Data Virtualization Primer - Introduction
DOCX
Cloud Computing And Virtualization
PPTX
Crash Course in Cloud Computing
PDF
Soyez Big Data ready avec Isilon
 
PPTX
7. emc isilon hdfs enterprise storage for hadoop
PDF
EMC Hadoop Starter Kit
 
PPTX
Emerging Big Data & Analytics Trends with Hadoop
PPTX
Cloud Computing & Big Data
PPTX
Big Data and Cloud Computing
PDF
Cloud Computing and Big Data
PDF
Hadoop on VMware
PPTX
EMC config Hadoop
PPTX
Hadoop on Virtual Machines
PPTX
Gartner IT Symposium 2014 - VMware Cloud Services
PPTX
Introduction to Cloud computing and Big Data-Hadoop
Best Practices for Virtualizing Apache Hadoop
1. beyond mission critical virtualizing big data and hadoop
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
Data Virtualization Primer - Introduction
Cloud Computing And Virtualization
Crash Course in Cloud Computing
Soyez Big Data ready avec Isilon
 
7. emc isilon hdfs enterprise storage for hadoop
EMC Hadoop Starter Kit
 
Emerging Big Data & Analytics Trends with Hadoop
Cloud Computing & Big Data
Big Data and Cloud Computing
Cloud Computing and Big Data
Hadoop on VMware
EMC config Hadoop
Hadoop on Virtual Machines
Gartner IT Symposium 2014 - VMware Cloud Services
Introduction to Cloud computing and Big Data-Hadoop
Ad

Similar to Big data on virtualized infrastucture (20)

PPTX
Hadoop World 2011: Hadoop as a Service in Cloud
PDF
Big Data/Hadoop Infrastructure Considerations
PDF
App cap2956v2-121001194956-phpapp01 (1)
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
PDF
Inside the Hadoop Machine @ VMworld
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
PDF
Architecting the Future of Big Data & Search - Eric Baldeschwieler
PDF
Hadoop - Now, Next and Beyond
PPTX
Improvements in Failover Clustering in Windows Server 2012
PPT
10 Minute Overview of Apache CloudStack
PPT
CloudStack Intro NYC
PDF
Scalable Object Storage with Apache CloudStack and Apache Hadoop
PPTX
16 August 2012 - SWUG - Hyper-V in Windows 2012
PDF
Cosbench apac
PDF
Savanna: Hadoop on OpenStack
PDF
Vsphere4 100325065654-phpapp01
PPTX
VMUG ISRAEL November 2012, EMC session by Itzik Reich
PDF
Cosbench apac
PDF
An Introduction to Azure IaaS
Hadoop World 2011: Hadoop as a Service in Cloud
Big Data/Hadoop Infrastructure Considerations
App cap2956v2-121001194956-phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
Inside the Hadoop Machine @ VMworld
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Hadoop - Now, Next and Beyond
Improvements in Failover Clustering in Windows Server 2012
10 Minute Overview of Apache CloudStack
CloudStack Intro NYC
Scalable Object Storage with Apache CloudStack and Apache Hadoop
16 August 2012 - SWUG - Hyper-V in Windows 2012
Cosbench apac
Savanna: Hadoop on OpenStack
Vsphere4 100325065654-phpapp01
VMUG ISRAEL November 2012, EMC session by Itzik Reich
Cosbench apac
An Introduction to Azure IaaS

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced IT Governance
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced IT Governance
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Per capita expenditure prediction using model stacking based on satellite ima...
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
GamePlan Trading System Review: Professional Trader's Honest Take
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Big data on virtualized infrastucture

  • 1. Enabling Highly Available, Elastic, Multi-tenancy Hadoop on Demand Richard McDougall, VMware, Inc @richardmcdougll © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. A Holistic View of a Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Big SQL Database (Greenplum, Batch (hBase, AsterData, Processing Gemfire, Etc…) Cassandra) Unstructured Data (HDFS) 3
  • 4. Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB §  Simplify •  Single Hardware Infrastructure Cluster Sprawling •  Unified operations Single purpose clusters for various business applications lead to cluster §  Optimize sprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 4
  • 5. Enterprise Challenges with Using Hadoop §  Deployment •  Slow to provision •  Complex to keep running/tune §  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.) §  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment §  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can’t run multiple versions on the cluster 5
  • 6. I.  Market Overview & Insights II.  Virtualization + Hadoop III.  Distribution & OSS Contribution 6
  • 7. Hadoop Runs Well on Virtualization Comparable performance to physical 1.2 1 0.8 Ratio to Native 0.6 0.4 1 VM 2 VMs 0.2 0 Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf 7
  • 8. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec 8
  • 9. Extend Virtual Storage Architecture to Include Local Disk §  Shared Storage: SAN or NAS §  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 9
  • 10. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling §  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand §  Unified operations across enterprise §  One click to setup §  Resource Guarantee §  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data 10
  • 11. Deploy a Hadoop Cluster in under 30 Minutes Step 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphere Step 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 11
  • 12. A Tour Through Serengeti $ ssh serengeti@serengeti-vm $ serengeti serengeti> 12
  • 13. A Tour Through Serengeti serengeti> cluster create --name myElephant serengeti> cluster list -–name myElephant name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50 name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184 13
  • 14. A Tour Through Serengeti $ ssh rmc@rmc-elephant-009.eng.vmware.com $ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data … 14
  • 15. Serengeti Spec File [ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ] 15
  • 16. Configuring Distro’s { "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] }, 16
  • 17. Serengeti Demo Deploy Serengeti vApp on vSphere Deploy a Hadoop cluster in 10 Minutes Run MapReduce Serengeti Demo Scale out the Hadoop cluster Create a Customized Hadoop cluster Use Your Favorite Hadoop Distribution 17
  • 18. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling §  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand §  Unified operations across enterprise §  One click to setup §  Resource Guarantee §  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data 18
  • 19. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server 19
  • 20. Live Machine Migration Reduces Planned Downtime Description: Enables the live migration of virtual machines from one host to another with continuous service availability. Benefits: •  Revolutionary technology that is the basis for automated virtual machine movement •  Meets service level and performance goals 20
  • 21. vSphere High Availability (HA) - protection against unplanned downtime Overview •  Protection against host and VM failures •  Automatic failure detection (host, guest OS) •  Automatic virtual machine restart in minutes, on any available host in cluster •  OS and application-independent, does not require complex configuration changes 21
  • 22. vSphere Fault Tolerance provides continuous protection Overview •  Single identical VMs running in lockstep on separate hosts •  Zero downtime, zero data loss XX failover for all virtual machines in App App App App App App App HA HA FT OS OS OS OS OS OS OS case of hardware failures VMware ESX VMware ESX •  Integrated with VMware HA/DRS •  No complex clustering or specialized hardware required •  Single common mechanism for all X applications and operating systems Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters 22
  • 23. One click to HA §  Easy to setup, one click is all you need 23
  • 24. Example HA Failover for Hadoop Serengeti Namenode vSphere HA Namenode Server TaskTracker TaskTracker TaskTracker TaskTracker HDFS Datanode HDFS Datanode HDFS Datanode HDFS Datanode Hive Hive Hive Hive hBase hBase hBase hBase 24
  • 25. vSphere HA and Optionally FT §  vSphere HA •  Is application-aware: will auto-restart NN if heartbeat goes away •  Is easy to configure •  Has no performance overhead §  vSphere FT •  Has the added bonus of no pause-time when there is hardware failure •  Has a one vcpu max •  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated measurement shows this is good for ~300 host cluster. §  HDFS 2 HA •  Only covers Namenode – what about the other 5+ master services? •  Not available in Apache Hadoop 0.20 •  Not as battle-tested as vSphere HA •  Is more complex to install, manage 25
  • 26. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server 26
  • 27. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling §  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand §  Unified operations across enterprise §  One click to setup §  Resource Guarantee §  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data 27
  • 28. Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current   Hadoop:   Compute T1 T2   Combined   VM VM Storage/ Storage Storage Compute 1.  Hadoop  in  VM   2.  Separate  Compute  and  Data   3.  Mul8.  Clusters   -­‐  Single  Tenant   -­‐  Single  Tenant   -­‐  Mul6ple  Tenants   -­‐  Fixed  Resources   -­‐  Elas6c  Compute   -­‐  Elas6c  Compute     28
  • 29. “Time Share” Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop vHelper VMware vSphere Host Host Host HDFS HDFS HDFS While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data. 29
  • 30. Virtualization delivers VM level Multi-tenancy §  Performance isolation Coke   Pepsi   •  No more noisy neighbors – Resource container to achieve guaranteed SLA for different tenants/users/ jobs Run6me     §  Configuration isolation  Hadoop    Hadoop    Hadoop   Virtual   Virtual   Virtual   Layer   •  Support multiple Hadoop  Hadoop    Queue   Virtual   environments on the same physical clusters •  Multiple Linux versions •  Multiple Hadoop Data   Data   Data   versions Container   Container   Container   §  Security isolation Data   HDFS   •  Higher level of security Layer   •  Compute VM can only access data VM Host   Host   Host   Host   Host   Host   through Access Control List 30
  • 31. I.  Market Overview & Insights II.  Virtualization + Hadoop III.  Distribution & OSS Contribution 31
  • 32. Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects •  Support major distribution and multiple projects •  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community 32
  • 33. Hadoop Virtualization Extensions: Topology Awareness 33
  • 35. Proposed Topology Changes HADOOP-8468 (Umbrella JIRA) HADOOP-8469 HDFS-3495 MAPREDUCE-4310 HDFS-3498 MAPREDUCE-4309 HADOOP-8470 HADOOP-8472 35
  • 36. Spring for Apache Hadoop §  Announced initial formation of Spring Data OSS project in 2010 •  Enables Spring-powered applications to use new data access technologies •  Data project technologies around MongoDB, Neo4J, Riak, Redis, JDBC Extensions, JPA, REST, and Blob §  Announcing additional contributions on GitHub: •  Integration with Cascading library •  Hbase support •  Hadoop security support •  More examples •  Administrative application, RESTful API to upload Hadoop jobs to schedule for batch execution, query status, etc. •  Web HDFS support 36
  • 37. Big Data on Virtualized Infrastructure Enabling Highly Available, Elastic, Multi-tenancy Hadoop on Demand Richard McDougall, VMware, Inc @richardmcdougll © 2009 VMware Inc. All rights reserved