Big data on virtualized infrastucture

Enabling Highly Available, Elastic, Multi-tenancy
Hadoop on Demand

Richard McDougall,
VMware, Inc
@richardmcdougll

© 2009 VMware Inc. All rights reserved

Cloud: Big Shifts in Simplification and Optimization

1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile
Costs IT Service Delivery
to simplify operations to redirect investment into to meet and anticipate the
and maintenance value-add opportunities needs of the business

2

A Holistic View of a Big Data System:

Real Time
Streams

Real-Time
Processing
(s4, storm)
Analytics

ETL Real Time
Structured Big SQL
Database (Greenplum, Batch
(hBase, AsterData, Processing
Gemfire, Etc…)
Cassandra)

Unstructured Data (HDFS)

3

Common Infrastructure for Big Data

MPP DB HBase Hadoop
Virtualization Platform
Virtualization Platform

Hadoop

HBase

Cluster Consolidation
MPP DB

§  Simplify
•  Single Hardware Infrastructure
Cluster Sprawling
•  Unified operations
Single purpose clusters for various
business applications lead to cluster §  Optimize
sprawl. •  Shared Resources = higher utilization
•  Elastic resources = faster on-demand access
4

Enterprise Challenges with Using Hadoop

§  Deployment
•  Slow to provision
•  Complex to keep running/tune
§  Single Points of Failure
•  Single point of failure with Name Node and Job tracker
•  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
§  Low Utilization
•  Dedicated clusters to run Hadoop with low CPU utilization
•  No easy way to share resource between Hadoop and non-Hadoop workloads
•  Noisy neighbor, lack resource containment
§  Need Multi-tenant Isolation, Resource Management, etc,…
•  Noisy Neighbor - no performance or security isolation between different tenants/users
•  Lack of configuration isolation - Can’t run multiple versions on the cluster

5

I.  Market Overview & Insights
II.  Virtualization + Hadoop
III.  Distribution & OSS Contribution

6

Hadoop Runs Well on Virtualization

Comparable performance to physical
1.2

1

0.8
Ratio to Native

0.6

0.4 1 VM
2 VMs

0.2

0

Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf
7

Use Local Disk where it’s Needed

SAN Storage NAS Filers Local Storage

$2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte

$1M gets: $1M gets: $1M gets:
0.5Petabytes 1 Petabyte 20 Petabytes
200,000 IOPS 400,000 IOPS 10,000,000 IOPS
1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec

8

Extend Virtual Storage Architecture to Include Local Disk

§  Shared Storage: SAN or NAS §  Hybrid Storage
•  Easy to provision •  SAN for boot images, VMs, other
•  Automated cluster rebalancing workloads
•  Local disk for Hadoop & HDFS
•  Scalable Bandwidth, Lower Cost/GB
Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM
Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop
Host Host Host Host Host Host

9

Why Virtualize Hadoop?

Simple to Operate Highly Available Elastic Scaling

§  Rapid deployment §  No more single point of §  Shrink and expand
failure cluster on demand
§  Unified operations
across enterprise §  One click to setup §  Resource Guarantee

§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of
Jobs Compute and data

10

Deploy a Hadoop Cluster in under 30 Minutes

Step 1: Deploy Serengeti virtual appliance on vSphere.

Deploy vHelperOVF to
vSphere

Step 2: A few simple commands to stand up Hadoop Cluster.
Select Compute, memory,
storage and network

Select configuration template

Automate deployment

Done

11

A Tour Through Serengeti

$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>

12


serengeti> cluster create --name myElephant

serengeti> cluster list -–name myElephant

name: myElephant, distro: cdh, status:RUNNING
NAME ROLES INSTANCE CPU MEM(MB) TYPE
---------------------------------------------------------------------------
master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50

name: myElephant, distro: cdh, status:RUNNING
NAME ROLES INSTANCE CPU MEM(MB) TYPE
---------------------------------------------------------------------------
master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50

NAME HOST IP
-----------------------------------------------------------------
myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184

13


$ ssh rmc@rmc-elephant-009.eng.vmware.com

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

…

14

Serengeti Spec File
[
"distro":"apache", Choice of Distro
{
"name": "master",
"roles": [
"hadoop_NameNode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "MEDIUM",
“ha”:true, HA Option
},
{
"name": "worker",
"roles": [
"hadoop_datanode", "hadoop_tasktracker"
],
"instanceNum": 5,
"instanceType": "SMALL",
"storage": { Choice of Shared Storage or Local Disk
"type": "LOCAL",
"sizeGB": 10
}
},
]

15

Configuring Distro’s

{
"name" : "cdh",
"version" : "3u3",
"packages" : [
{
"roles" : ["hadoop_NameNode", "hadoop_jobtracker",
"hadoop_tasktracker", "hadoop_datanode",
"hadoop_client"],
"tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
},
{
"roles" : ["hive"],
"tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
},
{
"roles" : ["pig"],
"tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
}
]
},

16

Serengeti Demo

Deploy Serengeti vApp on vSphere

Deploy a Hadoop cluster in 10 Minutes

Run MapReduce
Serengeti Demo

Scale out the Hadoop cluster

Create a Customized Hadoop cluster

Use Your Favorite Hadoop Distribution

17





18

High Availability for the Hadoop Stack

ETL Tools BI Reporting RDBMS

Pig (Data Flow) Hive (SQL) HCatalog
Zookeepr (Coordination)

Hive Hcatalog MDB
MetaDB

Management Server
MapReduce (Job Scheduling/Execution System)
HBase (Key-Value store) Jobtracker

Namenode
HDFS
(Hadoop Distributed File System)
Server

19

Live Machine Migration Reduces Planned Downtime

Description:
Enables the live migration of virtual
machines from one host to another
with continuous service availability.

Benefits:
•  Revolutionary technology that is the
basis for automated virtual machine
movement
•  Meets service level and performance
goals

20

vSphere High Availability (HA) - protection against unplanned downtime

Overview
•  Protection against host and VM failures
•  Automatic failure detection (host, guest OS)
•  Automatic virtual machine restart in minutes, on any available host in cluster
•  OS and application-independent, does not require complex configuration
changes

21

vSphere Fault Tolerance provides continuous protection

Overview

•  Single identical VMs running in
lockstep on separate hosts
•  Zero downtime, zero data loss
XX failover for all virtual machines in
App App App App App App App

HA HA FT
OS OS OS OS OS OS OS
case of hardware failures
VMware ESX VMware ESX
•  Integrated with VMware HA/DRS
•  No complex clustering or
specialized hardware required
•  Single common mechanism for all
X applications and operating
systems

Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

22

One click to HA

§  Easy to setup, one click is all you need

23

Example HA Failover for Hadoop

Serengeti
Namenode
vSphere HA Namenode
Server

TaskTracker TaskTracker TaskTracker TaskTracker
HDFS Datanode HDFS Datanode HDFS Datanode HDFS Datanode
Hive Hive Hive Hive

hBase hBase hBase hBase

24

vSphere HA and Optionally FT

§  vSphere HA
•  Is application-aware: will auto-restart NN if heartbeat goes away
•  Is easy to configure
•  Has no performance overhead
§  vSphere FT
•  Has the added bonus of no pause-time when there is hardware failure
•  Has a one vcpu max
•  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated
measurement shows this is good for ~300 host cluster.
§  HDFS 2 HA
•  Only covers Namenode – what about the other 5+ master services?
•  Not available in Apache Hadoop 0.20
•  Not as battle-tested as vSphere HA
•  Is more complex to install, manage

25

High Availability for the Hadoop Stack

ETL Tools BI Reporting RDBMS

Pig (Data Flow) Hive (SQL) HCatalog
Zookeepr (Coordination)

Hive Hcatalog MDB
MetaDB

Management Server
MapReduce (Job Scheduling/Execution System)
HBase (Key-Value store) Jobtracker

Namenode
HDFS
(Hadoop Distributed File System)
Server

26





27

Elastic Scaling and Multi-tenancy of Hadoop on vSphere

VM VM VM VM

Current

Hadoop:
Compute T1 T2

Combined
VM VM
Storage/ Storage Storage
Compute

1.
Hadoop
in
VM
2.
Separate
Compute
and
Data
3.
Mul8.
Clusters

-‐  Single
Tenant
-‐  Single
Tenant
-‐  Mul6ple
Tenants

-‐  Fixed
Resources
-‐  Elas6c
Compute
-‐  Elas6c
Compute

28

“Time Share”

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM
Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop
vHelper

VMware vSphere

Host Host Host
HDFS HDFS HDFS

While existing apps run during the day to support business
operations, Hadoop batch jobs kicks off at night to conduct
deep analysis of data.
29

Virtualization delivers VM level Multi-tenancy

§  Performance isolation
Coke
Pepsi
•  No more noisy neighbors –
Resource container to
achieve guaranteed SLA
for different tenants/users/
jobs

Run6me

§  Configuration isolation

Hadoop

Hadoop

Hadoop

Virtual

Virtual

Virtual

Layer
•  Support multiple Hadoop

Hadoop

Queue

Virtual
environments on the same
physical clusters
•  Multiple Linux versions
•  Multiple Hadoop
Data
Data
Data
versions
Container
Container
Container

§  Security isolation
Data

HDFS
•  Higher level of security
Layer

•  Compute VM can only
access data VM
Host
Host
Host
Host
Host
Host

through Access Control
List

30

I.  Market Overview & Insights
II.  Virtualization + Hadoop
III.  Distribution & OSS Contribution

31

Open Source of Serengeti, Spring Hadoop, Hadoop Extensions

Commercial Vendors Community Projects

•  Support major distribution and multiple projects
•  Contribute Hadoop Virtualization Extension (HVE) to Open Source
Community

32

Hadoop Virtualization Extensions: Topology Awareness

33

Proposed Topology Changes

HADOOP-8468 (Umbrella JIRA)
HADOOP-8469
HDFS-3495
MAPREDUCE-4310
HDFS-3498
MAPREDUCE-4309
HADOOP-8470
HADOOP-8472

35

Spring for Apache Hadoop

§  Announced initial formation of Spring
Data OSS project in 2010
•  Enables Spring-powered applications to use
new data access technologies
•  Data project technologies around MongoDB,
Neo4J, Riak, Redis, JDBC Extensions, JPA,
REST, and Blob

§  Announcing additional contributions on GitHub:
•  Integration with Cascading library
•  Hbase support
•  Hadoop security support
•  More examples
•  Administrative application, RESTful API to upload Hadoop jobs to schedule for
batch execution, query status, etc.
•  Web HDFS support
36

Big data on virtualized infrastucture

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big data on virtualized infrastucture (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Big data on virtualized infrastucture