4. v sphere big data extensions hadoop

© 2009 VMware Inc. All rights reserved
vSphere Big Data Extensions 之
Hadoop 参考架构和性能最佳实践
李欣慧
大数据研发高级工程师
VMware 中国研发中心

2
Agenda
Recommended Deployment Topology
 Plan Your Cluster

3
Virtualization
Host
VMDK
Shared storage
SAN/NAS
Local disks
OS Image –
VMDK
VMDK VMDK VMDK VMDK VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4
Task-
tracker
Ext4 Ext4 Ext4
mapred.local.dir
Standard Deployment Configuration on Single Worker
VMDKVMDK
Ext4 Ext4 Ext4 Ext4

4
Standard Deployment Configuration on Single Worker
Virtualization
Host
VMDK
Local disks
OS Image –
VMDK
VMDK VMDK VMDK VMDK VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4
Task-
tracker
Ext4 Ext4 Ext4
mapred.local.dir
VMDKVMDK
Ext4 Ext4 Ext4 Ext4

5
Virtualization
Host
VMDKOS Image –
VMDK
Hadoop
Virtual
Node 1
Datanode
Ext4
Task-
tracker
Ext4 Ext4 Ext4
Shared storage
SAN/NAS
Local disks
OS Image –
VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4
Task-
tracker
Ext4 Ext4 Ext4
mapred.local.dir
Standard Deployment Configuration

6
Virtualization
Host
VMDKOS Image –
VMDK
Hadoop
Virtual
Node 1
Datanode
Ext4
Task-
tracker
Ext4 Ext4 Ext4
Local disks
OS Image –
VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4
Task-
tracker
Ext4 Ext4 Ext4
mapred.local.dir
Standard Deployment Configuration

7
Virtualization
Host
OS Image –
VMDK
Hadoop
Virtual
Node 1
Task-
tracker
Shared storage
SAN/NAS
Local disks
OS Image –
VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK
… …
Standard Deployment Configuration for D/C Separation

8
Data Path for Combined vs. Data/Compute Separation
Virtualization
Host
Virtualization
Host
Hadoop Virtual
Node 1
Hadoop Virtual
Node 2
TaskTrackerTaskTracker
Virtual Switch
Hadoop Virtual NodeHadoop Virtual Node
Virtual Switch
 Serengeti provide local storage based temp for D/C separation.
• Each compute VM needs its own temp space
• Required temp space is different from an application to another
• Can result in wasted space

9
Recommended Topology of Data/Compute Separation
Virtualization
Host
VMDKOS Image –
VMDK
Hadoop
Virtual
Node 1
Ext4
Task-
tracker
Shared storage
SAN/NAS
Local disks
OS Image –
VMDK
Hadoop
Virtual
Node 2
Datanode
VMDK
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
…

10
Virtualization
Host
Hadoop Virtual
Node 1
Hadoop Virtual
Node 2
Virtual Switch Virtualization
Host
Hadoop Virtual
Node 1
Hadoop Virtual
Node 2
Virtual Switch
Data Path for Local TT Storage vs. NFS Temp
 Serengeti provide NFS based temp for D/C separation
• Improve local storage space utilization.
• Trade-off between bandwidth efficiency vs. overhead of NFS.

11
Consolidated Storage on Single DN VM
Virtualization
Host
OS Image –
VMDK
Hadoop
Virtual
Node 1
Task-
tracker
Shared storage
SAN/NAS
Local disks
OS Image –
VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4dirdirdirdirdirdirdirdir
VMDK
… …
NFS
Client
NFS
Server

12
Recommended Topology of Computing Only Cluster
Virtualization
Host
OS Image –
VMDK
Shared storage
SAN/NAS
OS Image –
VMDK
Hadoop
Virtual
Node 2
Datanode
Ext4
Hadoop
Virtual
Node 1
Task-
tracker
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
…
VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK
VMDK

13
Plan Your Cluster
 Start with a small cluster and grow it as required
• Initially just four or six nodes
• Increase amount of computation/data/memory as required
• Available space of HDFS = (DFS Remaining . value * 95%)/
dfs.replication.value
 Choose right hardware – master node
• Namenode and Jobtracker often run on same machine for smaller clusters
• Consider HA/FT settings
• separate NameNode and Jobtracker from slave nodes’ host.
• Dual power supplies

14
Plan Your Cluster
 Choose right hardware – slave node
• 2 * Quad-core CPUs at least, HT enabled
• RAM
• Consider 6% overhead for virtualization
• Recommend 4-8 GB memory per core
• Storage
• At least 8 disks per host, 12 disks per host may be ideal for absolute performance
but probably not for price-performance.
• Recommend 1-1.5 disks per core
• JBOD, SATA RPM7,200 is fine
• A good practical maximum is 24TB or 36TB per slave node. More than that will result
in massive network traffic if a node dies and block re-replication must take place.

15
Plan Your Cluster
 Networking
• Use dedicate switches for your Hadoop cluster and Nodes are connected to a
top-of-rack switch
• Nodes should be connected at a minimum speed of 1Gb/sec and consider
10Gb/sec for clusters with large scale of intermediate data
• Racks are interconnected via core switches
• Core switches should connect to top-of-rack switches by dual 10Gb/sec links
• Redundant top-of-rack switches, core switches
• Separate management network and vm network
• Adopt vDS and dvport groups that span hosts and ensure configuration consistency
for vms and virtual ports for functions of Vmotion and network storage
• Leave the management port out of your vDS

16
Virtualization Host
Networking Configurations – Four 1G NICs
vmnic 0
pSwitch 1
Virtual Switch 1
Hadoop cluster
VM portgroup
vmnic 1
pSwitch 2
Virtual Switch 0
MGMT
192.168.1.100
VMOTION
192.168.3.100
FT
192.168.4.100
VMKERNEL
192.168.2.100
vmnic 3
 Hadoop vm traffic goes
through vSwitch1
(vmnic2 and vmnic3,
both active)
 On vSwitch0, it goes
through MGMT, VM
kernel on
vmnic0(active, vmnic1
on standby)
 vMotion and FT on
vmnic1 (active, vmnic0
on standby)
1Gbs 1Gbs
vmnic 2
1Gbs 1Gbs

17
Virtualization Host
Networking Configurations -10G for Hadoop VMs
vmnic 0
pSwitch 1
Virtual Switch 1
Hadoop cluster
VM portgroup
vmnic 1
pSwitch 2
Virtual Switch 0
MGMT
192.168.1.100
VMOTION
192.168.3.100
FT
192.168.4.100
VMKERNEL
192.168.2.100
vmnic 2
 Hadoop vm traffic goes
through vSwitch1
(vmnic3)
 10G for Hadop cluster
vms
• more performance
benefits
• If any need, keep
redundancy with the other
suit of vmnic /pSwitch
 Keep redundancy for
management network
pSwitch 3
1Gbs 1Gbs
10 GBe

18
vSphere Configurations
 Configure hosts with NTP service and to ensure the time on all the
nodes is synchronized
 Virtual Disk Settings
• One datastore per physical disk
• Warm-up is needed on the provisioned cluster
 NUMA scheduler important for virtualized Hadoop performance
• Poor configuration can result in 12%(1)
performance degradation
• Data VM preferably should be distributed across NUMA nodes
 Provision right VM size
• Reserve 6% memory for vSphere usage
• Avoid over-commitment
• Enable NUMA and keep VM size within the NUMA node

19
For Existing Devices
 Crudely fit existing resource capacity for Hadoop
• CPU : RAM : Throughput - 4*1333MHZ: 32G: 800M/s
 Use powerful machine to run master node/computing node
 Use high throughput machine for slave node/data node

4. v sphere big data extensions hadoop

More Related Content

What's hot (20)

Similar to 4. v sphere big data extensions hadoop (20)

More from Chiou-Nan Chen (20)

Recently uploaded (20)

4. v sphere big data extensions hadoop

Editor's Notes