Hadoop operations

APOLLO GROUP

Hadoop Operations: Starting Out Small
So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
Principal Systems Engineer
14 June 2012

Agenda

Who
What (Definitions)
Decisions for Now
Decisions for Later
Lessons Learned

APOLLO GROUP © 2012 Apollo Group 2

APOLLO GROUP

Who

APOLLO GROUP Apollo Group
© 2012 3

Who is Apollo?

Apollo Group is a leading provider of higher
education programs for working adults.


Who is Michael Arnold?

Systems Administrator
Automation geek
13 years in IT
I deal with:
–Server hardware specification/configuration
–Server firmware
–Server operating system
–Hadoop application health
–Monitoring all the above


APOLLO GROUP

What
Definitions

© 2012 6

Definitions

Q: What is a tiny/small/medium/large cluster?
A:
–Tiny: 1-9
–Small: 10-99
–Medium: 100-999
–Large: 1000+
–Yahoo-sized: 4000


Definitions

Q: What is a “headnode”?
A: A server that runs one or more of the following
Hadoop processes:
–NameNode
–JobTracker
–Secondary NameNode
–ZooKeeper
–HBase Master


APOLLO GROUP

What decisions should you
make now and which can
you postpone for later?
Decisions for Now

© 2012 9

Which Hadoop distribution?

Amazon
Apache
Cloudera
Greenplum
Hortonworks
IBM
MapR
Platform Computing


Should you virtualize?

Can be OK for small clusters BUT
–virtualization adds overhead
–can cause performance degradation
–cannot take advantage of Hadoop rack locality
Virtualization can be good for:
–functional testing of M/R job or workflow changes
–evaluation of Hadoop upgrades


What sort of hardware should you be
considering?

Inexpensive
Not “enterprisey” hardware
–No RAID*
–No Redundant power*
Low power consumption
No optical drives
–get systems that can boot off the network

* except in headnodes


Plan for capacity expansion

Start at the bottom and
work your way up
Leave room in your
cabinets for more
machines


Plan for capacity expansion (cont.)

Deploy your initial
cluster in two cabinets
–One headnode, one
switch, and several
(five) datanodes per
cabinet


Plan for capacity expansion (cont.)

Install a second cluster
in the empty space in
the upper half of the
cabinet


APOLLO GROUP

What decisions should you
make now and which can
you postpone for later?
Decisions for Later

© 2012 16

What size cluster?

Depends upon your:
Budget
Data size
Workload characteristics
SLA


What size cluster? (cont.)

Are your MapReduce jobs:
compute-intensive?
reading lots of data?

http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/


Should you implement rack awareness?

If more than one switch in the cluster:

YES


Should you use automation?

If not in the beginning, then as soon as
possible.

Boot disks will fail.
Automated OS and application installs:
–Save time
–Reduce errors
•Cobbler/Spacewalk/Foreman/xCat/etc
•Puppet/Chef/Cfengine/shell scripts/etc


APOLLO GROUP

Lessons Learned

© 2012 21

Keep It Simple

Don't add redundancy and features
(server/network) that will make things more
complicated and expensive.

Hadoop has built-in redundancies.

Don't overlook them.


Automate the Hardware

Twelve hours of manual work in the datacenter is
not fun.
Make sure all server firmware is configured
identically.
–HP SmartStart Scripting Toolkit
–Dell OpenManage Deployment Toolkit
–IBM ServerGuide Scripting Toolkit


Rolling upgrades are possible

(Just not of the Hadoop software.)

Datanodes can be decommissioned, patched, and
added back into the cluster without service
downtime.


The smallest thing can have a big impact on the
cluster

Bad NIC/switchport can cause cluster slowness.

Slow disks can cause intermittent job slowdowns.


HDFS blocks are weird

On ext3/ext4:
–Small blocks are not padded to the HDFS block-
size, but rather the actual size of the data.
–Each HDFS block is actually two files on the
datanode's filesystem:
•The actual data and
•A metadata/checksum file

# ls -l blk_1058778885645824207*
-rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207
-rw-r--r-- 1 hdfs hdfs 283 May 14 01:26 blk_1058778885645824207_19155994.meta


Do not prematurely optimize

Be careful tuning your datanode filesystems.
• mkfs -t ext4 -T largefile4 ... (probably bad)
• mkfs -t ext4 -i 131072 -m 0 ... (better)

/etc/mke2fs.conf
[fs_types]
hadoop = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,
extra_isize
inode_ratio = 131072
blocksize = -1
reserved_ratio = 0
default_mntopts = acl,user_xattr
}


Use DNS-friendly names for services

hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/
mapred.delta.hadoop.apollogrp.edu:8021
http://oozie.delta.hadoop.apollogrp.edu:11000/
hiveserver.delta.hadoop.apollogrp.edu:10000

Yes, the names are long, but I bet you can figure out how to
connect to Bravo Cluster.


Use a parallel, remote execution tool

pdsh/Cluster SSH/mussh/etc

SSH in a for loop is so 2010

FUNC/MCollective


Make your log directories as large as you can.

20-100GB /var/log
–Implement log purging cronjobs or your log
directories will fill up.

Beware: M/R jobs can fill up /tmp as well.


Insist on IPMI 2.0 for out of band management of
server hardware.

Serial Over LAN is awesome when booting a
system.
Standardized hardware/temperature monitoring.
Simple remote power control.


Spanning-tree is the devil

Enable portfast on your server switch ports or the
BMCs may never get a DHCP lease.


Apollo has re-built it's cluster four times.

You may end up doing so as well.


Apollo Timeline

First build
Cloudera Professional Services helped install CDH
Four nodes
Manually build OS via USB CDROM.
CDH2


Apollo Timeline

Second build
Cobbler
All software deployment is via kickstart. Very little
is in puppet. Config files are deployed via wget.
CDH2


Apollo Timeline

Third build
OS filesystem partitioning needed to change.
Most software deployment still via kickstart.
CDH3b2


Apollo Timeline

Fourth build
HDFS filesystem inodes needed to be increased.
Full puppet automation.
Added redundant/hotswap enterprise hardware for
headnodes.
CDH3u1


Cluster failures at Apollo

Hardware
–disk failures (40+)
–disk cabling (6)
–RAM (2)
–switch port (1)
Software
–Cluster
•NFS (NN -> 2NN metadata)
–Job
•TT java heap
•Running out of /tmp or /var/log/hadoop
•Running out of HDFS space


Know your workload

You can spend all the time in the world trying to get
the best CPU/RAM/HDD/switch/cabinet
configuration, but you are running on pure luck
until you understand your cluster's workload.


Hadoop operations

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop operations (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Hadoop operations