70a monitoring & troubleshooting

Monitoring and Troubleshooting
7/6/2012

© 2012 MapR Technologies Troubleshooting 1

Monitoring & Troubleshooting
Agenda
• Cluster Monitoring Tools
• Troubleshooting MapReduce Jobs
• Troubleshooting Scenarios
• Working with MapR Support
• Things to Avoid


Monitoring & Troubleshooting
Objectives
At the end of this module you will be able to:
• Identify the tools you can use to monitor your cluster
• Explain how MapR central logging can help you monitor MapReduce jobs
• Describe several common troubleshooting scenarios and how to resolve
issues based on these scenarios
• List the tools you can use to work with MapR Support


Cluster Monitoring Tools


Monitoring Tools

 Built-In Tools
– MapR Control System
– MapR Metrics

 3rd Party Tools
– Nagios
– Ganglia

5 © 2012 MapR Technologies Troubleshooting 5

MapR Control System

 MapR Control System
– Dashboard with cluster overview
• Node health
• MapR-FS and available disks
• Resource utilization
– bandwidth
– disk space
– CPU
• MapReduce job status
• Alarms


MapR Control System


MapR Metrics

 MapR Metrics
– View performance information about Hadoop jobs
• Predict cluster usage
• Measure which jobs consume resources
• Troubleshoot failures & performance issues
– Metrics provided on
• Cumulative CPU/memory usage
• # of running/failed tasks/attempts
• Speed of input, output, and shuffle
• Duration of task attempts
• Data read, written, or shuffled
• Memory in use
• Number of records skipped/spilled


MapR Metrics


3rd Party Tools

 Nagios
– Configuration script generator
 Ganglia
– CLDB does metrics
– MapRGangliaContext
– Only need gmond on CLDB node


MapR Service Logs

 /opt/mapr/logs
 For example:
– CLDB
– Warden
– FileServer (mfs)
– NFS


Troubleshooting
MapReduce Jobs


Central Logging

 MapR 2.0 introduces central logging
– Log files written to “local” volume on MapR-FS
• replication factor = 1
– I/O confined to node
– /var/mapr/local/<host>/logs/mapred/userlogs
– Configurable via JobTracker variable
• mapr.localvolumes.path


Central Logging

 New CLI for MapReduce logs
maprcli job linklogs -jobid <jobPatten> -todir
<maprfsDir> [ -jobconf <pathToJobXml>]
– Create a job-centric view of all logs on all involved TaskTracker nodes
– Creates the following structure under <maprfsDir> for all <jobid>’s
matching <jobPattern>
• <jobid>/hosts/<host>/
– symbolic links to log directories of tasks executed for <jobid> on <host>
• <jobid>/mappers/
– symbolic links to log directories of all map task attempts for <jobid> across the
cluster
• <jobid>/reducers/
– symbolic links to log directories of all reduce task attempts for <jobid> across the
cluster


Troubleshooting
Scenarios


Troubleshooting Scenarios

 Slow nodes
 Out of memory
 Out of disk space
 Time skew
 No ZooKeeper quorum
 Contention for resources
 Requirements not met


Identifying Slow Nodes

 Before installation:
– Use dd to benchmark read/write speed
• dd bs=4M if=/dev/null of=/dev/sd<x>

– Compare performance across nodes to test network throughput:
• dd bs=4M if=/dev/null | sudo ssh root@node 'dd bs=4M of=/dev/foo’

 After installation:
– Look at task starting and completion times
– Look in system logs for memory or CPU problems
– Look at the performance of writes to the local volume
(where intermediate data goes)
 Slow disks identified based on a threshold in mfs.conf
– May really be slow NIC


Out of Memory

 Make sure there is enough swap space
 See if a memory-intensive job is running
 Use ulimit to make sure there are no limits on the number of file
descriptors, resource usage, and the number of processes
 Garbage collection can result in out-of-memory errors


Out of Disk Space

 MapR logs go to /opt/mapr/logs
– If this partition is too small, space can run out
– Set up a cron job to clean out old logs
– Move to a larger partition


Time Skew

 NTP is your friend
 20 Seconds differential is the max allowed


No ZooKeeper Quorum

 Not enough ZooKeepers running
 configure.sh run improperly
– Different ZooKeeper or CLDB nodes specified
 Network problem
– Hostname resolution
– Physical connection down


Contention for Resources

 Make sure there’s no limit on file descriptors, processes
 Make sure the service layout follows good guidelines
– Don’t run ZooKeeper with CLDB or JobTracker
– Fewer task slots when running TaskTracker with CLDB or ZooKeeper
– Avoid running the active JobTracker on the primary CLDB node

 Don’t run other random things on cluster nodes
 Don’t mix distributions


Requirements Not Met

 Use Sun Java JDK
 Same users/groups with same UID/GID numbers on all nodes
 Proper licensing
 Host resolution between all nodes
– DNS or /etc/hosts
 Keyless ssh between all nodes for the root user
 All necessary ports open
– Watch out for iptables and selinux


Working with MapR
Support


Working with MapR Support

 mapr-support-collect and mapr-support dump
 fsck and gfsck


Things to Avoid


Things to Avoid

 Remove ZooKeeper data manually
 Format disks (unless you are sure)
 Run configure.sh incorrectly
 Use dd on an installed node
 Modify configuration files
– Without a good reason
– Inconsistently


Questions


70a monitoring & troubleshooting

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to 70a monitoring & troubleshooting (20)

More from mapr-academy (8)

Recently uploaded (20)

70a monitoring & troubleshooting