SlideShare a Scribd company logo
Monitoring and Troubleshooting
  7/6/2012

© 2012 MapR Technologies   Troubleshooting 1
Monitoring & Troubleshooting
   Agenda
   • Cluster Monitoring Tools
   • Troubleshooting MapReduce Jobs
   • Troubleshooting Scenarios
   • Working with MapR Support
   • Things to Avoid




© 2012 MapR Technologies   Troubleshooting 2
Monitoring & Troubleshooting
   Objectives
   At the end of this module you will be able to:
   • Identify the tools you can use to monitor your cluster
   • Explain how MapR central logging can help you monitor MapReduce jobs
   • Describe several common troubleshooting scenarios and how to resolve
     issues based on these scenarios
   • List the tools you can use to work with MapR Support




© 2012 MapR Technologies        Troubleshooting 3
Cluster Monitoring Tools




© 2012 MapR Technologies   Troubleshooting 4
Monitoring Tools

         Built-In Tools
          – MapR Control System
          – MapR Metrics

         3rd Party Tools
          – Nagios
          – Ganglia




5   © 2012 MapR Technologies      Troubleshooting 5
MapR Control System

         MapR Control System
          –   Dashboard with cluster overview
              • Node health
              • MapR-FS and available disks
              • Resource utilization
                  –   bandwidth
                  –   disk space
                  –   CPU
              • MapReduce job status
              • Alarms




6   © 2012 MapR Technologies            Troubleshooting 6
MapR Control System




7   © 2012 MapR Technologies   Troubleshooting 7
MapR Metrics

         MapR Metrics
          –   View performance information about Hadoop jobs
              • Predict cluster usage
              • Measure which jobs consume resources
              • Troubleshoot failures & performance issues
          –   Metrics provided on
              •   Cumulative CPU/memory usage
              •   # of running/failed tasks/attempts
              •   Speed of input, output, and shuffle
              •   Duration of task attempts
              •   Data read, written, or shuffled
              •   Memory in use
              •   Number of records skipped/spilled

8   © 2012 MapR Technologies               Troubleshooting 8
MapR Metrics




9   © 2012 MapR Technologies   Troubleshooting 9
3rd Party Tools

          Nagios
           –   Configuration script generator
          Ganglia
           –   CLDB does metrics
           –   MapRGangliaContext
           –   Only need gmond on CLDB node




10   © 2012 MapR Technologies          Troubleshooting 10
MapR Service Logs

          /opt/mapr/logs
          For example:
           – CLDB
           – Warden
           – FileServer (mfs)
           – NFS




11   © 2012 MapR Technologies   Troubleshooting 11
Troubleshooting
                           MapReduce Jobs



© 2012 MapR Technologies      Troubleshooting 12
Central Logging

          MapR 2.0 introduces central logging
           –   Log files written to “local” volume on MapR-FS
               •   replication factor = 1
                   –   I/O confined to node
           – /var/mapr/local/<host>/logs/mapred/userlogs
           – Configurable via JobTracker variable
               •   mapr.localvolumes.path




13   © 2012 MapR Technologies                 Troubleshooting 13
Central Logging

          New CLI for MapReduce logs
               maprcli job linklogs -jobid <jobPatten> -todir
               <maprfsDir> [ -jobconf <pathToJobXml>]
           – Create a job-centric view of all logs on all involved TaskTracker nodes
           – Creates the following structure under <maprfsDir> for all <jobid>’s
             matching <jobPattern>
               •   <jobid>/hosts/<host>/
                   –   symbolic links to log directories of tasks executed for <jobid> on <host>
               •   <jobid>/mappers/
                   –   symbolic links to log directories of all map task attempts for <jobid> across the
                       cluster
               •   <jobid>/reducers/
                   –   symbolic links to log directories of all reduce task attempts for <jobid> across the
                       cluster


14   © 2012 MapR Technologies                   Troubleshooting 14
Troubleshooting
                              Scenarios



© 2012 MapR Technologies      Troubleshooting 15
Troubleshooting Scenarios

          Slow nodes
          Out of memory
          Out of disk space
          Time skew
          No ZooKeeper quorum
          Contention for resources
          Requirements not met




16   © 2012 MapR Technologies    Troubleshooting 16
Identifying Slow Nodes

          Before installation:
           –   Use dd to benchmark read/write speed
               •   dd bs=4M if=/dev/null of=/dev/sd<x>

           –   Compare performance across nodes to test network throughput:
               •   dd bs=4M if=/dev/null |       sudo ssh root@node 'dd bs=4M of=/dev/foo’

          After installation:
           – Look at task starting and completion times
           – Look in system logs for memory or CPU problems
           – Look at the performance of writes to the local volume
             (where intermediate data goes)
          Slow disks identified based on a threshold in mfs.conf
           –   May really be slow NIC


17   © 2012 MapR Technologies                     Troubleshooting 17
Out of Memory

          Make sure there is enough swap space
          See if a memory-intensive job is running
          Use ulimit to make sure there are no limits on the number of file
           descriptors, resource usage, and the number of processes
          Garbage collection can result in out-of-memory errors




18   © 2012 MapR Technologies     Troubleshooting 18
Out of Disk Space

          MapR logs go to /opt/mapr/logs
           – If this partition is too small, space can run out
           – Set up a cron job to clean out old logs
           – Move to a larger partition




19   © 2012 MapR Technologies          Troubleshooting 19
Time Skew

          NTP is your friend
          20 Seconds differential is the max allowed




20   © 2012 MapR Technologies    Troubleshooting 20
No ZooKeeper Quorum

          Not enough ZooKeepers running
          configure.sh run improperly
           –   Different ZooKeeper or CLDB nodes specified
          Network problem
           –   Hostname resolution
           –   Physical connection down




21   © 2012 MapR Technologies             Troubleshooting 21
Contention for Resources

          Make sure there’s no limit on file descriptors, processes
          Make sure the service layout follows good guidelines
           – Don’t run ZooKeeper with CLDB or JobTracker
           – Fewer task slots when running TaskTracker with CLDB or ZooKeeper
           – Avoid running the active JobTracker on the primary CLDB node

        Don’t run other random things on cluster nodes
        Don’t mix distributions




22   © 2012 MapR Technologies      Troubleshooting 22
Requirements Not Met

          Use Sun Java JDK
          Same users/groups with same UID/GID numbers on all nodes
          Proper licensing
          Host resolution between all nodes
           –   DNS or /etc/hosts
        Keyless ssh between all nodes for the root user
        All necessary ports open
           –   Watch out for iptables and selinux




23   © 2012 MapR Technologies          Troubleshooting 23
Working with MapR
                                Support



© 2012 MapR Technologies       Troubleshooting 24
Working with MapR Support

          mapr-support-collect and mapr-support dump
          fsck and gfsck




25   © 2012 MapR Technologies   Troubleshooting 25
Things to Avoid




© 2012 MapR Technologies      Troubleshooting 26
Things to Avoid

          Remove ZooKeeper data manually
          Format disks (unless you are sure)
          Run configure.sh incorrectly
          Use dd on an installed node
          Modify configuration files
           – Without a good reason
           – Inconsistently




27   © 2012 MapR Technologies        Troubleshooting 27
Questions




© 2012 MapR Technologies   Troubleshooting 28

More Related Content

PPTX
55a remote cluster
PPTX
80a disaster recovery
PPTX
58a migration
PPTX
52 nfs
PDF
Hands on MapR -- Viadea
PPTX
13c planning
PDF
MapR Tutorial Series
PPTX
20a installation
55a remote cluster
80a disaster recovery
58a migration
52 nfs
Hands on MapR -- Viadea
13c planning
MapR Tutorial Series
20a installation

What's hot (20)

PPTX
12a architecture
PDF
Hadoop Internals
PPTX
10c introduction
PPTX
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
PPT
Hadoop 2
PPTX
Hadoop fault-tolerance
PDF
MapReduce Improvements in MapR Hadoop
PDF
Hadoop Cluster With High Availability
PDF
Architectural Overview of MapR's Apache Hadoop Distribution
PPTX
Spark tunning in Apache Kylin
PPTX
Hadoop fault tolerance
PPTX
Introduction to Yarn
PDF
How to Increase Performance of Your Hadoop Cluster
PPTX
Inside MapR's M7
PPT
Advanced Hadoop Tuning and Optimization
PDF
Design, Scale and Performance of MapR's Distribution for Hadoop
PPTX
Ambari Meetup: NameNode HA
PPTX
Anatomy of Hadoop YARN
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PDF
Taming YARN @ Hadoop conference Japan 2014
12a architecture
Hadoop Internals
10c introduction
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Hadoop 2
Hadoop fault-tolerance
MapReduce Improvements in MapR Hadoop
Hadoop Cluster With High Availability
Architectural Overview of MapR's Apache Hadoop Distribution
Spark tunning in Apache Kylin
Hadoop fault tolerance
Introduction to Yarn
How to Increase Performance of Your Hadoop Cluster
Inside MapR's M7
Advanced Hadoop Tuning and Optimization
Design, Scale and Performance of MapR's Distribution for Hadoop
Ambari Meetup: NameNode HA
Anatomy of Hadoop YARN
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Taming YARN @ Hadoop conference Japan 2014
Ad

Viewers also liked (8)

PDF
Troubleshooting Hadoop: Distributed Debugging
PDF
A Survey on Big Data Analysis Techniques
PPT
Hive Apachecon 2008
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
PPT
Hadoop Summit 2009 Hive
KEY
Getting Started on Hadoop
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PDF
Spark 2.x Troubleshooting Guide
 
Troubleshooting Hadoop: Distributed Debugging
A Survey on Big Data Analysis Techniques
Hive Apachecon 2008
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop Summit 2009 Hive
Getting Started on Hadoop
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Spark 2.x Troubleshooting Guide
 
Ad

Similar to 70a monitoring & troubleshooting (20)

PPTX
48a tuning
PDF
Introduction to Spark
PPTX
22 configuration
PDF
Apache Spark Overview
PPTX
10c introduction
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PDF
Taming Latency: Case Studies in MapReduce Data Analytics
 
PDF
Yarns About Yarn
PPTX
Intro to Apache Spark
PPT
Hadoop mapreduce and yarn frame work- unit5
PPTX
HBase with MapR
PDF
Drill into Drill – How Providing Flexibility and Performance is Possible
PPTX
Hadoop ppt on the basics and architecture
PDF
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PPT
BDAS RDD study report v1.2
PDF
Apache Hadoop YARN - The Future of Data Processing with Hadoop
PDF
Analyzing Real-World Data with Apache Drill
PPTX
Coredns nodecache - A highly-available Node-cache DNS server
PDF
Infrastructure Around Hadoop
48a tuning
Introduction to Spark
22 configuration
Apache Spark Overview
10c introduction
Spark SQL versus Apache Drill: Different Tools with Different Rules
Taming Latency: Case Studies in MapReduce Data Analytics
 
Yarns About Yarn
Intro to Apache Spark
Hadoop mapreduce and yarn frame work- unit5
HBase with MapR
Drill into Drill – How Providing Flexibility and Performance is Possible
Hadoop ppt on the basics and architecture
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDAS RDD study report v1.2
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Analyzing Real-World Data with Apache Drill
Coredns nodecache - A highly-available Node-cache DNS server
Infrastructure Around Hadoop

More from mapr-academy (8)

DOCX
53 lab-nfs
DOCX
51 lab-volumes
PPTX
50a volumes
DOCX
42 lab-managing services
PPTX
41a managing services
PPTX
30a accessing your cluster
DOCX
14 lab-planing
DOCX
3 map r installation & setup administration course description
53 lab-nfs
51 lab-volumes
50a volumes
42 lab-managing services
41a managing services
30a accessing your cluster
14 lab-planing
3 map r installation & setup administration course description

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Advanced IT Governance
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Modernizing your data center with Dell and AMD
PDF
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
Advanced IT Governance
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
Advanced Soft Computing BINUS July 2025.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Modernizing your data center with Dell and AMD
Spectral efficient network and resource selection model in 5G networks

70a monitoring & troubleshooting

  • 1. Monitoring and Troubleshooting 7/6/2012 © 2012 MapR Technologies Troubleshooting 1
  • 2. Monitoring & Troubleshooting Agenda • Cluster Monitoring Tools • Troubleshooting MapReduce Jobs • Troubleshooting Scenarios • Working with MapR Support • Things to Avoid © 2012 MapR Technologies Troubleshooting 2
  • 3. Monitoring & Troubleshooting Objectives At the end of this module you will be able to: • Identify the tools you can use to monitor your cluster • Explain how MapR central logging can help you monitor MapReduce jobs • Describe several common troubleshooting scenarios and how to resolve issues based on these scenarios • List the tools you can use to work with MapR Support © 2012 MapR Technologies Troubleshooting 3
  • 4. Cluster Monitoring Tools © 2012 MapR Technologies Troubleshooting 4
  • 5. Monitoring Tools  Built-In Tools – MapR Control System – MapR Metrics  3rd Party Tools – Nagios – Ganglia 5 © 2012 MapR Technologies Troubleshooting 5
  • 6. MapR Control System  MapR Control System – Dashboard with cluster overview • Node health • MapR-FS and available disks • Resource utilization – bandwidth – disk space – CPU • MapReduce job status • Alarms 6 © 2012 MapR Technologies Troubleshooting 6
  • 7. MapR Control System 7 © 2012 MapR Technologies Troubleshooting 7
  • 8. MapR Metrics  MapR Metrics – View performance information about Hadoop jobs • Predict cluster usage • Measure which jobs consume resources • Troubleshoot failures & performance issues – Metrics provided on • Cumulative CPU/memory usage • # of running/failed tasks/attempts • Speed of input, output, and shuffle • Duration of task attempts • Data read, written, or shuffled • Memory in use • Number of records skipped/spilled 8 © 2012 MapR Technologies Troubleshooting 8
  • 9. MapR Metrics 9 © 2012 MapR Technologies Troubleshooting 9
  • 10. 3rd Party Tools  Nagios – Configuration script generator  Ganglia – CLDB does metrics – MapRGangliaContext – Only need gmond on CLDB node 10 © 2012 MapR Technologies Troubleshooting 10
  • 11. MapR Service Logs  /opt/mapr/logs  For example: – CLDB – Warden – FileServer (mfs) – NFS 11 © 2012 MapR Technologies Troubleshooting 11
  • 12. Troubleshooting MapReduce Jobs © 2012 MapR Technologies Troubleshooting 12
  • 13. Central Logging  MapR 2.0 introduces central logging – Log files written to “local” volume on MapR-FS • replication factor = 1 – I/O confined to node – /var/mapr/local/<host>/logs/mapred/userlogs – Configurable via JobTracker variable • mapr.localvolumes.path 13 © 2012 MapR Technologies Troubleshooting 13
  • 14. Central Logging  New CLI for MapReduce logs maprcli job linklogs -jobid <jobPatten> -todir <maprfsDir> [ -jobconf <pathToJobXml>] – Create a job-centric view of all logs on all involved TaskTracker nodes – Creates the following structure under <maprfsDir> for all <jobid>’s matching <jobPattern> • <jobid>/hosts/<host>/ – symbolic links to log directories of tasks executed for <jobid> on <host> • <jobid>/mappers/ – symbolic links to log directories of all map task attempts for <jobid> across the cluster • <jobid>/reducers/ – symbolic links to log directories of all reduce task attempts for <jobid> across the cluster 14 © 2012 MapR Technologies Troubleshooting 14
  • 15. Troubleshooting Scenarios © 2012 MapR Technologies Troubleshooting 15
  • 16. Troubleshooting Scenarios  Slow nodes  Out of memory  Out of disk space  Time skew  No ZooKeeper quorum  Contention for resources  Requirements not met 16 © 2012 MapR Technologies Troubleshooting 16
  • 17. Identifying Slow Nodes  Before installation: – Use dd to benchmark read/write speed • dd bs=4M if=/dev/null of=/dev/sd<x> – Compare performance across nodes to test network throughput: • dd bs=4M if=/dev/null | sudo ssh root@node 'dd bs=4M of=/dev/foo’  After installation: – Look at task starting and completion times – Look in system logs for memory or CPU problems – Look at the performance of writes to the local volume (where intermediate data goes)  Slow disks identified based on a threshold in mfs.conf – May really be slow NIC 17 © 2012 MapR Technologies Troubleshooting 17
  • 18. Out of Memory  Make sure there is enough swap space  See if a memory-intensive job is running  Use ulimit to make sure there are no limits on the number of file descriptors, resource usage, and the number of processes  Garbage collection can result in out-of-memory errors 18 © 2012 MapR Technologies Troubleshooting 18
  • 19. Out of Disk Space  MapR logs go to /opt/mapr/logs – If this partition is too small, space can run out – Set up a cron job to clean out old logs – Move to a larger partition 19 © 2012 MapR Technologies Troubleshooting 19
  • 20. Time Skew  NTP is your friend  20 Seconds differential is the max allowed 20 © 2012 MapR Technologies Troubleshooting 20
  • 21. No ZooKeeper Quorum  Not enough ZooKeepers running  configure.sh run improperly – Different ZooKeeper or CLDB nodes specified  Network problem – Hostname resolution – Physical connection down 21 © 2012 MapR Technologies Troubleshooting 21
  • 22. Contention for Resources  Make sure there’s no limit on file descriptors, processes  Make sure the service layout follows good guidelines – Don’t run ZooKeeper with CLDB or JobTracker – Fewer task slots when running TaskTracker with CLDB or ZooKeeper – Avoid running the active JobTracker on the primary CLDB node  Don’t run other random things on cluster nodes  Don’t mix distributions 22 © 2012 MapR Technologies Troubleshooting 22
  • 23. Requirements Not Met  Use Sun Java JDK  Same users/groups with same UID/GID numbers on all nodes  Proper licensing  Host resolution between all nodes – DNS or /etc/hosts  Keyless ssh between all nodes for the root user  All necessary ports open – Watch out for iptables and selinux 23 © 2012 MapR Technologies Troubleshooting 23
  • 24. Working with MapR Support © 2012 MapR Technologies Troubleshooting 24
  • 25. Working with MapR Support  mapr-support-collect and mapr-support dump  fsck and gfsck 25 © 2012 MapR Technologies Troubleshooting 25
  • 26. Things to Avoid © 2012 MapR Technologies Troubleshooting 26
  • 27. Things to Avoid  Remove ZooKeeper data manually  Format disks (unless you are sure)  Run configure.sh incorrectly  Use dd on an installed node  Modify configuration files – Without a good reason – Inconsistently 27 © 2012 MapR Technologies Troubleshooting 27
  • 28. Questions © 2012 MapR Technologies Troubleshooting 28