SlideShare a Scribd company logo
Debugging Apache
Hadoop YARN Cluster in
Production
Jian He, Junping Du and Xuan Gong
Hortonworks YARN Team
06/30/2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who are We
 Junping Du
– Apache Hadoop Committer and PMC
Member
– Dev Lead in Hortonworks YARN team
 Xuan Gong
– Apache Hadoop Committer and PMC
Member
– Software Engineer
 Jian He
– Apache Hadoop Committer and PMC
Member
– Staff Software Engineer
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today’s Agenda
 YARN in a Nutshell
 Trouble-shooting Process and Tools
 Case Study
 Enhanced YARN Log Tool Demo
 Summary and Future
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Architecture
 ResourceManager
 NodeManager
 ApplicationMaster
 Other daemons:
– Application
History/Timeline Server
– Job History Server (for
MR only)
– Proxy Server
– Etc.
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RM and NM in a nutshell
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Troubles” to start troubleshooting effort on a YARN cluster
 Applications Failed
 Applications Hang/Slow
 YARN configuration doesn’t work
 YARN APIs (CLI, WebService, etc.) doesn’t work
 YARN daemons crashed (OOM issue, etc.)
 YARN daemons’ log has error/warnings
 YARN cluster monitoring tools (like Ambari) alert
Problem Type Distribution
Configuration
Executing Jobs
Cluster Administration
Installation
Application Development
Performance
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Process: Phenomenon -> Root Cause -> Solution
 Solution:
– Infrastructure/Hardware issue
• Replace disks
• Fix network
– Mis-configuration
• Fix configuration
• Enhance documentation
– Setup issue
• Fix setup
• Restart services
– Application issue
• Update application
• Workaround
– A YARN Bug
• Report/fix it in Apache
community!
 Phenomenon:
– Application Failed
 Root cause:
– Container Launch failures
• Classpath issue
• Resource localization
failures
– Too many attempt failures
• Network connection
issue
• NM disk issues
• AM failed caused by
node restarted
– Application logic issue
• Container failed with
OOM, etc.
– Security issue
• Token related issues
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Case Study
 "java.lang.RuntimeException: java.io.FileNotFoundException:
/etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in
system)”
 That actually due to too many TCP connections issue
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Dig Deeply
 Most connections are from local NM to DNs
– LogAggregationService
– ResourceLocalizationService
 We found the root cause is threads leak on NM LogAggregationService:
– YARN-4697
NM aggregation thread pool is not bound by limits
– YARN-4325
Purge app state from NM state-store should cover more LOG_HANDLING cases
– YARN-4984
LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread
leak.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lesson Learned for Trouble-shooting on a production cluster
 What’s mean by a “Production” Cluster?
– Cannot afford stop/restart cluster for trouble shooting
– Most operations on cluster are “Read Only”
– In fenced network, remote debugging with local cluster admin.
 Lesson learned:
1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can
2. Work closely with the end user to gain an understanding of the issue and symptoms
3. Setup knowledge base used to compare to previous cases
4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify
5. Version your configuration!
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Handy Tools for YARN Troubleshooting
 Log
 UI
 Historic Info
– JobHistoryServer (for MR only)
– Application Timeline Service (v1, v1.5, v2.0)
 Monitoring tools, like: AMBARI
 Runtime info
– Memory Dump
– Jstack
– System Metrics
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Log
 Log CLI
– yarn logs -applicationId <application ID> [OPTIONS]
– Discuss more later
 Enable Debug log
– When daemons are NOT running
• Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh
• Start the daemons
– When Daemons are running
• Dynamic change log level via daemon’s logLevel UI/CLI
• CLI:
– yarn daemonlog [-getlevel <host:httpPort> <classname>]
– yarn daemonlog [-setlevel <host:httpPort> <classname> <level>]
– for YARN Client side
• Similar setting as daemons not running
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Runtime Log Level settings in YARN UI
 RM: http://<rm_addr>:8088/logLevel
 NM: http://<nm_addr>:8042/logLevel
 ATS: http://<ats_addr>:8188/logLevel
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
UI (Ambari and YARN)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Job History Server
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Memory dump analysis
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop metrics
 RPC metrics
– RpcQueueTimeAvgTime
– ReceivedBytes
…
 JVM metrics
– MemHeapUsedM
– ThreadsBlocked
…
 Documentation:
– http://s.apache.org/UwSu
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN top
 top like command line view for application stats, queue stats
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is my job hung ?
 Job can be stuck at 3 states.
NEW_SAVING: Waiting for app to be persisted in state-store
- Connection error with state-store (zookeeper etc.)
Accepted: Waiting to allocate ApplicationMaster container.
- Low max- AM-resource-percentage config
Running: waiting for containers to be allocated?
- Are there resources available for the app
- Otherwise, application land issue, stuck on
socket read/write.
App
states
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Friday evening, Customer experiences cluster outages.
 Large amount of jobs getting stuck.
 There are resources available in the cluster.
 Restarting Resource Manger can resolve issue temporarily
 But after several hours, cluster again goes back to the bad state
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Are there any resources available in the queue ?
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Are there any resources available for the app ?
– Sometimes, even if cluster has resources, user may still not be able to run their applications
because they hit the user-limit.
– User-limit controls how much resources a single user can use
– Check user-limit info on the scheduler UI
– Check application head room on application UI
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Not a problem of resource contention.
 Use yarn logs command to get hung application logs.
– Found app waiting for containers to be allocated.
 Problem: cluster has free resources, but app is not able to use it.
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 May be a scheduling issue.
 Analyze the scheduler log. (Most difficult)
– User not much familiar with the scheduler log.
– RM log is too huge, hard to do text searching in the logs.
– Getting worse if enabling debug log.
 Dump the scheduling log into a separate file
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Scheduler log shows several apps are skipped for scheduling.
 Pick one of the applications, go to the application attempt UI,
 Check the resource requests table (see below), notice billions of containers are asked by
the application.
8912124
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Tried to kill those misbehaving jobs, cluster went fine.
 Find the user who submit those jobs and stop him/her from doing that.
 Big achievement so far, unblock the cluster.
 Offline debugging and find product bug.
 Surprisingly, we use int for measuring memory size in the scheduler.
 That misbehaving app asked too much resources, which caused integer overflow in the
scheduler.
 YARN-4844, replace int with long for resource memory API.
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What we learn
 Rebooting service can solve many problems. 
– Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).
 Denial of Service - Poorly written, or accidental configuration for workloads can cause
component outages.
– Carefully code against DOS scenarios.
– Example: User RPC method (getQueueInfo) holds scheduler lock
 UI enhancement
– Small change, big impact.
– Example: Resource requests table on application very useful in this case.
 Alerts
– Ask too many containers, alerting to the users.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 10 % of the jobs are failing every day.
 After they re-run, jobs sometime finish successfully.
 No resource contention when jobs are running
 Logs contain a lot of mysterious connection errors (unable to read call parameters)
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 Initial attempt,
– Dig deeper into the code to see under what conditions, this exception may throw.
– Not able to figure out.
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 Requested more failed application logs
 Identify pattern for these applications
 Finally, we realize all apps failed on a certain set of nodes.
 Ask customer to exclude those nodes. Jobs running fine after that.
 Customer checked “/var/log/messages” and found disk issues for those nodes.
When dealing with mysterious connection
failures, hung problems, try to find
correlation between failed apps and nodes.
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhanced YARN Log CLI (YARN-4904)
 Useful Log CLIs
– Get container logs for running apps
• yarn logs –applicationId ${appId}
– Get a specific container log
• yarn logs –applicationId ${appId} –containerId ${containerId}
– Get AM Container logs.
• yarn logs -applicationId ${appId} –am 1
– Get a specific log file
• yarn logs -applicationId ${appId} –logFiles syslog
• Support java regular expression
– Get the log file's first 'n' bytes or the last 'n' bytes
• yarn logs –applicationId ${appId} –size 100
– Dump the application/container logs
• yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information
• yarn logs –applicationId ${appId} -show_application_log_info
• yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
Summary and Future
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary and Future
 Summary
– Methodology and Tools for trouble-shooting on YARN
– Case Study
– Enhanced YARN Log CLI
• YARN-4904
 Future Enhancement
– ATS (Application Timeline Service) v2
• YARN-2928
• #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at
Scale”
– New ResourceManager UI
• YARN-3368
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New RM UI (YARN-3368)
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info
yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info

More Related Content

PDF
Alphorm.com Support de la Formation VMware vSphere 6, Les machines virtuelles
PDF
Une Introduction à Hadoop
PDF
BigData_Chp2: Hadoop & Map-Reduce
PPTX
k8s practice 2023.pptx
PPTX
Ambari: Agent Registration Flow
PDF
Streaming architecture patterns
PPTX
What's new in Java 11
PDF
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Alphorm.com Support de la Formation VMware vSphere 6, Les machines virtuelles
Une Introduction à Hadoop
BigData_Chp2: Hadoop & Map-Reduce
k8s practice 2023.pptx
Ambari: Agent Registration Flow
Streaming architecture patterns
What's new in Java 11
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd

What's hot (20)

PPTX
BUSINESS INTELIGENCE : Exploitation d'un Datamart
PDF
Web Assembly (on the server)
PPTX
Seven Habits of Highly Effective Jenkins Users (2014 edition!)
PDF
DBA Tasks in Oracle Autonomous Database
PPTX
Optimize the performance, cost, and value of databases.pptx
PDF
Oracle Cloud is Best for Oracle Database - High Availability
PDF
왜 컨테이너인가? - OpenShift 구축 사례와 컨테이너로 환경 전환 시 고려사항
PPT
Maven Overview
PPT
Ansible presentation
PPTX
How API Enablement Drives Legacy Modernization
PDF
Mastering GC.pdf
PPTX
Snowflake: The Good, the Bad, and the Ugly
PPTX
Introduction to Haproxy
PPTX
Mass Migrate Virtual Machines to Kubevirt with Tool Forklift 2.0
PPTX
Kafka Connect
PPTX
Veeam backup and_replication
PDF
Ansible
PPTX
Apache Knox - Hadoop Security Swiss Army Knife
PPTX
CI-Jenkins.pptx
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
BUSINESS INTELIGENCE : Exploitation d'un Datamart
Web Assembly (on the server)
Seven Habits of Highly Effective Jenkins Users (2014 edition!)
DBA Tasks in Oracle Autonomous Database
Optimize the performance, cost, and value of databases.pptx
Oracle Cloud is Best for Oracle Database - High Availability
왜 컨테이너인가? - OpenShift 구축 사례와 컨테이너로 환경 전환 시 고려사항
Maven Overview
Ansible presentation
How API Enablement Drives Legacy Modernization
Mastering GC.pdf
Snowflake: The Good, the Bad, and the Ugly
Introduction to Haproxy
Mass Migrate Virtual Machines to Kubevirt with Tool Forklift 2.0
Kafka Connect
Veeam backup and_replication
Ansible
Apache Knox - Hadoop Security Swiss Army Knife
CI-Jenkins.pptx
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Ad

Viewers also liked (20)

PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Why is my Hadoop cluster slow?
PPTX
Hadoop 3.0 features
PDF
Druid @ branch
PDF
Apache Zeppelin, Helium and Beyond
PDF
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
PDF
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
PPTX
HDFS Erasure Coding in Action
PPTX
Druid at Hadoop Ecosystem
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
PDF
Apache Zeppelin Helium and Beyond
PPTX
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
PPTX
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
PPTX
What's new in Hadoop Common and HDFS
PDF
Comparison of Transactional Libraries for HBase
PPTX
A Multi Colored YARN
PDF
Religion and Enviroment
PPTX
Lazette Harnish: America's Most-Visited Tourist Places
PDF
Sintesis informativa 22 de marzo 2017
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop YARN: Past, Present and Future
Why is my Hadoop cluster slow?
Hadoop 3.0 features
Druid @ branch
Apache Zeppelin, Helium and Beyond
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
HDFS Erasure Coding in Action
Druid at Hadoop Ecosystem
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Apache Zeppelin Helium and Beyond
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
What's new in Hadoop Common and HDFS
Comparison of Transactional Libraries for HBase
A Multi Colored YARN
Religion and Enviroment
Lazette Harnish: America's Most-Visited Tourist Places
Sintesis informativa 22 de marzo 2017
Ad

Similar to Debugging Apache Hadoop YARN Cluster in Production (20)

PPTX
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
PPTX
Apache Hadoop YARN: state of the union
PPTX
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
PPTX
Running Services on YARN
PPTX
Scheduling Policies in YARN
PPTX
Apache Hadoop YARN: Present and Future
PPTX
June 10 145pm hortonworks_tan & welch_v2
PPTX
Hadoop summit-diverse-workload
PPTX
Enabling Diverse Workload Scheduling in YARN
PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Developing YARN Applications - Integrating natively to YARN July 24 2014
PDF
Ebs performance tuning session feb 13 2013---Presented by Oracle
PPTX
YARN - Past, Present, & Future
PPTX
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
PPTX
Hadoop Operations - Past, Present, and Future
PPTX
Hadoop 3 in a Nutshell
PPTX
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
PPTX
Running Non-MapReduce Big Data Applications on Apache Hadoop
PDF
Nonfunctional Testing: Examine the Other Side of the Coin
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Apache Hadoop YARN: state of the union
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Running Services on YARN
Scheduling Policies in YARN
Apache Hadoop YARN: Present and Future
June 10 145pm hortonworks_tan & welch_v2
Hadoop summit-diverse-workload
Enabling Diverse Workload Scheduling in YARN
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Apache Hadoop YARN: Past, Present and Future
Developing YARN Applications - Integrating natively to YARN July 24 2014
Ebs performance tuning session feb 13 2013---Presented by Oracle
YARN - Past, Present, & Future
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Hadoop Operations - Past, Present, and Future
Hadoop 3 in a Nutshell
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Running Non-MapReduce Big Data Applications on Apache Hadoop
Nonfunctional Testing: Examine the Other Side of the Coin

Recently uploaded (20)

PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
composite construction of structures.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Welding lecture in detail for understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Model Code of Practice - Construction Work - 21102022 .pdf
R24 SURVEYING LAB MANUAL for civil enggi
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
composite construction of structures.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CYBER-CRIMES AND SECURITY A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
additive manufacturing of ss316l using mig welding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Welding lecture in detail for understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Foundation to blockchain - A guide to Blockchain Tech
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026

Debugging Apache Hadoop YARN Cluster in Production

  • 1. Debugging Apache Hadoop YARN Cluster in Production Jian He, Junping Du and Xuan Gong Hortonworks YARN Team 06/30/2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who are We  Junping Du – Apache Hadoop Committer and PMC Member – Dev Lead in Hortonworks YARN team  Xuan Gong – Apache Hadoop Committer and PMC Member – Software Engineer  Jian He – Apache Hadoop Committer and PMC Member – Staff Software Engineer
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Today’s Agenda  YARN in a Nutshell  Trouble-shooting Process and Tools  Case Study  Enhanced YARN Log Tool Demo  Summary and Future
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Architecture  ResourceManager  NodeManager  ApplicationMaster  Other daemons: – Application History/Timeline Server – Job History Server (for MR only) – Proxy Server – Etc.
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RM and NM in a nutshell
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Troubles” to start troubleshooting effort on a YARN cluster  Applications Failed  Applications Hang/Slow  YARN configuration doesn’t work  YARN APIs (CLI, WebService, etc.) doesn’t work  YARN daemons crashed (OOM issue, etc.)  YARN daemons’ log has error/warnings  YARN cluster monitoring tools (like Ambari) alert Problem Type Distribution Configuration Executing Jobs Cluster Administration Installation Application Development Performance
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Process: Phenomenon -> Root Cause -> Solution  Solution: – Infrastructure/Hardware issue • Replace disks • Fix network – Mis-configuration • Fix configuration • Enhance documentation – Setup issue • Fix setup • Restart services – Application issue • Update application • Workaround – A YARN Bug • Report/fix it in Apache community!  Phenomenon: – Application Failed  Root cause: – Container Launch failures • Classpath issue • Resource localization failures – Too many attempt failures • Network connection issue • NM disk issues • AM failed caused by node restarted – Application logic issue • Container failed with OOM, etc. – Security issue • Token related issues
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Case Study  "java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in system)”  That actually due to too many TCP connections issue
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Dig Deeply  Most connections are from local NM to DNs – LogAggregationService – ResourceLocalizationService  We found the root cause is threads leak on NM LogAggregationService: – YARN-4697 NM aggregation thread pool is not bound by limits – YARN-4325 Purge app state from NM state-store should cover more LOG_HANDLING cases – YARN-4984 LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lesson Learned for Trouble-shooting on a production cluster  What’s mean by a “Production” Cluster? – Cannot afford stop/restart cluster for trouble shooting – Most operations on cluster are “Read Only” – In fenced network, remote debugging with local cluster admin.  Lesson learned: 1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can 2. Work closely with the end user to gain an understanding of the issue and symptoms 3. Setup knowledge base used to compare to previous cases 4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify 5. Version your configuration!
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Handy Tools for YARN Troubleshooting  Log  UI  Historic Info – JobHistoryServer (for MR only) – Application Timeline Service (v1, v1.5, v2.0)  Monitoring tools, like: AMBARI  Runtime info – Memory Dump – Jstack – System Metrics
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Log  Log CLI – yarn logs -applicationId <application ID> [OPTIONS] – Discuss more later  Enable Debug log – When daemons are NOT running • Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh • Start the daemons – When Daemons are running • Dynamic change log level via daemon’s logLevel UI/CLI • CLI: – yarn daemonlog [-getlevel <host:httpPort> <classname>] – yarn daemonlog [-setlevel <host:httpPort> <classname> <level>] – for YARN Client side • Similar setting as daemons not running
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Runtime Log Level settings in YARN UI  RM: http://<rm_addr>:8088/logLevel  NM: http://<nm_addr>:8042/logLevel  ATS: http://<ats_addr>:8188/logLevel
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved UI (Ambari and YARN)
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Job History Server
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory dump analysis
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop metrics  RPC metrics – RpcQueueTimeAvgTime – ReceivedBytes …  JVM metrics – MemHeapUsedM – ThreadsBlocked …  Documentation: – http://s.apache.org/UwSu
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN top  top like command line view for application stats, queue stats
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why is my job hung ?  Job can be stuck at 3 states. NEW_SAVING: Waiting for app to be persisted in state-store - Connection error with state-store (zookeeper etc.) Accepted: Waiting to allocate ApplicationMaster container. - Low max- AM-resource-percentage config Running: waiting for containers to be allocated? - Are there resources available for the app - Otherwise, application land issue, stuck on socket read/write. App states
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Friday evening, Customer experiences cluster outages.  Large amount of jobs getting stuck.  There are resources available in the cluster.  Restarting Resource Manger can resolve issue temporarily  But after several hours, cluster again goes back to the bad state
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Are there any resources available in the queue ?
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Are there any resources available for the app ? – Sometimes, even if cluster has resources, user may still not be able to run their applications because they hit the user-limit. – User-limit controls how much resources a single user can use – Check user-limit info on the scheduler UI – Check application head room on application UI
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Not a problem of resource contention.  Use yarn logs command to get hung application logs. – Found app waiting for containers to be allocated.  Problem: cluster has free resources, but app is not able to use it.
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  May be a scheduling issue.  Analyze the scheduler log. (Most difficult) – User not much familiar with the scheduler log. – RM log is too huge, hard to do text searching in the logs. – Getting worse if enabling debug log.  Dump the scheduling log into a separate file
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Scheduler log shows several apps are skipped for scheduling.  Pick one of the applications, go to the application attempt UI,  Check the resource requests table (see below), notice billions of containers are asked by the application. 8912124
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Tried to kill those misbehaving jobs, cluster went fine.  Find the user who submit those jobs and stop him/her from doing that.  Big achievement so far, unblock the cluster.  Offline debugging and find product bug.  Surprisingly, we use int for measuring memory size in the scheduler.  That misbehaving app asked too much resources, which caused integer overflow in the scheduler.  YARN-4844, replace int with long for resource memory API.
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What we learn  Rebooting service can solve many problems.  – Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).  Denial of Service - Poorly written, or accidental configuration for workloads can cause component outages. – Carefully code against DOS scenarios. – Example: User RPC method (getQueueInfo) holds scheduler lock  UI enhancement – Small change, big impact. – Example: Resource requests table on application very useful in this case.  Alerts – Ask too many containers, alerting to the users.
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  10 % of the jobs are failing every day.  After they re-run, jobs sometime finish successfully.  No resource contention when jobs are running  Logs contain a lot of mysterious connection errors (unable to read call parameters)
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  Initial attempt, – Dig deeper into the code to see under what conditions, this exception may throw. – Not able to figure out.
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  Requested more failed application logs  Identify pattern for these applications  Finally, we realize all apps failed on a certain set of nodes.  Ask customer to exclude those nodes. Jobs running fine after that.  Customer checked “/var/log/messages” and found disk issues for those nodes. When dealing with mysterious connection failures, hung problems, try to find correlation between failed apps and nodes.
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhanced YARN Log CLI (YARN-4904)  Useful Log CLIs – Get container logs for running apps • yarn logs –applicationId ${appId} – Get a specific container log • yarn logs –applicationId ${appId} –containerId ${containerId} – Get AM Container logs. • yarn logs -applicationId ${appId} –am 1 – Get a specific log file • yarn logs -applicationId ${appId} –logFiles syslog • Support java regular expression – Get the log file's first 'n' bytes or the last 'n' bytes • yarn logs –applicationId ${appId} –size 100 – Dump the application/container logs • yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information • yarn logs –applicationId ${appId} -show_application_log_info • yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary and Future  Summary – Methodology and Tools for trouble-shooting on YARN – Case Study – Enhanced YARN Log CLI • YARN-4904  Future Enhancement – ATS (Application Timeline Service) v2 • YARN-2928 • #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at Scale” – New ResourceManager UI • YARN-3368
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New RM UI (YARN-3368)
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup Slides
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info