How to overcome mysterious problems caused by large and multi-tenancy Hadoop cluster at Rakuten

How to overcome mysterious problems
caused by large and multi-tenant
hadoop cluster at Rakuten
Oct/27/2016
Tomomichi Hirano
EC Core Technology Department, Rakuten Inc.
tomomichi.hirano@rakuten.com

Who I am
 Tomomichi Hirano (平野智巌)
• Joined to Rakuten in 2013
• Hadoop administrator
• Monitoring, tuning and Improving hadoop cluster
• Verifying and enabling new hadoop related components
• Trouble shooting for all problems
• Regular operation such as server and user adding, disk replacement, etc.
 Previous team
• Server provisioning, networking and HW related things.
2

Today’s Agenda
 Quick introduction
• About our clusters
• Hadoop use cases at Rakuten
 Mysterious problems
• Never ending jobs
• DataNode freezing
• NameNode freezing
• High load after restarting NameNode
• Lessons learned
3
 Server provisioning and
management
• Background for Big Data systems
• Provisioning and management

About our clusters
 Production cluster
• # of slaves : around 200
• HDFS capacity : around 8PB
• # of jobs per day : 30,000 - 50,000
• # of hadoop active user accounts : around 40
• Types of jobs : MR, Hive, Tez, Spark, Pig, sqoop, HBase, Slider, etc.
 Other clusters
• Another production cluster for Business Continuity (BC).
• Some clusters for staging and development.
5

MAAS for OS
provisioning
About our clusters
6
Chef for configuring
Provisioning Engine
System Management
Shinken and PagerDuty
for alerting and incident
management
Splunk for reporting
Ganglia and Grafana for
graphingSecurity
Kerberos for cluster
security

Analysis Feedback loop
Output
Input
Hadoop use cases at Rakuten
7
shop data
purchase
data
item data
user
behavioruser
membership
item search
reports for shops
search quality
search suggest
recommendation
page design
recommendation
advertisement
event planning
site design
KPI management
marketing and sales

9
Never ending jobs
Some jobs were very slow to submit
or never ended with a lot of preemption
Mystery 1

Never ending jobs
 Recognized
• User began to complain “Hadoop is very slow !!!”
• Actually, a lot of jobs were very slow to submit and/or never ended.
10
“Container preempted by scheduler”

Never ending jobs
 What is “Capacity Scheduler Preemption”
• Jobs in high priority queue kill other jobs in low priority queue.
 Who kills who?
• There were already too many jobs and queues.
• Hard to get what was happening at all.
• So, we have decided to build our original monitoring system.
11

Never ending jobs
 Original monitoring with Grafana / Graphite
12
Graphite
for hadoop
carbon-cache
Grafana
Collectd
graphite-plugin
exec-plugin
scripts with jq
Graphite
for infra
NameNode
ResourceManager
via REST API
curl -s "${RM}:8088/ws/v1/cluster/apps?state=RUNNING"
curl -s "${RM}:8088/ws/v1/cluster/apps?finishedTimeBegin=`date -d '10
minutes ago' +%s%3N`"

Never ending jobs
 Graphs for YARN cluster
13
< Memory usage of YARN cluster >
< Running and Pending jobs >
Yellow : # of pending jobs
Green : # of running jobs
Pending jobs due to lack of memory.

Never ending jobs
 Graphs to analyze per user
14
< Running jobs per user >
< Pending jobs per user >
< Memory usage per user >
“Our cluster is not slow, your jobs are too much!”

Never ending jobs
 Never ending jobs with a lot of preemption
15
< YARM memory usage >
Too much preempting, maybe killing each other.
< Number of preemption per user >

Never ending jobs
 Turning for preemption, but how long?
• Investigated elapse time of each tasks and analyzed with Excel.
• 4.5 million tasks per day!
16
curl -s "http://${JH}:19888/ws/v1/history/mapreduce/jobs/${job_id}/tasks"
99% of tasks finished
within 5 min.

Never ending jobs
 Our solution : Cooperation with users
• In cluster side, set 10 min for
• In user side, we guided like below.
17
Please try to design your jobs so tasks finishes in less than 5 minutes normally
which leaves healthy room up to 10 minutes to avoid getting killed in all cases.
• Still some preemption, but far less!
• Yes, now cluster is under control.
• We can see “who kills who” and why now.

18
DataNode freezing
DataNode seemed to be freezing
for several minutes and went into dead
status sometime.
Mystery 2

DataNode freezing
 Recognized
• Last contact values of some DataNodes were very high.
• Normally, less than 5 sec.
• But, sometime 1 min, worst case 10 mins and went into “dead” status.
• But recovered without any operation.
19
Last contact of each DataNode
* Last contact of DataNode is elapsed time from
last successful health check with NameNode.

DataNode freezing
 Investigated DataNode log
• No log output during this issue was happening.
• Seemed the DataNode was freezing.
 Tried to restart DataNode and reboot OS
• Restarting DataNode did not help at all.
• Reboot OS cleared this issue for a while, but happened again.
 Observation
• Not Memory leak. OS related issue?
• Had to figure out this issue happened on which nodes and when.
20

DataNode freezing
 Added graph to monitor last contact value.
21
curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo"
< Last contact of each DataNode >
Figured out this issue happened
only on newly added DataNodes.
--- 200 sec ---
--- 100 sec ---
--- 0 sec ---

DataNode freezing
 Analyze with some other graphs
• Graphs for OS iowait and HDFS usage
• Trigger of this issue seemed high load caused by HDFS write.
22

DataNode freezing
 Many tries, but still no help
• Increasing DataNode heap, increasing handler count, upgrading OS, etc.
 Then, took thread dump and analyzed
• To figure out it was actual freezing or not.
• To figure out wrong thread blocks other thread?
23
${java home}/bin/jcmd ${pid of target JVM} Thread.print
${java home}/bin/jstack ${pid of target JVM}
Note : need to execute with process owner account.

DataNode freezing
 Thread dump analysis
• “heartbeating”, “DataXceiver” and “PacketResponder” were
blocked by a thread named “Thread-41”.
24
"DataNode: XXX heartbeating to ${NAMENODE}:8020" daemon prio=10 tid=0x0000000002156000 nid=0xf26 waiting for monitor entry
[0x00007f9dd315a000] java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.deleteBlock(BlockPoolSliceScanner.java:305)
- waiting to lock <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.deleteBlocks(BlockPoolSliceScanner.java:330)
...
"Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000] ...
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlockInfo(BlockPoolSliceScanner.java:237)
- locked <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.assignInitialVerificationTimes(BlockPoolSliceScanner.java:602)
- locked <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:645)
...
Blocked
Blocking

DataNode freezing
 What is “Thread-41”?
• Seems, progressing something with java “TreeMap”.
25
...
"Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.put(TreeMap.java:2019)
at java.util.TreeSet.add(TreeSet.java:255)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlockInfo(BlockPoolSliceScanner.java:243)
...
"Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.remove(TreeMap.java:2382)
at java.util.TreeSet.remove(TreeSet.java:276)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.delBlockInfo(BlockPoolSliceScanner.java:253)
...

DataNode freezing
 Source code reading
• org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner
26
Scans the block files under a block pool and verifies that the files are not corrupt.
This keeps track of blocks and their last verification times.
private static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks
• 3 weeks???
• DataNode scans blocks more than once in 3 weeks by default.
• So, it’s must be “Datanode Block Scanner” related something!

DataNode freezing
 Found a workaround!
• Found there were something strange behavior for creating block map
to decide which blocks will be scanned.
• Also found two files for “Datanode Block Scanner” in local FS.
27
dncp_block_verification.log.curr
dncp_block_verification.log.prev
 So, we tried to delete them and restart DN
• Kind of initialization for “Datanode Block Scanner”.
• Then this issue never happened after that!

DataNode freezing
 Don’t worry!
• This issue should have been improved already with HDFS 2.7.0.
28
https://issues.apache.org/jira/browse/HDFS-7430
Rewrite the BlockScanner to use O(1) memory and use multiple thread
 Lessons learned
• Thread dump and source code reading for deep analysis.
• Especially, in case that we can’t get any clues from logs.

29
NameNode freezing
NameNode seemed to be freezing for
several minutes repeatedly with an interval.
Mystery 3

NameNode freezing
 Recognized
• One day, we recognized strange behavior with ResourceManager.
30
< Memory usage of YARN cluster >
< Running and pending jobs in last 10 min >
• Seemed, ResourceManager couldn’t accept new jobs.
• But running jobs were ok.

NameNode freezing
 Added some graphs for NameNodes. Must be HDFS checkpoint.
31
< lastTxId, checkpointTxId >
< if_octets.tx, if_octets.rx >
< RpcQueueTimeAveTime, RpcProcessingTimeAveTime >
< CallQueueLength >

NameNode freezing
 Monitoring continuously.
• Then we could catch difference before and
after NameNode failover.
• Checkpoint on standby NameNode should
not affect to active NameNode.
• But, actually affected!
32
• White line is a fail-over from second (nn2) to first (nn1).
• Happened only when second NameNode was active.

NameNode freezing
 HDFS-7858
• Improve HA Namenode Failover detection on the client
• Fix Version/s : 2.8.0, 3.0.0-alpha1
 HDFS-6763
• Initialize file system-wide quota once on transitioning to active
• Fix Version/s : 2.8.0, 3.0.0-alpha1
 Workaround for now
• Our current workaround is just keeping first NameNode active.
• So, strongly want backports of them to an available HDP version!
33

34
High load after restarting NameNode
NameNode went into unstable state by this
unknown high load.
Mystery 4

 Symptom
• We met unknown high load after restarting NameNode in several
times.
• It suddenly disappeared several hours or a few days after.
• But the last one, it had never gone...
• During this high load was existing, NameNode went into very
unstable state.
• When it happened after on standby NameNode, we couldn’t fail-
over (fail-back).
• Very serious problem for us!
35

 Added graphs for RPC queue activities in NameNode
• Unknown high load between checkpoint
36
< lastTxId (Green) >
< checkpointTxId (Yellow) >
< Waiting Time (Yellow) >
< QueueLength >
Good case Bad case
< Processing Time (Red) >

 Multiple graphs
37
• NameNode seemed to be receiving
some amount of data from someone.
• Journal nodes? No...
• DataNode? Hard to know...
• But, it must relate to high load!
< Receive data size (Blue) >

 DataNode log analysis
• 3 kinds of 60000 msec timeout were continuously being output.
38
2016-09-28 05:33:35,384 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.net.SocketTimeoutException: Call From XXXX to bhdXXXX:8020 failed on socket timeout exception:
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/XXXX remote=XXX]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:616)
60 sec timeout log
3 methods in a class “BPServiceActor” were failing repeatedly.
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:523)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reportReceivedDeletedBlocks(BPServiceActor.java:312)

 Thread dump analysis on NameNode side
• Almost all server handers were waiting for one lock.
39
"IPC Server handler 45 on 8020" daemon prio=10 tid=0x00007fbed169f800 nid=0x26b2 waiting on condition [0x00007f9e05be1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa30ad4d890> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
....
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(...)
"IPC Server handler 45 on 8020" daemon prio=10 tid=0x00007fbed169f800 nid=0x26b2 waiting on condition [0x00007f9e05be1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa30ad4d890> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
....
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReceivedAndDeleted(...)

 Source code reading
• org.apache.hadoop.hdfs.server.datanode.BPServiceActor
• Seems to handle communication with NameNode.
40
private void offerService() throws Exception {
LOG.info("For namenode " + nnAddr + " using"
+ " DELETEREPORT_INTERVAL of " + dnConf.deleteReportInterval + " msec "
+ " BLOCKREPORT_INTERVAL of " + dnConf.blockReportInterval + "msec"
+ " CACHEREPORT_INTERVAL of " + dnConf.cacheReportInterval + "msec"
+ " Initial delay: " + dnConf.initialBlockReportDelay + "msec"
+ "; heartBeatInterval=" + dnConf.heartBeatInterval);
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode bhdpXXXX:8020
using DELETEREPORT_INTERVAL of 500000 msec
BLOCKREPORT_INTERVAL of 21600000msec <= 6 hours
CACHEREPORT_INTERVAL of 10000msec
Initial delay: 0msec;
heartBeatInterval=5000 <= 5 sec
Source code
Actual DN’s log

 DataNode log analysis again
• Failed to sent block report repeatedly with short time interval.
41
...
2016-10-03 03:40:47,141 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ...
...

 What was the high load?
• Almost all DataNodes failed to send full block report and retrying with
a few minutes interval.
• Yes, it was “Block Report Storm” from all of DataNodes.
• During this storm was existing, full-BlockReport had never succeeded.
 Then, how to stop the storm?
• Have to reduce concurrency of these request somehow.
42

 Tries and errors
• Manual arbitration with iptables
• Worked well, but a little bit tricky.
• And some DataNodes lost heartbeart
with Active NameNode sometimes.
43
DN DN DN
Standby
NN
Active
NN
DN DN
• Restart NameNode with different slaves files
• Worked in several time, but unfortunately, we got whole cluster down.
• So, you MUST NOT do this operation!!!
• Most safety way
• NameNode in startup phase discards non-initial block report.
• So, increase dfs.namenode.safemode.extension and wait.

 Monitor, Monitor, Monitor!!!
• Graphing tool is MUST for large and multi-tenant cluster.
• Investigation and monitoring with multiple graphs would great help.
 Cooperation with users
• For some issues, we have to solve cluster problem with users.
 Thread dump and source code reading for deep analysis
• In case that we can’t get any clues from logs, it’s very important.
• Thread dumping would be helpful for freezing or locking issues
especially.
44
Lessons learned from mysteries

 Query examples for NameNode and ResourceManager
45
Just as a reference
Contents Queries
HDFS cluster curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState"
NameNode JVM info curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=JvmMetrics"
NameNode and DataNode curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo"
NameNode state curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"
NameNode RPC curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020"
NameNode CMS curl -s "${NN}:50070/jmx?qry=java.lang:type=GarbageCollector,name=ConcurrentMarkSweep"
NameNode Heap curl -s "${NN}:50070/jmx?qry=java.lang:type=Memory"
List of a HDFS directory * curl -s --negotiate -u : "${NN}:50070/webhdfs/v1/${HDFS_PATH}?&op=LISTSTATUS"
Usage of a HDFS directory * curl -s --negotiate -u : "${NN}:50070/webhdfs/v1/${HDFS_PATH}?&op=GETCONTENTSUMMARY"
jobs finished in last 10 min curl -s "${RM}:8088/ws/v1/cluster/apps?finishedTimeBegin=`date -d '10 minutes ago' +%s%3N`"
running jobs curl -s "${RM}:8088/ws/v1/cluster/apps?state=RUNNING"
accepted jobs curl -s "${RM}:8088/ws/v1/cluster/apps?state=ACCEPTED"
ResourceManager status curl -s "${RM}:8088/ws/v1/cluster/info”
YARN cluster curl -s "${RM}:8088/ws/v1/cluster/metrics" | jq "."
NodeManager curl -s "${RM}:8088/ws/v1/cluster/nodes" | jq "."
* kinit required in secured cluster.

46
3
Server provisioning
and management

Background for Big Data systems
 Virtualization vs Bare Metal
47
Bare Metal Virtualization (Cloud)
Management
(Operation)
Quite Complicate Easy
Performance Best performance Bottleneck always
Solutions Many legacy way .. AWS , OpenStack ..
 What’s your choice ?
• Big Data , especially Hadoop needs more resource.
• Bare Metal is best way to maximize HW power.

Background for Big Data systems
 Server capacity is most important thing for Big Data
• Cheaper HW, we don’t care about warranty, cheaper parts,
furthermore NO REDUDANCY.
• Just what we want is more and more servers.
48
But it scared you,,, Don’t we afraid trouble? No, we don’t.
Here, Full automation OS provisioning should work for Big Data
 Only bare metal, but it’s most likely cloud
• Full automation for OS installation.
• Full stack management with Chef.
• Everything should be there when you click.

Automation Provisioning and Operation
49
CHEF
Dash Board
Organization
Role/Recipe
Host name
Custom data
for Rakuten
Provisioning
OS Provisioning
with Chef
API
Worker
Scratch Controller
Installation
Management
Monitoring
Operation
Configuration for you App
All Operation by Chef
App Deploy Monitoring
Request new server
Recipe you built for your App
Recipe for Application
By DevOps
Engineering
MAAS
DNS API
Power
DNS
Shinken Graphite
Full Automated Operation
Connect
Control Control

Provisioning, Just 3 Step
50
1st Step
• Chose Server
2nd Step
• Chose Action
• Install
• Destroy
3rd Step
• Hostname
• OS distribution/version
• Tenant and environment
• Recipes of your application
Final, click and get it
Hey, I want
new server
Just Do It

Provisioning Process
51
InstallOS SetupOS SetupApp
Provisioning Core
Default
Infra Role
App Role
Manage App’s recipes
Default Infra
Monitoring
App Monitoring
Basic Install
DNS entry
OS / APP
Configuration
Monitoring
Configuration
Finish
Task
Worker
Approximately 30 min
Request via GUI/API
MAAS CHEF CHEF

Full Stack Management
 Management not only Infra but also Hadoop
52
• Designed by ApplicationApp Monitoring
• Designed by Application
• Custom Package by ApplicationApp Deployment
• Custom OS Configuration
• Chef OrganizationCustom Configuration
• Default OS monitoringInfra Monitoring
• Default Configuration on OS
• Basic PackagesOS Configuration
• Simple image
• Disk Partitioning / Raid configurationOS Installation
• Detail H/W Spec
• Custom Information for BDDInventory Data
Role/Recipe
Infra Base
App XX
Organization
MAAS
Chef
Tool
Platform XX
Provisioning
Core
Criteria

53
4
Most important thing
at the last

We are hiring!
 Now Rakuten really focuses on utilizing Rakuten’s rich data. So,
Hadoop will be more and more important.
 Current hadoop admin team
• Leader (double-posts)
• 3 members (2 full-time and 1 double-posts)
 So, we need 2 or 3 more engineers for our team!
• Just mail to me, I can help you for your application!
http://global.rakuten.com/corp/careers/
tomomichi.hirano@rakuten.com
54

How to overcome mysterious problems caused by large and multi-tenancy Hadoop cluster at Rakuten

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to How to overcome mysterious problems caused by large and multi-tenancy Hadoop cluster at Rakuten (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

How to overcome mysterious problems caused by large and multi-tenancy Hadoop cluster at Rakuten