SlideShare a Scribd company logo
How to overcome mysterious problems
caused by large and multi-tenant
hadoop cluster at Rakuten
Oct/27/2016
Tomomichi Hirano
EC Core Technology Department, Rakuten Inc.
tomomichi.hirano@rakuten.com
Who I am
 Tomomichi Hirano (平野 智巌)
• Joined to Rakuten in 2013
• Hadoop administrator
• Monitoring, tuning and Improving hadoop cluster
• Verifying and enabling new hadoop related components
• Trouble shooting for all problems
• Regular operation such as server and user adding, disk replacement, etc.
 Previous team
• Server provisioning, networking and HW related things.
2
Today’s Agenda
 Quick introduction
• About our clusters
• Hadoop use cases at Rakuten
 Mysterious problems
• Never ending jobs
• DataNode freezing
• NameNode freezing
• High load after restarting NameNode
• Lessons learned
3
 Server provisioning and
management
• Background for Big Data systems
• Provisioning and management
4
1 Quick introduction
About our clusters
 Production cluster
• # of slaves : around 200
• HDFS capacity : around 8PB
• # of jobs per day : 30,000 - 50,000
• # of hadoop active user accounts : around 40
• Types of jobs : MR, Hive, Tez, Spark, Pig, sqoop, HBase, Slider, etc.
 Other clusters
• Another production cluster for Business Continuity (BC).
• Some clusters for staging and development.
5
MAAS for OS
provisioning
About our clusters
6
Chef for configuring
Provisioning Engine
System Management
Shinken and PagerDuty
for alerting and incident
management
Splunk for reporting
Ganglia and Grafana for
graphingSecurity
Kerberos for cluster
security
Analysis Feedback loop
Output
Input
Hadoop use cases at Rakuten
7
shop data
purchase
data
item data
user
behavioruser
membership
item search
reports for shops
search quality
search suggest
recommendation
page design
recommendation
advertisement
event planning
site design
KPI management
marketing and sales
8
2 Mysterious problems
9
Never ending jobs
Some jobs were very slow to submit
or never ended with a lot of preemption
Mystery 1
Never ending jobs
 Recognized
• User began to complain “Hadoop is very slow !!!”
• Actually, a lot of jobs were very slow to submit and/or never ended.
10
“Container preempted by scheduler”
Never ending jobs
 What is “Capacity Scheduler Preemption”
• Jobs in high priority queue kill other jobs in low priority queue.
 Who kills who?
• There were already too many jobs and queues.
• Hard to get what was happening at all.
• So, we have decided to build our original monitoring system.
11
Never ending jobs
 Original monitoring with Grafana / Graphite
12
Graphite
for hadoop
carbon-cache
Grafana
Collectd
graphite-plugin
exec-plugin
scripts with jq
Graphite
for infra
NameNode
ResourceManager
via REST API
curl -s "${RM}:8088/ws/v1/cluster/apps?state=RUNNING"
curl -s "${RM}:8088/ws/v1/cluster/apps?finishedTimeBegin=`date -d '10
minutes ago' +%s%3N`"
Never ending jobs
 Graphs for YARN cluster
13
< Memory usage of YARN cluster >
< Running and Pending jobs >
Yellow : # of pending jobs
Green : # of running jobs
Pending jobs due to lack of memory.
Never ending jobs
 Graphs to analyze per user
14
< Running jobs per user >
< Pending jobs per user >
< Memory usage per user >
“Our cluster is not slow, your jobs are too much!”
Never ending jobs
 Never ending jobs with a lot of preemption
15
< YARM memory usage >
Too much preempting, maybe killing each other.
< Number of preemption per user >
Never ending jobs
 Turning for preemption, but how long?
• Investigated elapse time of each tasks and analyzed with Excel.
• 4.5 million tasks per day!
16
curl -s "http://${JH}:19888/ws/v1/history/mapreduce/jobs/${job_id}/tasks"
99% of tasks finished
within 5 min.
Never ending jobs
 Our solution : Cooperation with users
• In cluster side, set 10 min for
• In user side, we guided like below.
17
Please try to design your jobs so tasks finishes in less than 5 minutes normally
which leaves healthy room up to 10 minutes to avoid getting killed in all cases.
• Still some preemption, but far less!
• Yes, now cluster is under control.
• We can see “who kills who” and why now.
18
DataNode freezing
DataNode seemed to be freezing
for several minutes and went into dead
status sometime.
Mystery 2
DataNode freezing
 Recognized
• Last contact values of some DataNodes were very high.
• Normally, less than 5 sec.
• But, sometime 1 min, worst case 10 mins and went into “dead” status.
• But recovered without any operation.
19
Last contact of each DataNode
* Last contact of DataNode is elapsed time from
last successful health check with NameNode.
DataNode freezing
 Investigated DataNode log
• No log output during this issue was happening.
• Seemed the DataNode was freezing.
 Tried to restart DataNode and reboot OS
• Restarting DataNode did not help at all.
• Reboot OS cleared this issue for a while, but happened again.
 Observation
• Not Memory leak. OS related issue?
• Had to figure out this issue happened on which nodes and when.
20
DataNode freezing
 Added graph to monitor last contact value.
21
curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo"
< Last contact of each DataNode >
Figured out this issue happened
only on newly added DataNodes.
--- 200 sec ---
--- 100 sec ---
--- 0 sec ---
DataNode freezing
 Analyze with some other graphs
• Graphs for OS iowait and HDFS usage
• Trigger of this issue seemed high load caused by HDFS write.
22
DataNode freezing
 Many tries, but still no help
• Increasing DataNode heap, increasing handler count, upgrading OS, etc.
 Then, took thread dump and analyzed
• To figure out it was actual freezing or not.
• To figure out wrong thread blocks other thread?
23
${java home}/bin/jcmd ${pid of target JVM} Thread.print
${java home}/bin/jstack ${pid of target JVM}
Note : need to execute with process owner account.
DataNode freezing
 Thread dump analysis
• “heartbeating”, “DataXceiver” and “PacketResponder” were
blocked by a thread named “Thread-41”.
24
"DataNode: XXX heartbeating to ${NAMENODE}:8020" daemon prio=10 tid=0x0000000002156000 nid=0xf26 waiting for monitor entry
[0x00007f9dd315a000] java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.deleteBlock(BlockPoolSliceScanner.java:305)
- waiting to lock <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.deleteBlocks(BlockPoolSliceScanner.java:330)
...
"Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000] ...
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlockInfo(BlockPoolSliceScanner.java:237)
- locked <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.assignInitialVerificationTimes(BlockPoolSliceScanner.java:602)
- locked <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:645)
...
Blocked
Blocking
DataNode freezing
 What is “Thread-41”?
• Seems, progressing something with java “TreeMap”.
25
...
"Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.put(TreeMap.java:2019)
at java.util.TreeSet.add(TreeSet.java:255)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlockInfo(BlockPoolSliceScanner.java:243)
...
"Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.remove(TreeMap.java:2382)
at java.util.TreeSet.remove(TreeSet.java:276)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.delBlockInfo(BlockPoolSliceScanner.java:253)
...
DataNode freezing
 Source code reading
• org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner
26
Scans the block files under a block pool and verifies that the files are not corrupt.
This keeps track of blocks and their last verification times.
private static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks
• 3 weeks???
• DataNode scans blocks more than once in 3 weeks by default.
• So, it’s must be “Datanode Block Scanner” related something!
DataNode freezing
 Found a workaround!
• Found there were something strange behavior for creating block map
to decide which blocks will be scanned.
• Also found two files for “Datanode Block Scanner” in local FS.
27
dncp_block_verification.log.curr
dncp_block_verification.log.prev
 So, we tried to delete them and restart DN
• Kind of initialization for “Datanode Block Scanner”.
• Then this issue never happened after that!
DataNode freezing
 Don’t worry!
• This issue should have been improved already with HDFS 2.7.0.
28
https://issues.apache.org/jira/browse/HDFS-7430
Rewrite the BlockScanner to use O(1) memory and use multiple thread
 Lessons learned
• Thread dump and source code reading for deep analysis.
• Especially, in case that we can’t get any clues from logs.
29
NameNode freezing
NameNode seemed to be freezing for
several minutes repeatedly with an interval.
Mystery 3
NameNode freezing
 Recognized
• One day, we recognized strange behavior with ResourceManager.
30
< Memory usage of YARN cluster >
< Running and pending jobs in last 10 min >
• Seemed, ResourceManager couldn’t accept new jobs.
• But running jobs were ok.
NameNode freezing
 Added some graphs for NameNodes. Must be HDFS checkpoint.
31
< lastTxId, checkpointTxId >
< if_octets.tx, if_octets.rx >
< RpcQueueTimeAveTime, RpcProcessingTimeAveTime >
< CallQueueLength >
NameNode freezing
 Monitoring continuously.
• Then we could catch difference before and
after NameNode failover.
• Checkpoint on standby NameNode should
not affect to active NameNode.
• But, actually affected!
32
• White line is a fail-over from second (nn2) to first (nn1).
• Happened only when second NameNode was active.
NameNode freezing
 HDFS-7858
• Improve HA Namenode Failover detection on the client
• Fix Version/s : 2.8.0, 3.0.0-alpha1
 HDFS-6763
• Initialize file system-wide quota once on transitioning to active
• Fix Version/s : 2.8.0, 3.0.0-alpha1
 Workaround for now
• Our current workaround is just keeping first NameNode active.
• So, strongly want backports of them to an available HDP version!
33
34
High load after restarting NameNode
NameNode went into unstable state by this
unknown high load.
Mystery 4
High load after restarting NameNode
 Symptom
• We met unknown high load after restarting NameNode in several
times.
• It suddenly disappeared several hours or a few days after.
• But the last one, it had never gone...
• During this high load was existing, NameNode went into very
unstable state.
• When it happened after on standby NameNode, we couldn’t fail-
over (fail-back).
• Very serious problem for us!
35
High load after restarting NameNode
 Added graphs for RPC queue activities in NameNode
• Unknown high load between checkpoint
36
< lastTxId (Green) >
< checkpointTxId (Yellow) >
< Waiting Time (Yellow) >
< QueueLength >
Good case Bad case
< Processing Time (Red) >
High load after restarting NameNode
 Multiple graphs
37
• NameNode seemed to be receiving
some amount of data from someone.
• Journal nodes? No...
• DataNode? Hard to know...
• But, it must relate to high load!
< Receive data size (Blue) >
High load after restarting NameNode
 DataNode log analysis
• 3 kinds of 60000 msec timeout were continuously being output.
38
2016-09-28 05:33:35,384 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.net.SocketTimeoutException: Call From XXXX to bhdXXXX:8020 failed on socket timeout exception:
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/XXXX remote=XXX]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:616)
60 sec timeout log
3 methods in a class “BPServiceActor” were failing repeatedly.
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:523)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reportReceivedDeletedBlocks(BPServiceActor.java:312)
High load after restarting NameNode
 Thread dump analysis on NameNode side
• Almost all server handers were waiting for one lock.
39
"IPC Server handler 45 on 8020" daemon prio=10 tid=0x00007fbed169f800 nid=0x26b2 waiting on condition [0x00007f9e05be1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa30ad4d890> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
....
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(...)
"IPC Server handler 45 on 8020" daemon prio=10 tid=0x00007fbed169f800 nid=0x26b2 waiting on condition [0x00007f9e05be1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa30ad4d890> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
....
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReceivedAndDeleted(...)
High load after restarting NameNode
 Source code reading
• org.apache.hadoop.hdfs.server.datanode.BPServiceActor
• Seems to handle communication with NameNode.
40
private void offerService() throws Exception {
LOG.info("For namenode " + nnAddr + " using"
+ " DELETEREPORT_INTERVAL of " + dnConf.deleteReportInterval + " msec "
+ " BLOCKREPORT_INTERVAL of " + dnConf.blockReportInterval + "msec"
+ " CACHEREPORT_INTERVAL of " + dnConf.cacheReportInterval + "msec"
+ " Initial delay: " + dnConf.initialBlockReportDelay + "msec"
+ "; heartBeatInterval=" + dnConf.heartBeatInterval);
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode bhdpXXXX:8020
using DELETEREPORT_INTERVAL of 500000 msec
BLOCKREPORT_INTERVAL of 21600000msec <= 6 hours
CACHEREPORT_INTERVAL of 10000msec
Initial delay: 0msec;
heartBeatInterval=5000 <= 5 sec
Source code
Actual DN’s log
High load after restarting NameNode
 DataNode log analysis again
• Failed to sent block report repeatedly with short time interval.
41
...
2016-10-03 03:40:47,141 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ...
2016-10-03 03:44:19,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ...
2016-10-03 03:47:43,464 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ...
...
High load after restarting NameNode
 What was the high load?
• Almost all DataNodes failed to send full block report and retrying with
a few minutes interval.
• Yes, it was “Block Report Storm” from all of DataNodes.
• During this storm was existing, full-BlockReport had never succeeded.
 Then, how to stop the storm?
• Have to reduce concurrency of these request somehow.
42
High load after restarting NameNode
 Tries and errors
• Manual arbitration with iptables
• Worked well, but a little bit tricky.
• And some DataNodes lost heartbeart
with Active NameNode sometimes.
43
DN DN DN
Standby
NN
Active
NN
DN DN
• Restart NameNode with different slaves files
• Worked in several time, but unfortunately, we got whole cluster down.
• So, you MUST NOT do this operation!!!
• Most safety way
• NameNode in startup phase discards non-initial block report.
• So, increase dfs.namenode.safemode.extension and wait.
 Monitor, Monitor, Monitor!!!
• Graphing tool is MUST for large and multi-tenant cluster.
• Investigation and monitoring with multiple graphs would great help.
 Cooperation with users
• For some issues, we have to solve cluster problem with users.
 Thread dump and source code reading for deep analysis
• In case that we can’t get any clues from logs, it’s very important.
• Thread dumping would be helpful for freezing or locking issues
especially.
44
Lessons learned from mysteries
 Query examples for NameNode and ResourceManager
45
Just as a reference
Contents Queries
HDFS cluster curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState"
NameNode JVM info curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=JvmMetrics"
NameNode and DataNode curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo"
NameNode state curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"
NameNode RPC curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020"
NameNode CMS curl -s "${NN}:50070/jmx?qry=java.lang:type=GarbageCollector,name=ConcurrentMarkSweep"
NameNode Heap curl -s "${NN}:50070/jmx?qry=java.lang:type=Memory"
List of a HDFS directory * curl -s --negotiate -u : "${NN}:50070/webhdfs/v1/${HDFS_PATH}?&op=LISTSTATUS"
Usage of a HDFS directory * curl -s --negotiate -u : "${NN}:50070/webhdfs/v1/${HDFS_PATH}?&op=GETCONTENTSUMMARY"
jobs finished in last 10 min curl -s "${RM}:8088/ws/v1/cluster/apps?finishedTimeBegin=`date -d '10 minutes ago' +%s%3N`"
running jobs curl -s "${RM}:8088/ws/v1/cluster/apps?state=RUNNING"
accepted jobs curl -s "${RM}:8088/ws/v1/cluster/apps?state=ACCEPTED"
ResourceManager status curl -s "${RM}:8088/ws/v1/cluster/info”
YARN cluster curl -s "${RM}:8088/ws/v1/cluster/metrics" | jq "."
NodeManager curl -s "${RM}:8088/ws/v1/cluster/nodes" | jq "."
* kinit required in secured cluster.
46
3
Server provisioning
and management
Background for Big Data systems
 Virtualization vs Bare Metal
47
Bare Metal Virtualization (Cloud)
Management
(Operation)
Quite Complicate Easy
Performance Best performance Bottleneck always
Solutions Many legacy way .. AWS , OpenStack ..
 What’s your choice ?
• Big Data , especially Hadoop needs more resource.
• Bare Metal is best way to maximize HW power.
Background for Big Data systems
 Server capacity is most important thing for Big Data
• Cheaper HW, we don’t care about warranty, cheaper parts,
furthermore NO REDUDANCY.
• Just what we want is more and more servers.
48
But it scared you,,, Don’t we afraid trouble? No, we don’t.
Here, Full automation OS provisioning should work for Big Data
 Only bare metal, but it’s most likely cloud
• Full automation for OS installation.
• Full stack management with Chef.
• Everything should be there when you click.
Automation Provisioning and Operation
49
CHEF
Dash Board
Organization
Role/Recipe
Host name
Custom data
for Rakuten
Provisioning
OS Provisioning
with Chef
API
Worker
Scratch Controller
Installation
Management
Monitoring
Operation
Configuration for you App
All Operation by Chef
App Deploy Monitoring
Request new server
Recipe you built for your App
Recipe for Application
By DevOps
Engineering
MAAS
DNS API
Power
DNS
Shinken Graphite
Full Automated Operation
Connect
Control Control
Provisioning, Just 3 Step
50
1st Step
• Chose Server
2nd Step
• Chose Action
• Install
• Destroy
3rd Step
• Hostname
• OS distribution/version
• Tenant and environment
• Recipes of your application
Final, click and get it
Hey, I want
new server
Just Do It
Provisioning Process
51
InstallOS SetupOS SetupApp
Provisioning Core
Default
Infra Role
App Role
Manage App’s recipes
Default Infra
Monitoring
App Monitoring
Basic Install
DNS entry
OS / APP
Configuration
Monitoring
Configuration
Finish
Task
Worker
Approximately 30 min
Request via GUI/API
MAAS CHEF CHEF
Full Stack Management
 Management not only Infra but also Hadoop
52
• Designed by ApplicationApp Monitoring
• Designed by Application
• Custom Package by ApplicationApp Deployment
• Custom OS Configuration
• Chef OrganizationCustom Configuration
• Default OS monitoringInfra Monitoring
• Default Configuration on OS
• Basic PackagesOS Configuration
• Simple image
• Disk Partitioning / Raid configurationOS Installation
• Detail H/W Spec
• Custom Information for BDDInventory Data
Role/Recipe
Infra Base
App XX
Organization
MAAS
Chef
Tool
Platform XX
Provisioning
Core
Criteria
53
4
Most important thing
at the last
We are hiring!
 Now Rakuten really focuses on utilizing Rakuten’s rich data. So,
Hadoop will be more and more important.
 Current hadoop admin team
• Leader (double-posts)
• 3 members (2 full-time and 1 double-posts)
 So, we need 2 or 3 more engineers for our team!
• Just mail to me, I can help you for your application!
http://global.rakuten.com/corp/careers/
tomomichi.hirano@rakuten.com
54

More Related Content

PPTX
Hive+Tez: A performance deep dive
PPTX
ORC File - Optimizing Your Big Data
PDF
High Concurrency Architecture at TIKI
PDF
Hive tuning
PDF
Hardening Kafka Replication
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
PPTX
[211] HBase 기반 검색 데이터 저장소 (공개용)
PPTX
Tuning kafka pipelines
Hive+Tez: A performance deep dive
ORC File - Optimizing Your Big Data
High Concurrency Architecture at TIKI
Hive tuning
Hardening Kafka Replication
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
[211] HBase 기반 검색 데이터 저장소 (공개용)
Tuning kafka pipelines

What's hot (20)

PDF
OpenStack DVR_What is DVR?
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Argus Production Monitoring at Salesforce
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PPTX
Introduction to Storm
PPTX
Data Stream Processing with Apache Flink
PDF
Hadoop and Kerberos
PDF
CSW2017 Qinghao tang+Xinlei ying vmware_escape_final
PPTX
Toi uu hoa he thong 30 trieu nguoi dung
PPTX
Hive on spark is blazing fast or is it final
PDF
Top 5 Mistakes When Writing Spark Applications
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PPT
Hive User Meeting August 2009 Facebook
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Improving Kafka at-least-once performance at Uber
PDF
The Dual write problem
PDF
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019 #hc...
PDF
20090622 Velocity
OpenStack DVR_What is DVR?
Apache Tez - A New Chapter in Hadoop Data Processing
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Argus Production Monitoring at Salesforce
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Introduction to Storm
Data Stream Processing with Apache Flink
Hadoop and Kerberos
CSW2017 Qinghao tang+Xinlei ying vmware_escape_final
Toi uu hoa he thong 30 trieu nguoi dung
Hive on spark is blazing fast or is it final
Top 5 Mistakes When Writing Spark Applications
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Hive User Meeting August 2009 Facebook
Hive + Tez: A Performance Deep Dive
Improving Kafka at-least-once performance at Uber
The Dual write problem
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019 #hc...
20090622 Velocity
Ad

Viewers also liked (13)

PPTX
Hadoop Operations - Best Practices from the Field
PDF
HDFS Design Principles
PDF
Hadoop - Lessons Learned
PPTX
Big data- HDFS(2nd presentation)
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PDF
Hadoop introduction
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
Hadoop HDFS Architeture and Design
PDF
Hadoop & Big Data benchmarking
PPTX
Hadoop & HDFS for Beginners
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PPSX
PPT
Seminar Presentation Hadoop
Hadoop Operations - Best Practices from the Field
HDFS Design Principles
Hadoop - Lessons Learned
Big data- HDFS(2nd presentation)
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Hadoop introduction
Distributed Computing with Apache Hadoop: Technology Overview
Hadoop HDFS Architeture and Design
Hadoop & Big Data benchmarking
Hadoop & HDFS for Beginners
Unleashing the Power of Apache Atlas with Apache Ranger
Seminar Presentation Hadoop
Ad

Similar to How to overcome mysterious problems caused by large and multi-tenancy Hadoop cluster at Rakuten (20)

PDF
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
PPTX
Hadoop administration
ODP
Hadoop2
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PDF
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
PDF
Scaling Hadoop at LinkedIn
PPTX
Hadoop Developer
PPTX
Managing growth in Production Hadoop Deployments
PDF
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
DOCX
500 data engineering interview question.docx
PDF
Most Popular Hadoop Interview Questions and Answers
PPTX
Hadoop architecture meetup
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PPTX
Hadoop and It_s Components_PPT .pptx
PPTX
Hadoop fault-tolerance
PPTX
Hadoop ppt on the basics and architecture
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PDF
Upgrading from HDP 2.1 to HDP 2.2
PDF
HDP2 and YARN operations point
PDF
Setting High Availability in Hadoop Cluster
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop administration
Hadoop2
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Scaling Hadoop at LinkedIn
Hadoop Developer
Managing growth in Production Hadoop Deployments
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
500 data engineering interview question.docx
Most Popular Hadoop Interview Questions and Answers
Hadoop architecture meetup
Apache Hadoop YARN, NameNode HA, HDFS Federation
Hadoop and It_s Components_PPT .pptx
Hadoop fault-tolerance
Hadoop ppt on the basics and architecture
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Upgrading from HDP 2.1 to HDP 2.2
HDP2 and YARN operations point
Setting High Availability in Hadoop Cluster

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf

How to overcome mysterious problems caused by large and multi-tenancy Hadoop cluster at Rakuten

  • 1. How to overcome mysterious problems caused by large and multi-tenant hadoop cluster at Rakuten Oct/27/2016 Tomomichi Hirano EC Core Technology Department, Rakuten Inc. tomomichi.hirano@rakuten.com
  • 2. Who I am  Tomomichi Hirano (平野 智巌) • Joined to Rakuten in 2013 • Hadoop administrator • Monitoring, tuning and Improving hadoop cluster • Verifying and enabling new hadoop related components • Trouble shooting for all problems • Regular operation such as server and user adding, disk replacement, etc.  Previous team • Server provisioning, networking and HW related things. 2
  • 3. Today’s Agenda  Quick introduction • About our clusters • Hadoop use cases at Rakuten  Mysterious problems • Never ending jobs • DataNode freezing • NameNode freezing • High load after restarting NameNode • Lessons learned 3  Server provisioning and management • Background for Big Data systems • Provisioning and management
  • 5. About our clusters  Production cluster • # of slaves : around 200 • HDFS capacity : around 8PB • # of jobs per day : 30,000 - 50,000 • # of hadoop active user accounts : around 40 • Types of jobs : MR, Hive, Tez, Spark, Pig, sqoop, HBase, Slider, etc.  Other clusters • Another production cluster for Business Continuity (BC). • Some clusters for staging and development. 5
  • 6. MAAS for OS provisioning About our clusters 6 Chef for configuring Provisioning Engine System Management Shinken and PagerDuty for alerting and incident management Splunk for reporting Ganglia and Grafana for graphingSecurity Kerberos for cluster security
  • 7. Analysis Feedback loop Output Input Hadoop use cases at Rakuten 7 shop data purchase data item data user behavioruser membership item search reports for shops search quality search suggest recommendation page design recommendation advertisement event planning site design KPI management marketing and sales
  • 9. 9 Never ending jobs Some jobs were very slow to submit or never ended with a lot of preemption Mystery 1
  • 10. Never ending jobs  Recognized • User began to complain “Hadoop is very slow !!!” • Actually, a lot of jobs were very slow to submit and/or never ended. 10 “Container preempted by scheduler”
  • 11. Never ending jobs  What is “Capacity Scheduler Preemption” • Jobs in high priority queue kill other jobs in low priority queue.  Who kills who? • There were already too many jobs and queues. • Hard to get what was happening at all. • So, we have decided to build our original monitoring system. 11
  • 12. Never ending jobs  Original monitoring with Grafana / Graphite 12 Graphite for hadoop carbon-cache Grafana Collectd graphite-plugin exec-plugin scripts with jq Graphite for infra NameNode ResourceManager via REST API curl -s "${RM}:8088/ws/v1/cluster/apps?state=RUNNING" curl -s "${RM}:8088/ws/v1/cluster/apps?finishedTimeBegin=`date -d '10 minutes ago' +%s%3N`"
  • 13. Never ending jobs  Graphs for YARN cluster 13 < Memory usage of YARN cluster > < Running and Pending jobs > Yellow : # of pending jobs Green : # of running jobs Pending jobs due to lack of memory.
  • 14. Never ending jobs  Graphs to analyze per user 14 < Running jobs per user > < Pending jobs per user > < Memory usage per user > “Our cluster is not slow, your jobs are too much!”
  • 15. Never ending jobs  Never ending jobs with a lot of preemption 15 < YARM memory usage > Too much preempting, maybe killing each other. < Number of preemption per user >
  • 16. Never ending jobs  Turning for preemption, but how long? • Investigated elapse time of each tasks and analyzed with Excel. • 4.5 million tasks per day! 16 curl -s "http://${JH}:19888/ws/v1/history/mapreduce/jobs/${job_id}/tasks" 99% of tasks finished within 5 min.
  • 17. Never ending jobs  Our solution : Cooperation with users • In cluster side, set 10 min for • In user side, we guided like below. 17 Please try to design your jobs so tasks finishes in less than 5 minutes normally which leaves healthy room up to 10 minutes to avoid getting killed in all cases. • Still some preemption, but far less! • Yes, now cluster is under control. • We can see “who kills who” and why now.
  • 18. 18 DataNode freezing DataNode seemed to be freezing for several minutes and went into dead status sometime. Mystery 2
  • 19. DataNode freezing  Recognized • Last contact values of some DataNodes were very high. • Normally, less than 5 sec. • But, sometime 1 min, worst case 10 mins and went into “dead” status. • But recovered without any operation. 19 Last contact of each DataNode * Last contact of DataNode is elapsed time from last successful health check with NameNode.
  • 20. DataNode freezing  Investigated DataNode log • No log output during this issue was happening. • Seemed the DataNode was freezing.  Tried to restart DataNode and reboot OS • Restarting DataNode did not help at all. • Reboot OS cleared this issue for a while, but happened again.  Observation • Not Memory leak. OS related issue? • Had to figure out this issue happened on which nodes and when. 20
  • 21. DataNode freezing  Added graph to monitor last contact value. 21 curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo" < Last contact of each DataNode > Figured out this issue happened only on newly added DataNodes. --- 200 sec --- --- 100 sec --- --- 0 sec ---
  • 22. DataNode freezing  Analyze with some other graphs • Graphs for OS iowait and HDFS usage • Trigger of this issue seemed high load caused by HDFS write. 22
  • 23. DataNode freezing  Many tries, but still no help • Increasing DataNode heap, increasing handler count, upgrading OS, etc.  Then, took thread dump and analyzed • To figure out it was actual freezing or not. • To figure out wrong thread blocks other thread? 23 ${java home}/bin/jcmd ${pid of target JVM} Thread.print ${java home}/bin/jstack ${pid of target JVM} Note : need to execute with process owner account.
  • 24. DataNode freezing  Thread dump analysis • “heartbeating”, “DataXceiver” and “PacketResponder” were blocked by a thread named “Thread-41”. 24 "DataNode: XXX heartbeating to ${NAMENODE}:8020" daemon prio=10 tid=0x0000000002156000 nid=0xf26 waiting for monitor entry [0x00007f9dd315a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.deleteBlock(BlockPoolSliceScanner.java:305) - waiting to lock <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.deleteBlocks(BlockPoolSliceScanner.java:330) ... "Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000] ... at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlockInfo(BlockPoolSliceScanner.java:237) - locked <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.assignInitialVerificationTimes(BlockPoolSliceScanner.java:602) - locked <0x00000006fc309158> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:645) ... Blocked Blocking
  • 25. DataNode freezing  What is “Thread-41”? • Seems, progressing something with java “TreeMap”. 25 ... "Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000] java.lang.Thread.State: RUNNABLE at java.util.TreeMap.put(TreeMap.java:2019) at java.util.TreeSet.add(TreeSet.java:255) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlockInfo(BlockPoolSliceScanner.java:243) ... "Thread-41" daemon prio=10 tid=0x00007f9dec7bf800 nid=0x1097 runnable [0x00007f9dd1c87000] java.lang.Thread.State: RUNNABLE at java.util.TreeMap.remove(TreeMap.java:2382) at java.util.TreeSet.remove(TreeSet.java:276) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.delBlockInfo(BlockPoolSliceScanner.java:253) ...
  • 26. DataNode freezing  Source code reading • org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner 26 Scans the block files under a block pool and verifies that the files are not corrupt. This keeps track of blocks and their last verification times. private static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks • 3 weeks??? • DataNode scans blocks more than once in 3 weeks by default. • So, it’s must be “Datanode Block Scanner” related something!
  • 27. DataNode freezing  Found a workaround! • Found there were something strange behavior for creating block map to decide which blocks will be scanned. • Also found two files for “Datanode Block Scanner” in local FS. 27 dncp_block_verification.log.curr dncp_block_verification.log.prev  So, we tried to delete them and restart DN • Kind of initialization for “Datanode Block Scanner”. • Then this issue never happened after that!
  • 28. DataNode freezing  Don’t worry! • This issue should have been improved already with HDFS 2.7.0. 28 https://issues.apache.org/jira/browse/HDFS-7430 Rewrite the BlockScanner to use O(1) memory and use multiple thread  Lessons learned • Thread dump and source code reading for deep analysis. • Especially, in case that we can’t get any clues from logs.
  • 29. 29 NameNode freezing NameNode seemed to be freezing for several minutes repeatedly with an interval. Mystery 3
  • 30. NameNode freezing  Recognized • One day, we recognized strange behavior with ResourceManager. 30 < Memory usage of YARN cluster > < Running and pending jobs in last 10 min > • Seemed, ResourceManager couldn’t accept new jobs. • But running jobs were ok.
  • 31. NameNode freezing  Added some graphs for NameNodes. Must be HDFS checkpoint. 31 < lastTxId, checkpointTxId > < if_octets.tx, if_octets.rx > < RpcQueueTimeAveTime, RpcProcessingTimeAveTime > < CallQueueLength >
  • 32. NameNode freezing  Monitoring continuously. • Then we could catch difference before and after NameNode failover. • Checkpoint on standby NameNode should not affect to active NameNode. • But, actually affected! 32 • White line is a fail-over from second (nn2) to first (nn1). • Happened only when second NameNode was active.
  • 33. NameNode freezing  HDFS-7858 • Improve HA Namenode Failover detection on the client • Fix Version/s : 2.8.0, 3.0.0-alpha1  HDFS-6763 • Initialize file system-wide quota once on transitioning to active • Fix Version/s : 2.8.0, 3.0.0-alpha1  Workaround for now • Our current workaround is just keeping first NameNode active. • So, strongly want backports of them to an available HDP version! 33
  • 34. 34 High load after restarting NameNode NameNode went into unstable state by this unknown high load. Mystery 4
  • 35. High load after restarting NameNode  Symptom • We met unknown high load after restarting NameNode in several times. • It suddenly disappeared several hours or a few days after. • But the last one, it had never gone... • During this high load was existing, NameNode went into very unstable state. • When it happened after on standby NameNode, we couldn’t fail- over (fail-back). • Very serious problem for us! 35
  • 36. High load after restarting NameNode  Added graphs for RPC queue activities in NameNode • Unknown high load between checkpoint 36 < lastTxId (Green) > < checkpointTxId (Yellow) > < Waiting Time (Yellow) > < QueueLength > Good case Bad case < Processing Time (Red) >
  • 37. High load after restarting NameNode  Multiple graphs 37 • NameNode seemed to be receiving some amount of data from someone. • Journal nodes? No... • DataNode? Hard to know... • But, it must relate to high load! < Receive data size (Blue) >
  • 38. High load after restarting NameNode  DataNode log analysis • 3 kinds of 60000 msec timeout were continuously being output. 38 2016-09-28 05:33:35,384 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From XXXX to bhdXXXX:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/XXXX remote=XXX]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:616) 60 sec timeout log 3 methods in a class “BPServiceActor” were failing repeatedly. at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:523) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reportReceivedDeletedBlocks(BPServiceActor.java:312)
  • 39. High load after restarting NameNode  Thread dump analysis on NameNode side • Almost all server handers were waiting for one lock. 39 "IPC Server handler 45 on 8020" daemon prio=10 tid=0x00007fbed169f800 nid=0x26b2 waiting on condition [0x00007f9e05be1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007fa30ad4d890> (a java.util.concurrent.locks.ReentrantLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) .... at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(...) "IPC Server handler 45 on 8020" daemon prio=10 tid=0x00007fbed169f800 nid=0x26b2 waiting on condition [0x00007f9e05be1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007fa30ad4d890> (a java.util.concurrent.locks.ReentrantLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) .... at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReceivedAndDeleted(...)
  • 40. High load after restarting NameNode  Source code reading • org.apache.hadoop.hdfs.server.datanode.BPServiceActor • Seems to handle communication with NameNode. 40 private void offerService() throws Exception { LOG.info("For namenode " + nnAddr + " using" + " DELETEREPORT_INTERVAL of " + dnConf.deleteReportInterval + " msec " + " BLOCKREPORT_INTERVAL of " + dnConf.blockReportInterval + "msec" + " CACHEREPORT_INTERVAL of " + dnConf.cacheReportInterval + "msec" + " Initial delay: " + dnConf.initialBlockReportDelay + "msec" + "; heartBeatInterval=" + dnConf.heartBeatInterval); INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode bhdpXXXX:8020 using DELETEREPORT_INTERVAL of 500000 msec BLOCKREPORT_INTERVAL of 21600000msec <= 6 hours CACHEREPORT_INTERVAL of 10000msec Initial delay: 0msec; heartBeatInterval=5000 <= 5 sec Source code Actual DN’s log
  • 41. High load after restarting NameNode  DataNode log analysis again • Failed to sent block report repeatedly with short time interval. 41 ... 2016-10-03 03:40:47,141 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ... 2016-10-03 03:44:19,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ... 2016-10-03 03:47:43,464 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report ... ...
  • 42. High load after restarting NameNode  What was the high load? • Almost all DataNodes failed to send full block report and retrying with a few minutes interval. • Yes, it was “Block Report Storm” from all of DataNodes. • During this storm was existing, full-BlockReport had never succeeded.  Then, how to stop the storm? • Have to reduce concurrency of these request somehow. 42
  • 43. High load after restarting NameNode  Tries and errors • Manual arbitration with iptables • Worked well, but a little bit tricky. • And some DataNodes lost heartbeart with Active NameNode sometimes. 43 DN DN DN Standby NN Active NN DN DN • Restart NameNode with different slaves files • Worked in several time, but unfortunately, we got whole cluster down. • So, you MUST NOT do this operation!!! • Most safety way • NameNode in startup phase discards non-initial block report. • So, increase dfs.namenode.safemode.extension and wait.
  • 44.  Monitor, Monitor, Monitor!!! • Graphing tool is MUST for large and multi-tenant cluster. • Investigation and monitoring with multiple graphs would great help.  Cooperation with users • For some issues, we have to solve cluster problem with users.  Thread dump and source code reading for deep analysis • In case that we can’t get any clues from logs, it’s very important. • Thread dumping would be helpful for freezing or locking issues especially. 44 Lessons learned from mysteries
  • 45.  Query examples for NameNode and ResourceManager 45 Just as a reference Contents Queries HDFS cluster curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState" NameNode JVM info curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=JvmMetrics" NameNode and DataNode curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo" NameNode state curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus" NameNode RPC curl -s "${NN}:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020" NameNode CMS curl -s "${NN}:50070/jmx?qry=java.lang:type=GarbageCollector,name=ConcurrentMarkSweep" NameNode Heap curl -s "${NN}:50070/jmx?qry=java.lang:type=Memory" List of a HDFS directory * curl -s --negotiate -u : "${NN}:50070/webhdfs/v1/${HDFS_PATH}?&op=LISTSTATUS" Usage of a HDFS directory * curl -s --negotiate -u : "${NN}:50070/webhdfs/v1/${HDFS_PATH}?&op=GETCONTENTSUMMARY" jobs finished in last 10 min curl -s "${RM}:8088/ws/v1/cluster/apps?finishedTimeBegin=`date -d '10 minutes ago' +%s%3N`" running jobs curl -s "${RM}:8088/ws/v1/cluster/apps?state=RUNNING" accepted jobs curl -s "${RM}:8088/ws/v1/cluster/apps?state=ACCEPTED" ResourceManager status curl -s "${RM}:8088/ws/v1/cluster/info” YARN cluster curl -s "${RM}:8088/ws/v1/cluster/metrics" | jq "." NodeManager curl -s "${RM}:8088/ws/v1/cluster/nodes" | jq "." * kinit required in secured cluster.
  • 47. Background for Big Data systems  Virtualization vs Bare Metal 47 Bare Metal Virtualization (Cloud) Management (Operation) Quite Complicate Easy Performance Best performance Bottleneck always Solutions Many legacy way .. AWS , OpenStack ..  What’s your choice ? • Big Data , especially Hadoop needs more resource. • Bare Metal is best way to maximize HW power.
  • 48. Background for Big Data systems  Server capacity is most important thing for Big Data • Cheaper HW, we don’t care about warranty, cheaper parts, furthermore NO REDUDANCY. • Just what we want is more and more servers. 48 But it scared you,,, Don’t we afraid trouble? No, we don’t. Here, Full automation OS provisioning should work for Big Data  Only bare metal, but it’s most likely cloud • Full automation for OS installation. • Full stack management with Chef. • Everything should be there when you click.
  • 49. Automation Provisioning and Operation 49 CHEF Dash Board Organization Role/Recipe Host name Custom data for Rakuten Provisioning OS Provisioning with Chef API Worker Scratch Controller Installation Management Monitoring Operation Configuration for you App All Operation by Chef App Deploy Monitoring Request new server Recipe you built for your App Recipe for Application By DevOps Engineering MAAS DNS API Power DNS Shinken Graphite Full Automated Operation Connect Control Control
  • 50. Provisioning, Just 3 Step 50 1st Step • Chose Server 2nd Step • Chose Action • Install • Destroy 3rd Step • Hostname • OS distribution/version • Tenant and environment • Recipes of your application Final, click and get it Hey, I want new server Just Do It
  • 51. Provisioning Process 51 InstallOS SetupOS SetupApp Provisioning Core Default Infra Role App Role Manage App’s recipes Default Infra Monitoring App Monitoring Basic Install DNS entry OS / APP Configuration Monitoring Configuration Finish Task Worker Approximately 30 min Request via GUI/API MAAS CHEF CHEF
  • 52. Full Stack Management  Management not only Infra but also Hadoop 52 • Designed by ApplicationApp Monitoring • Designed by Application • Custom Package by ApplicationApp Deployment • Custom OS Configuration • Chef OrganizationCustom Configuration • Default OS monitoringInfra Monitoring • Default Configuration on OS • Basic PackagesOS Configuration • Simple image • Disk Partitioning / Raid configurationOS Installation • Detail H/W Spec • Custom Information for BDDInventory Data Role/Recipe Infra Base App XX Organization MAAS Chef Tool Platform XX Provisioning Core Criteria
  • 54. We are hiring!  Now Rakuten really focuses on utilizing Rakuten’s rich data. So, Hadoop will be more and more important.  Current hadoop admin team • Leader (double-posts) • 3 members (2 full-time and 1 double-posts)  So, we need 2 or 3 more engineers for our team! • Just mail to me, I can help you for your application! http://global.rakuten.com/corp/careers/ tomomichi.hirano@rakuten.com 54