Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
2020/09/17
Akira Ajisaka
Upgrading HDFS to 3.3.0
and deploying RBF in
production
LINE Developer Meetup #68 – Big Data Platform

Self introduction
2
• Akira Ajisaka (鯵坂明, Twitter: @ajis_ka)
• Apache Hadoop PMC member (2016~)
• Yahoo! JAPAN (2018~)
Outdoor bouldering for the first time in Mitake

Agenda
3
• Why and how we upgraded the largest
HDFS cluster to 3.3.0
• Hadoop clusters in Yahoo! JAPAN
• Short intro of RBF and why we choose it
• How to upgrade
• How to split namespace
• What we considered and experimented
• Many troubles and lessons learned from
them

Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Why and how we
upgraded the cluster?

Yahoo! JAPAN's largest HDFS cluster
5
• 100PB actual used
• 500+ DataNodes
• 240M files + directories
• 290M blocks
• 400GB NameNode Java
heap
• HDP 2.6.x + patches
(as of Dec. 2019)
Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc

Major existing problems
6
• The namespace is too large
• NameNode does not scale infinitely due to
heavy GC
• The Hadoop version is too old
• HDP 2.6 is based on Apache Hadoop 2.7.3
• 2.7.3 was released 4 years ago
• We upgraded to HDFS 3.3.0 and use RBF
to split the namespace

RBF (Router-based Federation)
7
/
top/
shp/
auc/
Namespace
Namespace
Namespace
NameNode
NameNode
NameNode
ZooKeeper
StateStore
DFSRouter
Note: Kerberos authentication is supported in Hadoop 3.3.0

How to enable RBF w/o clients' config changes
8
NameNode @
host1
(port 8021)
NameNode
@ host2
NameNode
@ host3
ZooKeeper
StateStore
DFSRouter @
host1
(port 8020)NameNode
@ host1
(port 8020)
Before After
Note: We couldn't rolling upgrade the cluster because of the NN RPC port change

How to split namespaces
9
• Calculated # of files/directories/blocks from
fsimage
• Calculated # of RPCs from audit logs
• RPCs are classified into two groups (update/read)
• We had to check audit logs to ensure that there is
no rename operation between namespaces
• RBF does not support it for now
• Xiaomi has developed HDFS Federation Rename (HFR)
• https://issues.apache.org/jira/browse/HDFS-15087
(work in progress)

Split DataNodes or not?
10
Split DataNodes for each namespace (no-split) DNs register all the NameNodes
NN
DN
NN
DN
We chose splitting DNs because it is simple

Split DataNodes – Pros and Cons
11
Pros
• Simple
• Easy to troubleshoot, operate
• No limitation of the # of namespaces
• East-west traffic can be controlled easily
Cons
• Need to calculate how many DNs required for each
namespaces
• Possible unbalanced resource usage among namespaces
• HFR uses hard-link for rename and it assumes non-split DNs

Check HDFS client-server compatibility
12
• We upgrade HDFS only
• Old (HDP 2.6) clients still exist, so we have to
check the compatibility
• We read ".proto" files and verified that
• In addition, upgraded HDFS in development
cluster for end-users
• Wrote a blog post:
https://techblog.yahoo.co.jp/entry/20191206
786320/ (Japanese and English)

• If a client is configured as follows, the client always connects to
host1
• To avoid this problem, set "dfs.client.failover.random.order" to true
• This feature is available in Hadoop 2.9.0 and not available in the
old clients, so we patched internally
• The default value is true in Hadoop 3.4.0+ (HDFS-15350)
Load-balancing DFSRouters
13
<property name="dfs.nameservices" value="ns"/>
<property name="dfs.ha.namenodes.ns" value="dr1,dr2"/>
<property name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/>
<property name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>

Try Java 11
14
• Hadoop 3.3.0 supports Java 11 as runtime
• Upgrade to Java 11 to improve GC
performance
• We contributed many patches to support
Java 11 in Apache Hadoop community
• https://www.slideshare.net/techblogyahoo/jav
a11-apache-hadoop-146834504 (Japanese)

Upgrade ZooKeeper to 3.5.x
15
• Error log w/ Hadoop 3.3.0 and ZK 3.4.x
• Hadoop 3.3.0 upgraded Curator version and it
depends on ZooKeeper 3.5.x (HADOOP-16579)
• Rolling upgraded ZK cluster before upgrading HDFS
• Upgrade succeeded without any major problems
(snip)
Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode =
Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
(snip)

Planned schedule
16
• 2019.9 Upgraded to trunk in the dev
cluster
• 2020.3 Apache Hadoop 3.3.0 released
• 2020.3 Upgraded to 3.3.0 in the
staging cluster
• 2020.5 Upgraded to 3.3.0 in production

Actual schedule
17
• 2019.9 Upgraded to trunk in the dev
cluster (with 1 retries)
• 2020.7 Apache Hadoop 3.3.0 released
• 2020.8 Upgraded to 3.3.0 in the
staging cluster (with 2 retries)
• 2020.8 Upgraded to 3.3.0 in production
(no retry! but faced many troubles...)
• Upgrade is completed remotely

DistCp is slower than expected
19
• We used DistCp to move recent data between
namespaces after upgrade but it didn't finished by
deadline
• Directory listing of src/dst is serial
• Increasing Map tasks does not help
• DistCp always fails if (# of Map tasks) > 200 and
dynamic option is true
• Fails by configuration error
• To make matters worse, it fails after directory listing, which
takes very long time
• DistCp does not work well for very large directory
• Recommend splitting the job

• We faced many job failures just after the upgrade
• When splitting DNs, we considered only the data size
but it is not sufficient
• Read/write request must be considered as well
DN traffic reached the NW bandwidth limit
20
DN out traffic in a subcluster
25Gbps

DFSRouter slowdown
21
• DFSRouter drastically slowdown when restarting
active NameNode
• Wrote a patch and fixed in HDFS-15555
DFSRouter Average RPC Queue time
30 sec
Finished loading
fsimage
Restarted active
NameNode

HttpFS incompatibilities
22
• The implementation of the web server is different
• Hadoop 2.x: Tomcat 6.x
• Hadoop 3.x: Jetty 9.x
• The behavior is very different
• Jetty supports HTTP/1.1 (chunked encoding)
• Default idle timeout is different
• Tomcat: 60 seconds
• Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second)
• Response flow (what timing the server returns 401) is
different
• Response body itself is different
• and more...
• Need to test very carefully if you are using HttpFS

Lessons learned
23
• We have changed many configurations at a time,
but should be avoided as possible
• For example, we changed block placement policy to rack
fault-tolerant and under-replicated blocks become
300M+ after upgrade
• Trouble shooting become more difficult
• HttpFS upgrades can be also separated from this
upgrade, as well as ZooKeeper
• Imagine what will happen in production and test
them as possible in advance
• Consider the difference between dev/staging and prod
• There is a limit one people can imagine. Ask many
colleagues!

HDFS Future works
24
• Router-based Federation
• Rebalance DNs/namespaces between subclusters
well
• Considering multiple subclusters, non-split DNs (or
even in hybrid), HFR, and so on
• Erasure Coding in production
• Internally backporting EC feature to the old HDFS
client and the work mostly finished
• Try new low-pause-time GC algorithms
• ZGC, Shenandoah

We are hiring!
25
https://about.yahoo.co.jp/hr/job-info/role/1247/
(Japanese)

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

More Related Content

What's hot (20)

Similar to Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM (20)

More from Yahoo!デベロッパーネットワーク (20)

Recently uploaded (20)

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM