SlideShare a Scribd company logo
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
2020/09/17
Akira Ajisaka
Upgrading HDFS to 3.3.0
and deploying RBF in
production
LINE Developer Meetup #68 – Big Data Platform
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Self introduction
2
• Akira Ajisaka (鯵坂 明, Twitter: @ajis_ka)
• Apache Hadoop PMC member (2016~)
• Yahoo! JAPAN (2018~)
Outdoor bouldering for the first time in Mitake
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Agenda
3
• Why and how we upgraded the largest
HDFS cluster to 3.3.0
• Hadoop clusters in Yahoo! JAPAN
• Short intro of RBF and why we choose it
• How to upgrade
• How to split namespace
• What we considered and experimented
• Many troubles and lessons learned from
them
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Why and how we
upgraded the cluster?
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Yahoo! JAPAN's largest HDFS cluster
5
• 100PB actual used
• 500+ DataNodes
• 240M files + directories
• 290M blocks
• 400GB NameNode Java
heap
• HDP 2.6.x + patches
(as of Dec. 2019)
Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Major existing problems
6
• The namespace is too large
• NameNode does not scale infinitely due to
heavy GC
• The Hadoop version is too old
• HDP 2.6 is based on Apache Hadoop 2.7.3
• 2.7.3 was released 4 years ago
• We upgraded to HDFS 3.3.0 and use RBF
to split the namespace
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
RBF (Router-based Federation)
7
/
top/
shp/
auc/
Namespace
Namespace
Namespace
NameNode
NameNode
NameNode
ZooKeeper
StateStore
DFSRouter
Note: Kerberos authentication is supported in Hadoop 3.3.0
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
How to enable RBF w/o clients' config changes
8
NameNode @
host1
(port 8021)
NameNode
@ host2
NameNode
@ host3
ZooKeeper
StateStore
DFSRouter @
host1
(port 8020)NameNode
@ host1
(port 8020)
Before After
Note: We couldn't rolling upgrade the cluster because of the NN RPC port change
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
How to split namespaces
9
• Calculated # of files/directories/blocks from
fsimage
• Calculated # of RPCs from audit logs
• RPCs are classified into two groups (update/read)
• We had to check audit logs to ensure that there is
no rename operation between namespaces
• RBF does not support it for now
• Xiaomi has developed HDFS Federation Rename (HFR)
• https://issues.apache.org/jira/browse/HDFS-15087
(work in progress)
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Split DataNodes or not?
10
Split DataNodes for each namespace (no-split) DNs register all the NameNodes
NN
DN
NN
DN
We chose splitting DNs because it is simple
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Split DataNodes – Pros and Cons
11
Pros
• Simple
• Easy to troubleshoot, operate
• No limitation of the # of namespaces
• East-west traffic can be controlled easily
Cons
• Need to calculate how many DNs required for each
namespaces
• Possible unbalanced resource usage among namespaces
• HFR uses hard-link for rename and it assumes non-split DNs
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Check HDFS client-server compatibility
12
• We upgrade HDFS only
• Old (HDP 2.6) clients still exist, so we have to
check the compatibility
• We read ".proto" files and verified that
• In addition, upgraded HDFS in development
cluster for end-users
• Wrote a blog post:
https://techblog.yahoo.co.jp/entry/20191206
786320/ (Japanese and English)
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
• If a client is configured as follows, the client always connects to
host1
• To avoid this problem, set "dfs.client.failover.random.order" to true
• This feature is available in Hadoop 2.9.0 and not available in the
old clients, so we patched internally
• The default value is true in Hadoop 3.4.0+ (HDFS-15350)
Load-balancing DFSRouters
13
<property name="dfs.nameservices" value="ns"/>
<property name="dfs.ha.namenodes.ns" value="dr1,dr2"/>
<property name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/>
<property name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Try Java 11
14
• Hadoop 3.3.0 supports Java 11 as runtime
• Upgrade to Java 11 to improve GC
performance
• We contributed many patches to support
Java 11 in Apache Hadoop community
• https://www.slideshare.net/techblogyahoo/jav
a11-apache-hadoop-146834504 (Japanese)
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Upgrade ZooKeeper to 3.5.x
15
• Error log w/ Hadoop 3.3.0 and ZK 3.4.x
• Hadoop 3.3.0 upgraded Curator version and it
depends on ZooKeeper 3.5.x (HADOOP-16579)
• Rolling upgraded ZK cluster before upgrading HDFS
• Upgrade succeeded without any major problems
(snip)
Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode =
Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
(snip)
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Planned schedule
16
• 2019.9 Upgraded to trunk in the dev
cluster
• 2020.3 Apache Hadoop 3.3.0 released
• 2020.3 Upgraded to 3.3.0 in the
staging cluster
• 2020.5 Upgraded to 3.3.0 in production
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Actual schedule
17
• 2019.9 Upgraded to trunk in the dev
cluster (with 1 retries)
• 2020.7 Apache Hadoop 3.3.0 released
• 2020.8 Upgraded to 3.3.0 in the
staging cluster (with 2 retries)
• 2020.8 Upgraded to 3.3.0 in production
(no retry! but faced many troubles...)
• Upgrade is completed remotely
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Many troubles
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
DistCp is slower than expected
19
• We used DistCp to move recent data between
namespaces after upgrade but it didn't finished by
deadline
• Directory listing of src/dst is serial
• Increasing Map tasks does not help
• DistCp always fails if (# of Map tasks) > 200 and
dynamic option is true
• Fails by configuration error
• To make matters worse, it fails after directory listing, which
takes very long time
• DistCp does not work well for very large directory
• Recommend splitting the job
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
• We faced many job failures just after the upgrade
• When splitting DNs, we considered only the data size
but it is not sufficient
• Read/write request must be considered as well
DN traffic reached the NW bandwidth limit
20
DN out traffic in a subcluster
25Gbps
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
DFSRouter slowdown
21
• DFSRouter drastically slowdown when restarting
active NameNode
• Wrote a patch and fixed in HDFS-15555
DFSRouter Average RPC Queue time
30 sec
Finished loading
fsimage
Restarted active
NameNode
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
HttpFS incompatibilities
22
• The implementation of the web server is different
• Hadoop 2.x: Tomcat 6.x
• Hadoop 3.x: Jetty 9.x
• The behavior is very different
• Jetty supports HTTP/1.1 (chunked encoding)
• Default idle timeout is different
• Tomcat: 60 seconds
• Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second)
• Response flow (what timing the server returns 401) is
different
• Response body itself is different
• and more...
• Need to test very carefully if you are using HttpFS
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Lessons learned
23
• We have changed many configurations at a time,
but should be avoided as possible
• For example, we changed block placement policy to rack
fault-tolerant and under-replicated blocks become
300M+ after upgrade
• Trouble shooting become more difficult
• HttpFS upgrades can be also separated from this
upgrade, as well as ZooKeeper
• Imagine what will happen in production and test
them as possible in advance
• Consider the difference between dev/staging and prod
• There is a limit one people can imagine. Ask many
colleagues!
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
HDFS Future works
24
• Router-based Federation
• Rebalance DNs/namespaces between subclusters
well
• Considering multiple subclusters, non-split DNs (or
even in hybrid), HFR, and so on
• Erasure Coding in production
• Internally backporting EC feature to the old HDFS
client and the work mostly finished
• Try new low-pause-time GC algorithms
• ZGC, Shenandoah
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
We are hiring!
25
https://about.yahoo.co.jp/hr/job-info/role/1247/
(Japanese)

More Related Content

PPTX
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
PPTX
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
PPTX
Apache Hadoopに見るJavaミドルウェアのcompatibility(Open Developers Conference 2020 Onli...
PPTX
Apache Ranger
PDF
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
PDF
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
PPTX
HBase Low Latency
PPTX
YARN Ready: Integrating to YARN with Tez
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Apache Hadoopに見るJavaミドルウェアのcompatibility(Open Developers Conference 2020 Onli...
Apache Ranger
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
HBase Low Latency
YARN Ready: Integrating to YARN with Tez

What's hot (20)

PPTX
Apache HBase Performance Tuning
PPTX
Kudu Deep-Dive
PDF
Improving HDFS Availability with IPC Quality of Service
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
What is new in Apache Hive 3.0?
PPTX
Apache sqoop with an use case
PPTX
Securing Hadoop with Apache Ranger
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Ndc14 분산 서버 구축의 ABC
PPTX
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
PDF
Hadoop Overview kdd2011
PPTX
大規模データ処理の定番OSS Hadoop / Spark 最新動向 - 2021秋 -(db tech showcase 2021 / ONLINE 発...
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PPTX
Local Secondary Indexes in Apache Phoenix
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Introduction to Apache Sqoop
PDF
Intro to HBase
PPTX
Introduction to Apache Kudu
PPTX
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache HBase Performance Tuning
Kudu Deep-Dive
Improving HDFS Availability with IPC Quality of Service
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
What is new in Apache Hive 3.0?
Apache sqoop with an use case
Securing Hadoop with Apache Ranger
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Hive + Tez: A Performance Deep Dive
Ndc14 분산 서버 구축의 ABC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Overview kdd2011
大規模データ処理の定番OSS Hadoop / Spark 最新動向 - 2021秋 -(db tech showcase 2021 / ONLINE 発...
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Local Secondary Indexes in Apache Phoenix
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Introduction to Apache Sqoop
Intro to HBase
Introduction to Apache Kudu
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Ad

Similar to Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM (20)

PDF
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
PDF
Hadoop Hardware @Twitter: Size does matter!
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
PDF
What is New in Apache Hive 3.0?
PDF
tdtechtalk20160330johan
PDF
Scalable Hadoop in the cloud
PDF
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
PDF
Hadoop Hardware @Twitter: Size does matter.
PDF
Troubleshooting Hadoop: Distributed Debugging
PPT
Big data with hadoop Setup on Ubuntu 12.04
PPTX
Backup and Disaster Recovery in Hadoop
PDF
Scaling Hadoop at LinkedIn
PPTX
Hadoop operations-2014-strata-new-york-v5
PDF
Tajo_Meetup_20141120
PDF
Scalable and High available Distributed File System Metadata Service Using gR...
PDF
Trend Micro Big Data Platform and Apache Bigtop
PDF
Introduction to Hadoop Administration
PDF
Introduction to Hadoop Administration
PPTX
What's new in Hadoop Common and HDFS
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Hadoop Hardware @Twitter: Size does matter!
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
What is New in Apache Hive 3.0?
tdtechtalk20160330johan
Scalable Hadoop in the cloud
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Hadoop Hardware @Twitter: Size does matter.
Troubleshooting Hadoop: Distributed Debugging
Big data with hadoop Setup on Ubuntu 12.04
Backup and Disaster Recovery in Hadoop
Scaling Hadoop at LinkedIn
Hadoop operations-2014-strata-new-york-v5
Tajo_Meetup_20141120
Scalable and High available Distributed File System Metadata Service Using gR...
Trend Micro Big Data Platform and Apache Bigtop
Introduction to Hadoop Administration
Introduction to Hadoop Administration
What's new in Hadoop Common and HDFS
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Ad

More from Yahoo!デベロッパーネットワーク (20)

PDF
ゼロから始める転移学習
PDF
継続的なモデルモニタリングを実現するKubernetes Operator
PDF
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
PDF
オンプレML基盤on Kubernetes パネルディスカッション
PDF
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
PDF
Persistent-memory-native Database High-availability Feature
PDF
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
PDF
eコマースと実店舗の相互利益を目指したデザイン #yjtc
PDF
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
PDF
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
PDF
ビッグデータから人々のムードを捉える #yjtc
PDF
サイエンス領域におけるMLOpsの取り組み #yjtc
PDF
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
PDF
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
PDF
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
PDF
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PDF
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
PDF
「新しいおうち探し」のためのAIアシスト検索 #yjtc
PDF
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ゼロから始める転移学習
継続的なモデルモニタリングを実現するKubernetes Operator
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
オンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
Persistent-memory-native Database High-availability Feature
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
eコマースと実店舗の相互利益を目指したデザイン #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
ビッグデータから人々のムードを捉える #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
sap open course for s4hana steps from ECC to s4
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

  • 1. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 2020/09/17 Akira Ajisaka Upgrading HDFS to 3.3.0 and deploying RBF in production LINE Developer Meetup #68 – Big Data Platform
  • 2. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Self introduction 2 • Akira Ajisaka (鯵坂 明, Twitter: @ajis_ka) • Apache Hadoop PMC member (2016~) • Yahoo! JAPAN (2018~) Outdoor bouldering for the first time in Mitake
  • 3. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Agenda 3 • Why and how we upgraded the largest HDFS cluster to 3.3.0 • Hadoop clusters in Yahoo! JAPAN • Short intro of RBF and why we choose it • How to upgrade • How to split namespace • What we considered and experimented • Many troubles and lessons learned from them
  • 4. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Why and how we upgraded the cluster?
  • 5. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Yahoo! JAPAN's largest HDFS cluster 5 • 100PB actual used • 500+ DataNodes • 240M files + directories • 290M blocks • 400GB NameNode Java heap • HDP 2.6.x + patches (as of Dec. 2019) Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc
  • 6. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Major existing problems 6 • The namespace is too large • NameNode does not scale infinitely due to heavy GC • The Hadoop version is too old • HDP 2.6 is based on Apache Hadoop 2.7.3 • 2.7.3 was released 4 years ago • We upgraded to HDFS 3.3.0 and use RBF to split the namespace
  • 7. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. RBF (Router-based Federation) 7 / top/ shp/ auc/ Namespace Namespace Namespace NameNode NameNode NameNode ZooKeeper StateStore DFSRouter Note: Kerberos authentication is supported in Hadoop 3.3.0
  • 8. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. How to enable RBF w/o clients' config changes 8 NameNode @ host1 (port 8021) NameNode @ host2 NameNode @ host3 ZooKeeper StateStore DFSRouter @ host1 (port 8020)NameNode @ host1 (port 8020) Before After Note: We couldn't rolling upgrade the cluster because of the NN RPC port change
  • 9. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. How to split namespaces 9 • Calculated # of files/directories/blocks from fsimage • Calculated # of RPCs from audit logs • RPCs are classified into two groups (update/read) • We had to check audit logs to ensure that there is no rename operation between namespaces • RBF does not support it for now • Xiaomi has developed HDFS Federation Rename (HFR) • https://issues.apache.org/jira/browse/HDFS-15087 (work in progress)
  • 10. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Split DataNodes or not? 10 Split DataNodes for each namespace (no-split) DNs register all the NameNodes NN DN NN DN We chose splitting DNs because it is simple
  • 11. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Split DataNodes – Pros and Cons 11 Pros • Simple • Easy to troubleshoot, operate • No limitation of the # of namespaces • East-west traffic can be controlled easily Cons • Need to calculate how many DNs required for each namespaces • Possible unbalanced resource usage among namespaces • HFR uses hard-link for rename and it assumes non-split DNs
  • 12. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Check HDFS client-server compatibility 12 • We upgrade HDFS only • Old (HDP 2.6) clients still exist, so we have to check the compatibility • We read ".proto" files and verified that • In addition, upgraded HDFS in development cluster for end-users • Wrote a blog post: https://techblog.yahoo.co.jp/entry/20191206 786320/ (Japanese and English)
  • 13. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. • If a client is configured as follows, the client always connects to host1 • To avoid this problem, set "dfs.client.failover.random.order" to true • This feature is available in Hadoop 2.9.0 and not available in the old clients, so we patched internally • The default value is true in Hadoop 3.4.0+ (HDFS-15350) Load-balancing DFSRouters 13 <property name="dfs.nameservices" value="ns"/> <property name="dfs.ha.namenodes.ns" value="dr1,dr2"/> <property name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/> <property name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>
  • 14. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Try Java 11 14 • Hadoop 3.3.0 supports Java 11 as runtime • Upgrade to Java 11 to improve GC performance • We contributed many patches to support Java 11 in Apache Hadoop community • https://www.slideshare.net/techblogyahoo/jav a11-apache-hadoop-146834504 (Japanese)
  • 15. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Upgrade ZooKeeper to 3.5.x 15 • Error log w/ Hadoop 3.3.0 and ZK 3.4.x • Hadoop 3.3.0 upgraded Curator version and it depends on ZooKeeper 3.5.x (HADOOP-16579) • Rolling upgraded ZK cluster before upgrading HDFS • Upgrade succeeded without any major problems (snip) Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) (snip)
  • 16. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Planned schedule 16 • 2019.9 Upgraded to trunk in the dev cluster • 2020.3 Apache Hadoop 3.3.0 released • 2020.3 Upgraded to 3.3.0 in the staging cluster • 2020.5 Upgraded to 3.3.0 in production
  • 17. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Actual schedule 17 • 2019.9 Upgraded to trunk in the dev cluster (with 1 retries) • 2020.7 Apache Hadoop 3.3.0 released • 2020.8 Upgraded to 3.3.0 in the staging cluster (with 2 retries) • 2020.8 Upgraded to 3.3.0 in production (no retry! but faced many troubles...) • Upgrade is completed remotely
  • 18. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Many troubles
  • 19. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. DistCp is slower than expected 19 • We used DistCp to move recent data between namespaces after upgrade but it didn't finished by deadline • Directory listing of src/dst is serial • Increasing Map tasks does not help • DistCp always fails if (# of Map tasks) > 200 and dynamic option is true • Fails by configuration error • To make matters worse, it fails after directory listing, which takes very long time • DistCp does not work well for very large directory • Recommend splitting the job
  • 20. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. • We faced many job failures just after the upgrade • When splitting DNs, we considered only the data size but it is not sufficient • Read/write request must be considered as well DN traffic reached the NW bandwidth limit 20 DN out traffic in a subcluster 25Gbps
  • 21. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. DFSRouter slowdown 21 • DFSRouter drastically slowdown when restarting active NameNode • Wrote a patch and fixed in HDFS-15555 DFSRouter Average RPC Queue time 30 sec Finished loading fsimage Restarted active NameNode
  • 22. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. HttpFS incompatibilities 22 • The implementation of the web server is different • Hadoop 2.x: Tomcat 6.x • Hadoop 3.x: Jetty 9.x • The behavior is very different • Jetty supports HTTP/1.1 (chunked encoding) • Default idle timeout is different • Tomcat: 60 seconds • Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second) • Response flow (what timing the server returns 401) is different • Response body itself is different • and more... • Need to test very carefully if you are using HttpFS
  • 23. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Lessons learned 23 • We have changed many configurations at a time, but should be avoided as possible • For example, we changed block placement policy to rack fault-tolerant and under-replicated blocks become 300M+ after upgrade • Trouble shooting become more difficult • HttpFS upgrades can be also separated from this upgrade, as well as ZooKeeper • Imagine what will happen in production and test them as possible in advance • Consider the difference between dev/staging and prod • There is a limit one people can imagine. Ask many colleagues!
  • 24. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. HDFS Future works 24 • Router-based Federation • Rebalance DNs/namespaces between subclusters well • Considering multiple subclusters, non-split DNs (or even in hybrid), HFR, and so on • Erasure Coding in production • Internally backporting EC feature to the old HDFS client and the work mostly finished • Try new low-pause-time GC algorithms • ZGC, Shenandoah
  • 25. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. We are hiring! 25 https://about.yahoo.co.jp/hr/job-info/role/1247/ (Japanese)