SlideShare a Scribd company logo
Scaling Hadoop at LinkedIn
Konstantin V Shvachko
Sr. Staff Software Engineer
Zhe Zhang
Engineering Manager
Erik Krogen
Senior Software Engineer
Continuous Scalability as a Service
LINKEDIN RUNS ITS BIG DATA ANALYTICS ON HADOOP
1
More Members
More Social
Activity
More Data &
Analytics
Hadoop Infrastructure
ECOSYSTEM OF TOOLS FOR LINKEDIN ANALYTICS
• Hadoop Core
• Storage: HDFS
• Compute: YARN and MapReduce
•
• – workflow scheduler
• Dali – data abstraction and access layer
for Hadoop
• ETL:
• Dr. Elephant – artificially intelligent,
polite, but uncompromising bot
• – distributed SQL engine for
interactive analytic
2
TensorFlow
More Data
Growth Spiral
THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE
3
More ComputeMore Nodes
Cluster Growth 2015 - 2017
THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE
4
0
1
2
3
4
DEC 2015 DEC 2016 DEC 2017
Space Used (PB)
Objects(Millions)
Tasks(Millions/Day)
HDFS Cluster
STANDARD HDFS ARCHITECTURE
• HDFS metadata is decoupled from data
• NameNode keeps the directory tree in RAM
• Thousands of DataNodes store data blocks
• HDFS clients request metadata from active
NameNode and stream data to/from
DataNodes
DataNodes
Active
NameNode
Standby
NameNode
JournalNodes
5
Cluster Heterogeneity
&
Balancing
Homogeneous Hardware
ELUSIVE STARTING POINT OF A CLUSTER
7
Heterogeneous Hardware
PERIODIC CLUSTER EXPANSION ADDS A VARIETY OF HARDWARE
8
Balancer
MAINTAINS EVEN DISTRIBUTION OF DATA AMONG DATANODES
• DataNodes should be filled uniformly by % used space
• Locality principle: more data => more tasks => more data generated
• Balancer iterates until the balancing target is achieved
• Each iteration moves some blocks from overutilized to underutilized nodes
• Highly multithreaded. Spawns many dispatcher threads. In a thread:
• getBlocks() returns a list of blocks of total size S to move out of sourceDN
• Choose targetDN
• Schedule transfers from sourceDN to targetDN
9
Balancer Optimizations
BALANCER STARTUP CAUSES JOB TIMEOUT
• Problem 1. At the start of each Balancer iteration threads hit NameNode at once
• Increase RPC CallQueue length => user jobs timeout
• Solution. Disperse Balancer calls at startup over 10 sec period. Restricts the number
of RPC calls from Balancer to NameNode to 20 calls per second
• Problem 2. Inefficient block iterator in getBlocks()
• NameNode iterates blocks from a randomly selected startBlock index.
• Scans all blocks from the beginning, instead of jumping to startBlock position.
• 4x reduction in exec time for getBlocks() from 40ms down to 9ms
• Result: Balancer overhead on NameNode performance is negligible
HDFS-11384
HDFS-11634
10
Block Report Processing
POP QUIZ
What do Hadoop clusters and particle
colliders have in common?
Hadoop Large Hadron Particle Collider
12
POP QUIZ
What do Hadoop clusters and particle
colliders have in common?
Hadoop Large Hadron Particle Collider
Improbable events happen all the time! 13
Optimization of Block Report Processing
DATANODES REPORT BLOCKS TO THE NAMENODE VIA BLOCK REPORTS
• DataNodes send periodic block reports to the NameNode (6 hours)
• Block report lists all the block replicas on the DataNode
• The list contains block id, generation stamp and length for each replica
• Found a rare race condition in processing block reports on the NameNode
• Race with repeated reports from the same node
• The error recovers itself on the next successful report after six hours
• HDFS-10301, fixed the bug and simplified the processing logic
• Block reports are expensive as they hold the global lock for a long time
• Designed segmented block report proposal HDFS-11313
14
Cluster Versioning
Upgrade to Hadoop 2.7
COMMUNITY RELEASE PROCESS
• Motivation: chose 2.7 as the most stable branch compared to 2.8, 2.9, 3.0
• The team contributed majority of commits to 2.7 since 2.7.3
• Release process
• Worked with the community to lead release
• Apache Hadoop release v 2.7.4 (Aug 2017)
• Testing, performance benchmarks
• Community testing: automated and manual testing tools
• Apache BigTop integration and testing
• Performance testing with Dynamometer
• Benchmarks: TestDFSIO, Slive, GridMix
16
Upgrade to Hadoop 2.7
INTEGRATION INTO LINKEDIN ENVIRONMENT
• Comprehensive testing plan
• Rolling upgrade
• Balancer
• OrgQueue
• Per-component testing
• Pig, Hive, Spark, Presto
• Azkaban, Dr. Elephant, Gobblin
• Production Jobs
17
• What went wrong?
• DDOS of InGraphs with new metrics
• Introduced new metrics turned ON by
default
• Large scale required to cause the issue
Satellite Cluster Project
Small File
Problem
“Keeping the mice away from the elephants”
19
Small File
Problem
• “Small file” is less than one block
• Each requires at least two objects: block & inode
• Small files bloat the memory usage of NameNode
& lead to numerous RPC calls to NameNode
• Block-to-inode ratio steadily decreasing; now at 1.11
• 90% of our files are small!
20
Satellite Cluster Project
IDENTIFY THE MICE: SYSTEM LOGS
Data
Volume
• Realization: many of these small files
were logs (YARN, MR, Spark…)
• Average size of log files: 100KB!
• Only accessed by framework/system
daemons
• NodeManager
• MapReduce / Spark AppMaster
• MapReduce / Spark History Server
• Dr. Elephant
Logs
0.07%
Objects
Logs
43.5%
21
Satellite Cluster Project
GIVE THE MICE THEIR OWN CLUSTER
• Two separate clusters, one ~100x more
nodes than the other
• Operational challenge: how to
bootstrap? > 100M files to copy
• Write new logs to new cluster
• Custom FileSystem presents combined
read view of both clusters’ files
• Log retention policy eventually deletes
all files on old cluster
* Additional details in appendix
22
Satellite Cluster Project
ARCHITECTURE
Primary Cluster
Satellite Cluster
Log Files
Bulk data transfer
stays internal to the
primary cluster Same namespace capacity
(one active NameNode)
but much cheaper due to
smaller data capacity
(fewer DataNodes)
23
Testing: Dynamometer
Dynamometer
• Realistic performance
benchmark & stress test for HDFS
• Open sourced on LinkedIn
GitHub, hope to contribute to
Apache
• Evaluate scalability limits
• Provide confidence before new
feature/config deployment
25
Dynamometer
SIMULATED HDFS CLUSTER RUNS ON YARN
• Real NameNode, fake
DataNodes to run on
~5% the hardware
• Replay real traces
from production
cluster audit logs
Dynamometer
Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN Node
• • •
YARN Node
NameNode
Dynamometer
Infrastructure
Application
Workload MapReduce Job
Host YARN Cluster
Simulated Client Simulated Client
Simulated ClientSimulated Client
26
Namespace Locking
Nonfair
Locking
• NameNode uses a global read-write lock which
supports two modes:
• Fair: locks are acquired in FIFO order (HDFS default)
• Nonfair: locks can be acquired out of order
• Nonfair locking discriminates writes, but benefits
reads via increased read parallelism
• NameNode operations are > 95% reads
28
Dynamometer
EXAMPLE: PREDICTION OF FAIR VS NONFAIR NAMENODE LOCKING PERFORMANCE
RequestWaitTime
Requests Per Second
Dynamometer (Fair)
Production (Fair)
Dynamometer (Unfair)
Production (Unfair)
Dynamometer
predictions closely
mirror observations
post-deployment
29
Nonfair Locking
IN ACTION
Nonfair locking
deployed
RPCQueueTimeAvgTime 30
Optimal Journaling
Optimal Journaling Device
BENCHMARK METHODOLOGY
• Persistent state of NameNode
• Latest checkpoint FSImage. Periodic 8 hour intervals
• Journal EditsLog – latest updates to the namespace
• Journal IO workload
• Sequential writes of 1KB chunks, no reads
• Flush and sync to disk after every write
• NNThroughputBenchmark tuned for efficient use of CPU
• Ensure throughput is bottlenecked mostly by IOs while NameNode is journaling
• Operations: mkdir, create, rename
32
Optimal Journaling Device
HARDWARE RAID CONTROLLER WITH MEMORY CACHE
33
0
5,000
10,000
15,000
20,000
25,000
mkdirs create rename
ops/sec
SATA-HD SAS-HD SSD SW-RAID HW-RAID
Optimal Journaling Device
SUMMARY
• SATA vs SAS vs SSD vs software RAID vs hardware RAID
• SAS is 15-25% better than SATA
• SSD is on par with SATA
• SW-RAID doesn’t improve performance compared to single SATA drive
• HW-RAID provides 2x performance gain vs SATA drives
34
Open Source Community
Open Source Community
KEEPING INTERNAL BRANCH IN SYNC WITH UPSTREAM
• The commit rule:
• Backport to all upstream branches before internal
• Ensures future upgrades will have all historical changes
• Release management: 2.7.4, 2.7.5, 2.7.6
• Open sourcing of OrgQueue with Hortonworks
• 2.9+ GPU support with Microsoft
• StandbyNode reads with Uber & PayPal
36
Next Steps
What’s Next?
2X GROWTH IN 2018 IS IMMINENT
• Stage I. Consistent reads from standby
• Optimize for reads: 95% of all operations
• Consistent reading is a coordination problem
• Stage II. Eliminate NameNode’s global lock
• Implement namespace as a KV-store
• Stage III. Partitioned namespace
• Linear scaling to accommodate increases RPC load
HDFS-12943
HDFS-10419
38
Consistent Reads from Standby Node
ARCHITECTURE
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Read/Write Read Only• Stale Read Problem
• Standby Node syncs edits from Active
NameNode via Quorum Journal Manager
• Standby state is always behind the Active
• Consistent Reads Requirements
• Read your own writes
• Third-party communication
39
HDFS-12943
Thank You!
Konstantin V Shvachko Zhe Zhang Erik Krogen
Sr. Staff
Software Engineer
Senior
Software Engineer
Engineering
Manager
40
Appendix
Satellite Cluster Project
IMPLEMENTATION
mapreduce.job.hdfs-servers =
hdfs://li-satellite-02.grid.linkedin.com:9000
mapreduce.jobhistory.intermediate-done-dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/intermediate
mapreduce.jobhistory.done-dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/finished
yarn.nodemanager.remote-app-log-dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/app-logs
spark.eventLog.dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/spark-history
• Two clusters li-satellite-01 (thousands of nodes), li-satellite-02 (32 nodes)
• FailoverFileSystem – transparent view of /system directory during migration
• Access li-satellite-02 first. If not there go to li-satellite-01. listStatus() merges from both
• Configuration change for Hadoop-owned services/frameworks:
• NodeManager, MapReduce / Spark AppMaster & History Server, Azkaban, Dr. Elephant
42
Satellite Cluster Project
CHALLENGES
• Copy >100 million existing files (< 100TB) from li-satellite-01 to li-satellite-02
• Can take > 12 hours, but saturated NameNode before copying a single byte
• Solution: FailoverFileSystem created new history files on li-satellite-02
• Removed /system from li-satellite-01 after log retention period passed
• Very large block reports
• 32 DataNodes each holds 9 million block replicas (vs 200K on normal cluster)
• Takes forever to process on NameNode; long lock hold times
• Solution: Virtual partitioning of each drive into 10 storages
• With 6 drives, block report split into 60 storages per DataNode
• 150K blocks per storage – good report size
43

More Related Content

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Introduction to MongoDB
PPTX
Apache Spark overview
PDF
Introduction to spark
PPTX
Hive: Loading Data
PPTX
Apache Kudu: Technical Deep Dive


PDF
Ozone and HDFS's Evolution
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Introduction to MongoDB
Apache Spark overview
Introduction to spark
Hive: Loading Data
Apache Kudu: Technical Deep Dive


Ozone and HDFS's Evolution

What's hot (20)

PDF
What's New in Apache Hive
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
PPTX
PDF
Intro to HBase
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Hudi architecture, fundamentals and capabilities
PPTX
Hadoop and Big Data
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PDF
What is new in Apache Hive 3.0?
PPTX
Introduction to spark
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Streaming SQL with Apache Calcite
PPTX
Elastic Stack Introduction
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Introduction to apache spark
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPTX
Apache Flink and what it is used for
PPTX
Apache HBase Performance Tuning
What's New in Apache Hive
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Intro to HBase
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
A Deep Dive into Query Execution Engine of Spark SQL
Hudi architecture, fundamentals and capabilities
Hadoop and Big Data
HBase and HDFS: Understanding FileSystem Usage in HBase
What is new in Apache Hive 3.0?
Introduction to spark
Evening out the uneven: dealing with skew in Flink
Streaming SQL with Apache Calcite
Elastic Stack Introduction
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Introduction to apache spark
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Flink and what it is used for
Apache HBase Performance Tuning
Ad

Similar to Scaling Hadoop at LinkedIn (20)

PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PPTX
HDFS: Optimization, Stabilization and Supportability
PPTX
Hdfs 2016-hadoop-summit-dublin-v1
PPTX
Managing growth in Production Hadoop Deployments
PPTX
Big data and Hadoop Section..............
PPTX
Hadoop File system (HDFS)
PDF
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
PDF
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
PDF
Next Generation Hadoop Operations
DOCX
PPTX
PPTX
Hdfs 2016-hadoop-summit-san-jose-v4
PPTX
Hadoop-2022.pptx
PDF
hadoop distributed file systems complete information
PPTX
Bigdata workshop february 2015
PPTX
Hadoop Migration from 0.20.2 to 2.0
PDF
Hadoop and object stores can we do it better
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PPTX
Hadoop ppt on the basics and architecture
PPTX
Big Data-Session, data engineering and scala
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
HDFS: Optimization, Stabilization and Supportability
Hdfs 2016-hadoop-summit-dublin-v1
Managing growth in Production Hadoop Deployments
Big data and Hadoop Section..............
Hadoop File system (HDFS)
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Next Generation Hadoop Operations
Hdfs 2016-hadoop-summit-san-jose-v4
Hadoop-2022.pptx
hadoop distributed file systems complete information
Bigdata workshop february 2015
Hadoop Migration from 0.20.2 to 2.0
Hadoop and object stores can we do it better
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop ppt on the basics and architecture
Big Data-Session, data engineering and scala
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Scaling Hadoop at LinkedIn

  • 1. Scaling Hadoop at LinkedIn Konstantin V Shvachko Sr. Staff Software Engineer Zhe Zhang Engineering Manager Erik Krogen Senior Software Engineer
  • 2. Continuous Scalability as a Service LINKEDIN RUNS ITS BIG DATA ANALYTICS ON HADOOP 1 More Members More Social Activity More Data & Analytics
  • 3. Hadoop Infrastructure ECOSYSTEM OF TOOLS FOR LINKEDIN ANALYTICS • Hadoop Core • Storage: HDFS • Compute: YARN and MapReduce • • – workflow scheduler • Dali – data abstraction and access layer for Hadoop • ETL: • Dr. Elephant – artificially intelligent, polite, but uncompromising bot • – distributed SQL engine for interactive analytic 2 TensorFlow
  • 4. More Data Growth Spiral THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE 3 More ComputeMore Nodes
  • 5. Cluster Growth 2015 - 2017 THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE 4 0 1 2 3 4 DEC 2015 DEC 2016 DEC 2017 Space Used (PB) Objects(Millions) Tasks(Millions/Day)
  • 6. HDFS Cluster STANDARD HDFS ARCHITECTURE • HDFS metadata is decoupled from data • NameNode keeps the directory tree in RAM • Thousands of DataNodes store data blocks • HDFS clients request metadata from active NameNode and stream data to/from DataNodes DataNodes Active NameNode Standby NameNode JournalNodes 5
  • 9. Heterogeneous Hardware PERIODIC CLUSTER EXPANSION ADDS A VARIETY OF HARDWARE 8
  • 10. Balancer MAINTAINS EVEN DISTRIBUTION OF DATA AMONG DATANODES • DataNodes should be filled uniformly by % used space • Locality principle: more data => more tasks => more data generated • Balancer iterates until the balancing target is achieved • Each iteration moves some blocks from overutilized to underutilized nodes • Highly multithreaded. Spawns many dispatcher threads. In a thread: • getBlocks() returns a list of blocks of total size S to move out of sourceDN • Choose targetDN • Schedule transfers from sourceDN to targetDN 9
  • 11. Balancer Optimizations BALANCER STARTUP CAUSES JOB TIMEOUT • Problem 1. At the start of each Balancer iteration threads hit NameNode at once • Increase RPC CallQueue length => user jobs timeout • Solution. Disperse Balancer calls at startup over 10 sec period. Restricts the number of RPC calls from Balancer to NameNode to 20 calls per second • Problem 2. Inefficient block iterator in getBlocks() • NameNode iterates blocks from a randomly selected startBlock index. • Scans all blocks from the beginning, instead of jumping to startBlock position. • 4x reduction in exec time for getBlocks() from 40ms down to 9ms • Result: Balancer overhead on NameNode performance is negligible HDFS-11384 HDFS-11634 10
  • 13. POP QUIZ What do Hadoop clusters and particle colliders have in common? Hadoop Large Hadron Particle Collider 12
  • 14. POP QUIZ What do Hadoop clusters and particle colliders have in common? Hadoop Large Hadron Particle Collider Improbable events happen all the time! 13
  • 15. Optimization of Block Report Processing DATANODES REPORT BLOCKS TO THE NAMENODE VIA BLOCK REPORTS • DataNodes send periodic block reports to the NameNode (6 hours) • Block report lists all the block replicas on the DataNode • The list contains block id, generation stamp and length for each replica • Found a rare race condition in processing block reports on the NameNode • Race with repeated reports from the same node • The error recovers itself on the next successful report after six hours • HDFS-10301, fixed the bug and simplified the processing logic • Block reports are expensive as they hold the global lock for a long time • Designed segmented block report proposal HDFS-11313 14
  • 17. Upgrade to Hadoop 2.7 COMMUNITY RELEASE PROCESS • Motivation: chose 2.7 as the most stable branch compared to 2.8, 2.9, 3.0 • The team contributed majority of commits to 2.7 since 2.7.3 • Release process • Worked with the community to lead release • Apache Hadoop release v 2.7.4 (Aug 2017) • Testing, performance benchmarks • Community testing: automated and manual testing tools • Apache BigTop integration and testing • Performance testing with Dynamometer • Benchmarks: TestDFSIO, Slive, GridMix 16
  • 18. Upgrade to Hadoop 2.7 INTEGRATION INTO LINKEDIN ENVIRONMENT • Comprehensive testing plan • Rolling upgrade • Balancer • OrgQueue • Per-component testing • Pig, Hive, Spark, Presto • Azkaban, Dr. Elephant, Gobblin • Production Jobs 17 • What went wrong? • DDOS of InGraphs with new metrics • Introduced new metrics turned ON by default • Large scale required to cause the issue
  • 20. Small File Problem “Keeping the mice away from the elephants” 19
  • 21. Small File Problem • “Small file” is less than one block • Each requires at least two objects: block & inode • Small files bloat the memory usage of NameNode & lead to numerous RPC calls to NameNode • Block-to-inode ratio steadily decreasing; now at 1.11 • 90% of our files are small! 20
  • 22. Satellite Cluster Project IDENTIFY THE MICE: SYSTEM LOGS Data Volume • Realization: many of these small files were logs (YARN, MR, Spark…) • Average size of log files: 100KB! • Only accessed by framework/system daemons • NodeManager • MapReduce / Spark AppMaster • MapReduce / Spark History Server • Dr. Elephant Logs 0.07% Objects Logs 43.5% 21
  • 23. Satellite Cluster Project GIVE THE MICE THEIR OWN CLUSTER • Two separate clusters, one ~100x more nodes than the other • Operational challenge: how to bootstrap? > 100M files to copy • Write new logs to new cluster • Custom FileSystem presents combined read view of both clusters’ files • Log retention policy eventually deletes all files on old cluster * Additional details in appendix 22
  • 24. Satellite Cluster Project ARCHITECTURE Primary Cluster Satellite Cluster Log Files Bulk data transfer stays internal to the primary cluster Same namespace capacity (one active NameNode) but much cheaper due to smaller data capacity (fewer DataNodes) 23
  • 26. Dynamometer • Realistic performance benchmark & stress test for HDFS • Open sourced on LinkedIn GitHub, hope to contribute to Apache • Evaluate scalability limits • Provide confidence before new feature/config deployment 25
  • 27. Dynamometer SIMULATED HDFS CLUSTER RUNS ON YARN • Real NameNode, fake DataNodes to run on ~5% the hardware • Replay real traces from production cluster audit logs Dynamometer Driver DataNode DataNode DataNode DataNode • • • YARN Node • • • YARN Node NameNode Dynamometer Infrastructure Application Workload MapReduce Job Host YARN Cluster Simulated Client Simulated Client Simulated ClientSimulated Client 26
  • 29. Nonfair Locking • NameNode uses a global read-write lock which supports two modes: • Fair: locks are acquired in FIFO order (HDFS default) • Nonfair: locks can be acquired out of order • Nonfair locking discriminates writes, but benefits reads via increased read parallelism • NameNode operations are > 95% reads 28
  • 30. Dynamometer EXAMPLE: PREDICTION OF FAIR VS NONFAIR NAMENODE LOCKING PERFORMANCE RequestWaitTime Requests Per Second Dynamometer (Fair) Production (Fair) Dynamometer (Unfair) Production (Unfair) Dynamometer predictions closely mirror observations post-deployment 29
  • 31. Nonfair Locking IN ACTION Nonfair locking deployed RPCQueueTimeAvgTime 30
  • 33. Optimal Journaling Device BENCHMARK METHODOLOGY • Persistent state of NameNode • Latest checkpoint FSImage. Periodic 8 hour intervals • Journal EditsLog – latest updates to the namespace • Journal IO workload • Sequential writes of 1KB chunks, no reads • Flush and sync to disk after every write • NNThroughputBenchmark tuned for efficient use of CPU • Ensure throughput is bottlenecked mostly by IOs while NameNode is journaling • Operations: mkdir, create, rename 32
  • 34. Optimal Journaling Device HARDWARE RAID CONTROLLER WITH MEMORY CACHE 33 0 5,000 10,000 15,000 20,000 25,000 mkdirs create rename ops/sec SATA-HD SAS-HD SSD SW-RAID HW-RAID
  • 35. Optimal Journaling Device SUMMARY • SATA vs SAS vs SSD vs software RAID vs hardware RAID • SAS is 15-25% better than SATA • SSD is on par with SATA • SW-RAID doesn’t improve performance compared to single SATA drive • HW-RAID provides 2x performance gain vs SATA drives 34
  • 37. Open Source Community KEEPING INTERNAL BRANCH IN SYNC WITH UPSTREAM • The commit rule: • Backport to all upstream branches before internal • Ensures future upgrades will have all historical changes • Release management: 2.7.4, 2.7.5, 2.7.6 • Open sourcing of OrgQueue with Hortonworks • 2.9+ GPU support with Microsoft • StandbyNode reads with Uber & PayPal 36
  • 39. What’s Next? 2X GROWTH IN 2018 IS IMMINENT • Stage I. Consistent reads from standby • Optimize for reads: 95% of all operations • Consistent reading is a coordination problem • Stage II. Eliminate NameNode’s global lock • Implement namespace as a KV-store • Stage III. Partitioned namespace • Linear scaling to accommodate increases RPC load HDFS-12943 HDFS-10419 38
  • 40. Consistent Reads from Standby Node ARCHITECTURE DataNodes Active NameNode Standby NameNodes JournalNodes Read/Write Read Only• Stale Read Problem • Standby Node syncs edits from Active NameNode via Quorum Journal Manager • Standby state is always behind the Active • Consistent Reads Requirements • Read your own writes • Third-party communication 39 HDFS-12943
  • 41. Thank You! Konstantin V Shvachko Zhe Zhang Erik Krogen Sr. Staff Software Engineer Senior Software Engineer Engineering Manager 40
  • 43. Satellite Cluster Project IMPLEMENTATION mapreduce.job.hdfs-servers = hdfs://li-satellite-02.grid.linkedin.com:9000 mapreduce.jobhistory.intermediate-done-dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/intermediate mapreduce.jobhistory.done-dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/finished yarn.nodemanager.remote-app-log-dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/app-logs spark.eventLog.dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/spark-history • Two clusters li-satellite-01 (thousands of nodes), li-satellite-02 (32 nodes) • FailoverFileSystem – transparent view of /system directory during migration • Access li-satellite-02 first. If not there go to li-satellite-01. listStatus() merges from both • Configuration change for Hadoop-owned services/frameworks: • NodeManager, MapReduce / Spark AppMaster & History Server, Azkaban, Dr. Elephant 42
  • 44. Satellite Cluster Project CHALLENGES • Copy >100 million existing files (< 100TB) from li-satellite-01 to li-satellite-02 • Can take > 12 hours, but saturated NameNode before copying a single byte • Solution: FailoverFileSystem created new history files on li-satellite-02 • Removed /system from li-satellite-01 after log retention period passed • Very large block reports • 32 DataNodes each holds 9 million block replicas (vs 200K on normal cluster) • Takes forever to process on NameNode; long lock hold times • Solution: Virtual partitioning of each drive into 10 storages • With 6 drives, block report split into 60 storages per DataNode • 150K blocks per storage – good report size 43