Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union

Apache Hadoop YARN:
State of the union
Wangda Tan, Billie Rinaldi
@ Hortonworks

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speaker intro
 Wangda Tan: Apache Hadoop PMC member, mostly focus on GPU/deep learning on
YARN, worked on features of scheduler like node label / preemption, etc.
 Billie Rinaldi: Apache Hadoop committer, PMC member of various other top-level
Apache projects and incubating projects, currently focusing on long running services
and Docker containers on Apache Hadoop YARN

Agenda
 Introduction
 Past
 State of the Union

Multi colored YARN
 Multi-colored YARN
– Apps
– Long running services
 It’s all about data!
 Layers that enable applications and
higher order frameworks that interact
with data
https://www.flickr.com/photos/happyskrappy/15699919424

Page 5
Containerization
Containers
GPUs /
FPGAs
More
powerful
scheduling
Much faster
scheduling
Scale
SLAs
Usability
Service
workloads
Categories of recent initiatives

Hadoop Compute Platform – Today and Tomorrow
Platform Services
Storage
Resource
Management
Service
Discovery Cluster Mgmt
Monitoring
Alerts
IOT Assembly
Kafk
a
Storm HBase Solr
Security
Governance
MR Tez
Spark
Hive / Pig
LLAP
Flink
REEF

Past: A quick history

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
A brief Timeline: Pre GA
• Sub-project of Apache Hadoop
• Alphas and betas
– In production at several large sites for MapReduce already by that time
June-July 2010 August 2011 May 2012 August 2013

A brief Timeline: GA Releases 1/3
2.2 2.3 2.4 2.5 2.6 2.7
15 October 2013 24 February 2014 07 April 2014 11 August 2014
• 1st GA
• MR binary
compatibility
• YARN API
cleanup
• Testing!
• 1st Post GA
• Bug fixes
• Alpha features
• RM Fail-over
• CS
Preemption
• Timeline
Service V1
• Writable
REST APIs
• Timeline
Service V1
security
• Rolling
Upgrades
• Docker
• Node labels
18 November 2014
• Moving to
JDK 7+
• Pluggable
YARN
authentication
21 Apr 2015
Most Essential Requirements for enterprise usage

2.7.2 2.7.3 2.6.5 2.7.4 2.8.0 2.8.1 2.8.2 2.8.3
25 January 2016 25 August 2016 18 October 2016 04 August 2017
• Application
Priority
• Reservations
• Node labels
improvements
22 March 2017 08 June 2017 03 Oct 2017 12 Dec 2017
Enterprise consumption, need stablization

3.0.0-alpha1-4 3.0.0-beta1 3.0.0-GA2.9.0 3.0.1 3.1.0
Sep 16 – Aug 17 03 Oct 2017 13 Dec 2017
• GPU/FPGA
• Native
Service
• Placement
Constraints
06 April 201825 March 201817 Nov 2017
• YARN
Federation
• Opportunistic
Container
• Resource
types
• New YARN UI
• Timeline
service V2
More requirements comes (computation intensive, larger, services)

Apache Hadoop 2.8/2.9

Application priorities – YARN-1963
• Allocate resource to important apps first.
• Within a leaf-queue
FIFO Policy App 1 App 2 App 3 App 4
FIFO Policy
With priorities
App 1App 2App 3App 4
Higher priority  Lower priority

Queue priorities – YARN-5864
 For interactive / SLA sentitive workload
 Today
– Give to the least satisfied queue first
 With priorities
– Give to the highest priority queue first (for important workload).
root
A
20% Configured Capacity
But 5% of used Capacity
Usage = 5/20 = 25%
B
80% Configured Capacity
But 8% of used Capacity
Usage 8/80 = 10%

Reservations – YARN-1051
• “Run my workload tomorrow at 6AM”
• Persistence of the plans with RM failover: YARN-2573
Reservation-based Scheduling: If You’re Late Don’t Blame Us! - Carlo, et al. 2015

Apache Hadoop 3.0/3.1

Looking at the Scale!
 Tons of sites with clusters made up of large amount of nodes
– Yahoo!, Twitter, LinkedIn, Microsoft, Alibaba etc.
 Previously, largest clusters
– 6K-8K
 Now: 40K nodes (federated), 20K nodes (single cluster).
 Roadmap: To 100K and beyond

Moving towards Global & Fast Scheduling
 Problems
– Current design of one-node-at-a-time allocation cycle can lead to suboptimal decisions.
– Several coarse grained locks
 Current effort made us where we improved to
– Look at several nodes at a time
– Fine grained locks
– Multiple allocator threads
– YARN scheduler can allocate 3k+ containers per second ≈ 10 mil allocations / hour!
– 10X throughput gains with enhancement added recently
– Much better placement decisions

Resource profiles and custom resource types
 Past
– Supports only Memory and CPU
 Now
– A generalized vector
– Custom Resource Types!
 Ease of resource requesting model using
profiles
NodeManager
Memory
CPU
GPU
FPGA
Profile Memory CPU GPU
Small 2 GB 4 Cores 0 Cores
Medium 4 GB 8 Cores 0 Cores
Large 16 GB 16 Cores 4 Cores

GPU support on YARN
 Why need isolation?
– Multiple processes use the single GPU will be:
• Serialized.
• Cause OOM easily.
 GPU isolation on YARN: .
– Granularity is for per-GPU device.
– Use Cgroups / docker to enforce the isolation.

FPGA on YARN!
 FPGA isolation on YARN: .
– Granularity is for per-FPGA device.
– Use Cgroups to enforce the isolation.
 Currently, only Intel OpenCL SDK for FPGA is supported. But impl is extensible to other
FPGA SDK.

Better placement strategies (YARN-6592)
 Affinity  Anti-affinity
HBase Sto
rm
Hbase-
Region
Server
Hbase-
Region
Server

YARN Federation!
 Enables applications to scale to 100k of thousands of nodes
 Federation divides a large (10-100k nodes) cluster into smaller units called sub-clusters
 Federation negotiates with sub-clusters RM’s and provide resources to the application
 Applications can schedule tasks on any node

Packaging
 Containers
– Lightweight mechanism for packaging and resource isolation
– Popularized and made accessible by Docker
– Can replace VMs in some cases
– Or more accurately, VMs got used in places where they didn’t
need to be
 Native integration ++ in YARN
– Support for “Container Runtimes” in LCE: YARN-3611
– Process runtime
– Docker runtime

Services support
 Application & Services upgrades
– “Do an upgrade of my Spark / HBase apps with minimal impact to end-users”
– YARN-4726
 Simplified discovery of services via DNS mechanisms: YARN-4757
– regionserver-0.hbase-app-3.hadoop.yarn.site
 Placement policies
 Container restart

Simplified APIs for service definitions
 Applications need simple APIs
 Need to be deployable “easily”
 Simple REST API layer fronting YARN
– YARN-4793 Simplified API layer for services and beyond
 Spawn services & Manage them

Services Framework
 Platform is only as good as the tools
 A native YARN services framework
– YARN-4692
– [Umbrella] Native YARN framework layer for services and
beyond
 Assembly: Supporting a DAG of apps:
– SLIDER-875

User experience
API based queue management
Decentralized
(YARN-5734)
Improved logs
management
(YARN-4904)
Live application logs

User experience
New web UI

Timeline Service
 Application History
– “Where did my containers run?”
– “Why is my application slow?”
– “Is it really slow?”
– “Why is my application failing?”
– “What happened with my application?
Succeeded?”
 Cluster History
– Run analytics on historical apps!
– “User with most resource utilization”
– “Largest application run”
– “Why is my cluster slow?”
– “Why is my cluster down?”
– “What happened in my clusters?”
 Collect and use past data
– To schedule “my application” better
– To do better capacity planning

Timeline Service 2.0
• Next generation
– Today’s solution helped us understand the space
– Limited scalability and availability
• “Analyzing Hadoop Clusters is becoming a big-data problem”
– Don’t want to throw away the Hadoop application metadata
– Large scale
– Enable near real-time analysis: “Find me the user who is hammering the FileSystem with rouge applications. Now.”
• Timeline data stored in HBase and accessible to queries

Apache Hadoop 3.2 and beyond

Node Attributes (YARN-3409)
• Node Partition vs. Node Attribute
• Partition:
• One partition for one node
• ACL
• Shares between queues
• Preemption enforced.
• Attribute:
• For container placement
• No ACL/Shares on attributes
• First-come-first-serve

Container overcommit (YARN-1011)
 Each node has some allocated but unutilized capacities
 Use such capacity to run opportunistic tasks
 Preemption such tasks when needed

Auto-spawning of system services
(YARN-8048)
• System services is services required by
YARN, need to be started during
bootstrap.
• For example YARN ATSv2 needs Hbase, so
Hbase is system service of YARN.
• Only Admin can configure
• Started along with ResourceManager
• Place spec files under
yarn.service.system-service.dir FS path

Lessons learned running a container cloud
on YARN
https://dataworkssummit.com/berlin-2018/session/lessons-learned-running-a-container-cloud-on-
yarn/
4PM, Room I, Wed April 18th
-- Related Session --
Billie Rinaldi

Deep learning on YARN: running
distributed Tensorflow, etc. on Hadoop
clusters
https://dataworkssummit.com/berlin-2018/session/deep-learning-on-yarn-running-distributed-
tensorflow-mxnet-caffe-xgboost-on-hadoop-clusters/
2PM, Room II, Wed April 18th
Wangda Tan

BoF’s: Apache Hadoop – YARN, HDFS
https://dataworkssummit.com/berlin-2018/bofs/#apache-hadoop-8211-yarn-hdfs
Thursday April 19th

Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union

More Related Content

What's hot (20)

Similar to Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union (20)

More from Wangda Tan (6)

Recently uploaded (20)

Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union

Editor's Notes