Netflix container scheduling talk at stanford final

Elastic resource scheduling for
Netflix's scalable container
cloud
Sharma Podila, Andrew Spyker, Tomasz Bak
Feb 7th 2017

Topics
● Motivations for containers on AWS EC2
● Scheduling using Apache Mesos
● Fenzo deep dive
● Future plans

Containers add to our VM infrastructure
Already in VM’s have ...
microservice driven,
cloud native,
CI/CD devops enabled,
resilient,
elastically scalable,
environment

Container Provides Innovation Velocity
● Iterative local development, deploy when ready
● Manage app and dependencies easily and completely
● Simpler way to express resources, let system manage

Service Batch
Sampling of container usage
Media Encoding
Digital Watermarking
NodeJS UI Services
Operations and General
Stream Processing
Reporting

Sampling of realized container benefits
● Media Encoding - encoding research development time
○ VM’s platform to container platform - 1 month vs. 1 week
● Continuous Integration Testing
○ Build all Netflix codebases in hours
○ Saves development 100’s of hours of debugging
● Netflix API Re-architecture using NodeJS
○ Focus returns to app development
○ Provided reliable smaller instances
○ Simplifies, speeds test and deployment

Reactive stream processing: Mantis
Zuul
Cluster
API
Cluster
Mantis
Stream processing
Cloud native service
● Configurable message delivery guarantees
● Heterogeneous workloads
○ Real-time dashboarding, alerting
○ Anomaly detection, metric generation
○ Interactive exploration of streaming data
Anomaly
Detection

Current Mantis usage
● At peak:
○ 2,300 EC2 instances of M3.2xlarge instances
● Peak of 900 concurrent jobs
● Peak of 5,200 concurrent containers
○ Trough of 4,000 containers
○ Job sizes range from 1 to 500 containers
● Mix of perpetual and interactive exploratory jobs
● Peak of 13 Million events / sec

EC2
VPC
VMVM
TitusJob
Control
Containers
App
Cloud Platform
(metrics, IPC, health)
VMVM
Batch
Containers
Eureka Edda
Container deployment: Titus
Atlas &
Insight

Current Titus usage
#Containers (tasks) for the week of 11/7 in one of the regions
● Peak of ~1,800 instances
○ Mix of m4.4xl, r3.8xl, p2.8xl
○ ~800 instances at trough
● Mix of batch, stream
processing, and some
microservices

Core architectural components
AWS EC2
Apache Mesos
Titus/Mantis Framework
Fenzo
Fenzo at
https://github.com/Netflix/Fenzo
Apache Mesos at
http://mesos.apache.org/
Batch Job
Mgr
Service Job
Mgr

Motivation for a new Mesos scheduler
● Cloud native (cluster autoscaling)
● Customizable task placement optimizations
○ Mix of service, batch, and stream topologies

What does a Mesos scheduler do?
● API for users to interact
● Mesos interaction via the driver
● Compute resource assignments for tasks

What does a Mesos scheduler do?
● API for users to interact
● Be connected to Mesos via the driver
● Compute resource assignments for tasks
○ NetflixOSS Fenzo
https://github.com/Netflix/Fenzo

Scheduling optimizations
Speed Accuracy
First fit assignment Optimal assignment
Real world trade-offs

Fitness
Pending
Assigned
Urgency
Scheduling problem
N tasks to assign from M possible agents

Resource assignments

DC/Cloud
operator
● Bin packing
○ By resource usage
○ By job types
● Ease deployment of new
agent AMIs
● Ease server maintenance and
upgrades

DC/Cloud
operator
Application
owner
● Task locality, anti-locality
(noisy neighbors?, etc.)
● Resource affinity
● Task balancing across
racks/AZs/hosts

DC/Cloud
operator
Application
owner
Cost
● Save cloud footprint costs
● Right instance types
● Save power, cooling costs
● Does everything need to run right
away?

DC/Cloud
operator
Application
owner
Cost Security
Security aspects of
multi-tenant
applications on a host

DC/Cloud
operator
Application
owner
Cost Security
Proceed quickly in the
generally right
direction, adapting to
changes

● Extensible
● Cloud native
● Ease of experimentation
● Scheduling decisions visibility
Fenzo goals

Fenzo scheduling strategy
For each (ordered) task
On each available host
Validate hard constraints
Score fitness and soft constraints
Until score good enough, and
A minimum #hosts evaluated
Pick host with highest score

Experimentation with Fenzo
● Abstractions of tasks and servers (VMs)
● Create various strategies with custom fitness functions
and constraints
○ For example, dynamic task anti-locality
● “Good enough” can be dynamic
○ Based on pending task set size, task type, etc.
● Ordering of servers for allocation based on task type

Experimentation with Fenzo
Task runtime bin packing sample results
Resource bin packing sample results

Fitness function Vs. constraints
● Fitness: site policies
○ Bin packing for utilization, reduce fragmentation
○ Segregate hosts by task types, e.g., service Vs batch
● Constraints: user preferences
○ Resource affinity
○ Task locality
○ Balance tasks across racks or availability zones

● Degree of fitness, score of 0.0 - 1.0
● Composable
○ Multiple weighted fitness functions
● Extensible
○ Combine existing ones with custom plugins
Fitness evaluation

CPU bin packing fitness function
Fitness for
Host1 Host2 Host3 Host4 Host5
fitness = usedCPUs / totalCPUs

Fitness for 0.25 0.5 0.75 1.0 0.0

Fitness for 0.25 0.5 0.75 1.0 0.0
✔

Combines resource request bin packing with task type bin
packing
resBinpack = (cpuFit + memFit + networkFit) / 3.0
taskTypePack = numSameType / totTasks
fitness = resBinpack * 0.4 + taskTypePack * 0.6
Current fitness evaluator in Titus

Fenzo constraints
● Common constraints built-in
○ Host attribute value
○ Host with unique attribute value
○ Balance across hosts’ unique attribute value
● Can be used as “soft” or “hard” constraint
○ Soft evaluates to 0.0 - 1.0
○ Hard evaluates to true/false
● Additional custom plugins
○ Global constraint to send only GPU requiring tasks to GPU hosts
○ Global constraint to limit EC2 instance types to certain tasks

● CPU
● Memory
● Disk
● Ports
● Network bandwidth
● Scalar (used for GPU)
● Security groups and IP per container
Fenzo supported resources

Why is a task failing to launch?

Fenzo cluster autoscaling
Host 1 Host 2 Host 3 Host 4

Host 4Host 3Host 1
vs.
Host 1 Host 2
Host 2
Host 3 Host 4

● Threshold based
● Shortfall analysis based
Host 4Host 3Host 1
vs.
Host 1 Host 2
Host 2
Host 3 Host 4

Autoscaling multiple agent clusters
m4.4xlarge agents r3.8xlarge agents
Titus
Grouping agents by instance type let’s us autoscale them independently
Min
Desired
Max
Min
Desired
Max

Threshold based autoscaling
● Set up rules per agent attribute value
● Sample:
#Idle
hosts
Trigger down
scale
Trigger up
scale
min
max
Cluster Name Min Idle Max Idle Cooldown Secs
MemosyClstr 2 5 360
ComputeClstr 5 10 300

Shortfall analysis based scale up
● Rule-based scale up has a cool down period
○ What if there’s a surge of incoming requests?
● Pending requests trigger shortfall analysis
○ Scale up happens regardless of cool down period
○ Remembers which tasks have already been covered
● Shortcoming: scale can be too aggressive for short
periods of time

● Guarantee capacity for timely job starts
○ Mesos supports quotas, but, inadequate at this time
● Generally, optimize throughput for batch jobs and start
latency for service jobs
● Categorize by expected behavior
○ For example, some service style jobs may be less important
● Critical versus Flex (flexible) scheduling requirements
Capacity guarantees

Capacity guarantees
Critical
Flex
Quotas

Capacity guarantees
Critical
Flex
Critical
Flex
Resource
Allocation
Order
Quotas Prioritiesvs.

Capacity guarantees: hybrid view
AppC1
AppC2
AppC3
AppCN
AppF1
AppF2
AppFN
AppF3
Resource
Allocation
Order
Critical
Flex

Tier Capacity = SUM (App1-cap + App2-cap + … + AppN-cap) + BUFFER
BUFFER:
● Accommodate some new or ad hoc jobs with no guarantees
● Red-black pushes of apps temporarily double app capacity
AppC1
AppC2
AppC3
AppCN
AppF1
AppF2
AppFN
AppF3
Resource
Allocation
Order
Critical
Flex

Fenzo supports multi-tiered task
queues
Can have arbitrary number of tiers
Per tier DRF across multiple
queues
Tier 0
Tier 1

Sizing clusters for capacity guarantees
Tier 0:
Used
capacity
Idle
capacity
Autoscaled
Cluster min size (guaranteed capacity)
Cluster max Size
Tier 1:
Used
capacity
Autoscaled
Cluster desired size
Cluster max Size
(Idle size kept near zero)

Netflix container execution values

Netflix container execution values
● Consistent cloud infrastructure with VM’s
○ Virtualize and deeply re-use AWS features
● User and operator tooling common to VM’s
○ IPC and service discovery, telemetry and monitoring
○ Spinnaker integration for CI/CD
● Unique Features
○ Deep Amazon and Netflix infrastructure integration
○ VPC IP per container
○ Advanced security (sec groups, IAM Roles)

Elastic Network Interfaces (ENI)
AWS EC2 Instance
ENI0
IP0
IP1
IP2
IP3
ENI1
IP4
IP5
IP6
IP7
ENI2
IP8
IP9
IP10
IP11
ENI0
IP0
IP1
IP2
IP3
● Each EC2 instance
in VPC has 2 or
more ENIs
● Each ENI can have 2
or more IPs
● Security Groups are
set on the ENI

Network bandwidth isolation
Each container gets an IP on one of the ENIs
Linux tc policies used on virtual Ethernet
For both incoming and outgoing traffic
Bandwidth limited to the requested value
No bursting into unused bandwidth

GPU Enablement
Personalization and recommendations
● Deep learning with neural nets/mini batch
● Makes model training infrastructure self-service
Executor takes Scheduler resource definition
● Maps p2.8xl GPU’s using nvidia-docker-plugin
● Mounts drivers and devices into container

● Fine grain capacity guarantees
○ DRF adapted to elastic clusters
○ Preemptions to improve resource usage efficiency
○ Hierarchical sharing policies via h-DRF
○ Leveraging “Internal spot market”, aka the trough
● Onboarding new applications
○ Scale continues to grow
Ongoing, and future scheduling work

Elastic resource scheduling for
Netflix's scalable container
cloud
Sharma Podila, Andrew Spyker, Tomasz Bak
@podila @aspyker @tomaszbak1974
Questions?

Netflix container scheduling talk at stanford final

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Netflix container scheduling talk at stanford final (20)

Recently uploaded (20)

Netflix container scheduling talk at stanford final