Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Maximizing Scalability, Resiliency, and
Engineering Velocity in the Cloud
Coburn Watson
Manager, Cloud Performance, Netflix
Surge „13

Netflix, Inc.
• World's leading internet television network
• ~ 38 Million subscribers in 40+ countries
• Over a billion hours streamed per month
• Approximately 33% of all US Internet traffic at
night
• Recent Notables
• Increased Originals catalog
• Large open source contribution
• OpenConnect (homegrown CDN)
2

About Me
• Manage Cloud Performance Engineering Team
• Sub-team of Cloud Solutions Organization
• Focus on performance since 2000
• Large-scale billing applications, eCommerce, datacenter
mgmt., etc.
• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
• Passion for tackling performance at cloud-scale
• Looking for great performance engineers
• cwatson@netflix.com
3

Freedom and Responsibility
• Culture deck..a great read
• Good performers: 2x, Top performers: 10x
• What engineers dislike
• cumbersome processes
• deployment inefficiency
• restricted access
• restricted technical freedom
• lack of trust
• If removed…maximize:
• Engineering velocity
• Engineer satisfaction
4

Maximizing: Engineering
Velocity
5

How
• Implementation freedom
• SCM, libraries, language
• that said..platform benefits exist
• Deployment freedom
• Service team owns
• push schedule, functionality, performance
• operational activities (being paged)
• On-demand cloud capacity
• Thousands of instances at the push of a button
6

Rapid Deployment?
Impossible..
3-6 Months?
7

Rapid (Cloud) Deployment
3-5 Minutes
8

BaseAMI
• Supply the foundation
• Monitoring, java, apache, tomcat, etc.
• Open source project: Aminator
9

Pushing Code: Red-Black
• Gracefully roll code in, or out, of production
• Asgard is our AWS configuration mgmt. tool
10

Compounded risks with increased velocity
Risks: Decreased Reliability, Performance, and Scalability
Not all Roses
11

Goal: CI (Continuous
Improvement)
12

Fear (Revere) the Monkeys
• Simulate
• Latency
• Errors
• Initiate
• Instance Termination
• Availability Zone Failure
• Identify
• Configuration Drift
… in Test and Production
14

Tracking Change: Chronos
• Aggregate Significant Events *
• Current Sources:
• Pushes (Asgard)
• Production Change Requests (JIRA)
• AWS Notifications
• Dynamic Property Changes
• ASG Scaling Events
• Implementation
• Simple REST-service; customized adapters
* - “can disrupt production service”
15

Automated Canary Analysis
• Identify regression between new and existing code
• Point ACA to baseline (prod) and canary ASG
• Typically analyze an hours worth of time series data
• Compare ratio of averages between canary and baseline
• Evaluate range and noise; determine quality of signal
• Bucket: Hot, Cold, Noisy, or OK
• Multiple classifiers available
• Multiple metric collections (e.g. hand-picked by service, general)
• Rollup
• Constrained: along metric dimensions
• Final: Score the canary
• Implementation: R-based analysis
17

HOT OK NOISYCOLDOK
NOISY
constrained rollup (dashed)
final rollup
ACA: in Action
18

Hystrix: Defend Your App
● Protection from downstream service failures
● Functional (unavailable) or performance in nature
19

Maximizing: Scalability and
Performance
20

Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances per
day
• order of tens of thousands of EC2 instances
• Larger ASG spans 200-900 m2.4xlarge daily
Why:
• Improved scalability during unexpected workloads
• Absorb variance in service performance profile
• Reactive chain of dependencies
• Creates "reserved instance troughs" for batch activity
21

Dynamic Scaling, cont.
Example covers 3 services
• 2 edge (A,B), 1 mid-tier (C)
• C has more upstream services
than simply A and B
Multiple Autoscaling Policies
• (A) System Load Average
• (B,C) Request-Rate based
22

• Response time variability greatest during scaling events
• Average response time primary between 75-150 msec
24

• Instance counts 3x, Aggregate requests 4.5x (not shown)
• Average CPU utilization per instance: ~25-55%25

Study performed:
• 24 node C* SSD-based cluster (hi1.4xlarge)
• mid-tier service load application
• Targeting 2x production rates
• Increase read ops from 30k to to 70k in ~ 3 minutes
• Increase write ops 750 to 1500 in ~ 3 minutes
Results:
• 95th pctl response time increase: ~ 17 msec to 45 msec
• 99th pctl response time increase: ~ 35 msec to 80 msec
Cassandra Performance
26

Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
27

Cloud-scale Load Testing
• Ad-Hoc or CI-based load test model
• (CI) Run-over-run comparison; email on rule violation
1. Jenkins initiates job
2. JMeter instances apply load
3. Results written to s3
4. Instance metrics published to
Atlas
5. Raw data fetched and
processed
28

Conclusions
• Continually accelerate engineering velocity
• Evolve architecture and processes to mitigate risks
• Stateless micro-service architectures win!
• Remove barriers for engineers
• Last option should be to reduce rate of change
• Exercise failure and “thundering herd” scenarios
• Cloud native scaling and resiliency are key factors
• Leverage pre-existing OSS PaaS when possible
29

Netflix Open Source
Our Open Source Software simplifies mgmt at
scale
Great projects, stunning colleagues:
jobs.netflix.com30

Q&A
• cwatson@netflix.com
• Netflix Tech Blog: http://techblog.netflix.com
31

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

More Related Content

What's hot (18)

Viewers also liked (16)

Similar to Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud (20)

Recently uploaded (20)

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Editor's Notes