Netflix SRE perf meetup_slides

Global Client Performance
Fast Metrics

● Global Internet:
● faster (better networking)
● slower (broader reach, congestion)
● Don't wait for it, measure it and deal
● Working app > Feature rich app
Making the Internet fast
is slow.

We need to know what the Internet looks like,
without averages, seeing the full distribution.

● Sampling
○ Missed data
○ Rare events
○ Problems aren’t equal in
Population
● Averages
○ Can't see the distribution
○ Outliers heavily distort
∞, 0, negatives, errors
Logging Anti-Patterns
Instead, use the client as a map-reducer and send up aggregated
data, less often.

Infinite (free) compute power!

Netflix SRE perf meetup_slides

● Calculate the inverse empirical cumulative
distribution function by math.
Get median, 95th, etc.
> library(HistogramTools)
> iecdf <- HistToEcdf(histogram,
method='linear’, inverse=TRUE)
> iecdf(0.5)
[1] 0.7975309 # median
> iecdf(0.95)
[1] 4.65 # 95th
percentile
o ...or just use R which is free and knows how
to do it already

Better than debating opinions.
Architecture is hard. Make it cheap to experiment where your users really are.
"There's no way that the
client makes that many
requests.”
"No one really minds the
spinner."
"Why should we spend
time on that instead of
COOLFEATURE?"
"We live in a
50ms world!"

We built Daedalus
US
Elsewhere
Fast
Slow
DNS Time

● Visual → Numerical, need the IECDF for
Percentiles
○ ƒ(0.50) = 50th
(median)
○ ƒ(0.95) = 95th
● Cluster to get pretty colors similar experiences.
(k-means, hierarchical, etc.)
Interpret the data

● Go there!
● Abstract analysis - hard
● Feeling reality is much simpler than looking at graphs. Build!
Practical Teleportation.

Don't guess.
Developing a model based on
production data, without missing the
distribution of samples (network, render,
responsiveness) will lead to better
software.
Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com

Icarus
Martin Spier
@spiermar
Performance Engineering @ Netflix

Problem & Motivation
● Real-user performance monitoring solution
● More insight into the App performance
(as perceived by real users)
● Too many variables to trust synthetic
tests and labs
● Prioritize work around App performance
● Track App improvement progress over time
● Detect issues, internal and external

Device Diversity
● Netflix runs on all sorts of devices
● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
● Consistently evaluate performance

What are we monitoring?
● User Actions
(or things users do in the App)
● App Startup
● User Navigation
● Playing a Title
● Internal App metrics

What are we measuring?
● When does the timer start and stop?
● Time-to-Interactive (TTI)
○ Interactive, even if
some items were not fully
loaded and rendered
● Time-to-Render (TTR)
○ Everything above the fold
(visible without scrolling)
is rendered
● Play Delay
● Meaningful for what we are monitoring

High-dimensional Data
● Complex device categorization
● Geo regions, subregions, countries
● Highly granular network
classifications
● High volume of A/B tests
● Different facets of the same user action
○ Cold, suspended and backgrounded
App startups
○ Target view/page on App startup

Data Sketches
● Data structures that approximately
resemble a much larger data set
● Preserve essential features!
● Significantly smaller!
● Faster to operate on!

t-Digest
● t-Digest data structure
● Rank-based statistics
(such as quantiles)
● Parallel friendly
(can be merged!)
● Very fast!
● Really accurate!
https://github.com/tdunning/t-digest

iOS Median Comparison, Break by Country

iOS Median Comparison, Break by Country + iPhone 6S Plus

Going Forward
● Resource utilization metrics
● Device profiling
○ Instrumenting client code
● Explore other visualizations
○ Frequency heat maps
● Connection between perceived
performance, acquisition and
retention
@spiermar

Netflix
Autoscaling for experts
Vadim

● Mid-tier stateless services are ~2/3rd of the total
● Savings - 30% of mid-tier footprint (roughly 30K instances)
○ Higher savings if we break it down by region
○ Even higher savings on services that scale well
Savings!

Why we autoscale - philosophical reasons

Why we autoscale - pragmatic reasons
● Encoding
● Precompute
● Failover
● Red/black pushes
● Curing cancer**
● And more...
** Hack-day project

Should you autoscale?
Benefits
● On-demand capacity: direct $$ savings
● RI capacity: re-purposing spare capacity
However, for each server group, beware of
● Uneven distribution of traffic
● Sticky traffic
● Bursty traffic
● Small ASG sizes (<10)

Autoscaling impacts availability - true or false?
* If done correctly
Under-provisioning, however, can impact availability
● Autoscaling is not a problem
● The real problem is not knowing performance characteristics of the
service

AWS autoscaling mechanics
CloudWatch alarm ASG scaling policy
Aggregated metric feed
Notification
Tunables
Metric ● Threshold
● # of eval periods
● Scaling amount
● Warmup time

What metric to scale on?
Pros
● Tracks a direct measure of work
● Linear scaling
● Predictable
● Requires less adjustment over time
Cons
● Thresholds tend to drift over time
● Prone to changes in request mixture
● Less predictable
● More oscillation / jitter
Throughput
Resource
utilization

Autoscaling on multiple metrics
Proceed with caution
● Harder to reason about scaling behavior
● Different metrics might contradict each
other, causing oscillation
Typical Netflix configuration:
● Scale-up policy on throughput
● Scale-down policy on throughput
● Emergency scale-up policy on CPU, aka
“the hammer rule”

Common mistakes - “no rush” scaling
Problem: scaling amounts too
small, cooldown too long
Effect: scaling lags behind the
traffic flow. Not enough
capacity at peak, capacity
wasted in trough
Remedy: increase scaling
amounts, migrate to step
policies

Common mistakes - twitchy scaling
Problem: Scale-up policy is
too aggressive
Effect: unnecessary
capacity churn
Remedy: reduce scale-up
amount, increase the # of
eval periods

Common mistakes - should I stay or should I go
Problem: -up and -down
thresholds are too close to each
other
Effect: constant capacity
oscillation
Remedy: move -up and -down
thresholds farther apart

AWS target tracking - your best bet!
● Think of it as a step policy with auto-steps
● You can also think of it as a thermostat
● Accounts for the rate of change in monitored metric
● Pick a metric, set the target value and warmup time - that’s it!
Step Target-tracking

Netflix
PMCs on the Cloud
Brendan

Busy
Waiting
(“idle”)
90% CPU utilization:

Busy
Waiting
(“idle”)
Busy
Waiting
(“idle”)
Waiting
(“stalled”)
Reality:
90% CPU utilization:

# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%)
7,562 context-switches # 0.095 K/sec (100.00%)
1,157 cpu-migrations # 0.014 K/sec (100.00%)
109,734 page-faults # 0.001 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
10.001715965 seconds time elapsed
Performance
Monitoring Counters
(PMCs) in most clouds

# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%]
1,047,222 context-switches # 0.002 M/sec [100.00%]
83,420 cpu-migrations # 0.130 K/sec [100.00%]
38,905 page-faults # 0.061 K/sec
655,419,788,755 cycles # 1.022 GHz [75.02%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
536,830,399,277 instructions # 0.82 insns per cycle [75.02%]
97,103,651,128 branches # 151.412 M/sec [75.02%]
1,230,478,597 branch-misses # 1.27% of all branches [74.99%]
10.001622154 seconds time elapsed
AWS EC2 m4.16xl

Interpreting IPC & Actionable Items
IPC: Instructions Per Cycle (invert of CPI)
● IPC < 1.0: likely memory stalled
○ Data usage and layout to improve CPU caching, memory locality.
○ Choose larger CPU caches, faster memory busses and interconnects.
● IPC > 1.0: likely instruction bound
○ Reduce code execution, eliminate unnecessary work, cache operations,
improve algorithm order. Can analyze using CPU flame graphs.
○ Faster CPUs.

Event Name Umask Event S. Example Event Mask Mnemonic
UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P
Instruction Retired 00H C0H INST_RETIRED.ANY_P
UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE
LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS
Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES
Intel Architectural PMCs
Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)

# pmcarch 1
CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65
75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93
65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87
90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59
76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83
[...]
tiptop - [root]
Tasks: 96 total, 3 displayed screen 0: default
PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND
3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java
1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet
900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo
https://github.com/brendangregg/pmc-cloud-tools

Netflix SRE perf meetup_slides

More Related Content

What's hot (20)

Similar to Netflix SRE perf meetup_slides (20)

Recently uploaded (20)

Netflix SRE perf meetup_slides