SlideShare a Scribd company logo
Netflix
Performance Meetup
Global Client Performance
Fast Metrics
3G in Kazakhstan
● Global Internet:
● faster (better networking)
● slower (broader reach, congestion)
● Don't wait for it, measure it and deal
● Working app > Feature rich app
Making the Internet fast
is slow.
We need to know what the Internet looks like,
without averages, seeing the full distribution.
● Sampling
○ Missed data
○ Rare events
○ Problems aren’t equal in
Population
● Averages
○ Can't see the distribution
○ Outliers heavily distort
∞, 0, negatives, errors
Logging Anti-Patterns
Instead, use the client as a map-reducer and send up aggregated
data, less often.
Sizing up the Internet.
Infinite (free) compute power!
Netflix SRE perf meetup_slides
● Calculate the inverse empirical cumulative
distribution function by math.
Get median, 95th, etc.
> library(HistogramTools)
> iecdf <- HistToEcdf(histogram,
method='linear’, inverse=TRUE)
> iecdf(0.5)
[1] 0.7975309 # median
> iecdf(0.95)
[1] 4.65 # 95th
percentile
o ...or just use R which is free and knows how
to do it already
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Data > Opinions.
Better than debating opinions.
Architecture is hard. Make it cheap to experiment where your users really are.
"There's no way that the
client makes that many
requests.”
"No one really minds the
spinner."
"Why should we spend
time on that instead of
COOLFEATURE?"
"We live in a
50ms world!"
We built Daedalus
US
Elsewhere
Fast
Slow
DNS Time
● Visual → Numerical, need the IECDF for
Percentiles
○ ƒ(0.50) = 50th
(median)
○ ƒ(0.95) = 95th
● Cluster to get pretty colors similar experiences.
(k-means, hierarchical, etc.)
Interpret the data
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
● Go there!
● Abstract analysis - hard
● Feeling reality is much simpler than looking at graphs. Build!
Practical Teleportation.
Make a Reality Lab.
Netflix SRE perf meetup_slides
Don't guess.
Developing a model based on
production data, without missing the
distribution of samples (network, render,
responsiveness) will lead to better
software.
Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
Icarus
Martin Spier
@spiermar
Performance Engineering @ Netflix
Netflix SRE perf meetup_slides
Problem & Motivation
● Real-user performance monitoring solution
● More insight into the App performance
(as perceived by real users)
● Too many variables to trust synthetic
tests and labs
● Prioritize work around App performance
● Track App improvement progress over time
● Detect issues, internal and external
Device Diversity
● Netflix runs on all sorts of devices
● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
● Consistently evaluate performance
Netflix SRE perf meetup_slides
What are we monitoring?
● User Actions
(or things users do in the App)
● App Startup
● User Navigation
● Playing a Title
● Internal App metrics
What are we measuring?
● When does the timer start and stop?
● Time-to-Interactive (TTI)
○ Interactive, even if
some items were not fully
loaded and rendered
● Time-to-Render (TTR)
○ Everything above the fold
(visible without scrolling)
is rendered
● Play Delay
● Meaningful for what we are monitoring
High-dimensional Data
● Complex device categorization
● Geo regions, subregions, countries
● Highly granular network
classifications
● High volume of A/B tests
● Different facets of the same user action
○ Cold, suspended and backgrounded
App startups
○ Target view/page on App startup
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Data Sketches
● Data structures that approximately
resemble a much larger data set
● Preserve essential features!
● Significantly smaller!
● Faster to operate on!
t-Digest
● t-Digest data structure
● Rank-based statistics
(such as quantiles)
● Parallel friendly
(can be merged!)
● Very fast!
● Really accurate!
https://github.com/tdunning/t-digest
+ t-Digest sketches
Netflix SRE perf meetup_slides
iOS Median Comparison, Break by Country
iOS Median Comparison, Break by Country + iPhone 6S Plus
CDFs by UI Version
Warm Startup Rate
A/B Cell Comparison
Anomaly Detection
Going Forward
● Resource utilization metrics
● Device profiling
○ Instrumenting client code
● Explore other visualizations
○ Frequency heat maps
● Connection between perceived
performance, acquisition and
retention
@spiermar
Netflix
Autoscaling for experts
Vadim
● Mid-tier stateless services are ~2/3rd of the total
● Savings - 30% of mid-tier footprint (roughly 30K instances)
○ Higher savings if we break it down by region
○ Even higher savings on services that scale well
Savings!
Why we autoscale - philosophical reasons
Why we autoscale - pragmatic reasons
● Encoding
● Precompute
● Failover
● Red/black pushes
● Curing cancer**
● And more...
** Hack-day project
Should you autoscale?
Benefits
● On-demand capacity: direct $$ savings
● RI capacity: re-purposing spare capacity
However, for each server group, beware of
● Uneven distribution of traffic
● Sticky traffic
● Bursty traffic
● Small ASG sizes (<10)
Autoscaling impacts availability - true or false?
* If done correctly
Under-provisioning, however, can impact availability
● Autoscaling is not a problem
● The real problem is not knowing performance characteristics of the
service
AWS autoscaling mechanics
CloudWatch alarm ASG scaling policy
Aggregated metric feed
Notification
Tunables
Metric ● Threshold
● # of eval periods
● Scaling amount
● Warmup time
What metric to scale on?
Pros
● Tracks a direct measure of work
● Linear scaling
● Predictable
● Requires less adjustment over time
Cons
● Thresholds tend to drift over time
● Prone to changes in request mixture
● Less predictable
● More oscillation / jitter
Throughput
Resource
utilization
Autoscaling on multiple metrics
Proceed with caution
● Harder to reason about scaling behavior
● Different metrics might contradict each
other, causing oscillation
Typical Netflix configuration:
● Scale-up policy on throughput
● Scale-down policy on throughput
● Emergency scale-up policy on CPU, aka
“the hammer rule”
Well-behaved autoscaling
Common mistakes - “no rush” scaling
Problem: scaling amounts too
small, cooldown too long
Effect: scaling lags behind the
traffic flow. Not enough
capacity at peak, capacity
wasted in trough
Remedy: increase scaling
amounts, migrate to step
policies
Common mistakes - twitchy scaling
Problem: Scale-up policy is
too aggressive
Effect: unnecessary
capacity churn
Remedy: reduce scale-up
amount, increase the # of
eval periods
Common mistakes - should I stay or should I go
Problem: -up and -down
thresholds are too close to each
other
Effect: constant capacity
oscillation
Remedy: move -up and -down
thresholds farther apart
AWS target tracking - your best bet!
● Think of it as a step policy with auto-steps
● You can also think of it as a thermostat
● Accounts for the rate of change in monitored metric
● Pick a metric, set the target value and warmup time - that’s it!
Step Target-tracking
Netflix
PMCs on the Cloud
Brendan
Busy
Waiting
(“idle”)
90% CPU utilization:
Busy
Waiting
(“idle”)
Busy
Waiting
(“idle”)
Waiting
(“stalled”)
Reality:
90% CPU utilization:
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%)
7,562 context-switches # 0.095 K/sec (100.00%)
1,157 cpu-migrations # 0.014 K/sec (100.00%)
109,734 page-faults # 0.001 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
10.001715965 seconds time elapsed
Performance
Monitoring Counters
(PMCs) in most clouds
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%]
1,047,222 context-switches # 0.002 M/sec [100.00%]
83,420 cpu-migrations # 0.130 K/sec [100.00%]
38,905 page-faults # 0.061 K/sec
655,419,788,755 cycles # 1.022 GHz [75.02%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
536,830,399,277 instructions # 0.82 insns per cycle [75.02%]
97,103,651,128 branches # 151.412 M/sec [75.02%]
1,230,478,597 branch-misses # 1.27% of all branches [74.99%]
10.001622154 seconds time elapsed
AWS EC2 m4.16xl
Interpreting IPC & Actionable Items
IPC: Instructions Per Cycle (invert of CPI)
● IPC < 1.0: likely memory stalled
○ Data usage and layout to improve CPU caching, memory locality.
○ Choose larger CPU caches, faster memory busses and interconnects.
● IPC > 1.0: likely instruction bound
○ Reduce code execution, eliminate unnecessary work, cache operations,
improve algorithm order. Can analyze using CPU flame graphs.
○ Faster CPUs.
Event Name Umask Event S. Example Event Mask Mnemonic
UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P
Instruction Retired 00H C0H INST_RETIRED.ANY_P
UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE
LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS
Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES
Intel Architectural PMCs
Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
# pmcarch 1
CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65
75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93
65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87
90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59
76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83
[...]
tiptop - [root]
Tasks: 96 total, 3 displayed screen 0: default
PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND
3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java
1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet
900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo
https://github.com/brendangregg/pmc-cloud-tools
Netflix
Performance Meetup
Netflix
Performance Meetup

More Related Content

PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
Linux Profiling at Netflix
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
PDF
Linux Performance Profiling and Monitoring
PDF
Analyzing OS X Systems Performance with the USE Method
PDF
Profiling your Applications using the Linux Perf Tools
PDF
Automated deployment
PDF
Systems@Scale 2021 BPF Performance Getting Started
Kernel Recipes 2017: Using Linux perf at Netflix
Linux Profiling at Netflix
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux Performance Profiling and Monitoring
Analyzing OS X Systems Performance with the USE Method
Profiling your Applications using the Linux Perf Tools
Automated deployment
Systems@Scale 2021 BPF Performance Getting Started

What's hot (20)

PPT
Linux kernel memory allocators
PDF
Load Testing - How to Stress Your Odoo with Locust
PDF
Container Performance Analysis
PDF
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
PDF
Linux tuning to improve PostgreSQL performance
PDF
Velocity 2015 linux perf tools
PDF
Performance Wins with BPF: Getting Started
PPTX
Security: Odoo Code Hardening
PDF
RxNetty vs Tomcat Performance Results
PDF
02.[참고]오픈소스sw라이선스가이드라인
PPTX
Broken Linux Performance Tools 2016
PPTX
ClassLoader Leaks
PDF
分布式Key Value Store漫谈
PDF
MySQL Parallel Replication by Booking.com
PDF
LISA2019 Linux Systems Performance
PDF
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
PDF
patroni-based citrus high availability environment deployment
PDF
Quick and Solid - Baremetal on OpenStack | Rico Lin
PDF
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
PDF
Full Text Search in PostgreSQL
Linux kernel memory allocators
Load Testing - How to Stress Your Odoo with Locust
Container Performance Analysis
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
Linux tuning to improve PostgreSQL performance
Velocity 2015 linux perf tools
Performance Wins with BPF: Getting Started
Security: Odoo Code Hardening
RxNetty vs Tomcat Performance Results
02.[참고]오픈소스sw라이선스가이드라인
Broken Linux Performance Tools 2016
ClassLoader Leaks
分布式Key Value Store漫谈
MySQL Parallel Replication by Booking.com
LISA2019 Linux Systems Performance
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
patroni-based citrus high availability environment deployment
Quick and Solid - Baremetal on OpenStack | Rico Lin
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
Full Text Search in PostgreSQL
Ad

Similar to Netflix SRE perf meetup_slides (20)

PDF
Performance architecture for cloud connect
PDF
Performance Oriented Design
PPTX
Top Performance Problems in Distributed Architectures
PDF
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
PDF
Capacity Planning for fun & profit
PPTX
Cloud computing
PPTX
Cloud Computing - Geektalk
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
PDF
How Netflix Tunes EC2 Instances for Performance
PDF
RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Perform...
PPTX
Performance Metrics Driven CI/CD - Introduction to Continuous Innovation and ...
PDF
Scalable Microservices at Netflix. Challenges and Tools of the Trade
PPTX
Performance testing in scope of migration to cloud by Serghei Radov
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PPTX
Big Data on Cloud Native Platform
PPTX
Big Data on Cloud Native Platform
PDF
Performance OR Capacity #CMGimPACt2016
PPTX
Scale net apps in aws
PPTX
Scale net apps in aws
Performance architecture for cloud connect
Performance Oriented Design
Top Performance Problems in Distributed Architectures
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Capacity Planning for fun & profit
Cloud computing
Cloud Computing - Geektalk
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
How Netflix Tunes EC2 Instances for Performance
RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Perform...
Performance Metrics Driven CI/CD - Introduction to Continuous Innovation and ...
Scalable Microservices at Netflix. Challenges and Tools of the Trade
Performance testing in scope of migration to cloud by Serghei Radov
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Big Data on Cloud Native Platform
Big Data on Cloud Native Platform
Performance OR Capacity #CMGimPACt2016
Scale net apps in aws
Scale net apps in aws
Ad

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
composite construction of structures.pdf
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
web development for engineering and engineering
PDF
Digital Logic Computer Design lecture notes
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Mechanical Engineering MATERIALS Selection
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
composite construction of structures.pdf
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
web development for engineering and engineering
Digital Logic Computer Design lecture notes
CYBER-CRIMES AND SECURITY A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Foundation to blockchain - A guide to Blockchain Tech
Mechanical Engineering MATERIALS Selection
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Structs to JSON How Go Powers REST APIs.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
UNIT 4 Total Quality Management .pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

Netflix SRE perf meetup_slides

  • 4. ● Global Internet: ● faster (better networking) ● slower (broader reach, congestion) ● Don't wait for it, measure it and deal ● Working app > Feature rich app Making the Internet fast is slow.
  • 5. We need to know what the Internet looks like, without averages, seeing the full distribution.
  • 6. ● Sampling ○ Missed data ○ Rare events ○ Problems aren’t equal in Population ● Averages ○ Can't see the distribution ○ Outliers heavily distort ∞, 0, negatives, errors Logging Anti-Patterns Instead, use the client as a map-reducer and send up aggregated data, less often.
  • 7. Sizing up the Internet.
  • 10. ● Calculate the inverse empirical cumulative distribution function by math. Get median, 95th, etc. > library(HistogramTools) > iecdf <- HistToEcdf(histogram, method='linear’, inverse=TRUE) > iecdf(0.5) [1] 0.7975309 # median > iecdf(0.95) [1] 4.65 # 95th percentile o ...or just use R which is free and knows how to do it already
  • 14. Better than debating opinions. Architecture is hard. Make it cheap to experiment where your users really are. "There's no way that the client makes that many requests.” "No one really minds the spinner." "Why should we spend time on that instead of COOLFEATURE?" "We live in a 50ms world!"
  • 16. ● Visual → Numerical, need the IECDF for Percentiles ○ ƒ(0.50) = 50th (median) ○ ƒ(0.95) = 95th ● Cluster to get pretty colors similar experiences. (k-means, hierarchical, etc.) Interpret the data
  • 21. ● Go there! ● Abstract analysis - hard ● Feeling reality is much simpler than looking at graphs. Build! Practical Teleportation.
  • 24. Don't guess. Developing a model based on production data, without missing the distribution of samples (network, render, responsiveness) will lead to better software. Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
  • 27. Problem & Motivation ● Real-user performance monitoring solution ● More insight into the App performance (as perceived by real users) ● Too many variables to trust synthetic tests and labs ● Prioritize work around App performance ● Track App improvement progress over time ● Detect issues, internal and external
  • 28. Device Diversity ● Netflix runs on all sorts of devices ● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ... ● Consistently evaluate performance
  • 30. What are we monitoring? ● User Actions (or things users do in the App) ● App Startup ● User Navigation ● Playing a Title ● Internal App metrics
  • 31. What are we measuring? ● When does the timer start and stop? ● Time-to-Interactive (TTI) ○ Interactive, even if some items were not fully loaded and rendered ● Time-to-Render (TTR) ○ Everything above the fold (visible without scrolling) is rendered ● Play Delay ● Meaningful for what we are monitoring
  • 32. High-dimensional Data ● Complex device categorization ● Geo regions, subregions, countries ● Highly granular network classifications ● High volume of A/B tests ● Different facets of the same user action ○ Cold, suspended and backgrounded App startups ○ Target view/page on App startup
  • 36. Data Sketches ● Data structures that approximately resemble a much larger data set ● Preserve essential features! ● Significantly smaller! ● Faster to operate on!
  • 37. t-Digest ● t-Digest data structure ● Rank-based statistics (such as quantiles) ● Parallel friendly (can be merged!) ● Very fast! ● Really accurate! https://github.com/tdunning/t-digest
  • 40. iOS Median Comparison, Break by Country
  • 41. iOS Median Comparison, Break by Country + iPhone 6S Plus
  • 42. CDFs by UI Version
  • 46. Going Forward ● Resource utilization metrics ● Device profiling ○ Instrumenting client code ● Explore other visualizations ○ Frequency heat maps ● Connection between perceived performance, acquisition and retention @spiermar
  • 48. ● Mid-tier stateless services are ~2/3rd of the total ● Savings - 30% of mid-tier footprint (roughly 30K instances) ○ Higher savings if we break it down by region ○ Even higher savings on services that scale well Savings!
  • 49. Why we autoscale - philosophical reasons
  • 50. Why we autoscale - pragmatic reasons ● Encoding ● Precompute ● Failover ● Red/black pushes ● Curing cancer** ● And more... ** Hack-day project
  • 51. Should you autoscale? Benefits ● On-demand capacity: direct $$ savings ● RI capacity: re-purposing spare capacity However, for each server group, beware of ● Uneven distribution of traffic ● Sticky traffic ● Bursty traffic ● Small ASG sizes (<10)
  • 52. Autoscaling impacts availability - true or false? * If done correctly Under-provisioning, however, can impact availability ● Autoscaling is not a problem ● The real problem is not knowing performance characteristics of the service
  • 53. AWS autoscaling mechanics CloudWatch alarm ASG scaling policy Aggregated metric feed Notification Tunables Metric ● Threshold ● # of eval periods ● Scaling amount ● Warmup time
  • 54. What metric to scale on? Pros ● Tracks a direct measure of work ● Linear scaling ● Predictable ● Requires less adjustment over time Cons ● Thresholds tend to drift over time ● Prone to changes in request mixture ● Less predictable ● More oscillation / jitter Throughput Resource utilization
  • 55. Autoscaling on multiple metrics Proceed with caution ● Harder to reason about scaling behavior ● Different metrics might contradict each other, causing oscillation Typical Netflix configuration: ● Scale-up policy on throughput ● Scale-down policy on throughput ● Emergency scale-up policy on CPU, aka “the hammer rule”
  • 57. Common mistakes - “no rush” scaling Problem: scaling amounts too small, cooldown too long Effect: scaling lags behind the traffic flow. Not enough capacity at peak, capacity wasted in trough Remedy: increase scaling amounts, migrate to step policies
  • 58. Common mistakes - twitchy scaling Problem: Scale-up policy is too aggressive Effect: unnecessary capacity churn Remedy: reduce scale-up amount, increase the # of eval periods
  • 59. Common mistakes - should I stay or should I go Problem: -up and -down thresholds are too close to each other Effect: constant capacity oscillation Remedy: move -up and -down thresholds farther apart
  • 60. AWS target tracking - your best bet! ● Think of it as a step policy with auto-steps ● You can also think of it as a thermostat ● Accounts for the rate of change in monitored metric ● Pick a metric, set the target value and warmup time - that’s it! Step Target-tracking
  • 61. Netflix PMCs on the Cloud Brendan
  • 64. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%) 7,562 context-switches # 0.095 K/sec (100.00%) 1,157 cpu-migrations # 0.014 K/sec (100.00%) 109,734 page-faults # 0.001 M/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses 10.001715965 seconds time elapsed Performance Monitoring Counters (PMCs) in most clouds
  • 65. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%] 1,047,222 context-switches # 0.002 M/sec [100.00%] 83,420 cpu-migrations # 0.130 K/sec [100.00%] 38,905 page-faults # 0.061 K/sec 655,419,788,755 cycles # 1.022 GHz [75.02%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 536,830,399,277 instructions # 0.82 insns per cycle [75.02%] 97,103,651,128 branches # 151.412 M/sec [75.02%] 1,230,478,597 branch-misses # 1.27% of all branches [74.99%] 10.001622154 seconds time elapsed AWS EC2 m4.16xl
  • 66. Interpreting IPC & Actionable Items IPC: Instructions Per Cycle (invert of CPI) ● IPC < 1.0: likely memory stalled ○ Data usage and layout to improve CPU caching, memory locality. ○ Choose larger CPU caches, faster memory busses and interconnects. ● IPC > 1.0: likely instruction bound ○ Reduce code execution, eliminate unnecessary work, cache operations, improve algorithm order. Can analyze using CPU flame graphs. ○ Faster CPUs.
  • 67. Event Name Umask Event S. Example Event Mask Mnemonic UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P Instruction Retired 00H C0H INST_RETIRED.ANY_P UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES Intel Architectural PMCs Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
  • 68. # pmcarch 1 CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC% 90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65 75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93 65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87 90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59 76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83 [...] tiptop - [root] Tasks: 96 total, 3 displayed screen 0: default PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND 3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java 1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet 900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo https://github.com/brendangregg/pmc-cloud-tools