ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary
Finding (and Fixing!) Performance Anomalies
in Large Scale Distributed Systems
Victor Marmol
vmarmol@google.com

Confidential + Proprietary
Today
App
? ? ?

Containers Infrastructure
Manage containers @ Google
Everything runs in a container
2B+ containers started per week
Images by Connie Zhou

You may Know Some of our OSS Work
Let Me Contain That For You

What about at Google?
Images by Connie Zhou

Borg

What is Borg?
Large-scale cluster management at Google with Borg

Borglet
Google’s node agent
Borglet = init + Docker + a few other things
Primary goals
➔ Talk to master
➔ Manage tasks
➔ Manage resources (containers)

How do we get to task performance management?
Dremel: Interactive Analysis of Web-Scale Datasets

Task Performance Analysis (TPA)
Our system for container-based black-box application performance analysis
Containers are the main enabler
Manage, monitor, and improve application performance
Today’s Talk
➔ How does it work
➔ User stories: stories from the front-lines!
Container
App

How does it work?

Overall Flow
Collection → Aggregation → Baselines → SLOs → Enforcement

Low-Level Performance Metrics
Key: collect lots of container-based low-level metrics from the kernel
Custom kernel patches to give us even more stats and metrics
Sources
➔ cgroups
➔ /proc
➔ perf_events
➔ misc (e.g.: netlink, ioctls, etc)
Container
App
low-level performance metrics and telemetry

Low-Level Performance Metrics
Histograms are our favorite: number, breakdown, and tail of operations
➔ CPU latencies
➔ Memory reclaim, page faults, re-faults
➔ I/O wait time and service time
Metrics collected every 1s - 10s
➔ 1s: Used for on-machine control loops
➔ 10s: Exported for off-machine analysis
Collection is very low-overhead

Cluster-Wide Aggregation
Cluster service that collects all metrics and exports them to Dremel
Push data for all tasks on all machines, keep them for a while
Single-handedly our most valuable resource
➔ SQL is very expressive and flexible
➔ Ability to query all that data in seconds: priceless
Best news: You can use it too! Google BigQuery
Performance
Data DB
BigQuery

Performance Baselines
Cluster-level service: slice & dice data
➔ Types of tasks
➔ Distributions across replicas
➔ Per compute cluster (Borg cell)
➔ Historical trends
Gives us insights into performance trends and helps us develop performance
baselines
Performance baseline: performance we can achieve given different parameters
➔ CPU: How quickly can we schedule you on the CPU
➔ Disk I/O: What disk I/O latency can we achieve

Baselines → SLOs
From baselines we provide performance SLOs:
promise to the user
You promise to do X
➔ CPU: Use at most as much CPU as you asked for
➔ Disk I/O: Issue less than X I/Os per second
We promise to give you Y performance
➔ CPU: You will get scheduled on a CPU within Yms of requesting it
➔ Disk I/O: You will get I/O wait time of at most Yms

Enacting SLOs
Monitor SLOs closely and aggressively ensure they are met
Per-node
➔ Give more resources or better quality resources
➔ Throttle bad actors (antagonists)
Cluster-wide
➔ Ask for help!
➔ Move task to a different machine
➔ Move antagonist to a different machine
Container
App
Container
App

Metrics
➔ CPU
➔ NUMA
➔ Disk I/O

CPU
Low-level metrics
➔ Wakeup latency: time between
wanting to run and running
➔ Round-robin latency: how well
you share CPU within your app
➔ Load: how much work you
wanted to do
➔ Time per state: how much time
your spent in each state (e.g.:
sleep, wait, run, queue)

CPU
SLOs
➔ Wakeup latency when
well-behaved
➔ CPU usage rate when
well-behaved

NUMA
Low-level metrics
➔ CPU locality: how much of your CPU (and
usage) was in local vs remote nodes
➔ Memory locality: how much of your memory
(and accesses) was in local vs remote
nodes
➔ NUMA score: resource-product of both
above (0.0 - 1.0)
SLOs
➔ NUMA score of 0.85 or above given certain
job shapes
The NUMA Experience

Disk I/O
Low-level metrics
➔ Service time latency: time it took kernel to service request to disk
➔ Wait time latency: time it took kernel to queue and service request
to disk
➔ Queued: how much work you wanted to do
➔ Usage: how much work did you actually did
SLOs
➔ Small amount of disk time when well-behaved

User Stories

Performance Regression
User: VM environment
User Problem: … silence ...
SLO not met: CPU
Signal: CPU queue other
Root cause: Subtle, but expensive, new periodic operation
Make it better: Give the application more debug information

Performance Variation #1
User: Flight search
User Problem: QPS variation on some tasks
SLO not met: NUMA
Signal: CPU and memory locality
Root cause: Bad NUMA allocation by infrastructure
Make it better: Improve NUMA allocation

Performance Variation #2
User: Web search
User Problem: Latency variation on some task
SLO not met: CPI variation
Signal: CPI from perf_events
Root cause: Bad actors co-scheduled on the machine
Make it better: Throttle or move these bad actors

Performance Degradation Under Load
User: Borglet
User Problem: Stuckness under heavy load
SLO not met: Disk access
Signal: Disk I/O wait time latencies
Root cause: Heavy disk operations blocking other operations
Make it better: Move disk operations away from latency sensitive operations

Future Work
➔ Signals for more resources (e.g.: memory)
➔ Using the right signals
➔ Better reporting and fleet-wide view to catch regressions across various
components
Helping apps more
➔ Where are the problems?
➔ Suggest how to fix problems we can’t fix ourselves

Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper

Takeaways
cheaper
You can do this too!

Questions?
cheaper
You can do this too!
Victor Marmol
vmarmol@google.com

● Friday 8am - 1pm @ Google's Toronto office
● Hear real life experiences of two companies using GKE
● Share war stories with your peers
● Learn about future plans for microservice management
from Google
● Help shape our roadmap
g.co/microservicesroundtable
† Must be able to sign digital NDA
Join our Microservices Customer Roundtable

Questions?
Images by
Connie Zhou

ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

More Related Content

What's hot (15)

Similar to ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems (20)

Recently uploaded (20)

ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems