SlideShare a Scribd company logo
Confidential + ProprietaryConfidential + Proprietary
Finding (and Fixing!) Performance Anomalies
in Large Scale Distributed Systems
Victor Marmol
vmarmol@google.com
Confidential + Proprietary
Today
App
? ? ?
Confidential + Proprietary
Containers Infrastructure
Manage containers @ Google
Everything runs in a container
2B+ containers started per week
Images by Connie Zhou
Confidential + Proprietary
You may Know Some of our OSS Work
Let Me Contain That For You
Confidential + Proprietary
What about at Google?
Images by Connie Zhou
Confidential + Proprietary
Borg
Confidential + Proprietary
What is Borg?
Large-scale cluster management at Google with Borg
Confidential + Proprietary
Borglet
Google’s node agent
Borglet = init + Docker + a few other things
Primary goals
➔ Talk to master
➔ Manage tasks
➔ Manage resources (containers)
Confidential + Proprietary
How do we get to task performance management?
Dremel: Interactive Analysis of Web-Scale Datasets
Confidential + Proprietary
Task Performance Analysis (TPA)
Our system for container-based black-box application performance analysis
Containers are the main enabler
Manage, monitor, and improve application performance
Today’s Talk
➔ How does it work
➔ User stories: stories from the front-lines!
Container
App
Confidential + ProprietaryConfidential + Proprietary
How does it work?
Confidential + Proprietary
Overall Flow
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Low-Level Performance Metrics
Key: collect lots of container-based low-level metrics from the kernel
Custom kernel patches to give us even more stats and metrics
Sources
➔ cgroups
➔ /proc
➔ perf_events
➔ misc (e.g.: netlink, ioctls, etc)
Container
App
low-level performance metrics and telemetry
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Low-Level Performance Metrics
Histograms are our favorite: number, breakdown, and tail of operations
➔ CPU latencies
➔ Memory reclaim, page faults, re-faults
➔ I/O wait time and service time
Metrics collected every 1s - 10s
➔ 1s: Used for on-machine control loops
➔ 10s: Exported for off-machine analysis
Collection is very low-overhead
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Cluster-Wide Aggregation
Cluster service that collects all metrics and exports them to Dremel
Push data for all tasks on all machines, keep them for a while
Single-handedly our most valuable resource
➔ SQL is very expressive and flexible
➔ Ability to query all that data in seconds: priceless
Best news: You can use it too! Google BigQuery
Performance
Data DB
BigQuery
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Performance Baselines
Cluster-level service: slice & dice data
➔ Types of tasks
➔ Distributions across replicas
➔ Per compute cluster (Borg cell)
➔ Historical trends
Gives us insights into performance trends and helps us develop performance
baselines
Performance baseline: performance we can achieve given different parameters
➔ CPU: How quickly can we schedule you on the CPU
➔ Disk I/O: What disk I/O latency can we achieve
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Baselines → SLOs
From baselines we provide performance SLOs:
promise to the user
You promise to do X
➔ CPU: Use at most as much CPU as you asked for
➔ Disk I/O: Issue less than X I/Os per second
We promise to give you Y performance
➔ CPU: You will get scheduled on a CPU within Yms of requesting it
➔ Disk I/O: You will get I/O wait time of at most Yms
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Enacting SLOs
Monitor SLOs closely and aggressively ensure they are met
Per-node
➔ Give more resources or better quality resources
➔ Throttle bad actors (antagonists)
Cluster-wide
➔ Ask for help!
➔ Move task to a different machine
➔ Move antagonist to a different machine
Container
App
Container
App
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Metrics
➔ CPU
➔ NUMA
➔ Disk I/O
Confidential + Proprietary
CPU
Low-level metrics
➔ Wakeup latency: time between
wanting to run and running
➔ Round-robin latency: how well
you share CPU within your app
➔ Load: how much work you
wanted to do
➔ Time per state: how much time
your spent in each state (e.g.:
sleep, wait, run, queue)
Confidential + Proprietary
CPU
SLOs
➔ Wakeup latency when
well-behaved
➔ CPU usage rate when
well-behaved
Confidential + Proprietary
NUMA
Low-level metrics
➔ CPU locality: how much of your CPU (and
usage) was in local vs remote nodes
➔ Memory locality: how much of your memory
(and accesses) was in local vs remote
nodes
➔ NUMA score: resource-product of both
above (0.0 - 1.0)
SLOs
➔ NUMA score of 0.85 or above given certain
job shapes
The NUMA Experience
Confidential + Proprietary
Disk I/O
Low-level metrics
➔ Service time latency: time it took kernel to service request to disk
➔ Wait time latency: time it took kernel to queue and service request
to disk
➔ Queued: how much work you wanted to do
➔ Usage: how much work did you actually did
SLOs
➔ Small amount of disk time when well-behaved
Confidential + ProprietaryConfidential + Proprietary
User Stories
Confidential + Proprietary
Performance Regression
User: VM environment
User Problem: … silence ...
SLO not met: CPU
Signal: CPU queue other
Root cause: Subtle, but expensive, new periodic operation
Make it better: Give the application more debug information
Confidential + Proprietary
Performance Variation #1
User: Flight search
User Problem: QPS variation on some tasks
SLO not met: NUMA
Signal: CPU and memory locality
Root cause: Bad NUMA allocation by infrastructure
Make it better: Improve NUMA allocation
Confidential + Proprietary
Performance Variation #2
User: Web search
User Problem: Latency variation on some task
SLO not met: CPI variation
Signal: CPI from perf_events
Root cause: Bad actors co-scheduled on the machine
Make it better: Throttle or move these bad actors
Confidential + Proprietary
Performance Degradation Under Load
User: Borglet
User Problem: Stuckness under heavy load
SLO not met: Disk access
Signal: Disk I/O wait time latencies
Root cause: Heavy disk operations blocking other operations
Make it better: Move disk operations away from latency sensitive operations
Confidential + Proprietary
Future Work
➔ Signals for more resources (e.g.: memory)
➔ Using the right signals
➔ Better reporting and fleet-wide view to catch regressions across various
components
Helping apps more
➔ Where are the problems?
➔ Suggest how to fix problems we can’t fix ourselves
Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Confidential + Proprietary
Questions?
➔ Containers are the main enabler: common language for performance signals
➔ More data ⇒ better decisions
➔ Slicing and dicing of data is priceless for finding patterns and baselines
➔ On by default performance monitoring: low overhead and high value
➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Victor Marmol
vmarmol@google.com
● Friday 8am - 1pm @ Google's Toronto office
● Hear real life experiences of two companies using GKE
● Share war stories with your peers
● Learn about future plans for microservice management
from Google
● Help shape our roadmap
g.co/microservicesroundtable
† Must be able to sign digital NDA
Join our Microservices Customer Roundtable
Confidential + Proprietary
Questions?
Images by
Connie Zhou

More Related Content

PPTX
Arc305 how netflix leverages multiple regions to increase availability an i...
PDF
TIAD : Automating the aplication lifecycle
PPTX
So Easy, A Ten Year Old Can Do It by Zeph Gardler
ODP
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
PDF
TIAD : Automating the modern datacenter
PPTX
HadoopCon- Trend Micro SPN Hadoop Overview
PDF
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
PDF
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Arc305 how netflix leverages multiple regions to increase availability an i...
TIAD : Automating the aplication lifecycle
So Easy, A Ten Year Old Can Do It by Zeph Gardler
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
TIAD : Automating the modern datacenter
HadoopCon- Trend Micro SPN Hadoop Overview
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013

What's hot (15)

PDF
Pydata2014
PPTX
CPN302 your-linux-ami-optimization-and-performance
KEY
London devops logging
PDF
deep learning in production cff 2017
PDF
Carlos Conde : AWS Game Days - TIAD Paris
PDF
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
PDF
Introduction to Apache ZooKeeper
PDF
Use case for using the ElastiCache for Redis in production
PPTX
Spark Tips & Tricks
PPTX
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PDF
使用ZooKeeper打造軟體式負載平衡
PDF
Mitchell Hashimoto, HashiCorp
PPTX
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PDF
Erlang as a cloud citizen, a fractal approach to throughput
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Pydata2014
CPN302 your-linux-ami-optimization-and-performance
London devops logging
deep learning in production cff 2017
Carlos Conde : AWS Game Days - TIAD Paris
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
Introduction to Apache ZooKeeper
Use case for using the ElastiCache for Redis in production
Spark Tips & Tricks
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
使用ZooKeeper打造軟體式負載平衡
Mitchell Hashimoto, HashiCorp
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
Erlang as a cloud citizen, a fractal approach to throughput
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Ad

Similar to ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems (20)

PDF
[@IndeedEng] Redundant Array of Inexpensive Datacenters
PDF
DevOps Fest 2020. immutable infrastructure as code. True story.
PDF
Accelerating workloads and bursting data with Google Dataproc & Alluxio
PPTX
Denver devops : enabling DevOps with data virtualization
PPTX
DockerCon Europe 2018 Monitoring & Logging Workshop
PDF
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
PPTX
Using Docker EE to Scale Operational Intelligence at Splunk
PDF
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
PDF
OSMC 2019 | How to improve database Observability by Charles Judith
PDF
Mtc learnings from isv & enterprise interaction
PPTX
Mtc learnings from isv & enterprise (dated - Dec -2014)
PPTX
Citrix XenDesktop: Dealing with Failure - SYN408
PDF
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
PDF
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
PPTX
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
PPTX
Practical DMD Scripting
PPTX
Taking Splunk to the Next Level – Architecture
PDF
Learned lessons in real world projects by Jordi Anguela at Mallorca Software ...
PDF
12-Step Program for Scaling Web Applications on PostgreSQL
PPTX
Data Virtualization: revolutionizing database cloning
[@IndeedEng] Redundant Array of Inexpensive Datacenters
DevOps Fest 2020. immutable infrastructure as code. True story.
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Denver devops : enabling DevOps with data virtualization
DockerCon Europe 2018 Monitoring & Logging Workshop
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
Using Docker EE to Scale Operational Intelligence at Splunk
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
OSMC 2019 | How to improve database Observability by Charles Judith
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise (dated - Dec -2014)
Citrix XenDesktop: Dealing with Failure - SYN408
Webinar: Emerging Trends in Data Architecture – What’s the Next Big Thing?
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Practical DMD Scripting
Taking Splunk to the Next Level – Architecture
Learned lessons in real world projects by Jordi Anguela at Mallorca Software ...
12-Step Program for Scaling Web Applications on PostgreSQL
Data Virtualization: revolutionizing database cloning
Ad

Recently uploaded (20)

PPT
introduction to datamining and warehousing
PPTX
Geodesy 1.pptx...............................................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Well-logging-methods_new................
PPTX
Artificial Intelligence
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
PPT on Performance Review to get promotions
PDF
Digital Logic Computer Design lecture notes
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
introduction to datamining and warehousing
Geodesy 1.pptx...............................................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Well-logging-methods_new................
Artificial Intelligence
Foundation to blockchain - A guide to Blockchain Tech
CYBER-CRIMES AND SECURITY A guide to understanding
Model Code of Practice - Construction Work - 21102022 .pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT on Performance Review to get promotions
Digital Logic Computer Design lecture notes
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
UNIT 4 Total Quality Management .pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf

ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

  • 1. Confidential + ProprietaryConfidential + Proprietary Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems Victor Marmol vmarmol@google.com
  • 3. Confidential + Proprietary Containers Infrastructure Manage containers @ Google Everything runs in a container 2B+ containers started per week Images by Connie Zhou
  • 4. Confidential + Proprietary You may Know Some of our OSS Work Let Me Contain That For You
  • 5. Confidential + Proprietary What about at Google? Images by Connie Zhou
  • 7. Confidential + Proprietary What is Borg? Large-scale cluster management at Google with Borg
  • 8. Confidential + Proprietary Borglet Google’s node agent Borglet = init + Docker + a few other things Primary goals ➔ Talk to master ➔ Manage tasks ➔ Manage resources (containers)
  • 9. Confidential + Proprietary How do we get to task performance management? Dremel: Interactive Analysis of Web-Scale Datasets
  • 10. Confidential + Proprietary Task Performance Analysis (TPA) Our system for container-based black-box application performance analysis Containers are the main enabler Manage, monitor, and improve application performance Today’s Talk ➔ How does it work ➔ User stories: stories from the front-lines! Container App
  • 11. Confidential + ProprietaryConfidential + Proprietary How does it work?
  • 12. Confidential + Proprietary Overall Flow Collection → Aggregation → Baselines → SLOs → Enforcement
  • 13. Confidential + Proprietary Low-Level Performance Metrics Key: collect lots of container-based low-level metrics from the kernel Custom kernel patches to give us even more stats and metrics Sources ➔ cgroups ➔ /proc ➔ perf_events ➔ misc (e.g.: netlink, ioctls, etc) Container App low-level performance metrics and telemetry Collection → Aggregation → Baselines → SLOs → Enforcement
  • 14. Confidential + Proprietary Low-Level Performance Metrics Histograms are our favorite: number, breakdown, and tail of operations ➔ CPU latencies ➔ Memory reclaim, page faults, re-faults ➔ I/O wait time and service time Metrics collected every 1s - 10s ➔ 1s: Used for on-machine control loops ➔ 10s: Exported for off-machine analysis Collection is very low-overhead Collection → Aggregation → Baselines → SLOs → Enforcement
  • 15. Confidential + Proprietary Cluster-Wide Aggregation Cluster service that collects all metrics and exports them to Dremel Push data for all tasks on all machines, keep them for a while Single-handedly our most valuable resource ➔ SQL is very expressive and flexible ➔ Ability to query all that data in seconds: priceless Best news: You can use it too! Google BigQuery Performance Data DB BigQuery Collection → Aggregation → Baselines → SLOs → Enforcement
  • 16. Confidential + Proprietary Performance Baselines Cluster-level service: slice & dice data ➔ Types of tasks ➔ Distributions across replicas ➔ Per compute cluster (Borg cell) ➔ Historical trends Gives us insights into performance trends and helps us develop performance baselines Performance baseline: performance we can achieve given different parameters ➔ CPU: How quickly can we schedule you on the CPU ➔ Disk I/O: What disk I/O latency can we achieve Collection → Aggregation → Baselines → SLOs → Enforcement
  • 17. Confidential + Proprietary Baselines → SLOs From baselines we provide performance SLOs: promise to the user You promise to do X ➔ CPU: Use at most as much CPU as you asked for ➔ Disk I/O: Issue less than X I/Os per second We promise to give you Y performance ➔ CPU: You will get scheduled on a CPU within Yms of requesting it ➔ Disk I/O: You will get I/O wait time of at most Yms Collection → Aggregation → Baselines → SLOs → Enforcement
  • 18. Confidential + Proprietary Enacting SLOs Monitor SLOs closely and aggressively ensure they are met Per-node ➔ Give more resources or better quality resources ➔ Throttle bad actors (antagonists) Cluster-wide ➔ Ask for help! ➔ Move task to a different machine ➔ Move antagonist to a different machine Container App Container App Collection → Aggregation → Baselines → SLOs → Enforcement
  • 19. Confidential + Proprietary Metrics ➔ CPU ➔ NUMA ➔ Disk I/O
  • 20. Confidential + Proprietary CPU Low-level metrics ➔ Wakeup latency: time between wanting to run and running ➔ Round-robin latency: how well you share CPU within your app ➔ Load: how much work you wanted to do ➔ Time per state: how much time your spent in each state (e.g.: sleep, wait, run, queue)
  • 21. Confidential + Proprietary CPU SLOs ➔ Wakeup latency when well-behaved ➔ CPU usage rate when well-behaved
  • 22. Confidential + Proprietary NUMA Low-level metrics ➔ CPU locality: how much of your CPU (and usage) was in local vs remote nodes ➔ Memory locality: how much of your memory (and accesses) was in local vs remote nodes ➔ NUMA score: resource-product of both above (0.0 - 1.0) SLOs ➔ NUMA score of 0.85 or above given certain job shapes The NUMA Experience
  • 23. Confidential + Proprietary Disk I/O Low-level metrics ➔ Service time latency: time it took kernel to service request to disk ➔ Wait time latency: time it took kernel to queue and service request to disk ➔ Queued: how much work you wanted to do ➔ Usage: how much work did you actually did SLOs ➔ Small amount of disk time when well-behaved
  • 24. Confidential + ProprietaryConfidential + Proprietary User Stories
  • 25. Confidential + Proprietary Performance Regression User: VM environment User Problem: … silence ... SLO not met: CPU Signal: CPU queue other Root cause: Subtle, but expensive, new periodic operation Make it better: Give the application more debug information
  • 26. Confidential + Proprietary Performance Variation #1 User: Flight search User Problem: QPS variation on some tasks SLO not met: NUMA Signal: CPU and memory locality Root cause: Bad NUMA allocation by infrastructure Make it better: Improve NUMA allocation
  • 27. Confidential + Proprietary Performance Variation #2 User: Web search User Problem: Latency variation on some task SLO not met: CPI variation Signal: CPI from perf_events Root cause: Bad actors co-scheduled on the machine Make it better: Throttle or move these bad actors
  • 28. Confidential + Proprietary Performance Degradation Under Load User: Borglet User Problem: Stuckness under heavy load SLO not met: Disk access Signal: Disk I/O wait time latencies Root cause: Heavy disk operations blocking other operations Make it better: Move disk operations away from latency sensitive operations
  • 29. Confidential + Proprietary Future Work ➔ Signals for more resources (e.g.: memory) ➔ Using the right signals ➔ Better reporting and fleet-wide view to catch regressions across various components Helping apps more ➔ Where are the problems? ➔ Suggest how to fix problems we can’t fix ourselves
  • 30. Confidential + Proprietary Takeaways ➔ Containers are the main enabler: common language for performance signals ➔ More data ⇒ better decisions ➔ Slicing and dicing of data is priceless for finding patterns and baselines ➔ On by default performance monitoring: low overhead and high value ➔ Performance SLOs give power to the application and make infrastructure cheaper
  • 31. Confidential + Proprietary Takeaways ➔ Containers are the main enabler: common language for performance signals ➔ More data ⇒ better decisions ➔ Slicing and dicing of data is priceless for finding patterns and baselines ➔ On by default performance monitoring: low overhead and high value ➔ Performance SLOs give power to the application and make infrastructure cheaper You can do this too!
  • 32. Confidential + Proprietary Questions? ➔ Containers are the main enabler: common language for performance signals ➔ More data ⇒ better decisions ➔ Slicing and dicing of data is priceless for finding patterns and baselines ➔ On by default performance monitoring: low overhead and high value ➔ Performance SLOs give power to the application and make infrastructure cheaper You can do this too! Victor Marmol vmarmol@google.com
  • 33. ● Friday 8am - 1pm @ Google's Toronto office ● Hear real life experiences of two companies using GKE ● Share war stories with your peers ● Learn about future plans for microservice management from Google ● Help shape our roadmap g.co/microservicesroundtable † Must be able to sign digital NDA Join our Microservices Customer Roundtable