SlideShare a Scribd company logo
Title
Subtitle
Yonit Gruber-Hazani
Monitoring lessons
from Waze SRE team
A little about me - Yonit Gruber-Hazani
Helpdesk
MS admin
Linux Admin
Production Manager [Linux]
Devops Engineer [Linux]
SRE [Linux]
A little about me - Yonit Gruber-Hazani
A little about me - Yonit Gruber-Hazani
What we will go through: - About Waze, My Team
and Waze's technical
structure
- Monitoring, Alerting and
Complexity
- The new monitoring
direction
- Our best practices (that
works for us)
Waze in Numbers
130M 500K 80MActive Monthly
Users
Maps Editors API Calls Per Day
Outsmarting
traffic
together
Thousands of instances
Hundreds of Autoscaling
groups
2 PB cassandra data
On ~2000 cassandra
instances
Waze SRE team
● We build and operate the
Waze Infrastructure
● We’re part of Google
○ Autonomous
○ Running on top of
public clouds
● 21 Team members across the
globe
Waze Structure
Waze microservices multi cloud
Cache data layer
Database layer
Memcached Redis
Java microservices
Compute
engine
App engine Container
engine
Cassandra Spanner Cloud SQL
Cache data layer
Database layer
Memcached Redis
Java microservices
Containers EC2 Lambda
Cassandra RDS
Spinnaker
Waze microservices
Waze microservices
proprietary
communications
protocol
Geographical Sharding
Microservice regions
Microservice
Datacenters
Countries
Israel North
America
Asia Pacific Europe South
America
Production critical services are
split into dozens of geographical
shards.
● Spreads the load
● Reduces blast radius
Several Logical Data Centers
split across 3 regions
8am
5pm
Daily driving trends
Waze US data, 2017
In the beginning
there was Nagios
Managed monitoring API service
What did we look for?
- Managed monitoring service
- API for metrics collection, dashboard and Policies creation
- Support our scale and growing monitoring needs
- Multi cloud support
We chose Stackdriver
How do you deploy
monitoring on a
planet scale?
Baby steps
- Aggregate our Proprietary protocol stats from a central location
- Created basic dashboards that shows:
- QPM
- Latency
- Failure Rate
- We also added to the dashboards metrics from the cloud provides
GCP and AWS
For each Microservice}
Deployment steps
Auto monitoring for each microservice of:
- Memory
- Free disk
- CPU load
Zero conf monitoring
- Data layer
- Caching
- Pubsub
- Java GC
- Apps and configs versions
Removing monitoring
bottleneck from our
team
What about alerting?
Free
Disk
Space
Max Auto
Scaling
Groups
Too many
failed
instances
in group
CPU
overloaded
Free
memory
Monitoring lessons from waze sre team
Monitoring lessons from waze sre team
Herbert A. Simon
What information
consumes is rather
obvious: it consumes
the attention of its
recipients
Complexity
What's in
our
Dashboards
What's in
our
Dashboards
Server
Stats
‫קרהקר‬
What's in
our
Dashboards
Client
services
What's in
our
Dashboards
Dependencies
What's in
our
Dashboards
Data Layer
What's this service anyway?
The new monitoring
Error budgets
● SLI - Service Level Indicator
○ Error rate
○ Latency
● SLO - Service Level Objective
○ 95% Login < 300 ms
● User Journey
Services need target SLOs
that capture the
performance and
availability levels that, if
barely met, would keep the
typical customer happy.
SLO Classroom
The happiness test - Critical User Journey
“meets target SLO” ⇒ “happy customers”
“misses target SLO” ⇒ “sad customers”
30 day error budget
99.9 % == 43.2min
99.99% == 4.32min
99.999 % == 26sec
SLO in Numbers
Best Practices
Replace alerts with automations
Increase Max for autoscaling groups
Add disks
Replace instances with healthy instances
Remove all single pets servers
Blameless Post mortems
REALLY BLAMELESS
What happened?
Why did it happen?
How was it solved?
Did the Monitoring work?
What worked well?
What didn't?
Action Items
POST POSTMORTEM
Action Items bugs list after post mortems
with owner for each bug
Periodically review
EXISTING MONITORS
Review existing monitors and update thresholds
Remove old deprecated alerts
Verify you are monitoring the updated endpoints
Update monitors on the fly
Playbooks for alerts
Add Updated Playbooks for each alert
Playbooks contains DEV, SRE and QA owners,
links to dashboards,
Step by step procedures
Links to system designs
Relevant data layers - cassandra, DB, cache
dashboards
Clean your signals
Noisy signals cannot be monitored
Choose your battles
Three levels for alerts urgency:
1. Wake up an oncall
2. Open a bug
3. Send an email for debugging and
root cause searching
THINGS I LEARNED FROM BEING A PARENT
Thank you!

More Related Content

PPTX
A Crash Course in Building Site Reliability
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PPTX
Monitoring & Observability
PPTX
Observability, what, why and how
PPTX
DevOps-as-a-Service: Towards Automating the Automation
PDF
What (Else) Can Agile Learn From Complexity
PPTX
About DevOps in simple steps
PPTX
Customer case - Dynatrace Monitoring Redefined
A Crash Course in Building Site Reliability
Overview of Site Reliability Engineering (SRE) & best practices
Monitoring & Observability
Observability, what, why and how
DevOps-as-a-Service: Towards Automating the Automation
What (Else) Can Agile Learn From Complexity
About DevOps in simple steps
Customer case - Dynatrace Monitoring Redefined

What's hot (20)

PDF
Monitoring Kubernetes with Elasticsearch Services - Ted Jung, Consulting Arch...
PDF
Effective requirement gathering using Design Thinking technique
PDF
Observability at Scale
PDF
Remote-first Team Interactions with Team Topologies (public online session Ap...
PDF
Kks sre book_ch1,2
PDF
DevEx Essentials
PPTX
AppDynamics VS New Relic – The Complete Guide
PPTX
SRE-iously! Reliability!
PDF
Howtooptimizeyourteamsproductivtypoweredby33voices1 151009014624-lva1-app6891
PDF
Making Cloud Native CI_CD Services.pdf
PDF
Team Topologies in action - early results from industry - DOES London Virtual...
PPTX
Introduction to Chaos Engineering
PDF
stackconf 2022: Open Source for Better Observability
PDF
The magic of ops genie
PDF
Architectures for open and scalable clouds
PPTX
SOC Lessons from DevOps and SRE by Anton Chuvakin
PPTX
SRE 101 (Site Reliability Engineering)
PPTX
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
PDF
Application Monitoring using Datadog
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Monitoring Kubernetes with Elasticsearch Services - Ted Jung, Consulting Arch...
Effective requirement gathering using Design Thinking technique
Observability at Scale
Remote-first Team Interactions with Team Topologies (public online session Ap...
Kks sre book_ch1,2
DevEx Essentials
AppDynamics VS New Relic – The Complete Guide
SRE-iously! Reliability!
Howtooptimizeyourteamsproductivtypoweredby33voices1 151009014624-lva1-app6891
Making Cloud Native CI_CD Services.pdf
Team Topologies in action - early results from industry - DOES London Virtual...
Introduction to Chaos Engineering
stackconf 2022: Open Source for Better Observability
The magic of ops genie
Architectures for open and scalable clouds
SOC Lessons from DevOps and SRE by Anton Chuvakin
SRE 101 (Site Reliability Engineering)
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
Application Monitoring using Datadog
Infrastructure Agnostic Machine Learning Workload Deployment
Ad

Similar to Monitoring lessons from waze sre team (20)

PDF
Building and scaling a B2D service, the bootstrap way
PPTX
Empowering Uptime with a 24/7 Network Operations Center (NOC)
PDF
Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEA
PPTX
Liferay DEVCON 2023 - What's cooking in Liferay's Cloud?
PDF
Cloud-native Java EE-volution
PDF
xandria_successstory_migros_en
PDF
Dubbo and Weidian's practice on micro-service architecture
PPTX
SolarWinds Scalability for the Enterprise
PPTX
SAP on Azure. Use Cases and Benefits
PPT
Adaptive Server Farms for the Data Center
PDF
Rundeck Overview
PPTX
Nx ray etisalatnigeria
PDF
Adventures in Observability - Clickhouse and Instana
PDF
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
PPT
Exploring Opportunities in Crisis by Ramco
PDF
Netflix SRE perf meetup_slides
PDF
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
PDF
Lessons from Large-Scale Cloud Software at Databricks
PPTX
Site Performance Challenge: Magento with CloudMaestro
PPT
CloudSmart Webinar
Building and scaling a B2D service, the bootstrap way
Empowering Uptime with a 24/7 Network Operations Center (NOC)
Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEA
Liferay DEVCON 2023 - What's cooking in Liferay's Cloud?
Cloud-native Java EE-volution
xandria_successstory_migros_en
Dubbo and Weidian's practice on micro-service architecture
SolarWinds Scalability for the Enterprise
SAP on Azure. Use Cases and Benefits
Adaptive Server Farms for the Data Center
Rundeck Overview
Nx ray etisalatnigeria
Adventures in Observability - Clickhouse and Instana
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Exploring Opportunities in Crisis by Ramco
Netflix SRE perf meetup_slides
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Lessons from Large-Scale Cloud Software at Databricks
Site Performance Challenge: Magento with CloudMaestro
CloudSmart Webinar
Ad

Recently uploaded (20)

PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
DOCX
573137875-Attendance-Management-System-original
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mechanical Engineering MATERIALS Selection
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
R24 SURVEYING LAB MANUAL for civil enggi
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
573137875-Attendance-Management-System-original
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Embodied AI: Ushering in the Next Era of Intelligent Systems
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Model Code of Practice - Construction Work - 21102022 .pdf

Monitoring lessons from waze sre team