SlideShare a Scribd company logo
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com
Agenda
• My background
• Microservices, a review
• Analytics approach to monitoring
• Code push side effects, an example
• Summary
SignalFx
My Background
Experience
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]
SignalFx
Microservices, a Review
A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft
SignalFx’s Microservices
More than 15 internal services.
Spanning hundreds of
instances.
Across 3 AZs.
Have dependencies on
tens of external services.
Monitoring Challenges
• High iteration rate leads to shortened test
cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.
SignalFx
Analytics Approach to Monitoring
Measure
Store
Analyze
Detect
SignalFx
Examples
Monitoring at SignalFx
•We use SignalFx to monitor SignalFx
•CollectD for OS and Docker metrics on all VMs
•Yammer metrics for all Java app servers
•Custom logger to count exception types
•All metrics are sent to an analytics service
•Each service deploy a their cadence
•Push lab, then canary in prod, then rest of tier
Code Push Side Effects
Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.
Code Push Side Effects
However, upstream UI dashboard
showed unusual # of timeouts.
Code Push Side Effects
In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.
Code Push Side Effects
Sum the # of exceptions to create a single signal.
Code Push Side Effects
Compare sum with time-shifted sum from a day ago.
Code Push Side Effects
Look at an outlier host - an Analytics
service host.
Code Push Side Effects
java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.
Code Push Side Effects
• Analytics across multiple microservices reduced
time to identify problem. From push to resolution
was ~15min
• Service instrumentation helped narrowed down
root cause
• Discovery allowed us to create a detector using
analytics to notify similar problems in the future
Other Examples
• A customer started dropping data because they
reverted to an unsupported API
• Compare tsdb write throughput of two different
write strategies
• Create per-service capacity reports
• Identify memory usage patterns across our
Analytics service
• Create a detector for every previously uncaught
error conditions - postmortem output
SignalFx
Summary
• Measure and Store as much metrics and events as
possible
• Use data analytics techniques to
• Identify problems
• Chase down root cause
• Create analytics based detectors to notify you of
recurrence
SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com

More Related Content

PDF
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
PPTX
How Cloud-Ready Alerting Is Optimal For Today's Environments
PDF
Microservices meetup April 2017
PPTX
CDK - The next big thing - Quang Phuong
PDF
PDF
Redefining cloud native debugging
PPTX
Automated Remediation with Rundeck + Sensu
PDF
Time Series Tech Stack for the IoT Edge
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
How Cloud-Ready Alerting Is Optimal For Today's Environments
Microservices meetup April 2017
CDK - The next big thing - Quang Phuong
Redefining cloud native debugging
Automated Remediation with Rundeck + Sensu
Time Series Tech Stack for the IoT Edge

What's hot (20)

PDF
Opentelemetry - From frontend to backend
PDF
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
PPT
Extending the Enterprise with MEF
PDF
Serverless security - how to protect what you don't see?
PDF
Getting Started: Intro to Telegraf - July 2021
PPTX
Hystrix make your app bullet proof
PDF
Deployment Automation & Self-Healing with Dynatrace & Ansible
PDF
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
PDF
Eric Loyd - Fractal Nagios
PPTX
ADDO Open Source Observability Tools
PPTX
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
PDF
OSMC 2021 | Current State of Icinga
PDF
Policy as code what helm developers need to know about security
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Kubernetes meetup geneva june 2021
PPTX
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
PPTX
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
PDF
Wipro Customer Presentation
PPTX
A Network Engineer's Approach to Automation
PDF
Fences and Gates: Designing Ops for DevOps
Opentelemetry - From frontend to backend
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Extending the Enterprise with MEF
Serverless security - how to protect what you don't see?
Getting Started: Intro to Telegraf - July 2021
Hystrix make your app bullet proof
Deployment Automation & Self-Healing with Dynatrace & Ansible
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
Eric Loyd - Fractal Nagios
ADDO Open Source Observability Tools
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
OSMC 2021 | Current State of Icinga
Policy as code what helm developers need to know about security
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Kubernetes meetup geneva june 2021
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
Wipro Customer Presentation
A Network Engineer's Approach to Automation
Fences and Gates: Designing Ops for DevOps
Ad

Viewers also liked (8)

PDF
Fault, Errors, and Promise Theory
PDF
Docker at and with SignalFx
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
PDF
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
PDF
SignalFx Elasticsearch Metrics Monitoring and Alerting
PDF
AWS Loft Talk: Behind the Scenes with SignalFx
PPTX
SignalFx Kafka Consumer Optimization
PDF
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Fault, Errors, and Promise Theory
Docker at and with SignalFx
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
SignalFx Elasticsearch Metrics Monitoring and Alerting
AWS Loft Talk: Behind the Scenes with SignalFx
SignalFx Kafka Consumer Optimization
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Ad

Similar to Microservices and Devs in Charge: Why Monitoring is an Analytics Problem (20)

PDF
Lightning Fast Monitoring against Lightning Fast Outages
PPTX
Splunk SignalFx Infrastructure Monitoring
PDF
How Optimal Alerting is Better for Cloud Environments
PDF
How Optimal Alerting is Better for Cloud Environments
PDF
No Tool is an Island: Building DevOps into your business
PPTX
Observability - the good, the bad, and the ugly
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PPTX
Unified Operations Vision
PPTX
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
PDF
Four considerations when monitoring microservices
PDF
9 postproduction
PDF
Whiskey, Tango, Foxtrot: Understanding API Usage
PDF
Federal Agencies top 10 Use Cases
PDF
Winning the metrics battle
PPTX
Observability – the good, the bad, and the ugly
PDF
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
PDF
Mining Software Defects: Should We Consider Affected Releases?
PDF
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
PDF
Big Data LDN 2018: USING FAST DATA AND STREAM PROCESSING TO OPERATIONALISE MA...
PDF
Metrics-driven Continuous Delivery
Lightning Fast Monitoring against Lightning Fast Outages
Splunk SignalFx Infrastructure Monitoring
How Optimal Alerting is Better for Cloud Environments
How Optimal Alerting is Better for Cloud Environments
No Tool is an Island: Building DevOps into your business
Observability - the good, the bad, and the ugly
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Unified Operations Vision
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Four considerations when monitoring microservices
9 postproduction
Whiskey, Tango, Foxtrot: Understanding API Usage
Federal Agencies top 10 Use Cases
Winning the metrics battle
Observability – the good, the bad, and the ugly
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Mining Software Defects: Should We Consider Affected Releases?
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
Big Data LDN 2018: USING FAST DATA AND STREAM PROCESSING TO OPERATIONALISE MA...
Metrics-driven Continuous Delivery

Recently uploaded (20)

PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
STKI Israel Market Study 2025 version august
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
1 - Historical Antecedents, Social Consideration.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative study of natural language inference in Swahili using monolingua...
Group 1 Presentation -Planning and Decision Making .pptx
A novel scalable deep ensemble learning framework for big data classification...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
cloud_computing_Infrastucture_as_cloud_p
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
observCloud-Native Containerability and monitoring.pptx
STKI Israel Market Study 2025 version august
A contest of sentiment analysis: k-nearest neighbor versus neural network
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Getting started with AI Agents and Multi-Agent Systems
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Chapter 5: Probability Theory and Statistics
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

  • 1. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
  • 2. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem Phillip Liu phillip@signalfx.com @SignalFx - signalfx.com
  • 3. Agenda • My background • Microservices, a review • Analytics approach to monitoring • Code push side effects, an example • Summary
  • 5. Experience [2013 - ] SignalFx - Founder, CTO, Software Engineer Microservices; Monitoring using Analytics [2008 - 2012] Facebook - Software Engineer, Software Architect Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics [2004 - 2008] Opsware - Chief Architect, Software Engineer Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk [2000 - 2004] Loudcloud - Software Engineer LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool [1998 - 2000] Marimba - Software Engineer Client / Server; Monitoring using SNMP, FreshWater Software [ … ]
  • 7. A Microservices Definition Loosely coupled service oriented architecture with bounded context. Adrian Cockcroft
  • 8. SignalFx’s Microservices More than 15 internal services. Spanning hundreds of instances. Across 3 AZs. Have dependencies on tens of external services.
  • 9. Monitoring Challenges • High iteration rate leads to shortened test cycles • Integration test combinations are intractable • Catch problems during rolling deployments • Identify upstream/downstream side effects • e.g. backpressure • Identify brownouts before the customer • etc.
  • 12. Store
  • 16. Monitoring at SignalFx •We use SignalFx to monitor SignalFx •CollectD for OS and Docker metrics on all VMs •Yammer metrics for all Java app servers •Custom logger to count exception types •All metrics are sent to an analytics service •Each service deploy a their cadence •Push lab, then canary in prod, then rest of tier
  • 17. Code Push Side Effects
  • 18. Code Push Side Effects Push canary instance and Metadata API dashboard shows healthy tier.
  • 19. Code Push Side Effects However, upstream UI dashboard showed unusual # of timeouts.
  • 20. Code Push Side Effects In search of root cause. Always safe to start by looking at exception counts. Can’t derive much from all the noise.
  • 21. Code Push Side Effects Sum the # of exceptions to create a single signal.
  • 22. Code Push Side Effects Compare sum with time-shifted sum from a day ago.
  • 23. Code Push Side Effects Look at an outlier host - an Analytics service host.
  • 24. Code Push Side Effects java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na: 1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java: 1990) ~[na:1.7.0_79] … Looking at Analytic’s logs revealed source of the problem.
  • 25. Code Push Side Effects • Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down root cause • Discovery allowed us to create a detector using analytics to notify similar problems in the future
  • 26. Other Examples • A customer started dropping data because they reverted to an unsupported API • Compare tsdb write throughput of two different write strategies • Create per-service capacity reports • Identify memory usage patterns across our Analytics service • Create a detector for every previously uncaught error conditions - postmortem output
  • 28. • Measure and Store as much metrics and events as possible • Use data analytics techniques to • Identify problems • Chase down root cause • Create analytics based detectors to notify you of recurrence
  • 29. SignalFx Thank You! Phillip Liu phillip@signalfx.com WE’RE HIRING jobs@signalfx.com @SignalFx - signalfx.com