SlideShare a Scribd company logo
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com
Agenda
• My background
• Microservices, a review
• Analytics approach to monitoring
• Code push side effects, an example
• Summary
SignalFx
My Background
Experience
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]
SignalFx
Microservices, a Review
A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft
SignalFx’s Microservices
More than 15 internal services.
Spanning hundreds of
instances.
Across 3 AZs.
Have dependencies on
tens of external services.
Monitoring Challenges
• High iteration rate leads to shortened test
cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.
SignalFx
Analytics Approach to Monitoring
Measure
Store
Analyze
Detect
SignalFx
Examples
Monitoring at SignalFx
•We use SignalFx to monitor SignalFx
•CollectD for OS and Docker metrics on all VMs
•Yammer metrics for all Java app servers
•Custom logger to count exception types
•All metrics are sent to an analytics service
•Each service deploy a their cadence
•Push lab, then canary in prod, then rest of tier
Code Push Side Effects
Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.
Code Push Side Effects
However, upstream UI dashboard
showed unusual # of timeouts.
Code Push Side Effects
In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.
Code Push Side Effects
Sum the # of exceptions to create a single signal.
Code Push Side Effects
Compare sum with time-shifted sum from a day ago.
Code Push Side Effects
Look at an outlier host - an Analytics
service host.
Code Push Side Effects
java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.
Code Push Side Effects
• Analytics across multiple microservices reduced
time to identify problem. From push to resolution
was ~15min
• Service instrumentation helped narrowed down
root cause
• Discovery allowed us to create a detector using
analytics to notify similar problems in the future
Other Examples
• A customer started dropping data because they
reverted to an unsupported API
• Compare tsdb write throughput of two different
write strategies
• Create per-service capacity reports
• Identify memory usage patterns across our
Analytics service
• Create a detector for every previously uncaught
error conditions - postmortem output
SignalFx
Summary
• Measure and Store as much metrics and events as
possible
• Use data analytics techniques to
• Identify problems
• Chase down root cause
• Create analytics based detectors to notify you of
recurrence
SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com

More Related Content

PDF
Tune your App Perf (and get fit for summer)
PDF
Application Security from the Inside - OWASP
PDF
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
PDF
Tips to Reduce the Attack Surface When Using Third-Party Libraries
PDF
Henrique Dantas - API fuzzing using Swagger
PPTX
Security at Greenhouse
PDF
HITCON Defense Summit 2019 - 從 SAST 談持續式資安測試
PDF
End-to-end Testing for IoT Integrity
Tune your App Perf (and get fit for summer)
Application Security from the Inside - OWASP
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Tips to Reduce the Attack Surface When Using Third-Party Libraries
Henrique Dantas - API fuzzing using Swagger
Security at Greenhouse
HITCON Defense Summit 2019 - 從 SAST 談持續式資安測試
End-to-end Testing for IoT Integrity

What's hot (18)

PDF
Modern Web 2019 從零開始加入自動化資安測試
PPTX
José Vila - ¿Otro parche más? No, por favor. [rooted2018]
PPTX
Stephen Sadowski - Securely automating infrastructure in the cloud
PPTX
Javier Hijas & Ori Kuyumgiski - Security at the speed of DevOps [rooted2018]
PPTX
Alfredo Reino - Monitoring aws and azure
PPT
Owasp Code Crawler Presentation
PPTX
Software Security in DevOps: Synthesizing Practitioners’ Perceptions and Prac...
PPTX
ATAGTR2017 Cost-effective Security Testing Approaches for Web, Mobile & Enter...
PPTX
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
PDF
[OPD 2019] Governance as a missing part of IT security architecture
PDF
Veracode Automation CLI (using Jenkins for SDL integration)
PPTX
Customer Presentation - KCP&L
PDF
Ernesto Bethencourt & Javier Sanz - OFRECIENDO SEGURIDAD DE AUTOCONSUMO A LOS...
PDF
Fences and Gates: Designing Ops for DevOps
PDF
Integrating DevOps and Security
PDF
Owasp top 10 2017 (en)
PDF
The Dev, Sec and Ops of API Security - API World
PPTX
Beyond Continuous Delivery
Modern Web 2019 從零開始加入自動化資安測試
José Vila - ¿Otro parche más? No, por favor. [rooted2018]
Stephen Sadowski - Securely automating infrastructure in the cloud
Javier Hijas & Ori Kuyumgiski - Security at the speed of DevOps [rooted2018]
Alfredo Reino - Monitoring aws and azure
Owasp Code Crawler Presentation
Software Security in DevOps: Synthesizing Practitioners’ Perceptions and Prac...
ATAGTR2017 Cost-effective Security Testing Approaches for Web, Mobile & Enter...
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
[OPD 2019] Governance as a missing part of IT security architecture
Veracode Automation CLI (using Jenkins for SDL integration)
Customer Presentation - KCP&L
Ernesto Bethencourt & Javier Sanz - OFRECIENDO SEGURIDAD DE AUTOCONSUMO A LOS...
Fences and Gates: Designing Ops for DevOps
Integrating DevOps and Security
Owasp top 10 2017 (en)
The Dev, Sec and Ops of API Security - API World
Beyond Continuous Delivery
Ad

Viewers also liked (7)

PDF
Aging in Place: Housing Washington 2014 Conference SLIDE DECK
PPTX
Celebrities then and now
PDF
Empowering The Mature Mind - SUMMER 2014 Newsletter
PDF
Reaching & Connecting with the BOOMER CONSUMER - by EtMM Aaron D. Murphy
PPTX
Ashxarhi test
PPTX
Lightning Talk: The History of SEO (in 3 Minutes) | Cardiff SEO Meet
PPTX
Omer presentation
Aging in Place: Housing Washington 2014 Conference SLIDE DECK
Celebrities then and now
Empowering The Mature Mind - SUMMER 2014 Newsletter
Reaching & Connecting with the BOOMER CONSUMER - by EtMM Aaron D. Murphy
Ashxarhi test
Lightning Talk: The History of SEO (in 3 Minutes) | Cardiff SEO Meet
Omer presentation
Ad

Similar to Why monitoring is an analytics problem (20)

PDF
AWS Loft Talk: Behind the Scenes with SignalFx
PDF
Scaling security in a cloud environment v0.5 (Sep 2017)
PDF
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
PPTX
Making Security Agile
PPTX
Netflix Cloud Architecture and Open Source
PPTX
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
PDF
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
PPTX
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
PPTX
Dev{sec}ops
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PDF
Pragmatic Pipeline Security
PDF
AWS live hack: Atlassian + Snyk OSS on AWS
PDF
Log Analytics for Distributed Microservices
PPTX
Bangalore OpenMSA DevDay - September 19, 2018
PDF
Cncf checkov and bridgecrew
PDF
Platform governance, gestire un ecosistema di microservizi a livello enterprise
PPTX
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
PPTX
Empowering developers and operators through Gitlab and HashiCorp
PDF
DevSecOps - Background, Status and Future Challenges
PDF
OORPT Dynamic Analysis
AWS Loft Talk: Behind the Scenes with SignalFx
Scaling security in a cloud environment v0.5 (Sep 2017)
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
Making Security Agile
Netflix Cloud Architecture and Open Source
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
Dev{sec}ops
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Pragmatic Pipeline Security
AWS live hack: Atlassian + Snyk OSS on AWS
Log Analytics for Distributed Microservices
Bangalore OpenMSA DevDay - September 19, 2018
Cncf checkov and bridgecrew
Platform governance, gestire un ecosistema di microservizi a livello enterprise
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
Empowering developers and operators through Gitlab and HashiCorp
DevSecOps - Background, Status and Future Challenges
OORPT Dynamic Analysis

Recently uploaded (20)

PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Internet___Basics___Styled_ presentation
PPTX
artificial intelligence overview of it and more
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPT
tcp ip networks nd ip layering assotred slides
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
DOCX
Unit-3 cyber security network security of internet system
PPTX
E -tech empowerment technologies PowerPoint
PDF
Testing WebRTC applications at scale.pdf
SASE Traffic Flow - ZTNA Connector-1.pdf
Internet___Basics___Styled_ presentation
artificial intelligence overview of it and more
An introduction to the IFRS (ISSB) Stndards.pdf
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
introduction about ICD -10 & ICD-11 ppt.pptx
tcp ip networks nd ip layering assotred slides
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Design_with_Watersergyerge45hrbgre4top (1).ppt
Slides PPTX World Game (s) Eco Economic Epochs.pptx
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Module 1 - Cyber Law and Ethics 101.pptx
Unit-3 cyber security network security of internet system
E -tech empowerment technologies PowerPoint
Testing WebRTC applications at scale.pdf

Why monitoring is an analytics problem

  • 1. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
  • 2. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem Phillip Liu phillip@signalfx.com @SignalFx - signalfx.com
  • 3. Agenda • My background • Microservices, a review • Analytics approach to monitoring • Code push side effects, an example • Summary
  • 5. Experience [2013 - ] SignalFx - Founder, CTO, Software Engineer Microservices; Monitoring using Analytics [2008 - 2012] Facebook - Software Engineer, Software Architect Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics [2004 - 2008] Opsware - Chief Architect, Software Engineer Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk [2000 - 2004] Loudcloud - Software Engineer LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool [1998 - 2000] Marimba - Software Engineer Client / Server; Monitoring using SNMP, FreshWater Software [ … ]
  • 7. A Microservices Definition Loosely coupled service oriented architecture with bounded context. Adrian Cockcroft
  • 8. SignalFx’s Microservices More than 15 internal services. Spanning hundreds of instances. Across 3 AZs. Have dependencies on tens of external services.
  • 9. Monitoring Challenges • High iteration rate leads to shortened test cycles • Integration test combinations are intractable • Catch problems during rolling deployments • Identify upstream/downstream side effects • e.g. backpressure • Identify brownouts before the customer • etc.
  • 12. Store
  • 16. Monitoring at SignalFx •We use SignalFx to monitor SignalFx •CollectD for OS and Docker metrics on all VMs •Yammer metrics for all Java app servers •Custom logger to count exception types •All metrics are sent to an analytics service •Each service deploy a their cadence •Push lab, then canary in prod, then rest of tier
  • 17. Code Push Side Effects
  • 18. Code Push Side Effects Push canary instance and Metadata API dashboard shows healthy tier.
  • 19. Code Push Side Effects However, upstream UI dashboard showed unusual # of timeouts.
  • 20. Code Push Side Effects In search of root cause. Always safe to start by looking at exception counts. Can’t derive much from all the noise.
  • 21. Code Push Side Effects Sum the # of exceptions to create a single signal.
  • 22. Code Push Side Effects Compare sum with time-shifted sum from a day ago.
  • 23. Code Push Side Effects Look at an outlier host - an Analytics service host.
  • 24. Code Push Side Effects java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na: 1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java: 1990) ~[na:1.7.0_79] … Looking at Analytic’s logs revealed source of the problem.
  • 25. Code Push Side Effects • Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down root cause • Discovery allowed us to create a detector using analytics to notify similar problems in the future
  • 26. Other Examples • A customer started dropping data because they reverted to an unsupported API • Compare tsdb write throughput of two different write strategies • Create per-service capacity reports • Identify memory usage patterns across our Analytics service • Create a detector for every previously uncaught error conditions - postmortem output
  • 28. • Measure and Store as much metrics and events as possible • Use data analytics techniques to • Identify problems • Chase down root cause • Create analytics based detectors to notify you of recurrence
  • 29. SignalFx Thank You! Phillip Liu phillip@signalfx.com WE’RE HIRING jobs@signalfx.com @SignalFx - signalfx.com