MONITORING MICROSERVICES
APRI L 2 6 , 2 0 1 7
Rich Schofield
rschofield@signalfx.com
Jamison Clouthier
jamison@signalfx.com
Karthik Rau, CEO
VMware, VP Products
Loudcloud/Opsware, Products
S E L E C T C U S T O M E R S I N V E S T O R S
Phillip Liu, CTO
Facebook, Lead Architect
Loudcloud/Opsware,
Chief Architect
F O U N D E R S
A LITTLE BACKGROUND
SignalFx is an advanced monitoring & alerting system for cloud apps, delivered as a SaaS solution
SIGNALFX HIGH LEVEL ARCHITECTURE
METADATA
1s / 5s
1m
1h
•••
MESSAGEBUS
TIME SERIES DATABASE
ANALYTICS
25
INGEST
NOTIFICATION
8d
384d
32d
1s 16m (IN MEMORY)
DATAPOINTS
MICROSERVICES
BENEFITS OF MICROSERVICES ARCHITECTURE
Application modules released independently
• Enables agile development per team
• Canary deployments for testing
• Containers to enable rapid automated deployment & rollback
Scale in/out with service load
• Respond quickly to increased data flow or usage
• Optimize infrastructure costs as load falls
Technology flexibility
• Easier to upgrade components or change platforms
MONITORING A MICROSERVICE APPLICATION
Real-time metrics and analytics
• Detect issues and trends quickly
Apply context: history, related events, service metadata
• Metrics without context are just numbers!
Tag-based monitoring for elastic/ephemeral services
• Adapt automatically to monitor services as they scale in/out
Shared, self-service access across all groups
• App developers, tech leads, support, management
• Avoid different tools for different teams
ISSUES WITH TRADITIONAL MONITORING
Noisy, reactive monitoring
C H A L L E N G E
• Too many alerts fire at once for a cluster-wide
problem
• Is the machine down because we scaled down
the cluster or because we had a real problem?
• Do we even care if a single node is down?
• Component-specific monitoring configurations
that require constant maintenance in
ephemeral/elastic environments
What
matters?
Where to
start?
?
THE MODERN MONITORING LANDSCAPE
A P M M E T R I C S L O G S
Performance Testing Pre-Flight Streaming Metrics Aggregated In-Flight Black Box Recorder Post-Flight
Luxury of TimeReal Time MattersLuxury of Time
LET’S TALK METRICS
Metric name
Metric value
Metric type
Timestamp
Dimensions
cpu.idle
27
gauge
1234567
host = relic47df
datacenter = sjc1
env = prod
…
Dimensions allow
you to filter,
aggregate, compare
across sources
METRICS IN A TIME SERIES
{
"gauge":
[{"metric":”cpu.idle",
"dimensions":
{"host":”hostname123",
"datacenter":”snc"},
"value":249}]
}
{
"gauge":
[{"metric":”cpu.idle",
"dimensions":
{"host":”hostname123",
"datacenter":”snc"},
"value":230}]
}
{
"gauge":
[{"metric":”cpu.idle",
"dimensions":
{"host":”hostname123",
"datacenter":”snc"},
"value":202}]
}
{
"gauge":
[{"metric":”cpu.idle",
"dimensions":
{"host":”hostname123",
"datacenter":”snc"},
"value":284}]
}
10:15:02 10:15:03 10:15:04 10:15:05
USING TIME SERIES ANALYTICS TO CORRELATE
AND IDENTIFY PATTERNS
How well load balanced is this
8-node Kafka cluster?
Compare the signal against
historical patterns and alert on
anomalous patterns
Create a signal to represent the
cluster’s load balancing
effectiveness, computed within
seconds
GUIDED TRIAGE
0
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
INFRASTRUCTURE SERVICE
PRODUCT LAUNCH
EVENT
CORRELATE EVENTS
TO MONITORING
IS THIS A PROBLEM?
ANALYSIS
1. PRODUCT LAUNCH
(EVENT RECORDED)
2. TRAFFIC SPIKE
3. TRANSIENT?
4. IMPACT EXISTING
CUSTOMERS?
5. IS IT LEVELING OUT?
6. RAM ISSUE?
7. STORAGE ISSUE?
8. BUS BACKED UP?
JOURNEY TO METRICS BASED MONITORING
PHASE 0 PHASE 2PHASE 1 PHASE 3
Health
checks and
logs
Small internal
metrics
system
Build out scalable,
highly-available metrics
system
Build out more
sophisticated
analytics
From individual component checks to proactive management of service-wide performance
M M / D D / Y Y
YOUR TITLE HERE
P R E P A R E D F O R :
P L A C E L O G O
H E R EDEMO
M M / D D / Y Y
YOUR TITLE HERE
P R E P A R E D F O R :
P L A C E L O G O
H E R E
T H A N K Y O U !
jamison@signalfx.com
rschofield@signalfx.com
S I G N U P F O R A T R I A L A T :
signalfx.com

More Related Content

PPTX
How Cloud-Ready Alerting Is Optimal For Today's Environments
PDF
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
PDF
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
PDF
AWS Loft Talk: Behind the Scenes with SignalFx
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PPTX
Kurt Schneider [Discover Financial] | How Discover Modernizes Observability w...
PDF
Getting Started: Intro to Telegraf - July 2021
PDF
Capgemini: Observability within the Dutch government
How Cloud-Ready Alerting Is Optimal For Today's Environments
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
AWS Loft Talk: Behind the Scenes with SignalFx
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Kurt Schneider [Discover Financial] | How Discover Modernizes Observability w...
Getting Started: Intro to Telegraf - July 2021
Capgemini: Observability within the Dutch government

What's hot (19)

PDF
Time Series Tech Stack for the IoT Edge
PDF
Time Series Analysis Using an Event Streaming Platform
PDF
Monitor Traefik with Prometheus
PDF
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
PDF
Cloud monitoring
PDF
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
PPTX
InfluxDB Community Office Hours September 2020
PDF
What's new in confluent platform 5.4 online talk
PDF
Top 5 Considerations for Operating a Kubernetes Environment at Scale
PPTX
Growing into a proactive Data Platform
PPTX
Automated Remediation with Rundeck + Sensu
PDF
Supersonic, Subatomic, Kubernetes Native Java : Microservices Day Dallas
PDF
Digital Transformation & Solvency II Simulations for L&G: Optimizing, Acceler...
 
ODP
Continuous Delivery with Spinnaker.io
PPTX
Big Data on OpenStack
PDF
Case Study : InfluxDB
PDF
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
PDF
Fall in Love with Graphs and Metrics using Grafana
PDF
Overview of Blue Medora - New Relic Plugin for Cisco Nexus
Time Series Tech Stack for the IoT Edge
Time Series Analysis Using an Event Streaming Platform
Monitor Traefik with Prometheus
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Cloud monitoring
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
InfluxDB Community Office Hours September 2020
What's new in confluent platform 5.4 online talk
Top 5 Considerations for Operating a Kubernetes Environment at Scale
Growing into a proactive Data Platform
Automated Remediation with Rundeck + Sensu
Supersonic, Subatomic, Kubernetes Native Java : Microservices Day Dallas
Digital Transformation & Solvency II Simulations for L&G: Optimizing, Acceler...
 
Continuous Delivery with Spinnaker.io
Big Data on OpenStack
Case Study : InfluxDB
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
Fall in Love with Graphs and Metrics using Grafana
Overview of Blue Medora - New Relic Plugin for Cisco Nexus
Ad

Similar to Microservices meetup April 2017 (20)

PDF
Incrementalism: An Industrial Strategy For Adopting Modern Automation
PPTX
Industrial Edge.pptx
PDF
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
PDF
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
PDF
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
PDF
Francisco Javier Ramirez Urea - Hopla - OSL19
PDF
Big Data Tools in AWS
PDF
Cloud-native .NET-Microservices mit Kubernetes @BASTAcon
PDF
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
PDF
PPTX
Introduction to architecture exploration
PPTX
E1: Building the Digital Twin (Predix Transform 2016)
PDF
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
PDF
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
PPTX
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
PDF
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
PPTX
Top Performance Problems in Distributed Architectures
PDF
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
PDF
Predix Builder Roadshow
PPTX
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Industrial Edge.pptx
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Francisco Javier Ramirez Urea - Hopla - OSL19
Big Data Tools in AWS
Cloud-native .NET-Microservices mit Kubernetes @BASTAcon
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
Introduction to architecture exploration
E1: Building the Digital Twin (Predix Transform 2016)
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Top Performance Problems in Distributed Architectures
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
Predix Builder Roadshow
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Ad

More from SignalFx (7)

PPTX
Top Considerations For Operating a Kubernetes Environment at Scale
PDF
SignalFx Elasticsearch Metrics Monitoring and Alerting
PPTX
SignalFx Kafka Consumer Optimization
PDF
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
PDF
Go debugging and troubleshooting tips - from real life lessons at SignalFx
PDF
Docker at and with SignalFx
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Top Considerations For Operating a Kubernetes Environment at Scale
SignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Kafka Consumer Optimization
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Docker at and with SignalFx
Scaling ingest pipelines with high performance computing principles - Rajiv K...

Recently uploaded (20)

PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
The various Industrial Revolutions .pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Microsoft Excel 365/2024 Beginner's training
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
Five Habits of High-Impact Board Members
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPT
What is a Computer? Input Devices /output devices
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
Flame analysis and combustion estimation using large language and vision assi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
A review of recent deep learning applications in wood surface defect identifi...
1 - Historical Antecedents, Social Consideration.pdf
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Developing a website for English-speaking practice to English as a foreign la...
Hindi spoken digit analysis for native and non-native speakers
The various Industrial Revolutions .pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Microsoft Excel 365/2024 Beginner's training
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
UiPath Agentic Automation session 1: RPA to Agents
Five Habits of High-Impact Board Members
Custom Battery Pack Design Considerations for Performance and Safety
Module 1.ppt Iot fundamentals and Architecture
A proposed approach for plagiarism detection in Myanmar Unicode text
What is a Computer? Input Devices /output devices
OpenACC and Open Hackathons Monthly Highlights July 2025
Convolutional neural network based encoder-decoder for efficient real-time ob...
Consumable AI The What, Why & How for Small Teams.pdf

Microservices meetup April 2017

  • 1. MONITORING MICROSERVICES APRI L 2 6 , 2 0 1 7 Rich Schofield rschofield@signalfx.com Jamison Clouthier jamison@signalfx.com
  • 2. Karthik Rau, CEO VMware, VP Products Loudcloud/Opsware, Products S E L E C T C U S T O M E R S I N V E S T O R S Phillip Liu, CTO Facebook, Lead Architect Loudcloud/Opsware, Chief Architect F O U N D E R S A LITTLE BACKGROUND SignalFx is an advanced monitoring & alerting system for cloud apps, delivered as a SaaS solution
  • 3. SIGNALFX HIGH LEVEL ARCHITECTURE METADATA 1s / 5s 1m 1h ••• MESSAGEBUS TIME SERIES DATABASE ANALYTICS 25 INGEST NOTIFICATION 8d 384d 32d 1s 16m (IN MEMORY) DATAPOINTS MICROSERVICES
  • 4. BENEFITS OF MICROSERVICES ARCHITECTURE Application modules released independently • Enables agile development per team • Canary deployments for testing • Containers to enable rapid automated deployment & rollback Scale in/out with service load • Respond quickly to increased data flow or usage • Optimize infrastructure costs as load falls Technology flexibility • Easier to upgrade components or change platforms
  • 5. MONITORING A MICROSERVICE APPLICATION Real-time metrics and analytics • Detect issues and trends quickly Apply context: history, related events, service metadata • Metrics without context are just numbers! Tag-based monitoring for elastic/ephemeral services • Adapt automatically to monitor services as they scale in/out Shared, self-service access across all groups • App developers, tech leads, support, management • Avoid different tools for different teams
  • 6. ISSUES WITH TRADITIONAL MONITORING Noisy, reactive monitoring C H A L L E N G E • Too many alerts fire at once for a cluster-wide problem • Is the machine down because we scaled down the cluster or because we had a real problem? • Do we even care if a single node is down? • Component-specific monitoring configurations that require constant maintenance in ephemeral/elastic environments What matters? Where to start? ?
  • 7. THE MODERN MONITORING LANDSCAPE A P M M E T R I C S L O G S Performance Testing Pre-Flight Streaming Metrics Aggregated In-Flight Black Box Recorder Post-Flight Luxury of TimeReal Time MattersLuxury of Time
  • 8. LET’S TALK METRICS Metric name Metric value Metric type Timestamp Dimensions cpu.idle 27 gauge 1234567 host = relic47df datacenter = sjc1 env = prod … Dimensions allow you to filter, aggregate, compare across sources
  • 9. METRICS IN A TIME SERIES { "gauge": [{"metric":”cpu.idle", "dimensions": {"host":”hostname123", "datacenter":”snc"}, "value":249}] } { "gauge": [{"metric":”cpu.idle", "dimensions": {"host":”hostname123", "datacenter":”snc"}, "value":230}] } { "gauge": [{"metric":”cpu.idle", "dimensions": {"host":”hostname123", "datacenter":”snc"}, "value":202}] } { "gauge": [{"metric":”cpu.idle", "dimensions": {"host":”hostname123", "datacenter":”snc"}, "value":284}] } 10:15:02 10:15:03 10:15:04 10:15:05
  • 10. USING TIME SERIES ANALYTICS TO CORRELATE AND IDENTIFY PATTERNS How well load balanced is this 8-node Kafka cluster? Compare the signal against historical patterns and alert on anomalous patterns Create a signal to represent the cluster’s load balancing effectiveness, computed within seconds
  • 11. GUIDED TRIAGE 0 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 INFRASTRUCTURE SERVICE PRODUCT LAUNCH EVENT CORRELATE EVENTS TO MONITORING IS THIS A PROBLEM? ANALYSIS 1. PRODUCT LAUNCH (EVENT RECORDED) 2. TRAFFIC SPIKE 3. TRANSIENT? 4. IMPACT EXISTING CUSTOMERS? 5. IS IT LEVELING OUT? 6. RAM ISSUE? 7. STORAGE ISSUE? 8. BUS BACKED UP?
  • 12. JOURNEY TO METRICS BASED MONITORING PHASE 0 PHASE 2PHASE 1 PHASE 3 Health checks and logs Small internal metrics system Build out scalable, highly-available metrics system Build out more sophisticated analytics From individual component checks to proactive management of service-wide performance
  • 13. M M / D D / Y Y YOUR TITLE HERE P R E P A R E D F O R : P L A C E L O G O H E R EDEMO
  • 14. M M / D D / Y Y YOUR TITLE HERE P R E P A R E D F O R : P L A C E L O G O H E R E T H A N K Y O U ! jamison@signalfx.com rschofield@signalfx.com S I G N U P F O R A T R I A L A T : signalfx.com