SlideShare a Scribd company logo
Monitoring Node.js Microservices on CloudFoundry with
Open Source Tools and a Shoestring Budget
Tony Erwin, aerwin@us.ibm.com
Agenda
• Introduction to Bluemix UI & Architecture
• Importance of Monitoring w/ Microservices
• Overview of Monitoring Architecture
• Using Monitoring Data
• Building Your Own Monitoring System
• Synthetic Measurements
Bluemix UI
• Front-end to IBM’s open cloud Bluemix offering
• Lets users view and manage CF resources, containers,
virtual servers, user accounts, billing/usage, etc
• Runs on top of Bluemix PaaS Layer (Cloud Foundry)
Dashboard Catalog Resource Details
And
More!
Bluemix UI Architecture
• Migrated from a
monolithic to a
microservice
architecture over
the last couple of
years
• Composed of 25+
Node.js apps
deployed to Cloud
Foundry
• See talk from
earlier this week
for more details
– To Kill a Monolith:
Slaying the Demons
of a Monolith with
Node.js
Microservices on
CloudFoundry
Home Catalog … DashboardPricing
Orgs/	
Spaces
Backend	APIs	(CF,	Containers,	VMs,	BSS,	MCCP,	etc.)
Bluemix UI (Client)
Bluemix
PaaS Proxy
Common
Monitoring	
Framework
Session	
Store
NoSQL	
DB
Cloud Foundry
Importance of Monitoring
Importance of Monitoring
• Root cause analysis when a problem occurs
– Bluemix UI is most visible part of the platform and acts as a “canary in the mine shaft”
for the whole platform
– When a critical event or outage occurs, it often starts with reports like:
• “Can’t login to console”
• “Console doesn’t work…”
• “Console is slow…”
– When this happens in the middle of the night, my team is regularly the first to get a
PagerDuty
• Being able to quickly find root cause is a matter of self-preservation
– Console behavior is often (but not always!) a symptom of something going on elsewhere
(like CF is having problems, networking is down, etc.)
• Auto-detection of problems
– Ideally, we want to find and fix problems before a user hits them
– Example: Send a PagerDuty when error rates for a given API go above a threshold
• Tracking against performance and quality targets
– Can’t meet goals for something you can’t measure over time
What to Monitor?
• Metrics we were especially interested in:
– Data for every inbound/outbound request for every microservice
• Response time
• HTTP response code
• Etc.
– Memory usage, CPU usage, uptime, and crashes for every instance of every microservice
– General health of ourselves and dependencies
Monitoring Architecture
Monitoring Architecture
Monitor	
Storage
Backend	APIs	(CF,	Containers,	VMs,	BSS,	MCCP,	etc.)
Bluemix UI (Client)
Cloud Foundry
Proxy
InfluxDB
App	1
MQTT
PagerDuty,	
Slack,	etc.
… App	N
Monitor	
Alerts
Space	
Scanner
Monitoring Components
• Each microservice bound to an MQTT service (which happens to be provided by the IBM Internet of Things
service)
• Each microservice adds middleware (private npm module) that publishes inbound / outbound request data to
MQTT in a “fire and forget” manner
– Also supports a general “publish” function to send arbitrary metrics to MQTT (e.g., overall system health, number of times we
retrieve JSON from Redis cache instead of API, etc.)
• Storage microservice:
– Subscribes to the same queue, does some massaging of the data (such as tagging with URL “category”), and writes to
InfluxDB
• Alerts microservice:
– Subscribes to the same queue, aggregates the inputs over the last X minutes, and sends alerts (like Slack, PagerDuty, etc.)
• Scanner microservice:
– Calls CF APIs every 60 seconds to get data for each app instance on mem usage, CPU usage, uptime, and crashes
– Publishes the data to MQTT
• Grafana dashboards display data from data series in InfluxDB
• Details app is deployed that can pull data from InfluxDB to complement Grafana:
– Shows details of all of the requests in tabular format
– Provides capabilities to make special queries against the InfluxDB data
Using Monitoring Data
Grafana Dashboards
• Grafana
dashboards used
to visualize data
over time for any
microservice
• Data includes:
– Total requests
– Response time
(mean, median,
90% time)
– Error rate
Identifying a Problem in Grafana
• Like a
cardiologist
reading an
echocardiogram,
we’ve gotten
good at
identifying
anomalies in
these charts
• Data to left
shows a recent
“outage” where
error rates and
response times
spiked for a
period of time
Root Cause Analysis
• We can dive into more detailed data to do root cause analysis
• In chart below, response time is broken down by “category” (e.g., CF, UAA,
Containers, etc.)
• We can see time outs in a large number of components, indicating a broader
systemic issue
Details View
• Can drill down and get tabular view with aggregated details about the
requests making up a chart
• Can drill down again to see list of individual requests (with timestamps) as well as get more
detailed statistics on individual URLs
Wall of Shame
• Building on the details view from the previous page,
we can build walls of “shame” to help drive
improvements
– Show the 10 slowest API calls made to/from a specific
microservice that have been called at least 1000 times
during the last 24 hours
– Show the top 10 requests with the most error responses
that are invoked at least X times over an arbitrary time
period
– Etc.
Memory, CPU Usage, Crashes
• Another important set of data includes memory, CPU usage, and crashes for all instances of
all microservices
• Chart below shows a major CPU usage issue we found in a dev system, so was able to fix
before finding its way to production
Building Your Own Monitoring System
Node Application Metrics (appmetrics)
• Had planned on publishing some of my monitoring code,
but in prep for CF Summit learned of the appmetrics
project being driven by some fellow IBMers
• Shares much in common with the middleware I
mentioned earlier that publishes metrics to MQTT, but
goes even deeper to provide additional performance
insights
• Fully open source
– https://github.com/RuntimeTools/appmetrics
• Proves yet again that IBM is a big place J
Default Capabilities and MQTT
• Sends data to MQTT, meaning you can subscribe to updates
• Provides an Event API which allows:
– custom triggers based on the monitoring data
– publication of custom events
• This would be enough to support other pieces of the Bluemix UI monitoring system (like the
storage service or the alerts service)
App Metrics – Default Capabilities
Data Storage
• Can be configured to store data:
– Elastic Search
• https://github.com/RuntimeTools/appmetrics-elk
– StatsD
• https://github.com/RuntimeTools/appmetrics-statsd
• No support for InfluxDB yet, but I’ve suggested
to the team they should add it
Collecting Synthetic Data
Collecting Synthetic Data
• Monitoring discussed so far only
paints a picture of the server side
• It’s also important to get a
perspective from the client
• Continuously run scripts that
leverage Sitespeed.io
(https://www.sitespeed.io/) to load
the major pages of the product
• Collects data such as perf score,
first visual change, speed index,
etc. and stores in Graphite
– Grafana dashboards built to allow us
to visualize the data
– Scripts can be running from multiple
geo locations
The End
Questions?
Tony Erwin
Email: aerwin@us.ibm.com
Twitter: @tonyerwin
See also presentation from earlier this week:
To Kill a Monolith: Slaying the Demons of a Monolith
with Node.js Microservices on CloudFoundry
(http://sched.co/AJmh)

More Related Content

PDF
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
PPTX
Evolution of the IBM Cloud Console: From Monolith to Microservices and Beyond
PPT
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
PPTX
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
PPTX
Serverless Architecture - introduction + AWS demo
PPTX
Introduction into Windows Azure Pack and Service Management Automation
PDF
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
PPT
Cloud computing-2 (1)
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
Evolution of the IBM Cloud Console: From Monolith to Microservices and Beyond
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Serverless Architecture - introduction + AWS demo
Introduction into Windows Azure Pack and Service Management Automation
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
Cloud computing-2 (1)

What's hot (20)

PDF
REST vs. Messaging For Microservices
PPTX
NIC - Windows Azure Pack - Level 300
PPTX
NServiceBus introduction
PPTX
Grails in the Cloud (2013)
PPTX
Getting Started with Orchestrator and Service Manager
PDF
SCORCH: Tying it All Together
PPTX
Ordina SOFTC Presentation - Desktop Virtualization
PDF
Spring cloud
PPTX
Ios models
PPTX
10 ways to trigger runbooks from Orchestrator
PDF
Iib v10 performance problem determination examples
PDF
VMware VCP7-DTM: More than just Horizon View
PDF
Microservices Using Docker Containers for Magento 2
PDF
Olympus pesentation2
PPTX
Designing distributed, scalable and reliable systems using NServiceBus
PDF
VMware Mirage for Retail
PPT
Roll your own FOSS cloud hosting
PDF
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
PPTX
Configuration management comes to Windows
PPTX
Event Driven Architectures with Apache Kafka
REST vs. Messaging For Microservices
NIC - Windows Azure Pack - Level 300
NServiceBus introduction
Grails in the Cloud (2013)
Getting Started with Orchestrator and Service Manager
SCORCH: Tying it All Together
Ordina SOFTC Presentation - Desktop Virtualization
Spring cloud
Ios models
10 ways to trigger runbooks from Orchestrator
Iib v10 performance problem determination examples
VMware VCP7-DTM: More than just Horizon View
Microservices Using Docker Containers for Magento 2
Olympus pesentation2
Designing distributed, scalable and reliable systems using NServiceBus
VMware Mirage for Retail
Roll your own FOSS cloud hosting
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
Configuration management comes to Windows
Event Driven Architectures with Apache Kafka
Ad

Similar to Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a Shoestring Budget (20)

PPTX
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
PDF
Migrating a Monolithic App to Microservices on Cloud Foundry
PDF
Gilmore, Palani [InfluxData] | Use Case: Monitoring / Observability | InfluxD...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
PDF
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
PPTX
Observability for Application Developers (1)-1.pptx
PDF
Getting Started with Cloud Foundry on Bluemix
PDF
Getting Started with Cloud Foundry on Bluemix
PDF
Getting Started with Cloud Foundry on Bluemix
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PDF
Closer Look at Cloud Centric Architectures
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PDF
Gluecon Monitoring Microservices and Containers: A Challenge
PDF
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
PDF
What the hell is your software doing at runtime?
PDF
Monitoring Big Data Systems - "The Simple Way"
PDF
Microservices and Prometheus (Microservices NYC 2016)
PDF
Developer connect - microservices
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Migrating a Monolithic App to Microservices on Cloud Foundry
Gilmore, Palani [InfluxData] | Use Case: Monitoring / Observability | InfluxD...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
Observability for Application Developers (1)-1.pptx
Getting Started with Cloud Foundry on Bluemix
Getting Started with Cloud Foundry on Bluemix
Getting Started with Cloud Foundry on Bluemix
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Closer Look at Cloud Centric Architectures
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Gluecon Monitoring Microservices and Containers: A Challenge
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
What the hell is your software doing at runtime?
Monitoring Big Data Systems - "The Simple Way"
Microservices and Prometheus (Microservices NYC 2016)
Developer connect - microservices
Ad

Recently uploaded (20)

PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
1. Introduction to Computer Programming.pptx
PPT
Teaching material agriculture food technology
SOPHOS-XG Firewall Administrator PPT.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
Group 1 Presentation -Planning and Decision Making .pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Tartificialntelligence_presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25-Week II
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1. Introduction to Computer Programming.pptx
Teaching material agriculture food technology

Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a Shoestring Budget

  • 1. Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a Shoestring Budget Tony Erwin, aerwin@us.ibm.com
  • 2. Agenda • Introduction to Bluemix UI & Architecture • Importance of Monitoring w/ Microservices • Overview of Monitoring Architecture • Using Monitoring Data • Building Your Own Monitoring System • Synthetic Measurements
  • 3. Bluemix UI • Front-end to IBM’s open cloud Bluemix offering • Lets users view and manage CF resources, containers, virtual servers, user accounts, billing/usage, etc • Runs on top of Bluemix PaaS Layer (Cloud Foundry) Dashboard Catalog Resource Details And More!
  • 4. Bluemix UI Architecture • Migrated from a monolithic to a microservice architecture over the last couple of years • Composed of 25+ Node.js apps deployed to Cloud Foundry • See talk from earlier this week for more details – To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservices on CloudFoundry Home Catalog … DashboardPricing Orgs/ Spaces Backend APIs (CF, Containers, VMs, BSS, MCCP, etc.) Bluemix UI (Client) Bluemix PaaS Proxy Common Monitoring Framework Session Store NoSQL DB Cloud Foundry
  • 6. Importance of Monitoring • Root cause analysis when a problem occurs – Bluemix UI is most visible part of the platform and acts as a “canary in the mine shaft” for the whole platform – When a critical event or outage occurs, it often starts with reports like: • “Can’t login to console” • “Console doesn’t work…” • “Console is slow…” – When this happens in the middle of the night, my team is regularly the first to get a PagerDuty • Being able to quickly find root cause is a matter of self-preservation – Console behavior is often (but not always!) a symptom of something going on elsewhere (like CF is having problems, networking is down, etc.) • Auto-detection of problems – Ideally, we want to find and fix problems before a user hits them – Example: Send a PagerDuty when error rates for a given API go above a threshold • Tracking against performance and quality targets – Can’t meet goals for something you can’t measure over time
  • 7. What to Monitor? • Metrics we were especially interested in: – Data for every inbound/outbound request for every microservice • Response time • HTTP response code • Etc. – Memory usage, CPU usage, uptime, and crashes for every instance of every microservice – General health of ourselves and dependencies
  • 9. Monitoring Architecture Monitor Storage Backend APIs (CF, Containers, VMs, BSS, MCCP, etc.) Bluemix UI (Client) Cloud Foundry Proxy InfluxDB App 1 MQTT PagerDuty, Slack, etc. … App N Monitor Alerts Space Scanner
  • 10. Monitoring Components • Each microservice bound to an MQTT service (which happens to be provided by the IBM Internet of Things service) • Each microservice adds middleware (private npm module) that publishes inbound / outbound request data to MQTT in a “fire and forget” manner – Also supports a general “publish” function to send arbitrary metrics to MQTT (e.g., overall system health, number of times we retrieve JSON from Redis cache instead of API, etc.) • Storage microservice: – Subscribes to the same queue, does some massaging of the data (such as tagging with URL “category”), and writes to InfluxDB • Alerts microservice: – Subscribes to the same queue, aggregates the inputs over the last X minutes, and sends alerts (like Slack, PagerDuty, etc.) • Scanner microservice: – Calls CF APIs every 60 seconds to get data for each app instance on mem usage, CPU usage, uptime, and crashes – Publishes the data to MQTT • Grafana dashboards display data from data series in InfluxDB • Details app is deployed that can pull data from InfluxDB to complement Grafana: – Shows details of all of the requests in tabular format – Provides capabilities to make special queries against the InfluxDB data
  • 12. Grafana Dashboards • Grafana dashboards used to visualize data over time for any microservice • Data includes: – Total requests – Response time (mean, median, 90% time) – Error rate
  • 13. Identifying a Problem in Grafana • Like a cardiologist reading an echocardiogram, we’ve gotten good at identifying anomalies in these charts • Data to left shows a recent “outage” where error rates and response times spiked for a period of time
  • 14. Root Cause Analysis • We can dive into more detailed data to do root cause analysis • In chart below, response time is broken down by “category” (e.g., CF, UAA, Containers, etc.) • We can see time outs in a large number of components, indicating a broader systemic issue
  • 15. Details View • Can drill down and get tabular view with aggregated details about the requests making up a chart • Can drill down again to see list of individual requests (with timestamps) as well as get more detailed statistics on individual URLs
  • 16. Wall of Shame • Building on the details view from the previous page, we can build walls of “shame” to help drive improvements – Show the 10 slowest API calls made to/from a specific microservice that have been called at least 1000 times during the last 24 hours – Show the top 10 requests with the most error responses that are invoked at least X times over an arbitrary time period – Etc.
  • 17. Memory, CPU Usage, Crashes • Another important set of data includes memory, CPU usage, and crashes for all instances of all microservices • Chart below shows a major CPU usage issue we found in a dev system, so was able to fix before finding its way to production
  • 18. Building Your Own Monitoring System
  • 19. Node Application Metrics (appmetrics) • Had planned on publishing some of my monitoring code, but in prep for CF Summit learned of the appmetrics project being driven by some fellow IBMers • Shares much in common with the middleware I mentioned earlier that publishes metrics to MQTT, but goes even deeper to provide additional performance insights • Fully open source – https://github.com/RuntimeTools/appmetrics • Proves yet again that IBM is a big place J
  • 20. Default Capabilities and MQTT • Sends data to MQTT, meaning you can subscribe to updates • Provides an Event API which allows: – custom triggers based on the monitoring data – publication of custom events • This would be enough to support other pieces of the Bluemix UI monitoring system (like the storage service or the alerts service)
  • 21. App Metrics – Default Capabilities
  • 22. Data Storage • Can be configured to store data: – Elastic Search • https://github.com/RuntimeTools/appmetrics-elk – StatsD • https://github.com/RuntimeTools/appmetrics-statsd • No support for InfluxDB yet, but I’ve suggested to the team they should add it
  • 24. Collecting Synthetic Data • Monitoring discussed so far only paints a picture of the server side • It’s also important to get a perspective from the client • Continuously run scripts that leverage Sitespeed.io (https://www.sitespeed.io/) to load the major pages of the product • Collects data such as perf score, first visual change, speed index, etc. and stores in Graphite – Grafana dashboards built to allow us to visualize the data – Scripts can be running from multiple geo locations
  • 25. The End Questions? Tony Erwin Email: aerwin@us.ibm.com Twitter: @tonyerwin See also presentation from earlier this week: To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservices on CloudFoundry (http://sched.co/AJmh)