SlideShare a Scribd company logo
The Art of Container Monitoring
Derek Chen
2016.9.22
Copyright 2016 Trend Micro Inc.2
About me
• DevOps Engineer at Trend Micro
– Agile transformation
– Micro service and cloud service
– Docker integration
– Monitoring system development
• Automate all the things
• Make everything smoother
• Find me at derekhound@gmail.com
Copyright 2016 Trend Micro Inc.3
• We want to know when things go wrong
• We want to know when things aren’t quite right
• We want to know in advance of problems
Why monitoring?
Microscope?
Magnifier?
Telescope?
Copyright 2016 Trend Micro Inc.4
Blackbox vs Whitebox
• Blackbox
– Requires no participation of the monitored system
– Observes external functionality, “what the user see”
– End to end test
• Whitebox
– Collects data internally provided by the target system
– Has more granular information about the system
– Can provide warning of problems before they occur
Copyright 2016 Trend Micro Inc.5
Blackbox tests measure…
• Can you ping the server
• Can you fetch a page from the server
• Does it have the correct contents
Whitebox monitoring collects…
• Network statistics (packets/bytes sent/received)
• System load (cpu, memory, disk usage)
• Application statistics (connection count, query per second)
Copyright 2016 Trend Micro Inc.6
Goal
Operating System
Application
Business Logic
Metrics
Events
Logs
Router
Destinations
Store
Visualize
Alert
Monitoring Framework
• Be metric, event, and log
• Allow us to easily visualize the
stat of our environment
• Provide contextual and useful
notification.
• Focus on whitebox monitoring
instead of blackbox monitoring
[image] https://www.artofmonitoring.com
Copyright 2016 Trend Micro Inc.7
• The practice of system admin is changing quickly
– DevOps
– Infrastructure as Code
– Visualization
– Cloud Platform
• Commercial solution can’t keep up!
Why Open Source Software Monitoring?
[image] http://devops.com/2014/05/05/meet-infrastructure-code
Copyright 2016 Trend Micro Inc.8
Monolithic
• If all-in-one gets the job done, then great
• Good for smaller scale, non-tech-focused
companies
Modular
• DevOps requires flexibility and innovation
• Good for tech driven and ops-focused
companies
Metric
We’ll rely most heavily on metrics to help us understand
what’s going on in our environment.
Copyright 2016 Trend Micro Inc.10
Metric
• Provide a dynamic, real-time picture of the state of
your infrastructure
Collect Store Visualize
Operating System
Application
Business Logic
Copyright 2016 Trend Micro Inc.11
Pull Mode
A central collector periodically
requests metrics from each
monitored system
Scrape
Metrics
Server #1
Service
Server #2
Service
Server #3
Service
Monitoring
Push Mode
Metrics are periodically sent
by each monitored system to a
central collector
Server #1
Agent
Server #2
Agent
Server #3
Agent
Monitoring
Push
Metrics
Collectd
The system statistics collection daemon
Copyright 2016 Trend Micro Inc.13
Database
Host Machine OS Metric
K8S Minion
Collectd
K8S Master
Collectd
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDB
Collectd Collectd
K8S Minion
Collectd
cAdvisor
Collects, aggregates, processes, and exports information
about running containers.
Copyright 2016 Trend Micro Inc.15
Database
Container OS Metric
K8S Minion
cAdvisor
Heapster
K8S Master
API Server
cAdvisor
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDB
K8S Minion
cAdvisor
Copyright 2016 Trend Micro Inc.16
http://127.0.0.1:4194
http://127.0.0.1:4194/metrics
Telegraf
Collects time series data from a variety of sources
Copyright 2016 Trend Micro Inc.18
Database
Application Metric – Pull Mode
K8S Minion
K8S Master
API Server
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDB
NginxPostgres ES
Telegraf
K8S Minion
Redis
Copyright 2016 Trend Micro Inc.19
Database
Application Metric – Push Mode
K8S Minion
Telegraf
K8S Master
API Server
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDB
Nginx
Telegraf
Postgres
Telegraf
ES
K8S Minion
Telegraf
Redis
Pod Pod
Copyright 2016 Trend Micro Inc.20
Telegraf
• An agent written in Go
• Easy to contribute or develop your plugins
• Supported Input Plugins
– System: cpu, mem, net, disk, processes and etc.
– Application: docker, nginx, postgresql, redis, and etc.
– Third party api: aws cloudwatch
• Supported output plugins
– influxdb, graphite, cloudwatch, datadog and etc.
Copyright 2016 Trend Micro Inc.21
Pull Mode
• Collector discoveries services
periodically
• Collector needs to talk to target
services
– Overlay network. Ex: Flannel,
Weave, etc.
– Forward proxy
Scrape
Metrics
Server #1
Service
Server #2
Service
Server #3
Service
Monitoring
Copyright 2016 Trend Micro Inc.22
Push Mode
• Agent sends metrics as soon as it
starts up
• Suitable for short-lived containers
or dynamic environments
Server #1
Agent
Server #2
Agent
Server #3
Agent
Monitoring
Push
Metrics
InfluxDB
Time series database stores all the metrics
Copyright 2016 Trend Micro Inc.24
• Similar scope to Graphite
• Written in Go
• No external dependency
• Ready for billions row
• Several client libraries
• SQL style queries
Copyright 2016 Trend Micro Inc.25
InfluxDB Schema
• Measurements (e.g. cpu, mem, disk, net, nginx)
• Timestamp (nano second)
• Tags (e.g. app=atom-apigw namespace=alpha)
• Fields (e.g. accepts=30925, active=1, handled=30925)
Copyright 2016 Trend Micro Inc.26
InfluxDB Schema
• Measurements (e.g. cpu, mem, disk, net, nginx)
• Timestamp (nano second)
• Tags (e.g. app=atom-apigw namespace=alpha)
• Fields (e.g. accepts=30925, active=1, handled=30925)
Copyright 2016 Trend Micro Inc.27
InfluxDB Schema
• Measurements (e.g. cpu, mem, disk, net, nginx)
• Timestamp (nano second)
• Tags (e.g. app=atom-apigw namespace=alpha)
• Fields (e.g. accepts=30925, active=1, handled=30925)
Copyright 2016 Trend Micro Inc.28
InfluxDB Schema
• Measurements (e.g. cpu, mem, disk, net, nginx)
• Timestamp (nano second)
• Tags (e.g. app=atom-apigw namespace=alpha)
• Fields (e.g. accepts=30925, active=1, handled=30925)
Copyright 2016 Trend Micro Inc.29
InfluxDB Schema
• Measurements (e.g. cpu, mem, disk, net, nginx)
• Timestamp (nano second)
• Tags (e.g. app=atom-apigw namespace=alpha)
• Fields (e.g. accepts=30925, active=1, handled=30925)
Grafana
Querying and visualizing time series and metrics
Finally! What we can to see!
Less talk more demo…
Copyright 2016 Trend Micro Inc.32
The Art of Container Monitoring
The Art of Container Monitoring
Copyright 2016 Trend Micro Inc.35
The Art of Container Monitoring
Copyright 2016 Trend Micro Inc.37
Event
We’ll generally use events to let us know about changes
and occurrences in our environment.
Icinga2
A monitoring system which checks the availability of
your resources, notifies users of outages
Copyright 2016 Trend Micro Inc.40
• Originally forked from Nagios in 2009
• Independent version Icinga2 since 2014
• Monitors everything
• Gathering status
• Collect performance data
/use/lib/nagios/plugins/plugin_name
Any program which returns
• 0 – OK
• 1 – WARNING
• 2 – CRITICAL
Message to STDOUT
Copyright 2016 Trend Micro Inc.41
Database
Event Monitoring
K8S Minion
K8S Master
API Server
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDBIcinga2
NginxPostgres ES
K8S Minion
Redis
The Art of Container Monitoring
Copyright 2016 Trend Micro Inc.43
Copyright 2016 Trend Micro Inc.44
Host
Pod
Copyright 2016 Trend Micro Inc.45
Server Monitoring
• About changes and occurrences in our environment
– Is cpu load too high?
– Is memory not enough?
– Is docker engine still alive?
External Monitoring
• End to end test from user’s perspective
• Can you connect ssh port 22?
• Can you browse web page?
• Can you request RESTful API successfully?
Dashboard
• Provides a clear view for current environment
Alerting
• Notifies using email, slack, pager and etc.
• Notification escalation
Others…
• Distributed Monitoring
• Reloads config without interrupt checks
Log
Logs are a subset of events. They’re often most useful
for fault diagnosis and investigation
Copyright 2016 Trend Micro Inc.48
• EFK stack is a mature logging solution
EFK
Shipping all of your log
to where it should go
The main part to store your
data with high availability
Visualize will power your data.
To know more about its value.
Summary
Copyright 2016 Trend Micro Inc.50
Collect Store Visualize
Monitoring Service Stack
Alert
InfluxDB
Elasticserach
Grafana
Kibana
Icinga2Collectd
Telegraf Fluentd
cAdvisor
Copyright 2016 Trend Micro Inc.51
Last but Not least
• Data Flow
– Blackbox vs Whitebox
– Pull Mode vs Push Mode
• Data Content
– Opeating System, Application,
Business Logic
• Data Type
– Metric, Events, Logs
Operating System
Application
Business Logic
Metrics
Events
Logs
Router
Destinations
Store
Visualize
Alert
Monitoring Framwork
[image] https://www.artofmonitoring.com
THANKS!
Any Questions?

More Related Content

PDF
Road to (Enterprise) Observability
PDF
Data warehousing labs maunal
PPTX
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
PPTX
Real time data quality on Flink
PDF
Observabilidade: Será que você está fazendo do jeito certo?
PPTX
Flink history, roadmap and vision
PDF
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
PDF
InfluxDB & Grafana
Road to (Enterprise) Observability
Data warehousing labs maunal
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Real time data quality on Flink
Observabilidade: Será que você está fazendo do jeito certo?
Flink history, roadmap and vision
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
InfluxDB & Grafana

What's hot (20)

PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PPTX
Extending Flink SQL for stream processing use cases
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PDF
ELK Stack
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Monitoring and observability
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
SharePoint Syntex from an Architects Perspective
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
PDF
API Management
PDF
Improving Apache Spark's Reliability with DataSourceV2
PPTX
Open Metadata and Governance with Apache Atlas
PDF
Introduction to Apache Airflow
PPTX
MongoDB for Time Series Data: Schema Design
PDF
Apache Spark's Built-in File Sources in Depth
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Apache Atlas: Governance for your Data
PDF
CTO Summit 2022
PPTX
Python Virtual Environment.pptx
PDF
Using ClickHouse for Experimentation
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Extending Flink SQL for stream processing use cases
Apache Kafka Fundamentals for Architects, Admins and Developers
ELK Stack
Processing Semantically-Ordered Streams in Financial Services
Monitoring and observability
Optimizing Delta/Parquet Data Lakes for Apache Spark
SharePoint Syntex from an Architects Perspective
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
API Management
Improving Apache Spark's Reliability with DataSourceV2
Open Metadata and Governance with Apache Atlas
Introduction to Apache Airflow
MongoDB for Time Series Data: Schema Design
Apache Spark's Built-in File Sources in Depth
Evening out the uneven: dealing with skew in Flink
Apache Atlas: Governance for your Data
CTO Summit 2022
Python Virtual Environment.pptx
Using ClickHouse for Experimentation
Ad

Similar to The Art of Container Monitoring (20)

PPTX
Time to say goodbye to your Nagios based setup
PDF
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan
PDF
Monitoring Big Data Systems - "The Simple Way"
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
PDF
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
PDF
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
PDF
Handout: 'Open Source Tools & Resources'
PDF
OSMC 2014 | Time to say goodbye to your Nagios based setup? by Oliver Jan
PDF
Webinar Monitoring in era of cloud computing
PDF
Multi Layer Monitoring V1
PPTX
Functionality, security and performance monitoring of web assets (e.g. Joomla...
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PDF
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
PDF
OSDC 2018 - Distributed monitoring
PDF
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
PPTX
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
PDF
OSMC 2008 | Monitoring Tools Shootout by Tom De Cooman
PDF
I pushed in production :). Have a nice weekend
Time to say goodbye to your Nagios based setup
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan
Monitoring Big Data Systems - "The Simple Way"
What does "monitoring" mean? (FOSDEM 2017)
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Evolution of Monitoring and Prometheus (Dublin 2018)
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
Handout: 'Open Source Tools & Resources'
OSMC 2014 | Time to say goodbye to your Nagios based setup? by Oliver Jan
Webinar Monitoring in era of cloud computing
Multi Layer Monitoring V1
Functionality, security and performance monitoring of web assets (e.g. Joomla...
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 - Distributed monitoring
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
OSMC 2008 | Monitoring Tools Shootout by Tom De Cooman
I pushed in production :). Have a nice weekend
Ad

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Digital Strategies for Manufacturing Companies
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Transform Your Business with a Software ERP System
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
history of c programming in notes for students .pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Introduction to Artificial Intelligence
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
AI in Product Development-omnex systems
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPT
Introduction Database Management System for Course Database
PDF
Nekopoi APK 2025 free lastest update
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Understanding Forklifts - TECH EHS Solution
Design an Analysis of Algorithms I-SECS-1021-03
Digital Strategies for Manufacturing Companies
ManageIQ - Sprint 268 Review - Slide Deck
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Transform Your Business with a Software ERP System
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Online Work Permit System for Fast Permit Processing
history of c programming in notes for students .pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Introduction to Artificial Intelligence
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
AI in Product Development-omnex systems
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Introduction Database Management System for Course Database
Nekopoi APK 2025 free lastest update
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
Understanding Forklifts - TECH EHS Solution

The Art of Container Monitoring

  • 1. The Art of Container Monitoring Derek Chen 2016.9.22
  • 2. Copyright 2016 Trend Micro Inc.2 About me • DevOps Engineer at Trend Micro – Agile transformation – Micro service and cloud service – Docker integration – Monitoring system development • Automate all the things • Make everything smoother • Find me at derekhound@gmail.com
  • 3. Copyright 2016 Trend Micro Inc.3 • We want to know when things go wrong • We want to know when things aren’t quite right • We want to know in advance of problems Why monitoring? Microscope? Magnifier? Telescope?
  • 4. Copyright 2016 Trend Micro Inc.4 Blackbox vs Whitebox • Blackbox – Requires no participation of the monitored system – Observes external functionality, “what the user see” – End to end test • Whitebox – Collects data internally provided by the target system – Has more granular information about the system – Can provide warning of problems before they occur
  • 5. Copyright 2016 Trend Micro Inc.5 Blackbox tests measure… • Can you ping the server • Can you fetch a page from the server • Does it have the correct contents Whitebox monitoring collects… • Network statistics (packets/bytes sent/received) • System load (cpu, memory, disk usage) • Application statistics (connection count, query per second)
  • 6. Copyright 2016 Trend Micro Inc.6 Goal Operating System Application Business Logic Metrics Events Logs Router Destinations Store Visualize Alert Monitoring Framework • Be metric, event, and log • Allow us to easily visualize the stat of our environment • Provide contextual and useful notification. • Focus on whitebox monitoring instead of blackbox monitoring [image] https://www.artofmonitoring.com
  • 7. Copyright 2016 Trend Micro Inc.7 • The practice of system admin is changing quickly – DevOps – Infrastructure as Code – Visualization – Cloud Platform • Commercial solution can’t keep up! Why Open Source Software Monitoring? [image] http://devops.com/2014/05/05/meet-infrastructure-code
  • 8. Copyright 2016 Trend Micro Inc.8 Monolithic • If all-in-one gets the job done, then great • Good for smaller scale, non-tech-focused companies Modular • DevOps requires flexibility and innovation • Good for tech driven and ops-focused companies
  • 9. Metric We’ll rely most heavily on metrics to help us understand what’s going on in our environment.
  • 10. Copyright 2016 Trend Micro Inc.10 Metric • Provide a dynamic, real-time picture of the state of your infrastructure Collect Store Visualize Operating System Application Business Logic
  • 11. Copyright 2016 Trend Micro Inc.11 Pull Mode A central collector periodically requests metrics from each monitored system Scrape Metrics Server #1 Service Server #2 Service Server #3 Service Monitoring Push Mode Metrics are periodically sent by each monitored system to a central collector Server #1 Agent Server #2 Agent Server #3 Agent Monitoring Push Metrics
  • 12. Collectd The system statistics collection daemon
  • 13. Copyright 2016 Trend Micro Inc.13 Database Host Machine OS Metric K8S Minion Collectd K8S Master Collectd Kubernetes Cluster Elasticsearch Monitor GrafanaInfluxDB Collectd Collectd K8S Minion Collectd
  • 14. cAdvisor Collects, aggregates, processes, and exports information about running containers.
  • 15. Copyright 2016 Trend Micro Inc.15 Database Container OS Metric K8S Minion cAdvisor Heapster K8S Master API Server cAdvisor Kubernetes Cluster Elasticsearch Monitor GrafanaInfluxDB K8S Minion cAdvisor
  • 16. Copyright 2016 Trend Micro Inc.16 http://127.0.0.1:4194 http://127.0.0.1:4194/metrics
  • 17. Telegraf Collects time series data from a variety of sources
  • 18. Copyright 2016 Trend Micro Inc.18 Database Application Metric – Pull Mode K8S Minion K8S Master API Server Kubernetes Cluster Elasticsearch Monitor GrafanaInfluxDB NginxPostgres ES Telegraf K8S Minion Redis
  • 19. Copyright 2016 Trend Micro Inc.19 Database Application Metric – Push Mode K8S Minion Telegraf K8S Master API Server Kubernetes Cluster Elasticsearch Monitor GrafanaInfluxDB Nginx Telegraf Postgres Telegraf ES K8S Minion Telegraf Redis Pod Pod
  • 20. Copyright 2016 Trend Micro Inc.20 Telegraf • An agent written in Go • Easy to contribute or develop your plugins • Supported Input Plugins – System: cpu, mem, net, disk, processes and etc. – Application: docker, nginx, postgresql, redis, and etc. – Third party api: aws cloudwatch • Supported output plugins – influxdb, graphite, cloudwatch, datadog and etc.
  • 21. Copyright 2016 Trend Micro Inc.21 Pull Mode • Collector discoveries services periodically • Collector needs to talk to target services – Overlay network. Ex: Flannel, Weave, etc. – Forward proxy Scrape Metrics Server #1 Service Server #2 Service Server #3 Service Monitoring
  • 22. Copyright 2016 Trend Micro Inc.22 Push Mode • Agent sends metrics as soon as it starts up • Suitable for short-lived containers or dynamic environments Server #1 Agent Server #2 Agent Server #3 Agent Monitoring Push Metrics
  • 23. InfluxDB Time series database stores all the metrics
  • 24. Copyright 2016 Trend Micro Inc.24 • Similar scope to Graphite • Written in Go • No external dependency • Ready for billions row • Several client libraries • SQL style queries
  • 25. Copyright 2016 Trend Micro Inc.25 InfluxDB Schema • Measurements (e.g. cpu, mem, disk, net, nginx) • Timestamp (nano second) • Tags (e.g. app=atom-apigw namespace=alpha) • Fields (e.g. accepts=30925, active=1, handled=30925)
  • 26. Copyright 2016 Trend Micro Inc.26 InfluxDB Schema • Measurements (e.g. cpu, mem, disk, net, nginx) • Timestamp (nano second) • Tags (e.g. app=atom-apigw namespace=alpha) • Fields (e.g. accepts=30925, active=1, handled=30925)
  • 27. Copyright 2016 Trend Micro Inc.27 InfluxDB Schema • Measurements (e.g. cpu, mem, disk, net, nginx) • Timestamp (nano second) • Tags (e.g. app=atom-apigw namespace=alpha) • Fields (e.g. accepts=30925, active=1, handled=30925)
  • 28. Copyright 2016 Trend Micro Inc.28 InfluxDB Schema • Measurements (e.g. cpu, mem, disk, net, nginx) • Timestamp (nano second) • Tags (e.g. app=atom-apigw namespace=alpha) • Fields (e.g. accepts=30925, active=1, handled=30925)
  • 29. Copyright 2016 Trend Micro Inc.29 InfluxDB Schema • Measurements (e.g. cpu, mem, disk, net, nginx) • Timestamp (nano second) • Tags (e.g. app=atom-apigw namespace=alpha) • Fields (e.g. accepts=30925, active=1, handled=30925)
  • 30. Grafana Querying and visualizing time series and metrics
  • 31. Finally! What we can to see! Less talk more demo…
  • 32. Copyright 2016 Trend Micro Inc.32
  • 35. Copyright 2016 Trend Micro Inc.35
  • 37. Copyright 2016 Trend Micro Inc.37
  • 38. Event We’ll generally use events to let us know about changes and occurrences in our environment.
  • 39. Icinga2 A monitoring system which checks the availability of your resources, notifies users of outages
  • 40. Copyright 2016 Trend Micro Inc.40 • Originally forked from Nagios in 2009 • Independent version Icinga2 since 2014 • Monitors everything • Gathering status • Collect performance data /use/lib/nagios/plugins/plugin_name Any program which returns • 0 – OK • 1 – WARNING • 2 – CRITICAL Message to STDOUT
  • 41. Copyright 2016 Trend Micro Inc.41 Database Event Monitoring K8S Minion K8S Master API Server Kubernetes Cluster Elasticsearch Monitor GrafanaInfluxDBIcinga2 NginxPostgres ES K8S Minion Redis
  • 43. Copyright 2016 Trend Micro Inc.43
  • 44. Copyright 2016 Trend Micro Inc.44 Host Pod
  • 45. Copyright 2016 Trend Micro Inc.45 Server Monitoring • About changes and occurrences in our environment – Is cpu load too high? – Is memory not enough? – Is docker engine still alive? External Monitoring • End to end test from user’s perspective • Can you connect ssh port 22? • Can you browse web page? • Can you request RESTful API successfully?
  • 46. Dashboard • Provides a clear view for current environment Alerting • Notifies using email, slack, pager and etc. • Notification escalation Others… • Distributed Monitoring • Reloads config without interrupt checks
  • 47. Log Logs are a subset of events. They’re often most useful for fault diagnosis and investigation
  • 48. Copyright 2016 Trend Micro Inc.48 • EFK stack is a mature logging solution EFK Shipping all of your log to where it should go The main part to store your data with high availability Visualize will power your data. To know more about its value.
  • 50. Copyright 2016 Trend Micro Inc.50 Collect Store Visualize Monitoring Service Stack Alert InfluxDB Elasticserach Grafana Kibana Icinga2Collectd Telegraf Fluentd cAdvisor
  • 51. Copyright 2016 Trend Micro Inc.51 Last but Not least • Data Flow – Blackbox vs Whitebox – Pull Mode vs Push Mode • Data Content – Opeating System, Application, Business Logic • Data Type – Metric, Events, Logs Operating System Application Business Logic Metrics Events Logs Router Destinations Store Visualize Alert Monitoring Framwork [image] https://www.artofmonitoring.com

Editor's Notes

  • #2: 大家好,很開心今天可以在這裡跟大家分享 the art of container monitoring。開始之前,大家想看看有沒有遇這樣的困擾。 因為 aws 帶來的便利性,我們很習慣把機器搬到 aws 上面,但是機器發生問題的時候,打開 cloud watch,發現這些 metric 都不是我想看的,你有這種困擾嗎?我有。 所以我就把 custom metric 放到 cloud watch 上,卻發現 cloud watch 萬萬稅,新增 metric 要錢,設定 alarm 要錢,五分鐘監控一次改成一分鐘監控一次也要錢,你有這種困擾嗎?我有。 把 container 放到 kubernetes 上面跑的時候,卻發現不知道怎麼計算 real time total current connection,你有這種困擾嗎?我有。 如果你跟我一樣有這些困擾的話,希望今天的演講結束以後,人人都可以量身訂造一個屬於你自己的 monitoring system。
  • #3: 首先我要介紹一下我自己,我叫做 Derek ,目前任職於趨勢科技擔任 DevOps 的工作,主要將 CI/CD 導入生產團隊,對於 microservice 和 cloud service 有著濃厚的興趣。最近致力於將 kubernetes 整合進生產環境,並建構 container monitoring system。我喜歡將所有的事情盡可能的自動化,讓每件事情變得更簡單更順利。有問題想要找我的話,email在這裡。
  • #4: 在容器的世界中,依據系統的規模大小,可能有數百個到數千個容器在運行。我們到底是應該用顯微鏡觀察每一個 container 的運行狀態,還是用放大鏡觀察 micro service 的整體狀態,還是拿個望遠鏡,遠遠的觀望整個系統的狀態,取決於觀察的精細度是否能幫助我們容易找到問題和解決問題。
  • #8: 因為我們所處的環境不斷的在改變。近幾年提昌的 DevOps,需要團隊做敏捷的開發,需要整合 CI / CD 的 tool。最近開始要有一句響亮的口號 ‘Infrastructure as Code’,希望能夠自動化完成所有的部署工作。我們需要把數據視覺化。我們開始大量的採用 cloud platform, 例如 aws, google cloud, azure。 環境一直在改變,但是 commercial solution 在大部分的時候沒有這麼彈性,難以符合我們不斷變動的需求。
  • #9: 右上方的圖是 google nexus 手機,是一個 all in one solution 的代表,右下方的圖是 google開發的模組化手機,可以根據你的需求替換不同的模組。當你在選擇 open source software 時,你會選擇那一種呢,all in one solution 或是 modular solution 呢? 假如這個 all in one solution 剛好符合你的需求,那你真得非常幸運,Promethues 就是這類型的代表。 但是,大部份的時候我們沒有那麼幸運,我們必須將數個 software 放在一起去建構我們自己的 monitoring system。
  • #41: http://www.slideshare.net/icinga/why-favour-icinga-over-nagios-rootconf-2015?qid=13b9dfe7-0eee-45cc-b16a-fcdaa2b54b22&v=&b=&from_search=23