Amazon CloudWatch - Observability and Monitoring

AWS CloudWatch
Observability and Monitoring
2
Rick Hwang
rick_kyhwang@hotmail.com
2017/12/28

3
http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1

4
https://env.healthinfo.tw/air/

CloudWatch
Overview, Event-Driven, Automation
AI / ML
6

Agenda
● CloudWatch Metric
● CloudWatch Dashboard
● CloudWatch Alarm
● CloudWatch Event / Rules
● CloudWatch Logs
7

● SNS: Simple Notification Service
● SES: Simple Email Service
● SQS: Simple Queue Service
● Lambda: Serverless
● Auto Scaling
● CloudTrail
Related AWS Services
8

Questions
● 怎麼知道系統的狀況？
● 系統的指標是怎麼來的？
● 系統有哪一些層級要知道？哪些人要知道？怎麼知道？
● 知道之後做什麼？怎麼做？主動、被動？
● 什麼是監、控？
9

How Amazon CloudWatch Works
CloudWatch Basic Concepts
10

11
EC2 Instances
Log Shipper
Logs
Log Groups
Log Stream A
Log Stream B
Log Stream C
Log Stream N
Alarms
Filters
[ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...]
/var/log/app/*.log
2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0
2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0
2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0
2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0
2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0
2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0
2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0
Dashboard
Metrics
S3
Amazon ESLambda
SNS Topics
Export
Streaming
Push
Lambda

12
出處：AWS Summit 2016: Big Data Architectural Patterns and Best Practices

Key Points
13
● 產生結構化、有意義的 Log
○ 結構化: csv, json
○ 有意義: 可統計的資料 → sum, max, min, average, count …
○ 可以下 SQL
● 想想系統上線後需要知道什麼？這些東西哪裡來？
● 盡可能不要動用到 ETL (Extract, Transform, Load)
○ 成本很高、浪費
○ 維護成本
○ 溝通成本

CloudWatch Metrics
每個指標背後都有不同的故事
15

16Source: http://booklook.morningstar.com.tw/pdf/0139022.pdf
健檢報告的指標，都是經過無數臨床經驗 (測試)
與科學實驗 (量測、觀察) 得來的。

Metric - CPU Utilization
17
UTC

CloudWatch Metric
18
● Period: 每次取樣的時間週期
○ EC2 預設為 5m (Free), 可以調整為 1m (另外計費)
○ ELB 預設為 1m
○ Custom metirc supports high resolution: 1s
● Statistics: 統計方式，不同指標有預設的方式
○ Sum
○ Average
○ Max
○ Min
○ Sample Count
● Unit: 單位
○ Percent
○ Count
○ Bytes

Wikipedia: 長尾
Statistics - Long Tail
19Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets

Metric Types
● Metrics Provided by AWS
● Custom Metric
○ 透過 AWS CLI / SDK 上傳取樣資料 (json) → 不好做，容易出錯
○ 透過 awslogs or CloudWatch Agent (New) 上傳到 CloudWatch Logs，自訂 Filter 產生 Metric
■ 流程長，但是不難做
■ 推薦這個做法
20

EC2 Metrics
每個指標背後代表不同的現象
21
Amazon EC2 Metrics and Dimensions

22
Metric Description
CPUUtilization The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing
power required to run an application upon a selected instance.
To use the percentiles statistic, you must enable detailed monitoring.
Depending on the instance type, tools in your operating system can show a lower percentage than CloudWatch when the
instance is not allocated a full processor core.
Units: Percent
DiskReadOps Completed read operations from all instance store volumes available to the instance in a specified period of time.
To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number
of seconds in that period.
Units: Count
DiskWriteOps Completed write operations to all instance store volumes available to the instance in a specified period of time.
To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number
of seconds in that period.
Units: Count

23
Metric Description
DiskReadBytes Bytes read from all instance store volumes available to the instance.
This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to
determine the speed of the application.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
DiskWriteBytes Bytes written to all instance store volumes available to the instance.
This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to
determine the speed of the application.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this
Units: Bytes

24
Metric Description
NetworkIn The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a
single instance.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide
this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
NetworkOut The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from
a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this
Units: Bytes
NetworkPacketsIn The number of packets received on all network interfaces by the instance. This metric identifies the volume of incoming traffic in terms of
the number of packets on a single instance. This metric is available for basic monitoring only.
Units: Count
Statistics: Minimum, Maximum, Average
NetworkPacketsOut The number of packets sent out on all network interfaces by the instance. This metric identifies the volume of outgoing traffic in terms of
the number of packets on a single instance. This metric is available for basic monitoring only.
Units: Count
Statistics: Minimum, Maximum, Average

EC2 Metrics
● 預設 Period = 5min (Free)
○ Detail Monitoring: period = 1min ($$)
● memory, disk 不支援，需要透過其他方式
○ CloudWatch Agent (201712 release)
○ telegraf, collectd, cacti, nagios ….
25

ELB Metrics
負載平衡
26
Elastic Load Balancing Metrics and Dimensions

27
Metric Description
Latency [HTTP listener] The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the
instance started to send the response headers.
[TCP listener] The total time elapsed, in seconds, for the load balancer to successfully establish a connection to a registered
instance.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Average. Use Maximum to determine whether some requests are taking substantially longer
than the average. Note that Minimum is typically not useful.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1
instance in us-west-2a have a higher latency. The average for us-west-2a has a higher value than the average for us-west-2b.
RequestCount The number of requests completed or connections made during the specified interval (1 or 5 minutes).
[HTTP listener] The number of requests received and routed, including HTTP error responses from the registered instances.
[TCP listener] The number of connections made to the registered instances.
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average all return 1.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are
sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent
to us-west-2b, with each instance receiving 20 requests. With the AvailabilityZone dimension, there is a sum of 60 requests in
us-west-2a and 40 requests in us-west-2b. With the LoadBalancerName dimension, there is a sum of 100 requests.

28
Metric Description
HealthyHostCount The number of healthy instances registered with your load balancer. A newly registered instance is considered healthy after it passes
the first health check. If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension
is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone.
Reporting criteria: There are registered instances
Statistics: The most useful statistics are Average and Maximum. These statistics are determined by the load balancer nodes. Note
that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is
healthy.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, us-west-2a has 1 unhealthy
instance, and us-west-2b has no unhealthy instances. With the AvailabilityZone dimension, there is an average of 1 healthy and 1
unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b.
UnHealthyHostCount The number of unhealthy instances registered with your load balancer. An instance is considered unhealthy after it exceeds the
unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy
threshold configured for health checks.
Reporting criteria: There are registered instances
Statistics: The most useful statistics are Average and Minimum. These statistics are determined by the load balancer nodes. Note that
some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is
healthy.
Example: See HealthyHostCount.

29
Metric Description
HTTPCode_Backend_2XX,
HTTPCode_Backend_5XX
[HTTP listener] The number of HTTP response codes generated by registered instances. This count does not include any response
codes generated by the load balancer.
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1
instance in us-west-2a result in HTTP 500 responses. The sum for us-west-2a includes these error responses, while the sum for
us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
HTTPCode_ELB_4XX [HTTP listener] The number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a
request is malformed or incomplete.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that client requests include a malformed
request URL. As a result, client errors would likely increase in all Availability Zones. The sum for the load balancer is the sum of the
values for the Availability Zones.
HTTPCode_ELB_5XX [HTTP listener] The number of HTTP 5XX server error codes generated by the load balancer. This count does not include any
response codes generated by the registered instances. The metric is reported if there are no healthy instances registered to the load
balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a
fills and clients receive a 503 error. If us-west-2b continues to respond normally, the sum for the load balancer equals the sum for
us-west-2a.

30
Metric Description
BackendConnectionErrors The number of connections that were not successfully established between the load balancer and the registered instances. Because
the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also
includes any connection errors related to health checks.
Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are
not typically useful. However, the difference between the minimum and maximum (or peak to average or average to trough) might be
useful to determine whether a load balancer node is an outlier.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that attempts to connect
to 1 instance in us-west-2a result in back-end connection errors. The sum for us-west-2a includes these connection errors, while the
sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.

31
Metric Description
SpilloverCount The total number of requests that were rejected because the surge queue is full.
[HTTP listener] The load balancer returns an HTTP 503 error code.
[TCP listener] The load balancer closes the connection.
Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are
not typically useful.
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer node in us-west-2a
fills, resulting in spillover. If us-west-2b continues to respond normally, the sum for the load balancer will be the same as the sum for
us-west-2a.
SurgeQueueLength The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection
with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when
the queue is full. For more information, see SpilloverCount.
Reporting criteria: There is a nonzero value.
Statistics: The most useful statistic is Maximum, because it represents the peak of queued requests. The Average statistic can be
useful in combination with Minimum and Maximum to determine the range of queued requests. Note that Sum is not useful.
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a
fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the
SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for
us-west-2a.

請參閱：Amazon CloudWatch Metrics
and Dimensions Reference
族繁不及備載 ...
32

● EC2
● EBS
● ELB: CLB, ALB, NLB
○ Classic Load Balancing
○ Application Load Balancing
○ Network Load Balancing
需要了解的 Metrics
33

每個指標背後
都有故事可以說。
34

Question and Think:
EC2 / ELB 的指標是怎麼來的？
35

CloudWatch Dashboard
拉高視野，看見全局
37

41
CloudWatch Dashboard
● widget: line, stacked, number, text (markdown)
● auto refresh
● local timezone
○ EC2 metric is UTC
● time range
● Horizontal annotation
● Right / Left Y axis
● full screen (dark / light mode)

● Dashboard 可以 import / export 成 json
● 可以透過 API 自動更新
● $3.00 per dashboard per month (ap-northeast-1)
● Time zone
42
Tips

Demo: CloudWatch Dashboard
Widgets, X/Y Axis, Annotation
45

CloudWatch Alarm
Event-driven, Feedback
47

CloudWatch Alarm
48
● 達到門檻值 (Threshold) 之後觸發的動作
○ 五分鐘之內
○ CPU >= 80%
○ 五次
● 動作類型
○ EC2 actions: reboot, stop, terminate. 通常結合 EC2 System Status 使用。
○ SNS to:
■ SES
■ SQS
■ Lambda
■ HTTP Request

CloudWatch Alarm - Status
49
● ALARM: over threshold
● INSUFFICIENT: INSUFFICIENT DATA
● OK

Event-driven → Feedback → Automation
51來源：『自動化XXX』的陷阱
CW Alarm

CloudWatch Events
Rules, Cron, Scheduler
53

54
CloudWatch Event
● Event Source
○ Event Pattern
○ Schedule
● Targets
○ Multiple 5 targets (fixed)
○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS …..

55
CloudWatch Events
● Event Source
○ Event Pattern: DynamoDB, EC2, AutoScaling, RDS …. 太多了
○ Schedule
● Targets
○ Multiple 5 targets (fixed)
○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS ….. 太多了

56
常用情境
● EC2 預防性自動化:
○ 不該關機的機器被關機，自動重啟
○ 機器硬體故障，自動重啟
○ 狀態改變的行為
● S3 Action 之後
○ Action: PutObject
○ Trigger: Lambda, Put Message to SQS

CloudWatch Logs
Filter, Custom Metric, Log Shipper
59

60
EC2 Instances
Log Shipper
Logs
Log Groups
Log Stream A
Log Stream B
Log Stream C
Log Stream N
Alarms
Filters
[ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...]
/var/log/app/*.log
2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0
2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0
2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0
2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0
2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0
2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0
2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0
Dashboard
Metrics
S3
Amazon ESLambda
SNS Topics
Export
Streaming
Push
Lambda

● 前提：EC2 要安裝 awslogs driver or CloudWatch agent
○ ECS Instance 用選的就可以
● 即時把 Log 傳到 CWL
○ 可以在 CWL 直接 Query Log (堪用)
○ 不用擔心 Storage 會爆炸 or 維護
○ 可以設定 Log Rotation
● 透過 Filter 建立 Custom Metric
○ 可以建立 Dashboard
○ 可以建立 Alarm → Event-driven
■ To Lambda, Slack
■ ETL
■ Automation … 無限可能
CloudWatch Logs (CWL)
61

● 透過取樣 (Sampling) 待測目標得來的資料
○ 單位時間的資料，例如每毫秒、每秒、每分
● 取樣頻率越高，數據越精準
● 聲音的音質 (sample rate per second)
○ CD Quality: 44.1kHz
○ 錄音室錄音：192kHz
● 攝影的解析度 (Resolution)
○ HD
○ Full-HD
○ 4k
指標 (Metric)
62

上述講的東西，都可以 `as Code`
65

Questions
● 怎麼知道系統的狀況？
○ 觀測 (Observe)、量測 (Measure)
● 系統的指標是怎麼來的？
○ 指標是經過系統性測試 (System Test) 後，分析 Log 找出來的
● 系統有哪一些層級要知道？哪些人要知道？
○ Business、Application、OS/Hardware、Network
● 知道之後做什麼？怎麼做？主動、被動？
● 什麼是監、控？
○ 監: Watch
○ 控: Control
66

什麼是監控？
What is Monitoring?
69

監控
Watch
Monitor
Observe
Measure
73

監控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
74

監控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
75
Dashboard
(儀表板)

監控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
76
Dashboard
(儀表板)
Console
(主控台)

Dashboard (儀表板)
77
StarTrek (星艦企業號)

Console (主控台)
78
演唱會 Mixer

81
Target Services /
Systems
Watchers

82
Target Services /
Systems
Watchers Controllers

Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
83
Target Services /
Systems
Push or Pull Data
(Observability, Measure)

● Health Status
● Sum of Biz TX
● Sys Resources
● …
Push or Pull Data
84
Target Services /
Systems
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...

Commands
● Health Status
● Sum of Biz TX
● Sys Resources
● …
85
Target Services /
Systems
Events
● Notification
● ...
Push or Pull Data

Commands
● Health Status
● Sum of Biz TX
● Sys Resources
● …
86
Target Services /
Systems
Events
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data

Commands
● Health Status
● Sum of Biz TX
● Sys Resources
● …
87
Target Services /
Systems
Events
● Notification
● ...
Feedback
Push or Pull Data
監

Commands
● Health Status
● Sum of Biz TX
● Sys Resources
● …
88
Target Services /
Systems
Events
● Notification
● ...
Feedback
Push or Pull Data
監控

89
Observability vs Monitoring
● 量測：Measure
● 觀測：Observe
● 氣象局
○ Observability 觀測
○ Measurement 量測
● 政府
○ Monitoring
○ Alert
○ Action
○ Feedback

90
http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1

91
量測 (Measure) → Sample from Log
觀測 (Observe) → Metric
回饋 (Feedback) → Analyze, Condition, Alarm
控制 (Control) → Automation, 躺著幹

無法量測，就無法觀測
無法觀測，則沒有回饋
沒有回饋，就不能控制
92

Log 很重要
沒有結構化的 Log or Data
會付出很多 ETL 的成本與時間
93

Event-driven → Feedback → Automation
94來源：『自動化XXX』的陷阱
CW Alarm

Why CloudWatch
● Serverless Monitoring System
● Event-driven
● Programmable and Automation
● Realtime and Backup
● Monitoring Monitoring System at Netflix - 2017/05/22
● CloudWatch 滿足 “Basic Montioring” 的需求
96

97
Source: Microservice Prerequisites

為什麼不選其它監控工具？
● 不想自己蓋機器、養機器
● 監控系統做得再好，都只是成本
● 監控系統不是 Big Data
● 有些 Solution 的架構沒有考慮 HA, ex: Prometheus
98

100
Alarm System using Serverless

EC2
CloudWatch Alarms
Operators
CloudWatch Event
(time-based)
SNS-Adapter
Slack-Notifier
SNS Topic
Info, Warning
Info
Developers
Health-Checker
Auto Scaling
SNS Topic
Urgent SMS
Warning
系統架構: CloudWatch + SNS + Lambda + Slack
Testers
● Urgent: SMS, Slack
● Warning: Slack w/ tag
● Info: Slack w/o tag

102
CloudWatch Reporter - System Architecture
CloudWatch
Reporter / Alamer
CloudWatch Event
(time-based)
Info / Alert
Channels
Operators
(值班)
Operators
Developers
(On Call)
Metric Configs
(Namespace, Stats)
Target Services
Loading
maintain
PR
Read
CW Metrics
Schedule
maintain
Developers
development
Feature Request

Best Practice
● 盡量活用 Cloud SaaS，像是 AWS CloudWatch, GCP Stackdriver
● 把部署設定過程設計成 Configurable
● 把 Log 設計成結構化格式 (csv or json)
● 利用 Big Data Solution 處理 Log Query 需求，像是 AWS Athena or GCP
BigQuery
● Log 透過 Shipper (awslogs, statsd, collectd, fluentd, telegraf ... ) 同時傳到
○ S3 備份，以符合稽核需求
○ CloudWatch 作為 Debug / 監控需求
● 巨量 Log Streaming 資料需要有 Queue 協助
○ AWS Kinesis
○ GCP Pub/Sub
104

● CloudWatch User Guide
● CloudWatch Events User Guide
● CloudWatch Log User Guide
Reference - User Guide
105

● AWS re:Invent 2015: Log, Monitor and Analyze your IT with Amazon
CloudWatch (DVO315)
● Amazon CloudWatch Update – Percentile Statistics and New Dashboard
Widgets
● New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch
● 淺談系統監控與 CloudWatch 的應用 - AWS User Group Taiwan
● Study Notes - CloudWatch
● SRE CH6 Monitoring Distributed Systems (監控分散式系統)
● 高品質微服務 - CH6 監控
Reference - Youtube, Blog
106

Amazon CloudWatch - Observability and Monitoring

More Related Content

What's hot (20)

Similar to Amazon CloudWatch - Observability and Monitoring (20)

More from Rick Hwang (20)

Recently uploaded (20)

Amazon CloudWatch - Observability and Monitoring