SlideShare a Scribd company logo
1
AWS CloudWatch
Observability and Monitoring
2
Rick Hwang
rick_kyhwang@hotmail.com
2017/12/28
3
http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1
4
https://env.healthinfo.tw/air/
5
CloudWatch
Overview, Event-Driven, Automation
AI / ML
6
Agenda
● CloudWatch Metric
● CloudWatch Dashboard
● CloudWatch Alarm
● CloudWatch Event / Rules
● CloudWatch Logs
7
● SNS: Simple Notification Service
● SES: Simple Email Service
● SQS: Simple Queue Service
● Lambda: Serverless
● Auto Scaling
● CloudTrail
Related AWS Services
8
Questions
● 怎麼知道系統的狀況?
● 系統的指標是怎麼來的?
● 系統有哪一些層級要知道?哪些人要知道?怎麼知道?
● 知道之後做什麼?怎麼做?主動、被動?
● 什麼是監、控?
9
How Amazon CloudWatch Works
CloudWatch Basic Concepts
10
11
EC2 Instances
Log Shipper
Logs
Log Groups
Log Stream A
Log Stream B
Log Stream C
Log Stream N
Alarms
Filters
[ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...]
/var/log/app/*.log
2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0
2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0
2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0
2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0
2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0
2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0
2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0
Dashboard
Metrics
S3
Amazon ESLambda
SNS Topics
Export
Streaming
Push
Lambda
12
出處:AWS Summit 2016: Big Data Architectural Patterns and Best Practices
Key Points
13
● 產生結構化、有意義的 Log
○ 結構化: csv, json
○ 有意義: 可統計的資料 → sum, max, min, average, count …
○ 可以下 SQL
● 想想系統上線後需要知道什麼?這些東西哪裡來?
● 盡可能不要動用到 ETL (Extract, Transform, Load)
○ 成本很高、浪費
○ 維護成本
○ 溝通成本
14
CloudWatch Metrics
每個指標背後都有不同的故事
15
16Source: http://booklook.morningstar.com.tw/pdf/0139022.pdf
健檢報告的指標,都是經過無數 臨床經驗 (測試)
與科學實驗 (量測、觀察) 得來的。
Metric - CPU Utilization
17
UTC
CloudWatch Metric
18
● Period: 每次取樣的時間週期
○ EC2 預設為 5m (Free), 可以調整為 1m (另外計費)
○ ELB 預設為 1m
○ Custom metirc supports high resolution: 1s
● Statistics: 統計方式,不同指標有預設的方式
○ Sum
○ Average
○ Max
○ Min
○ Sample Count
● Unit: 單位
○ Percent
○ Count
○ Bytes
Wikipedia: 長尾
Statistics - Long Tail
19Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets
Metric Types
● Metrics Provided by AWS
● Custom Metric
○ 透過 AWS CLI / SDK 上傳取樣資料 (json) → 不好做,容易出錯
○ 透過 awslogs or CloudWatch Agent (New) 上傳到 CloudWatch Logs,自訂 Filter 產生 Metric
■ 流程長,但是不難做
■ 推薦這個做法
20
EC2 Metrics
每個指標背後代表不同的現象
21
Amazon EC2 Metrics and Dimensions
22
Metric Description
CPUUtilization The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing
power required to run an application upon a selected instance.
To use the percentiles statistic, you must enable detailed monitoring.
Depending on the instance type, tools in your operating system can show a lower percentage than CloudWatch when the
instance is not allocated a full processor core.
Units: Percent
DiskReadOps Completed read operations from all instance store volumes available to the instance in a specified period of time.
To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number
of seconds in that period.
Units: Count
DiskWriteOps Completed write operations to all instance store volumes available to the instance in a specified period of time.
To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number
of seconds in that period.
Units: Count
23
Metric Description
DiskReadBytes Bytes read from all instance store volumes available to the instance.
This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to
determine the speed of the application.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
DiskWriteBytes Bytes written to all instance store volumes available to the instance.
This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to
determine the speed of the application.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
24
Metric Description
NetworkIn The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a
single instance.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide
this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
NetworkOut The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from
a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
NetworkPacketsIn The number of packets received on all network interfaces by the instance. This metric identifies the volume of incoming traffic in terms of
the number of packets on a single instance. This metric is available for basic monitoring only.
Units: Count
Statistics: Minimum, Maximum, Average
NetworkPacketsOut The number of packets sent out on all network interfaces by the instance. This metric identifies the volume of outgoing traffic in terms of
the number of packets on a single instance. This metric is available for basic monitoring only.
Units: Count
Statistics: Minimum, Maximum, Average
EC2 Metrics
● 預設 Period = 5min (Free)
○ Detail Monitoring: period = 1min ($$)
● memory, disk 不支援,需要透過其他方式
○ CloudWatch Agent (201712 release)
○ telegraf, collectd, cacti, nagios ….
25
ELB Metrics
負載平衡
26
Elastic Load Balancing Metrics and Dimensions
27
Metric Description
Latency [HTTP listener] The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the
instance started to send the response headers.
[TCP listener] The total time elapsed, in seconds, for the load balancer to successfully establish a connection to a registered
instance.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Average. Use Maximum to determine whether some requests are taking substantially longer
than the average. Note that Minimum is typically not useful.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1
instance in us-west-2a have a higher latency. The average for us-west-2a has a higher value than the average for us-west-2b.
RequestCount The number of requests completed or connections made during the specified interval (1 or 5 minutes).
[HTTP listener] The number of requests received and routed, including HTTP error responses from the registered instances.
[TCP listener] The number of connections made to the registered instances.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average all return 1.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are
sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent
to us-west-2b, with each instance receiving 20 requests. With the AvailabilityZone dimension, there is a sum of 60 requests in
us-west-2a and 40 requests in us-west-2b. With the LoadBalancerName dimension, there is a sum of 100 requests.
28
Metric Description
HealthyHostCount The number of healthy instances registered with your load balancer. A newly registered instance is considered healthy after it passes
the first health check. If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension
is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone.
Reporting criteria: There are registered instances
Statistics: The most useful statistics are Average and Maximum. These statistics are determined by the load balancer nodes. Note
that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is
healthy.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, us-west-2a has 1 unhealthy
instance, and us-west-2b has no unhealthy instances. With the AvailabilityZone dimension, there is an average of 1 healthy and 1
unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b.
UnHealthyHostCount The number of unhealthy instances registered with your load balancer. An instance is considered unhealthy after it exceeds the
unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy
threshold configured for health checks.
Reporting criteria: There are registered instances
Statistics: The most useful statistics are Average and Minimum. These statistics are determined by the load balancer nodes. Note that
some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is
healthy.
Example: See HealthyHostCount.
29
Metric Description
HTTPCode_Backend_2XX,
HTTPCode_Backend_3XX,
HTTPCode_Backend_4XX,
HTTPCode_Backend_5XX
[HTTP listener] The number of HTTP response codes generated by registered instances. This count does not include any response
codes generated by the load balancer.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1
instance in us-west-2a result in HTTP 500 responses. The sum for us-west-2a includes these error responses, while the sum for
us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
HTTPCode_ELB_4XX [HTTP listener] The number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a
request is malformed or incomplete.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that client requests include a malformed
request URL. As a result, client errors would likely increase in all Availability Zones. The sum for the load balancer is the sum of the
values for the Availability Zones.
HTTPCode_ELB_5XX [HTTP listener] The number of HTTP 5XX server error codes generated by the load balancer. This count does not include any
response codes generated by the registered instances. The metric is reported if there are no healthy instances registered to the load
balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a
fills and clients receive a 503 error. If us-west-2b continues to respond normally, the sum for the load balancer equals the sum for
us-west-2a.
30
Metric Description
BackendConnectionErrors The number of connections that were not successfully established between the load balancer and the registered instances. Because
the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also
includes any connection errors related to health checks.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are
not typically useful. However, the difference between the minimum and maximum (or peak to average or average to trough) might be
useful to determine whether a load balancer node is an outlier.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that attempts to connect
to 1 instance in us-west-2a result in back-end connection errors. The sum for us-west-2a includes these connection errors, while the
sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
31
Metric Description
SpilloverCount The total number of requests that were rejected because the surge queue is full.
[HTTP listener] The load balancer returns an HTTP 503 error code.
[TCP listener] The load balancer closes the connection.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are
not typically useful.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer node in us-west-2a
fills, resulting in spillover. If us-west-2b continues to respond normally, the sum for the load balancer will be the same as the sum for
us-west-2a.
SurgeQueueLength The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection
with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when
the queue is full. For more information, see SpilloverCount.
Reporting criteria: There is a nonzero value.
Statistics: The most useful statistic is Maximum, because it represents the peak of queued requests. The Average statistic can be
useful in combination with Minimum and Maximum to determine the range of queued requests. Note that Sum is not useful.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a
fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the
SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for
us-west-2a.
請參閱:Amazon CloudWatch Metrics
and Dimensions Reference
族繁不及備載 ...
32
● EC2
● EBS
● ELB: CLB, ALB, NLB
○ Classic Load Balancing
○ Application Load Balancing
○ Network Load Balancing
需要了解的 Metrics
33
每個指標背後
都有故事可以說。
34
Question and Think:
EC2 / ELB 的指標是怎麼來的?
35
36
CloudWatch Dashboard
拉高視野,看見全局
37
38StarTrek (星艦企業號)
39
Passenger (星艦過客)
40
Passenger (星艦過客)
41
CloudWatch Dashboard
● widget: line, stacked, number, text (markdown)
● auto refresh
● local timezone
○ EC2 metric is UTC
● time range
● Horizontal annotation
● Right / Left Y axis
● full screen (dark / light mode)
● Dashboard 可以 import / export 成 json
● 可以透過 API 自動更新
● $3.00 per dashboard per month (ap-northeast-1)
● Time zone
42
Tips
43
LetSSL - System Level
44
LetSSL - Application Level
Demo: CloudWatch Dashboard
Widgets, X/Y Axis, Annotation
45
46
CloudWatch Alarm
Event-driven, Feedback
47
CloudWatch Alarm
48
● 達到門檻值 (Threshold) 之後觸發的動作
○ 五分鐘之內
○ CPU >= 80%
○ 五次
● 動作類型
○ EC2 actions: reboot, stop, terminate. 通常結合 EC2 System Status 使用。
○ SNS to:
■ SES
■ SQS
■ Lambda
■ HTTP Request
CloudWatch Alarm - Status
49
● ALARM: over threshold
● INSUFFICIENT: INSUFFICIENT DATA
● OK
Demo: CloudWatch Alarm
50
Event-driven → Feedback → Automation
51來源:『自動化XXX』的陷阱
CW Alarm
52
CloudWatch Events
Rules, Cron, Scheduler
53
54
CloudWatch Event
● Event Source
○ Event Pattern
○ Schedule
● Targets
○ Multiple 5 targets (fixed)
○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS …..
55
CloudWatch Events
● Event Source
○ Event Pattern: DynamoDB, EC2, AutoScaling, RDS …. 太多了
○ Schedule
● Targets
○ Multiple 5 targets (fixed)
○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS ….. 太多了
56
常用情境
● EC2 預防性自動化:
○ 不該關機的機器被關機,自動重 啟
○ 機器硬體故障,自動重 啟
○ 狀態改變的行為
● S3 Action 之後
○ Action: PutObject
○ Trigger: Lambda, Put Message to SQS
Demo: CloudWatch Events
57
58
CloudWatch Logs
Filter, Custom Metric, Log Shipper
59
60
EC2 Instances
Log Shipper
Logs
Log Groups
Log Stream A
Log Stream B
Log Stream C
Log Stream N
Alarms
Filters
[ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...]
/var/log/app/*.log
2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0
2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0
2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0
2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0
2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0
2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0
2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0
Dashboard
Metrics
S3
Amazon ESLambda
SNS Topics
Export
Streaming
Push
Lambda
● 前提:EC2 要安裝 awslogs driver or CloudWatch agent
○ ECS Instance 用選的就可以
● 即時把 Log 傳到 CWL
○ 可以在 CWL 直接 Query Log (堪用)
○ 不用擔心 Storage 會爆炸 or 維護
○ 可以設定 Log Rotation
● 透過 Filter 建立 Custom Metric
○ 可以建立 Dashboard
○ 可以建立 Alarm → Event-driven
■ To Lambda, Slack
■ ETL
■ Automation … 無限可能
CloudWatch Logs (CWL)
61
● 透過取樣 (Sampling) 待測目標得來的資料
○ 單位時間的資料,例如每毫秒、每秒、每分
● 取樣頻率越高,數據越精準
● 聲音的音質 (sample rate per second)
○ CD Quality: 44.1kHz
○ 錄音室錄音:192kHz
● 攝影的解析度 (Resolution)
○ HD
○ Full-HD
○ 4k
指標 (Metric)
62
Demo: CloudWatch Logs
63
64
上述講的東西,都可以 `as Code`
65
Questions
● 怎麼知道系統的狀況?
○ 觀測 (Observe)、量測 (Measure)
● 系統的指標是怎麼來的?
○ 指標是經過系統性測試 (System Test) 後,分析 Log 找出來的
● 系統有哪一些層級要知道?哪些人要知道?
○ Business、Application、OS/Hardware、Network
● 知道之後做什麼?怎麼做?主動、被動?
● 什麼是監、控?
○ 監: Watch
○ 控: Control
66
67
本質性問題
68
什麼是監控?
What is Monitoring?
69
監
70
監 控
71
監 控
72
監 控
Watch
Monitor
Observe
Measure
73
監 控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
74
監 控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
75
Dashboard
(儀表板)
監 控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
76
Dashboard
(儀表板)
Console
(主控台)
Dashboard (儀表板)
77
StarTrek (星艦企業號)
Console (主控台)
78
演唱會 Mixer
79
80
Target Services /
Systems
81
Target Services /
Systems
Watchers
82
Target Services /
Systems
Watchers Controllers
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
83
Target Services /
Systems
Watchers Controllers
Push or Pull Data
(Observability, Measure)
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
Push or Pull Data
(Observability, Measure)
84
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
85
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Push or Pull Data
(Observability, Measure)
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
86
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data
(Observability, Measure)
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
87
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data
(Observability, Measure)
監
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
88
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data
(Observability, Measure)
監 控
89
Observability vs Monitoring
● 量測:Measure
● 觀測:Observe
● 氣象局
○ Observability 觀測
○ Measurement 量測
● 政府
○ Monitoring
○ Alert
○ Action
○ Feedback
90
http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1
91
量測 (Measure) → Sample from Log
觀測 (Observe) → Metric
回饋 (Feedback) → Analyze, Condition, Alarm
控制 (Control) → Automation, 躺著幹
無法量測,就無法觀測
無法觀測,則沒有回饋
沒有回饋,就不能控制
92
Log 很重要
沒有結構化的 Log or Data
會付出很多 ETL 的成本與時間
93
Event-driven → Feedback → Automation
94來源:『自動化XXX』的陷阱
CW Alarm
95
Why CloudWatch
● Serverless Monitoring System
● Event-driven
● Programmable and Automation
● Realtime and Backup
● Monitoring Monitoring System at Netflix - 2017/05/22
● CloudWatch 滿足 “Basic Montioring” 的需求
96
97
Source: Microservice Prerequisites
為什麼不選其它監控工具?
● 不想自己蓋機器、養機器
● 監控系統做得再好,都只是成本
● 監控系統不是 Big Data
● 有些 Solution 的架構沒有考慮 HA, ex: Prometheus
98
99
100
Alarm System using Serverless
EC2
CloudWatch Alarms
Operators
CloudWatch Event
(time-based)
SNS-Adapter
Slack-Notifier
SNS Topic
Info, Warning
Info
Developers
Health-Checker
Auto Scaling
SNS Topic
Urgent SMS
Warning
系統架構: CloudWatch + SNS + Lambda + Slack
Testers
● Urgent: SMS, Slack
● Warning: Slack w/ tag
● Info: Slack w/o tag
102
CloudWatch Reporter - System Architecture
CloudWatch
Reporter / Alamer
CloudWatch Event
(time-based)
Info / Alert
Channels
Operators
(值班)
Operators
Developers
(On Call)
Metric Configs
(Namespace, Stats)
Target Services
Loading
maintain
PR
Read
CW Metrics
Schedule
maintain
Developers
development
Feature Request
103
Best Practice
● 盡量活用 Cloud SaaS,像是 AWS CloudWatch, GCP Stackdriver
● 把部署設定過程設計成 Configurable
● 把 Log 設計成結構化格式 (csv or json)
● 利用 Big Data Solution 處理 Log Query 需求,像是 AWS Athena or GCP
BigQuery
● Log 透過 Shipper (awslogs, statsd, collectd, fluentd, telegraf ... ) 同時傳到
○ S3 備份,以符合稽核需求
○ CloudWatch 作為 Debug / 監控需求
● 巨量 Log Streaming 資料需要有 Queue 協助
○ AWS Kinesis
○ GCP Pub/Sub
104
● CloudWatch User Guide
● CloudWatch Events User Guide
● CloudWatch Log User Guide
Reference - User Guide
105
● AWS re:Invent 2015: Log, Monitor and Analyze your IT with Amazon
CloudWatch (DVO315)
● Amazon CloudWatch Update – Percentile Statistics and New Dashboard
Widgets
● New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch
● 淺談系統監控與 CloudWatch 的應用 - AWS User Group Taiwan
● Study Notes - CloudWatch
● SRE CH6 Monitoring Distributed Systems (監控分散式系統)
● 高品質微服務 - CH6 監控
Reference - Youtube, Blog
106
107
/* End of Slide */

More Related Content

PDF
VMware Tanzu Introduction- June 11, 2020
PDF
Terraform introduction
PDF
Azure Monitoring Overview
PDF
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
PPTX
Kubernetes day 2 Operations
PPTX
An introduction to DevOps
PDF
Monitoring Kubernetes with Prometheus
PDF
TechnicalTerraformLandingZones121120229238.pdf
VMware Tanzu Introduction- June 11, 2020
Terraform introduction
Azure Monitoring Overview
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
Kubernetes day 2 Operations
An introduction to DevOps
Monitoring Kubernetes with Prometheus
TechnicalTerraformLandingZones121120229238.pdf

What's hot (20)

PPTX
DevOps Overview
PPTX
PPTX
Introduction to Docker - 2017
PDF
VMware Tanzu Introduction
PPTX
Azure migration
PDF
Microservices for Application Modernisation
PPTX
Virtualization Vs. Containers
PDF
Creating AWS infrastructure using Terraform
PPTX
DevOps 101 - an Introduction to DevOps
PPTX
Azure DevOps
PDF
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
PDF
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
PPTX
An Intrudction to OpenStack 2017
PPTX
Microsoft Azure - Introduction
PPTX
Envoy and Kafka
PDF
Infrastructure & System Monitoring using Prometheus
PPTX
cloud_foundation_on_vxrail_vcf_pnp_licensing_guide.pptx
PPTX
Terraform Basics
PPTX
Drive business outcomes using Azure Devops
PDF
Microsoft Windows Server 2022 Overview
DevOps Overview
Introduction to Docker - 2017
VMware Tanzu Introduction
Azure migration
Microservices for Application Modernisation
Virtualization Vs. Containers
Creating AWS infrastructure using Terraform
DevOps 101 - an Introduction to DevOps
Azure DevOps
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
An Intrudction to OpenStack 2017
Microsoft Azure - Introduction
Envoy and Kafka
Infrastructure & System Monitoring using Prometheus
cloud_foundation_on_vxrail_vcf_pnp_licensing_guide.pptx
Terraform Basics
Drive business outcomes using Azure Devops
Microsoft Windows Server 2022 Overview
Ad

Similar to Amazon CloudWatch - Observability and Monitoring (20)

PPTX
SessionBased.pptx
PDF
Cloudwatch: Monitoring your Services with Metrics and Alarms
PDF
Monitoring on Amazon AWS Cloud
PDF
Cloudwatch: Monitoring your AWS services with Metrics and Alarms
KEY
Cloudwatch - The In's and Out's
PPTX
What is AWS Cloud Watch
PDF
Network visibility and control using industry standard sFlow telemetry
PDF
Observability foundations in dynamically evolving architectures
PPTX
AWS Cloud Watch
PPTX
The Art of Container Monitoring
PDF
Multi Layer Monitoring V1
PDF
The present and future of serverless observability
PPTX
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
PDF
Webinar Monitoring in era of cloud computing
PPTX
Cloud Monitoring 101 - The Five Key Elements to Effective Cloud Monitoring
PDF
Path Solutions Network Monitor V4 Glossy
PDF
Performance Monitoring: Understanding Your Scylla Cluster
PDF
IBM SevOne for network and systems monitoring
PDF
Kks sre book_ch10
PPTX
AcademyCloudFoundations_Module_10 (2).pptx
SessionBased.pptx
Cloudwatch: Monitoring your Services with Metrics and Alarms
Monitoring on Amazon AWS Cloud
Cloudwatch: Monitoring your AWS services with Metrics and Alarms
Cloudwatch - The In's and Out's
What is AWS Cloud Watch
Network visibility and control using industry standard sFlow telemetry
Observability foundations in dynamically evolving architectures
AWS Cloud Watch
The Art of Container Monitoring
Multi Layer Monitoring V1
The present and future of serverless observability
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Webinar Monitoring in era of cloud computing
Cloud Monitoring 101 - The Five Key Elements to Effective Cloud Monitoring
Path Solutions Network Monitor V4 Glossy
Performance Monitoring: Understanding Your Scylla Cluster
IBM SevOne for network and systems monitoring
Kks sre book_ch10
AcademyCloudFoundations_Module_10 (2).pptx
Ad

More from Rick Hwang (20)

PDF
在生命轉彎的地方 - 從軟體開發職涯,探索人生
PDF
20230829 - 探索職涯,複利人生
PDF
2023 08 - SRE 實踐與開發平台指南 - 書友見面會
PDF
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
PDF
20230618 - 軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
PDF
CH02 API Governance
PDF
Chapter 8. Partial updates and retrievals.pdf
PDF
Ch09 Custom Methods
PDF
AWS Career Exploration Day
PDF
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
PDF
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
PDF
導讀持續交付 2.0 - CH02 價值探索環
PDF
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
PDF
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
PDF
Software Development Process v1.5 - 20121214
PDF
第三章 建立良好的人際關係網路
PDF
Wiki in Teamroom - Connected Mind
PDF
導讀持續交付 2.0 - 談當代軟體交付之虛實融合
PDF
Study Notes - Event-Driven Data Management for Microservices
PDF
Study Notes - Using an API Gateway
在生命轉彎的地方 - 從軟體開發職涯,探索人生
20230829 - 探索職涯,複利人生
2023 08 - SRE 實踐與開發平台指南 - 書友見面會
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
20230618 - 軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
CH02 API Governance
Chapter 8. Partial updates and retrievals.pdf
Ch09 Custom Methods
AWS Career Exploration Day
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
導讀持續交付 2.0 - CH02 價值探索環
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
Software Development Process v1.5 - 20121214
第三章 建立良好的人際關係網路
Wiki in Teamroom - Connected Mind
導讀持續交付 2.0 - 談當代軟體交付之虛實融合
Study Notes - Event-Driven Data Management for Microservices
Study Notes - Using an API Gateway

Recently uploaded (20)

PPT
Drone Technology Electronics components_1
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Construction Project Organization Group 2.pptx
PPTX
web development for engineering and engineering
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Welding lecture in detail for understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Drone Technology Electronics components_1
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Construction Project Organization Group 2.pptx
web development for engineering and engineering
Embodied AI: Ushering in the Next Era of Intelligent Systems
Lesson 3_Tessellation.pptx finite Mathematics
Internet of Things (IOT) - A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
Welding lecture in detail for understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
Arduino robotics embedded978-1-4302-3184-4.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Lecture Notes Electrical Wiring System Components
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Amazon CloudWatch - Observability and Monitoring

  • 1. 1
  • 2. AWS CloudWatch Observability and Monitoring 2 Rick Hwang rick_kyhwang@hotmail.com 2017/12/28
  • 5. 5
  • 7. Agenda ● CloudWatch Metric ● CloudWatch Dashboard ● CloudWatch Alarm ● CloudWatch Event / Rules ● CloudWatch Logs 7
  • 8. ● SNS: Simple Notification Service ● SES: Simple Email Service ● SQS: Simple Queue Service ● Lambda: Serverless ● Auto Scaling ● CloudTrail Related AWS Services 8
  • 9. Questions ● 怎麼知道系統的狀況? ● 系統的指標是怎麼來的? ● 系統有哪一些層級要知道?哪些人要知道?怎麼知道? ● 知道之後做什麼?怎麼做?主動、被動? ● 什麼是監、控? 9
  • 10. How Amazon CloudWatch Works CloudWatch Basic Concepts 10
  • 11. 11 EC2 Instances Log Shipper Logs Log Groups Log Stream A Log Stream B Log Stream C Log Stream N Alarms Filters [ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...] /var/log/app/*.log 2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0 2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0 2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0 2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0 2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0 2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0 2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0 Dashboard Metrics S3 Amazon ESLambda SNS Topics Export Streaming Push Lambda
  • 12. 12 出處:AWS Summit 2016: Big Data Architectural Patterns and Best Practices
  • 13. Key Points 13 ● 產生結構化、有意義的 Log ○ 結構化: csv, json ○ 有意義: 可統計的資料 → sum, max, min, average, count … ○ 可以下 SQL ● 想想系統上線後需要知道什麼?這些東西哪裡來? ● 盡可能不要動用到 ETL (Extract, Transform, Load) ○ 成本很高、浪費 ○ 維護成本 ○ 溝通成本
  • 14. 14
  • 17. Metric - CPU Utilization 17 UTC
  • 18. CloudWatch Metric 18 ● Period: 每次取樣的時間週期 ○ EC2 預設為 5m (Free), 可以調整為 1m (另外計費) ○ ELB 預設為 1m ○ Custom metirc supports high resolution: 1s ● Statistics: 統計方式,不同指標有預設的方式 ○ Sum ○ Average ○ Max ○ Min ○ Sample Count ● Unit: 單位 ○ Percent ○ Count ○ Bytes
  • 19. Wikipedia: 長尾 Statistics - Long Tail 19Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets
  • 20. Metric Types ● Metrics Provided by AWS ● Custom Metric ○ 透過 AWS CLI / SDK 上傳取樣資料 (json) → 不好做,容易出錯 ○ 透過 awslogs or CloudWatch Agent (New) 上傳到 CloudWatch Logs,自訂 Filter 產生 Metric ■ 流程長,但是不難做 ■ 推薦這個做法 20
  • 22. 22 Metric Description CPUUtilization The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance. To use the percentiles statistic, you must enable detailed monitoring. Depending on the instance type, tools in your operating system can show a lower percentage than CloudWatch when the instance is not allocated a full processor core. Units: Percent DiskReadOps Completed read operations from all instance store volumes available to the instance in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. Units: Count DiskWriteOps Completed write operations to all instance store volumes available to the instance in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. Units: Count
  • 23. 23 Metric Description DiskReadBytes Bytes read from all instance store volumes available to the instance. This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to determine the speed of the application. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes DiskWriteBytes Bytes written to all instance store volumes available to the instance. This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to determine the speed of the application. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes
  • 24. 24 Metric Description NetworkIn The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a single instance. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes NetworkOut The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance. The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes NetworkPacketsIn The number of packets received on all network interfaces by the instance. This metric identifies the volume of incoming traffic in terms of the number of packets on a single instance. This metric is available for basic monitoring only. Units: Count Statistics: Minimum, Maximum, Average NetworkPacketsOut The number of packets sent out on all network interfaces by the instance. This metric identifies the volume of outgoing traffic in terms of the number of packets on a single instance. This metric is available for basic monitoring only. Units: Count Statistics: Minimum, Maximum, Average
  • 25. EC2 Metrics ● 預設 Period = 5min (Free) ○ Detail Monitoring: period = 1min ($$) ● memory, disk 不支援,需要透過其他方式 ○ CloudWatch Agent (201712 release) ○ telegraf, collectd, cacti, nagios …. 25
  • 26. ELB Metrics 負載平衡 26 Elastic Load Balancing Metrics and Dimensions
  • 27. 27 Metric Description Latency [HTTP listener] The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the instance started to send the response headers. [TCP listener] The total time elapsed, in seconds, for the load balancer to successfully establish a connection to a registered instance. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Average. Use Maximum to determine whether some requests are taking substantially longer than the average. Note that Minimum is typically not useful. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1 instance in us-west-2a have a higher latency. The average for us-west-2a has a higher value than the average for us-west-2b. RequestCount The number of requests completed or connections made during the specified interval (1 or 5 minutes). [HTTP listener] The number of requests received and routed, including HTTP error responses from the registered instances. [TCP listener] The number of connections made to the registered instances. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average all return 1. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent to us-west-2b, with each instance receiving 20 requests. With the AvailabilityZone dimension, there is a sum of 60 requests in us-west-2a and 40 requests in us-west-2b. With the LoadBalancerName dimension, there is a sum of 100 requests.
  • 28. 28 Metric Description HealthyHostCount The number of healthy instances registered with your load balancer. A newly registered instance is considered healthy after it passes the first health check. If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone. Reporting criteria: There are registered instances Statistics: The most useful statistics are Average and Maximum. These statistics are determined by the load balancer nodes. Note that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is healthy. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, us-west-2a has 1 unhealthy instance, and us-west-2b has no unhealthy instances. With the AvailabilityZone dimension, there is an average of 1 healthy and 1 unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b. UnHealthyHostCount The number of unhealthy instances registered with your load balancer. An instance is considered unhealthy after it exceeds the unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy threshold configured for health checks. Reporting criteria: There are registered instances Statistics: The most useful statistics are Average and Minimum. These statistics are determined by the load balancer nodes. Note that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is healthy. Example: See HealthyHostCount.
  • 29. 29 Metric Description HTTPCode_Backend_2XX, HTTPCode_Backend_3XX, HTTPCode_Backend_4XX, HTTPCode_Backend_5XX [HTTP listener] The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1 instance in us-west-2a result in HTTP 500 responses. The sum for us-west-2a includes these error responses, while the sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a. HTTPCode_ELB_4XX [HTTP listener] The number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a request is malformed or incomplete. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that client requests include a malformed request URL. As a result, client errors would likely increase in all Availability Zones. The sum for the load balancer is the sum of the values for the Availability Zones. HTTPCode_ELB_5XX [HTTP listener] The number of HTTP 5XX server error codes generated by the load balancer. This count does not include any response codes generated by the registered instances. The metric is reported if there are no healthy instances registered to the load balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills and clients receive a 503 error. If us-west-2b continues to respond normally, the sum for the load balancer equals the sum for us-west-2a.
  • 30. 30 Metric Description BackendConnectionErrors The number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also includes any connection errors related to health checks. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are not typically useful. However, the difference between the minimum and maximum (or peak to average or average to trough) might be useful to determine whether a load balancer node is an outlier. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that attempts to connect to 1 instance in us-west-2a result in back-end connection errors. The sum for us-west-2a includes these connection errors, while the sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
  • 31. 31 Metric Description SpilloverCount The total number of requests that were rejected because the surge queue is full. [HTTP listener] The load balancer returns an HTTP 503 error code. [TCP listener] The load balancer closes the connection. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are not typically useful. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer node in us-west-2a fills, resulting in spillover. If us-west-2b continues to respond normally, the sum for the load balancer will be the same as the sum for us-west-2a. SurgeQueueLength The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when the queue is full. For more information, see SpilloverCount. Reporting criteria: There is a nonzero value. Statistics: The most useful statistic is Maximum, because it represents the peak of queued requests. The Average statistic can be useful in combination with Minimum and Maximum to determine the range of queued requests. Note that Sum is not useful. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for us-west-2a.
  • 32. 請參閱:Amazon CloudWatch Metrics and Dimensions Reference 族繁不及備載 ... 32
  • 33. ● EC2 ● EBS ● ELB: CLB, ALB, NLB ○ Classic Load Balancing ○ Application Load Balancing ○ Network Load Balancing 需要了解的 Metrics 33
  • 35. Question and Think: EC2 / ELB 的指標是怎麼來的? 35
  • 36. 36
  • 41. 41 CloudWatch Dashboard ● widget: line, stacked, number, text (markdown) ● auto refresh ● local timezone ○ EC2 metric is UTC ● time range ● Horizontal annotation ● Right / Left Y axis ● full screen (dark / light mode)
  • 42. ● Dashboard 可以 import / export 成 json ● 可以透過 API 自動更新 ● $3.00 per dashboard per month (ap-northeast-1) ● Time zone 42 Tips
  • 45. Demo: CloudWatch Dashboard Widgets, X/Y Axis, Annotation 45
  • 46. 46
  • 48. CloudWatch Alarm 48 ● 達到門檻值 (Threshold) 之後觸發的動作 ○ 五分鐘之內 ○ CPU >= 80% ○ 五次 ● 動作類型 ○ EC2 actions: reboot, stop, terminate. 通常結合 EC2 System Status 使用。 ○ SNS to: ■ SES ■ SQS ■ Lambda ■ HTTP Request
  • 49. CloudWatch Alarm - Status 49 ● ALARM: over threshold ● INSUFFICIENT: INSUFFICIENT DATA ● OK
  • 51. Event-driven → Feedback → Automation 51來源:『自動化XXX』的陷阱 CW Alarm
  • 52. 52
  • 54. 54 CloudWatch Event ● Event Source ○ Event Pattern ○ Schedule ● Targets ○ Multiple 5 targets (fixed) ○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS …..
  • 55. 55 CloudWatch Events ● Event Source ○ Event Pattern: DynamoDB, EC2, AutoScaling, RDS …. 太多了 ○ Schedule ● Targets ○ Multiple 5 targets (fixed) ○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS ….. 太多了
  • 56. 56 常用情境 ● EC2 預防性自動化: ○ 不該關機的機器被關機,自動重 啟 ○ 機器硬體故障,自動重 啟 ○ 狀態改變的行為 ● S3 Action 之後 ○ Action: PutObject ○ Trigger: Lambda, Put Message to SQS
  • 58. 58
  • 59. CloudWatch Logs Filter, Custom Metric, Log Shipper 59
  • 60. 60 EC2 Instances Log Shipper Logs Log Groups Log Stream A Log Stream B Log Stream C Log Stream N Alarms Filters [ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...] /var/log/app/*.log 2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0 2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0 2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0 2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0 2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0 2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0 2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0 Dashboard Metrics S3 Amazon ESLambda SNS Topics Export Streaming Push Lambda
  • 61. ● 前提:EC2 要安裝 awslogs driver or CloudWatch agent ○ ECS Instance 用選的就可以 ● 即時把 Log 傳到 CWL ○ 可以在 CWL 直接 Query Log (堪用) ○ 不用擔心 Storage 會爆炸 or 維護 ○ 可以設定 Log Rotation ● 透過 Filter 建立 Custom Metric ○ 可以建立 Dashboard ○ 可以建立 Alarm → Event-driven ■ To Lambda, Slack ■ ETL ■ Automation … 無限可能 CloudWatch Logs (CWL) 61
  • 62. ● 透過取樣 (Sampling) 待測目標得來的資料 ○ 單位時間的資料,例如每毫秒、每秒、每分 ● 取樣頻率越高,數據越精準 ● 聲音的音質 (sample rate per second) ○ CD Quality: 44.1kHz ○ 錄音室錄音:192kHz ● 攝影的解析度 (Resolution) ○ HD ○ Full-HD ○ 4k 指標 (Metric) 62
  • 64. 64
  • 66. Questions ● 怎麼知道系統的狀況? ○ 觀測 (Observe)、量測 (Measure) ● 系統的指標是怎麼來的? ○ 指標是經過系統性測試 (System Test) 後,分析 Log 找出來的 ● 系統有哪一些層級要知道?哪些人要知道? ○ Business、Application、OS/Hardware、Network ● 知道之後做什麼?怎麼做?主動、被動? ● 什麼是監、控? ○ 監: Watch ○ 控: Control 66
  • 67. 67
  • 79. 79
  • 83. Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 83 Target Services / Systems Watchers Controllers Push or Pull Data (Observability, Measure)
  • 84. Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … Push or Pull Data (Observability, Measure) 84 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ...
  • 85. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 85 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Push or Pull Data (Observability, Measure)
  • 86. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 86 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure)
  • 87. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 87 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure) 監
  • 88. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 88 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure) 監 控
  • 89. 89 Observability vs Monitoring ● 量測:Measure ● 觀測:Observe ● 氣象局 ○ Observability 觀測 ○ Measurement 量測 ● 政府 ○ Monitoring ○ Alert ○ Action ○ Feedback
  • 91. 91 量測 (Measure) → Sample from Log 觀測 (Observe) → Metric 回饋 (Feedback) → Analyze, Condition, Alarm 控制 (Control) → Automation, 躺著幹
  • 93. Log 很重要 沒有結構化的 Log or Data 會付出很多 ETL 的成本與時間 93
  • 94. Event-driven → Feedback → Automation 94來源:『自動化XXX』的陷阱 CW Alarm
  • 95. 95
  • 96. Why CloudWatch ● Serverless Monitoring System ● Event-driven ● Programmable and Automation ● Realtime and Backup ● Monitoring Monitoring System at Netflix - 2017/05/22 ● CloudWatch 滿足 “Basic Montioring” 的需求 96
  • 98. 為什麼不選其它監控工具? ● 不想自己蓋機器、養機器 ● 監控系統做得再好,都只是成本 ● 監控系統不是 Big Data ● 有些 Solution 的架構沒有考慮 HA, ex: Prometheus 98
  • 99. 99
  • 100. 100 Alarm System using Serverless
  • 101. EC2 CloudWatch Alarms Operators CloudWatch Event (time-based) SNS-Adapter Slack-Notifier SNS Topic Info, Warning Info Developers Health-Checker Auto Scaling SNS Topic Urgent SMS Warning 系統架構: CloudWatch + SNS + Lambda + Slack Testers ● Urgent: SMS, Slack ● Warning: Slack w/ tag ● Info: Slack w/o tag
  • 102. 102 CloudWatch Reporter - System Architecture CloudWatch Reporter / Alamer CloudWatch Event (time-based) Info / Alert Channels Operators (值班) Operators Developers (On Call) Metric Configs (Namespace, Stats) Target Services Loading maintain PR Read CW Metrics Schedule maintain Developers development Feature Request
  • 103. 103
  • 104. Best Practice ● 盡量活用 Cloud SaaS,像是 AWS CloudWatch, GCP Stackdriver ● 把部署設定過程設計成 Configurable ● 把 Log 設計成結構化格式 (csv or json) ● 利用 Big Data Solution 處理 Log Query 需求,像是 AWS Athena or GCP BigQuery ● Log 透過 Shipper (awslogs, statsd, collectd, fluentd, telegraf ... ) 同時傳到 ○ S3 備份,以符合稽核需求 ○ CloudWatch 作為 Debug / 監控需求 ● 巨量 Log Streaming 資料需要有 Queue 協助 ○ AWS Kinesis ○ GCP Pub/Sub 104
  • 105. ● CloudWatch User Guide ● CloudWatch Events User Guide ● CloudWatch Log User Guide Reference - User Guide 105
  • 106. ● AWS re:Invent 2015: Log, Monitor and Analyze your IT with Amazon CloudWatch (DVO315) ● Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets ● New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch ● 淺談系統監控與 CloudWatch 的應用 - AWS User Group Taiwan ● Study Notes - CloudWatch ● SRE CH6 Monitoring Distributed Systems (監控分散式系統) ● 高品質微服務 - CH6 監控 Reference - Youtube, Blog 106
  • 107. 107 /* End of Slide */