POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications

NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: Error Detection
and Diagnosis of Sporadic
Operations on Cloud Applications
Dr. Liming Zhu
Liming.Zhu@nicta.com.au
Principal Researcher, NICTA/UNSW
April, 2014 at Berkeley AMPLab

Outline
• Dependable Cloud Operation
• Approach: Process-Oriented Dependability (POD)
– POD-Diagnosis
– Undo/Recovery Planning using AI Planning
– Modeling and Analysis using DTMC
• Connections with AMPLab BDAS
2

Dependable Cloud Operation: Motivation
• Sporadic operations cause most outages
– Deployment, reconfiguration, (rolling) upgrade, rollback…
• as opposed to normal operations
– DevOps-related: continuous integration/deploy/delivery
• Etsy.com: 25 full deployments per day at 10 commits per deploy
– Other drivers: resource sharing, micro services/partition
migration, backup/recovery, auto-mitigation itself…
• Limited control & visibility during sporadic operation
– Heavy reliance on Cloud APIs
– Limited visibility and exception handling capabilities
3

Dependable Cloud Operation: Challenges
• Our Context
– Large-scale web/enterprise operation in Cloud
– Distributed data analytics in Cloud (Hadoop/Spark)
• Goal: detect, diagnose and react to errors
occurring during a sporadic cloud operation
• Challenges
1. Anomaly detection during sporadic operations
2. Undo/Recovery planning for recovery
3. Modelling and analysis of sporadic operation
4

Sporadic Operation Example: Rolling Upgrade
Update Auto-Scaling
Group (ASG)
Remove & Deregister
Old Instances from ELB
Wait for ASG to Start
New Instances
Terminate Old Instances
Register New Instances
with ELB
Sort Instances
Stop
Start
- Have 100 servers in cloud with
version 1 software
- Upgrade 10 servers at a time to
version 2 software
- No downtime or redundancy cost
- Potentially take a long time to
complete with errors during the
operation with other interfering
operations
5

Challenge 1: Anomaly Detection
• Traditional anomaly-based error detection is
designed for “normal operation”
– significant false positives OR disable all monitoring
during sporadic operation
• Continuous changes to the production systems
– From months at scheduled downtime to hours at all times
– Multiple operations at the same time
• Quality of automation scripts + human
– fully testing the operation (scripts + human) in uncertain
cloud environment is very difficult
6

Our Approach: Use Process Context
• Offline: treat an operation as a process
– Process discovered automatically from logs/scripts
• Clustering of log lines and process mining
– Intermediary step outcomes specified as assertions
• Online: use process context
– Process context: process/instance/step ids, expected states…
– Errors detected by examining logs and monitoring data
• Assertions evaluations integration with monitoring facilities
• Compliance checking against expected processes using logs
– Detected errors are further diagnosed for (root) causes
• Examining a fault tree to locate potential root causes
• Performing more diagnostic tests and on-demand assertions
X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014. 7

Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Service
Controls
Outputs Create SnapshotCheck AZs
Create instance
from snapshot
Create AMI from
instance
Evaluate AMI
Discovered
Model
Asgard Log dataLog dataGenerates
Offline
Online
8

Process Mining Service: how it works
• Process Mining: Discovery
1. Collect the logs (using Logstash)
2. Filter the logs
3. Calculating string distance
(Levenshtein distance) between
each pair of log lines
4. Cluster the log lines
5. Look at the dendrogram to
decide on threshold
6. Name & combine clusters
7. Derive regular expressions for the
clusters
8. Classify the log lines using the
regular expressions and cluster
names
9. Import altered log into process
mining tools
10. Apply different process discovery
algorithms
11. If anything requires changes, go
back to the respective steps and
redo from there
9

POD-Detection: Error Detection
Error Detection Service has two
methods for detecting errors:
• Assertion Checking
• Conformance Checking
10

Assertion Checking: how it works
Log line:
Assertions:
11

Log line:
• Remove ...
Assertions:
• i has been de-registered
from ELB
• i has been removed from
ASG
• there is 1 less instance of v1
12

Log line:
• Remove ...
• Terminate ...
Assertions:
• i successfully terminated
13

Log line:
• Remove ...
• Terminate ...
• Wait ...
Assertions:
• Next log line should appear
within 17m35s (95 percentile)
14

Log line:
• Remove ...
• Terminate ...
• Wait ...
• New instance ...
Assertions:
• i„ successfully launched
15

Conformance Checking: how it works
Log lines:
16

Log lines:
• Remove ...
17

Log lines:
• Remove ...
• Terminate ...
18

Log lines:
• Remove ...
• Terminate ...
• Wait ...
19

Log lines:
• Remove ...
• Terminate ...
• Wait ...
• Terminate ...???
20

POD-Diagnosis: how it works
• Fault trees are built as
knowledge base
• Process context used for fault
tree pruning
• On-demand diagnosis tests
to locate the (root) causes
21

Evaluation: POD-Detection/Diagnosis
• Experiments
– Rolling upgrade of 100+ node cluster in AWS
• Fault injection+ confounding processes: random kill, scaling-in..
• Detected errors
– Assertion checking: known errors and global errors
• Examples: key management, launch configuration, images…
– Compliance checking: unknown errors
• skipping activities or undone activities
• Time and precision
– Compared with Asgard/Monitoring internal mechanisms
• Detected more errors earlier
– Diagnosis: limited to known causes in the fault tree
• 95 percentile less than 4s; accuracy ranges 80%~100%
22

Evaluation: POD-Detection/Diagnosis
23

Other Related Research
Challenges
1. Anomaly detection during sporadic operations
2. Undo/Recovery planning
3. Modelling and analysis of sporadic operation
24

Challenge 2: Undo/Recovery Planning
S1 S2
Serr
A certain
step
Reparation
Compensation Undo
Parameterizable Redo
Alternative
Checkpoint-base Undo
Previous states
… ... S0S-i
25

Undo/Undoability Approach in a Nutshell
• Goal: undo support for
“indirect control” setting
– Problem 1: some actions are
irreversible, e.g., delete
– Problem 2: undo ≠ copy back
previous state of memory
• Have to call the right actions on the
right resources in the right order
– Problem 3: partly irreversible
operations, e.g. on Amazon WS:
• Stopping a machine disassociates an
elastic IP address (if any), and
releases internal IP / public DNS
• Starting the machine isn‟t undo:
elastic IP is dangling, internal IP /
public DNS / timestamps are different
• Solution components:
 Replace “do” with “pseudo-do”
 Undo System based on AI Planning
• Outcome: sequence of undo actions
 Undoability Checking:
• Is the operation I‟m about to execute
undoable?
• Learn which aspects can be fully undone
for each operation (whole domain)
• If not, can we abstract / change so that
undoability is given?
 Projection (of a domain)
26
Ingo Weber et. al. Supporting undoability in systems operations. In USENIX LISA'13: Large Installation System
Administration Conference, Washington, DC, USA, November 2013.

Undoability Checking Approach
Operation(s) to execute
(e.g., script, command)
Resources and
properties required
to be undoable
Define
Tool user
(e.g., sys admin)
Tool provider
Full domain model
(e.g., AWS)
Projection
Specification
Generate
Undoability CheckerDefine
Apply
Projection
Generate
Projected
domain
model
Per operation:
Generate pre and
post-states
Check undoability per
pre-post state pair
Undoability (yes/no)
List of causes if not
undoable
Result
Feedback
For each
pair: call
AI Planner
27

Challenge 3: Modeling and Analysis
• Approach: Model as stochastic processes
– Discrete/Continuous Markov Chain (DTMC/CTMC)
• Forward states: net successful operations
• Backward states: failure or deliberate rollback/undo
• A family of g-k chains with different parameters
– g: rolling-upgrade wave granularity. k: no. of failure/rollback per wave
Daniel Sun & L Zhu, et. al. ” Understanding Rolling Upgrade” 33th International Symposium on Reliable Distributed
Systems (SRDS), 2014 (submitted)
28

Model used for
Predictions
- e.g. completion time,
failure rate impact
Optimization and Decision
Problems
- e.g. when to activate new
versions to guarantee a
99.99% success
29

Connection with AMPLab BDAS
30

Projects Related to BDAS (1/2)
1. Log/Metrics analysis in POD-Diagnosis
– Currently using Spark/MLBase
– Voluminous log/events into Spark Streaming
2. Dependable deployment/operation of BDAS
– POD applied to Hadoop before, maybe BDAS?
3. Multi-level granularity access for data analytics
– Australian Urban Research Infrastructure Network (AURIN)
• Portal to provide transport-related data to international researchers
• Cluster sharing for in-portal pre-processing and analytics
• de-anonymization concerns and different views for the same data
– Evaluating how BDAS can support this
31

Projects Related to BDAS (2/2)
Redacted
4. Data scientist workflow and local exploration
5. Distributed machine learning
32

Team Acknowledgement
• Researchers
– Len Bass
– Alan Fekete
– Anna Liu
– Daniel Sun
– Hiroshi Wada
– Ingo Weber
– Sherry Xu
– Liming Zhu
• Engineers
– Adnene Guabtni
– Chao Li
• Students
– Amer Abdalamer
– Ahmed Alqahtani
– Mostafa Farshchi
– Min Fu
– Jin Li
– Matthew Sladescu
– Donna Xu
– DongYao Wu
33

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications

More Related Content

Similar to POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications (20)

More from Liming Zhu (19)

Recently uploaded (20)

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications