Enhancing Failure Propagation Analysis in Cloud Computing Systems - ISSRE 2019 Presentation

Enhancing Failure Propagation Analysis in
Cloud Computing Systems
Domenico Cotroneo, Luigi De Simone, Pietro Liguori,
Roberto Natella, Nematollah Bidokhti
DIETI, Università degli Studi di Napoli Federico II, Italy
Futurewei Technologies, Inc., USA
ISSRE 2019, Berlin, Germany, October 28-31, 2019

ISSRE 2019, Berlin, Germany, October 28-31, 2019 2
X
X
X
Failure propagation in cloud systems
Fault
Storage, network, software, ...
Failure propagation across
components and layers
Designers need to anticipate
how failures can possibly
propagate

X
X
X
Analyzing failures with fault injection
Fault Injection
• Network partitions
• Latency
• Node crashes
• ...
Fault-Injection Experiment
Monitoring
tools
Workload
LogsSystem
Logs
Distributed
TracingFailure propagation analysis is still
too cumbersome!
• Large volumes of data
• Noise in the data
(«false anomalies», not actually caused
by the fault!)
• Black-box systems

Spotting differences between normal
and failed executions (OpenStack example)
REST
API
Neutron
Nova
Cinder
Creating virtual instances and
networks
Creating and attaching
volumes
System
up
De-provisioning
REST
API
Neutron
Nova
Cinder
Fault
injection
The actual failure propagation
(storage not activated)
Many «false positives» (messages
in different order or skipped)
Normal exec.
(no faults)

Contributions
 A novel approach for failure propagation analysis
 Fault injection + distributed tracing + anomaly detection
 Driving idea: probabilistic model (variable-order Markov)
of events under fault-free conditions
 Case study: OpenStack
 High accuracy (false «anomalies» and actual failure
symptoms are not mistaken)
 Low computational cost
 Quick training (few training traces required)

Overview of the approach
Node
Node
Node
Step 2: Run the system
without fault injection;
collect fault-free traces
Step 3: Perform fault injection;
collect a faulty trace
Step 1: Instrument
communication APIs
(REST, Msg
Queues, ...)
for tracing
A
B
C
Presentation
Event timelines
(one per node)
A
B
C
Something
unexpected
happened in C!
Model training of normal
behavior
Anomaly detection
Step 4: Anomaly
detection on the
faulty traces
Step 5: Report
results to the
human analyst

Definition of anomalous event
Event timelines
A C E
Faulty trace
(fault-injection experiment)
A B C D
Normal (i.e., fault-free) trace
t
t
E
B D
Events (e.g., API calls)
represented as
«symbols»
create_volume = A
create_instance = B
...
«Common» events
(same type, same order) are
non-anomalous
«Omitted» events
(not happened in the faulty
trace) are anomalous
«Spurious» events
(only happened in the faulty
trace) are anomalous

Sequence
alignment
A B A C D
A B A E D
Most similar
fault-free trace:
Anomaly detection approach
Faulty trace
A B A C D
Fault-free traces
A B A E
A B A E E
A B A E
A B A E D
...
...
...
...
If low probability,
it is a spurious anomaly
(should not normally happen)
If high probability,
it is an omission anomaly
(should normally happen)
Probabilistic
model
A B A E E
D
Non-common
events
P(C|A B A) P(E|A B A)

Variable-order Markov models (VMM)
 VMM are a popular and powerful technique for
probabilistic modeling of sequences
 E.g., in bioinformatics and compression algorithms (RAR)
 States represent observable events
 The probability of an event depends on the
sequence of previous events

VMM example (Bellazougui and Cunial, 2016)
a
c
t
t
aa
g
c
ca c
a
c
g
P(a|accga)
P(c|accga)
P(g|accga)
P(ε|accga)
S = agatagatcgcctgtcgatcgatgaattaaccgat ... time
variable-length «context»
The VMM uses a suffix tree to
represent all suffixes in the training set
For each suffix, the VMM learns the
conditional probability of symbols
given the suffix

Why VMM?
 «Plain» Markov models
 The memoryless property does not apply
 E.g., in OpenStack, before creating a volume, an instance
must have been created and initialized!
 Hidden Markov models (separate observations from
states)
 Difficult to tune the number/probabilities of hidden states
 Recurrent Neural Networks
 Tailored for massive training sets, high overhead
 It is desirable to train the model with a small
number of fault-free executions

Case study: OpenStack
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
instance create
volume create
volume attach
...
API requests
Internal message queues

Experimental setup
 Workloads
 New IaaS deployment: creates new instances, volumes,
networks from scratch
 Network management: creates and re-configures virtual
networks with virtual routers, etc.
 Storage management: updates, rebuilds and boots instances
 Fault injections (2137 experiments, 1432 failures)
 Throw exception
 Wrong return value
 Wrong parameter value
 Delay

Results – False positives
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of training traces
0
1
2
3
4
5
6
7
%FalsePositives
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
1
2
3
4
5
6
7
%FalsePositives
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
1
2
3
4
5
6
7
%FalsePositives
 VMM trained with up to 20 fault-free traces
 Ground-truth build on 200+ fault-free traces and
manually analysis
 VMM filters more than half of the false anomalies
New depl. Network Storage
(sequence alignment alg.)

Results – False negatives
 VMM filters out some anomalies (false negatives) in
less than 1% of failures
 Negligible risk of missing a real failure in the dataset
 Only non-propagating failures are missed
Workload type LCS LCS with VMM
New deployment 6.35% 7.00%
Network mgmt. 0% 0.91%
Storage mgmt. 0.88% 1.37%

Results – Computational cost
 The computational cost grows linearly
 It is small enough for practical purposes
 E.g, training takes ~5 minutes for up to 40 fault-free traces
5 10 15 20 25 30 35 40
0
100
200
300
400
Time(s)
0 500 1000 1500 2000 2500
Number of events per trace
0
100
200
300
400
500
600
Time(s)
0 500 1000 1500 2000 2500
Number of events per trace
0
200
400
600
800
Time(s)
Training time
(wrt #evts)
Classification
(wrt #evts)
Training time
(wrt #traces)

Conclusion
 We presented an approach for analyzing execution
traces of distributed systems under fault injection
 The approach addresses non-determinism by using a
probabilistic model for sequence analysis
 Future work: discovering failure modes in large
fault injection datasets through clustering
Node
Node
Node
A
B
C
Presentation
Event timelines
(one per node)
A
B
C
Something
unexpected
happened in C!
Model training of normal
behavior
Anomaly detection
fault-free traces
faulty trace

Enhancing Failure Propagation Analysis in Cloud Computing Systems - ISSRE 2019 Presentation

More Related Content

What's hot (7)

Similar to Enhancing Failure Propagation Analysis in Cloud Computing Systems - ISSRE 2019 Presentation (20)

Recently uploaded (20)

Enhancing Failure Propagation Analysis in Cloud Computing Systems - ISSRE 2019 Presentation