SlideShare a Scribd company logo
Enhancing Failure Propagation Analysis in
Cloud Computing Systems
Domenico Cotroneo, Luigi De Simone, Pietro Liguori,
Roberto Natella, Nematollah Bidokhti
DIETI, Università degli Studi di Napoli Federico II, Italy
Futurewei Technologies, Inc., USA
ISSRE 2019, Berlin, Germany, October 28-31, 2019
ISSRE 2019, Berlin, Germany, October 28-31, 2019 2
X
X
X
Failure propagation in cloud systems
Fault
Storage, network, software, ...
Failure propagation across
components and layers
Designers need to anticipate
how failures can possibly
propagate
ISSRE 2019, Berlin, Germany, October 28-31, 2019 3
X
X
X
Analyzing failures with fault injection
Fault Injection
• Network partitions
• Latency
• Node crashes
• ...
Fault-Injection Experiment
Monitoring
tools
Workload
LogsSystem
Logs
Distributed
TracingFailure propagation analysis is still
too cumbersome!
• Large volumes of data
• Noise in the data
(«false anomalies», not actually caused
by the fault!)
• Black-box systems
ISSRE 2019, Berlin, Germany, October 28-31, 2019 4
Spotting differences between normal
and failed executions (OpenStack example)
REST
API
Neutron
Nova
Cinder
Creating virtual instances and
networks
Creating and attaching
volumes
System
up
De-provisioning
REST
API
Neutron
Nova
Cinder
Fault
injection
The actual failure propagation
(storage not activated)
Many «false positives» (messages
in different order or skipped)
Normal exec.
(no faults)
ISSRE 2019, Berlin, Germany, October 28-31, 2019 5
Contributions
 A novel approach for failure propagation analysis
 Fault injection + distributed tracing + anomaly detection
 Driving idea: probabilistic model (variable-order Markov)
of events under fault-free conditions
 Case study: OpenStack
 High accuracy (false «anomalies» and actual failure
symptoms are not mistaken)
 Low computational cost
 Quick training (few training traces required)
ISSRE 2019, Berlin, Germany, October 28-31, 2019 6
Overview of the approach
Node
Node
Node
Step 2: Run the system
without fault injection;
collect fault-free traces
Step 3: Perform fault injection;
collect a faulty trace
Step 1: Instrument
communication APIs
(REST, Msg
Queues, ...)
for tracing
A
B
C
Presentation
Event timelines
(one per node)
A
B
C
Something
unexpected
happened in C!
Model training of normal
behavior
Anomaly detection
Step 4: Anomaly
detection on the
faulty traces
Step 5: Report
results to the
human analyst
ISSRE 2019, Berlin, Germany, October 28-31, 2019 7
Definition of anomalous event
Event timelines
A C E
Faulty trace
(fault-injection experiment)
A B C D
Normal (i.e., fault-free) trace
t
t
E
B D
Events (e.g., API calls)
represented as
«symbols»
create_volume = A
create_instance = B
...
«Common» events
(same type, same order) are
non-anomalous
«Omitted» events
(not happened in the faulty
trace) are anomalous
«Spurious» events
(only happened in the faulty
trace) are anomalous
ISSRE 2019, Berlin, Germany, October 28-31, 2019 8
Sequence
alignment
A B A C D
A B A E D
Most similar
fault-free trace:
Anomaly detection approach
Faulty trace
A B A C D
Fault-free traces
A B A E
A B A E E
A B A E
A B A E D
...
...
...
...
If low probability,
it is a spurious anomaly
(should not normally happen)
If high probability,
it is an omission anomaly
(should normally happen)
Probabilistic
model
A B A E E
D
Non-common
events
P(C|A B A) P(E|A B A)
ISSRE 2019, Berlin, Germany, October 28-31, 2019 9
Variable-order Markov models (VMM)
 VMM are a popular and powerful technique for
probabilistic modeling of sequences
 E.g., in bioinformatics and compression algorithms (RAR)
 States represent observable events
 The probability of an event depends on the
sequence of previous events
ISSRE 2019, Berlin, Germany, October 28-31, 2019 10
VMM example (Bellazougui and Cunial, 2016)
a
c
t
t
aa
g
c
ca c
a
c
g
P(a|accga)
P(c|accga)
P(g|accga)
P(ε|accga)
S = agatagatcgcctgtcgatcgatgaattaaccgat ... time
variable-length «context»
The VMM uses a suffix tree to
represent all suffixes in the training set
For each suffix, the VMM learns the
conditional probability of symbols
given the suffix
ISSRE 2019, Berlin, Germany, October 28-31, 2019 11
Why VMM?
 «Plain» Markov models
 The memoryless property does not apply
 E.g., in OpenStack, before creating a volume, an instance
must have been created and initialized!
 Hidden Markov models (separate observations from
states)
 Difficult to tune the number/probabilities of hidden states
 Recurrent Neural Networks
 Tailored for massive training sets, high overhead
 It is desirable to train the model with a small
number of fault-free executions
ISSRE 2019, Berlin, Germany, October 28-31, 2019 12
Case study: OpenStack
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
instance create
volume create
volume attach
...
API requests
Internal message queues
ISSRE 2019, Berlin, Germany, October 28-31, 2019 13
Experimental setup
 Workloads
 New IaaS deployment: creates new instances, volumes,
networks from scratch
 Network management: creates and re-configures virtual
networks with virtual routers, etc.
 Storage management: updates, rebuilds and boots instances
 Fault injections (2137 experiments, 1432 failures)
 Throw exception
 Wrong return value
 Wrong parameter value
 Delay
ISSRE 2019, Berlin, Germany, October 28-31, 2019 14
Results – False positives
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of training traces
0
1
2
3
4
5
6
7
%FalsePositives
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of training traces
0
1
2
3
4
5
6
7
%FalsePositives
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of training traces
0
1
2
3
4
5
6
7
%FalsePositives
 VMM trained with up to 20 fault-free traces
 Ground-truth build on 200+ fault-free traces and
manually analysis
 VMM filters more than half of the false anomalies
New depl. Network Storage
(sequence alignment alg.)
ISSRE 2019, Berlin, Germany, October 28-31, 2019 15
Results – False negatives
 VMM filters out some anomalies (false negatives) in
less than 1% of failures
 Negligible risk of missing a real failure in the dataset
 Only non-propagating failures are missed
Workload type LCS LCS with VMM
New deployment 6.35% 7.00%
Network mgmt. 0% 0.91%
Storage mgmt. 0.88% 1.37%
ISSRE 2019, Berlin, Germany, October 28-31, 2019 16
Results – Computational cost
 The computational cost grows linearly
 It is small enough for practical purposes
 E.g, training takes ~5 minutes for up to 40 fault-free traces
5 10 15 20 25 30 35 40
Number of training traces
0
100
200
300
400
Time(s)
0 500 1000 1500 2000 2500
Number of events per trace
0
100
200
300
400
500
600
Time(s)
0 500 1000 1500 2000 2500
Number of events per trace
0
200
400
600
800
Time(s)
Training time
(wrt #evts)
Classification
(wrt #evts)
Training time
(wrt #traces)
ISSRE 2019, Berlin, Germany, October 28-31, 2019 17
Conclusion
 We presented an approach for analyzing execution
traces of distributed systems under fault injection
 The approach addresses non-determinism by using a
probabilistic model for sequence analysis
 Future work: discovering failure modes in large
fault injection datasets through clustering
Node
Node
Node
A
B
C
Presentation
Event timelines
(one per node)
A
B
C
Something
unexpected
happened in C!
Model training of normal
behavior
Anomaly detection
fault-free traces
faulty trace

More Related Content

PPTX
Enhancing the Analysis of Software Failures in Cloud Computing Systems with D...
PPTX
Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Softwa...
PPTX
Towards Runtime Verification via Event Stream Processing in Cloud Computing I...
PPTX
EVIL: Exploiting Software via Natural Language
PDF
Incident Response in Cyber-Relevant Time - OpenC2
PDF
Predicting bugs using antipatterns
PPT
Esrel08 Final
PPTX
Surveillance scene classification using machine learning
Enhancing the Analysis of Software Failures in Cloud Computing Systems with D...
Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Softwa...
Towards Runtime Verification via Event Stream Processing in Cloud Computing I...
EVIL: Exploiting Software via Natural Language
Incident Response in Cyber-Relevant Time - OpenC2
Predicting bugs using antipatterns
Esrel08 Final
Surveillance scene classification using machine learning

What's hot (7)

PPTX
Technical Seminar on Securing the IoT in the Quantum World
PDF
AI & ML in Cyber Security - Why Algorithms are Dangerous
PDF
An Empirical Study on Bounded Model Checking
PPTX
Binary Analysis - Luxembourg
PPTX
Anomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
PPTX
Automated Program Repair Keynote talk
PDF
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
Technical Seminar on Securing the IoT in the Quantum World
AI & ML in Cyber Security - Why Algorithms are Dangerous
An Empirical Study on Bounded Model Checking
Binary Analysis - Luxembourg
Anomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
Automated Program Repair Keynote talk
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
Ad

Similar to Enhancing Failure Propagation Analysis in Cloud Computing Systems - ISSRE 2019 Presentation (20)

PDF
Mastering AIOps with Deep Learning
PDF
On Error Injection for NoC Platforms: A UVM-based Practical Case Study
PDF
Orchestrated Chaos: Applying Failure Testing Research at Scale.
PDF
Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgar...
PDF
UVM ARCHITECTURE FOR VERIFICATION
PDF
What activates a bug? A refinement of the Laprie terminology model.
PPTX
Automated Repair - ISSTA Summer School
PDF
A New Tracer for Reverse Engineering - PacSec 2010
PDF
Celebrating 30 years of ISSRE
PDF
Celebrating 30 years of ISSRE
PPTX
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
PPTX
Security Data Quality Challenges
PDF
Third International Competition on Computational Models of Argumentation
PDF
Association Rule Mining Scheme for Software Failure Analysis
PDF
A VNF modeling approach for verification purposes
PDF
A simplified predictive framework for cost evaluation to fault assessment usi...
PDF
Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
PDF
Parallel machines flinkforward2017
PPTX
Ch09-4-modelBased.pptxhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
PDF
Dependability Benchmarking by Injecting Software Bugs
Mastering AIOps with Deep Learning
On Error Injection for NoC Platforms: A UVM-based Practical Case Study
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgar...
UVM ARCHITECTURE FOR VERIFICATION
What activates a bug? A refinement of the Laprie terminology model.
Automated Repair - ISSTA Summer School
A New Tracer for Reverse Engineering - PacSec 2010
Celebrating 30 years of ISSRE
Celebrating 30 years of ISSRE
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
Security Data Quality Challenges
Third International Competition on Computational Models of Argumentation
Association Rule Mining Scheme for Software Failure Analysis
A VNF modeling approach for verification purposes
A simplified predictive framework for cost evaluation to fault assessment usi...
Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
Parallel machines flinkforward2017
Ch09-4-modelBased.pptxhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Dependability Benchmarking by Injecting Software Bugs
Ad

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
composite construction of structures.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Sustainable Sites - Green Building Construction
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Welding lecture in detail for understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Geodesy 1.pptx...............................................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
composite construction of structures.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Foundation to blockchain - A guide to Blockchain Tech
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Sustainable Sites - Green Building Construction
Automation-in-Manufacturing-Chapter-Introduction.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Welding lecture in detail for understanding
Internet of Things (IOT) - A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Geodesy 1.pptx...............................................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
additive manufacturing of ss316l using mig welding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

Enhancing Failure Propagation Analysis in Cloud Computing Systems - ISSRE 2019 Presentation

  • 1. Enhancing Failure Propagation Analysis in Cloud Computing Systems Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, Nematollah Bidokhti DIETI, Università degli Studi di Napoli Federico II, Italy Futurewei Technologies, Inc., USA ISSRE 2019, Berlin, Germany, October 28-31, 2019
  • 2. ISSRE 2019, Berlin, Germany, October 28-31, 2019 2 X X X Failure propagation in cloud systems Fault Storage, network, software, ... Failure propagation across components and layers Designers need to anticipate how failures can possibly propagate
  • 3. ISSRE 2019, Berlin, Germany, October 28-31, 2019 3 X X X Analyzing failures with fault injection Fault Injection • Network partitions • Latency • Node crashes • ... Fault-Injection Experiment Monitoring tools Workload LogsSystem Logs Distributed TracingFailure propagation analysis is still too cumbersome! • Large volumes of data • Noise in the data («false anomalies», not actually caused by the fault!) • Black-box systems
  • 4. ISSRE 2019, Berlin, Germany, October 28-31, 2019 4 Spotting differences between normal and failed executions (OpenStack example) REST API Neutron Nova Cinder Creating virtual instances and networks Creating and attaching volumes System up De-provisioning REST API Neutron Nova Cinder Fault injection The actual failure propagation (storage not activated) Many «false positives» (messages in different order or skipped) Normal exec. (no faults)
  • 5. ISSRE 2019, Berlin, Germany, October 28-31, 2019 5 Contributions  A novel approach for failure propagation analysis  Fault injection + distributed tracing + anomaly detection  Driving idea: probabilistic model (variable-order Markov) of events under fault-free conditions  Case study: OpenStack  High accuracy (false «anomalies» and actual failure symptoms are not mistaken)  Low computational cost  Quick training (few training traces required)
  • 6. ISSRE 2019, Berlin, Germany, October 28-31, 2019 6 Overview of the approach Node Node Node Step 2: Run the system without fault injection; collect fault-free traces Step 3: Perform fault injection; collect a faulty trace Step 1: Instrument communication APIs (REST, Msg Queues, ...) for tracing A B C Presentation Event timelines (one per node) A B C Something unexpected happened in C! Model training of normal behavior Anomaly detection Step 4: Anomaly detection on the faulty traces Step 5: Report results to the human analyst
  • 7. ISSRE 2019, Berlin, Germany, October 28-31, 2019 7 Definition of anomalous event Event timelines A C E Faulty trace (fault-injection experiment) A B C D Normal (i.e., fault-free) trace t t E B D Events (e.g., API calls) represented as «symbols» create_volume = A create_instance = B ... «Common» events (same type, same order) are non-anomalous «Omitted» events (not happened in the faulty trace) are anomalous «Spurious» events (only happened in the faulty trace) are anomalous
  • 8. ISSRE 2019, Berlin, Germany, October 28-31, 2019 8 Sequence alignment A B A C D A B A E D Most similar fault-free trace: Anomaly detection approach Faulty trace A B A C D Fault-free traces A B A E A B A E E A B A E A B A E D ... ... ... ... If low probability, it is a spurious anomaly (should not normally happen) If high probability, it is an omission anomaly (should normally happen) Probabilistic model A B A E E D Non-common events P(C|A B A) P(E|A B A)
  • 9. ISSRE 2019, Berlin, Germany, October 28-31, 2019 9 Variable-order Markov models (VMM)  VMM are a popular and powerful technique for probabilistic modeling of sequences  E.g., in bioinformatics and compression algorithms (RAR)  States represent observable events  The probability of an event depends on the sequence of previous events
  • 10. ISSRE 2019, Berlin, Germany, October 28-31, 2019 10 VMM example (Bellazougui and Cunial, 2016) a c t t aa g c ca c a c g P(a|accga) P(c|accga) P(g|accga) P(ε|accga) S = agatagatcgcctgtcgatcgatgaattaaccgat ... time variable-length «context» The VMM uses a suffix tree to represent all suffixes in the training set For each suffix, the VMM learns the conditional probability of symbols given the suffix
  • 11. ISSRE 2019, Berlin, Germany, October 28-31, 2019 11 Why VMM?  «Plain» Markov models  The memoryless property does not apply  E.g., in OpenStack, before creating a volume, an instance must have been created and initialized!  Hidden Markov models (separate observations from states)  Difficult to tune the number/probabilities of hidden states  Recurrent Neural Networks  Tailored for massive training sets, high overhead  It is desirable to train the model with a small number of fault-free executions
  • 12. ISSRE 2019, Berlin, Germany, October 28-31, 2019 12 Case study: OpenStack Nova Horizon Cinder NeutronGlance Keystone Swift instance create volume create volume attach ... API requests Internal message queues
  • 13. ISSRE 2019, Berlin, Germany, October 28-31, 2019 13 Experimental setup  Workloads  New IaaS deployment: creates new instances, volumes, networks from scratch  Network management: creates and re-configures virtual networks with virtual routers, etc.  Storage management: updates, rebuilds and boots instances  Fault injections (2137 experiments, 1432 failures)  Throw exception  Wrong return value  Wrong parameter value  Delay
  • 14. ISSRE 2019, Berlin, Germany, October 28-31, 2019 14 Results – False positives 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of training traces 0 1 2 3 4 5 6 7 %FalsePositives 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of training traces 0 1 2 3 4 5 6 7 %FalsePositives 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of training traces 0 1 2 3 4 5 6 7 %FalsePositives  VMM trained with up to 20 fault-free traces  Ground-truth build on 200+ fault-free traces and manually analysis  VMM filters more than half of the false anomalies New depl. Network Storage (sequence alignment alg.)
  • 15. ISSRE 2019, Berlin, Germany, October 28-31, 2019 15 Results – False negatives  VMM filters out some anomalies (false negatives) in less than 1% of failures  Negligible risk of missing a real failure in the dataset  Only non-propagating failures are missed Workload type LCS LCS with VMM New deployment 6.35% 7.00% Network mgmt. 0% 0.91% Storage mgmt. 0.88% 1.37%
  • 16. ISSRE 2019, Berlin, Germany, October 28-31, 2019 16 Results – Computational cost  The computational cost grows linearly  It is small enough for practical purposes  E.g, training takes ~5 minutes for up to 40 fault-free traces 5 10 15 20 25 30 35 40 Number of training traces 0 100 200 300 400 Time(s) 0 500 1000 1500 2000 2500 Number of events per trace 0 100 200 300 400 500 600 Time(s) 0 500 1000 1500 2000 2500 Number of events per trace 0 200 400 600 800 Time(s) Training time (wrt #evts) Classification (wrt #evts) Training time (wrt #traces)
  • 17. ISSRE 2019, Berlin, Germany, October 28-31, 2019 17 Conclusion  We presented an approach for analyzing execution traces of distributed systems under fault injection  The approach addresses non-determinism by using a probabilistic model for sequence analysis  Future work: discovering failure modes in large fault injection datasets through clustering Node Node Node A B C Presentation Event timelines (one per node) A B C Something unexpected happened in C! Model training of normal behavior Anomaly detection fault-free traces faulty trace