SlideShare a Scribd company logo
Exploring Failure Transparency and the Limits of
Generic Recovery
David E. Lowell, Subhachandra Chandra, Peter M. Chen
Presented by Miroslav Cupak
10/29/2012
Introduction
failure recovery as an illusion of failure-free
operation
goal of the paper:
2 invariants for failure transparency - save/lose
work
evaluation of real-life applications
Guaranteeing Failure Transparency
requirements:
automatized
fast
generic
tools:
commit events
rollback of a process
reexecution
problems:
undoable and redoable operations
some things hard to undo
some things hard to redo
Guaranteeing Failure Transparency: Failures
Guaranteeing Failure Transparency: Stop Failures
consistent recovery
Recovery is consistent if and only if there exists a
complete, failure-free execution of the computation
that would result in a sequence of visible events
equivalent to the sequence of visible events
actually output in the failed and recovered run.
ignore ordering and duplicates
non-deterministic events
Guaranteeing Failure Transparency: Save Work
save-work invariant
A computation is guaranteed consistent recovery
from stop failures if and only if for each executed
non-deterministic event ei
p that causally precedes a
visible or commit event e, process p executes a
commit event ej
p such that ej
p happens-before (or
atomic with) e, and i <= j.
save-work visible and save-work orphan invariant
many protocols
Guaranteeing Failure Transparency: Protocol Space
Guaranteeing Failure Transparency: Protocol Space
Guaranteeing Failure Transparency: Propagation Failures
non-determinism helpful
fixed vs transient events
Guaranteeing Failure Transparency: Lose Work
lose-work invariant
Application-generic recovery from propagation
failures is guaranteed to be possible if and only if
the application executes no commit event on a
dangerous path.
single/multi-process dangerous paths
algorithms
techniques for adding non-determinism
Evaluation: Performance Penalty
4 applications: nvi, magic, xpilot, TreadMarks
failure transparency: Discount Checking
modified to intercept non-deterministic system
calls
reliable memory: Rio File Cache
transactions: Vista transaction library
7 protocols: CAND, CPVS, CBNDVS,
CAND-LOG, CBNVS-LOG, CPV-2PC,
CBNDV-2PC
Evaluation: Performance Penalty - Results
24 performance analyses (no. of checkpoints,
increase in execution time)
interesting observations:
commit frequency decreases and performance
increases with increasing coordinates (except for
xpilot)
at least 1 protocol performs well for each app
disk-based recovery with reasonably low overhead
for interactive apps
Evaluation: Conflicts (Lose-work vs Save-work)
sample: 50 crashes for each of 7 fault types for
2 applications (nvi, postgres)
application faults (commit on dangerous path
after fault activation)
lose-work violation: 35%
guess for conflicts based on non-deterministic bug
distribution in other applications: 90%
OS faults
failure to recover: 3% (postgres) - 15% (nvi)
propagation failures: 10% (postgres) - 41% (nvi)
Related work
lot of work on transactions, fault tolerance and
recovery from stop failures
notable papers:
E. N. Elnozahy et al.: A Survey of
Rollback-Recovery Protocols in Message-Passing
Systems (thorough analysis of checkpointing &
log-based recovery protocols, over 240 references)
S. Chandra: Evaluating the Recovery-Related
Properties of Software Faults (PhD thesis
explaining most of the failure-recovery terms used
in the paper)
D. E. Lowell: Theory and Practice of Failure
Transparency (consistent recovery, commit
invariant, protocol space, discount checking)
D. E. Lowell, P. M. Chen: Discount Checking (fast
light checkpointing system)
Conclusions & Contributions
extending the concept of the stop failures to
propagation failures
invariant for surviving propagation failures
evaluation of propagation failures for which
consistent recovery is not possible
failure transparency for stop failures possible,
help from the application needed for
propagation failures
Questions?
Discount Checking system used in evaluations is
built on top of Vista transaction library. Why are
transactions needed?
What are the transactions needed for?
DC doesn’t know anything about semantics of
application operations and cannot use it for
rollback
transactions can be easily used to implement
checkpointing
interval between checkpoints = body of a
transaction
taking a checkpoint = transaction commiting
process state rollback to the last checkpoint =
transaction aborting
Applications failed to recover in up to 15% cases
using Discount Checking with the possibility to roll
back to the last checkpoint. Why don’t they use
multiple checkpoints? Would it help?
Would using multiple checkpoints help? Why don’t they use them?
it would definitely help to decrease the recovery
failure rate
but it wouldn’t solve all the problems (starting
states)
it’s not used probably because of
implementation difficulty (system based on
transactions, this kind of action would require
another encapsulation)
performance overhead would increase
Failure transparency is great. What are the
drawbacks?
What are the drawbacks of failure transparency?
increased cost
complicated debugging (masked failures)
interference with other components
lower priority of fault correction
How representative are the results of the
evaluations?
How representative are the results of the evaluations?
small sample of different types of applications
with prevailing types of events
even though only 2 applications tested for
performance, significantly different results
only 1 checkpointing system
part of the statistics inferred from other
programs
12 years ago - new HW and possibly different
bottlenecks, new possibly more efficient
protocols etc.
As a developer trying to make an application
transparent to failures, would you use the same
approach as the authors did for they evaluations
(Discount Checking)? Why?
Would you use Discount Checking?
good solution for many use cases
amazingly flexible, very easy setup for existing
applications compared to other systems we
mentioned
works out of the box, but extensible if needed
low performance overhead

More Related Content

PPTX
Software testing metrics | David Tzemach
PPTX
Software testing and process
PPTX
Software testing metrics
PPT
PDF
SE2018_Lec 19_ Software Testing
PDF
What activates a bug? A refinement of the Laprie terminology model.
PDF
Test Status Reporting: Focus Your Message for Executives
PPTX
Software testing
Software testing metrics | David Tzemach
Software testing and process
Software testing metrics
SE2018_Lec 19_ Software Testing
What activates a bug? A refinement of the Laprie terminology model.
Test Status Reporting: Focus Your Message for Executives
Software testing

What's hot (20)

PPT
Software Testing
PPTX
unit testing and debugging
PPTX
Using Control Charts for Detecting and Understanding Performance Regressions ...
PDF
Jerry banks introduction to simulation
PDF
Software testing-life-cycle-process Process
DOC
Metrics formulas
ODP
Defects in software testing
PPT
Adaptive software testing
PPTX
PPTX
Unit 5 general principles, simulation software
PPTX
Manual testing-training-institute-in-marathahalli
PDF
Introduction to Software Testing
PPTX
Software testing ppt
PPTX
Software techniques
PPT
Testing concepts ppt
PPTX
Automated Detection of Performance Regressions Using Statistical Process Cont...
PPT
Sop test planning
PPTX
CTFL chapter 05
PPTX
Software Testing and Debugging
Software Testing
unit testing and debugging
Using Control Charts for Detecting and Understanding Performance Regressions ...
Jerry banks introduction to simulation
Software testing-life-cycle-process Process
Metrics formulas
Defects in software testing
Adaptive software testing
Unit 5 general principles, simulation software
Manual testing-training-institute-in-marathahalli
Introduction to Software Testing
Software testing ppt
Software techniques
Testing concepts ppt
Automated Detection of Performance Regressions Using Statistical Process Cont...
Sop test planning
CTFL chapter 05
Software Testing and Debugging
Ad

Similar to Exploring Failure Transparency and the Limits of Generic Recovery (20)

PDF
CS9222 ADVANCED OPERATING SYSTEMS
PPT
F33 book-depend-pres-pt6
PPTX
Distributed DBMS - Unit 9 - Distributed Deadlock & Recovery
PDF
Fault tolerance
PPT
Lecture07_FaultTolerance in parallel and distributing
PPT
Lecture07_FaultTolerance in parallel and distributed
PDF
Fault Tolerance 101
PPT
Software Fault Tolerance
PPTX
Fault Tolerance System
PDF
PDF
PPTX
fault tolerance1.pptx
PDF
Recovery in Distributed operating system
PDF
The 7 quests of resilient software design
PDF
Fault tolerance
PPTX
Static analysis works for mission-critical systems, why not yours?
PPTX
Unit_4_Fault_Tolerance.pptx
PDF
Atifalhas
PDF
PDF
What do you do when you’ve caught an exception?
CS9222 ADVANCED OPERATING SYSTEMS
F33 book-depend-pres-pt6
Distributed DBMS - Unit 9 - Distributed Deadlock & Recovery
Fault tolerance
Lecture07_FaultTolerance in parallel and distributing
Lecture07_FaultTolerance in parallel and distributed
Fault Tolerance 101
Software Fault Tolerance
Fault Tolerance System
fault tolerance1.pptx
Recovery in Distributed operating system
The 7 quests of resilient software design
Fault tolerance
Static analysis works for mission-critical systems, why not yours?
Unit_4_Fault_Tolerance.pptx
Atifalhas
What do you do when you’ve caught an exception?
Ad

More from Miro Cupak (20)

PDF
Exploring the latest and greatest from Java 14
PDF
Exploring reactive programming in Java
PDF
Exploring the last year of Java
PDF
Local variable type inference - Will it compile?
PDF
The Good, the Bad and the Ugly of Java API design
PDF
Local variable type inference - Will it compile?
PDF
Exploring reactive programming in Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Master class in modern Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Exploring reactive programming in Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Writing clean code with modern Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Master class in modern Java
PDF
Exploring reactive programming in Java
PDF
Writing clean code with modern Java
PDF
Exploring what's new in Java 10 and 11 (and 12)
PDF
Exploring what's new in Java 10 and 11
PDF
Exploring what's new in Java in 2018
Exploring the latest and greatest from Java 14
Exploring reactive programming in Java
Exploring the last year of Java
Local variable type inference - Will it compile?
The Good, the Bad and the Ugly of Java API design
Local variable type inference - Will it compile?
Exploring reactive programming in Java
The good, the bad, and the ugly of Java API design
Master class in modern Java
The good, the bad, and the ugly of Java API design
Exploring reactive programming in Java
The good, the bad, and the ugly of Java API design
Writing clean code with modern Java
The good, the bad, and the ugly of Java API design
Master class in modern Java
Exploring reactive programming in Java
Writing clean code with modern Java
Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11
Exploring what's new in Java in 2018

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
ai tools demonstartion for schools and inter college
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Transform Your Business with a Software ERP System
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Softaken Excel to vCard Converter Software.pdf
PTS Company Brochure 2025 (1).pdf.......
ai tools demonstartion for schools and inter college
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Nekopoi APK 2025 free lastest update
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
history of c programming in notes for students .pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
ManageIQ - Sprint 268 Review - Slide Deck
2025 Textile ERP Trends: SAP, Odoo & Oracle
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Which alternative to Crystal Reports is best for small or large businesses.pdf
top salesforce developer skills in 2025.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Online Work Permit System for Fast Permit Processing
Transform Your Business with a Software ERP System
Wondershare Filmora 15 Crack With Activation Key [2025
How to Migrate SBCGlobal Email to Yahoo Easily
Softaken Excel to vCard Converter Software.pdf

Exploring Failure Transparency and the Limits of Generic Recovery

  • 1. Exploring Failure Transparency and the Limits of Generic Recovery David E. Lowell, Subhachandra Chandra, Peter M. Chen Presented by Miroslav Cupak 10/29/2012
  • 2. Introduction failure recovery as an illusion of failure-free operation goal of the paper: 2 invariants for failure transparency - save/lose work evaluation of real-life applications
  • 3. Guaranteeing Failure Transparency requirements: automatized fast generic tools: commit events rollback of a process reexecution problems: undoable and redoable operations some things hard to undo some things hard to redo
  • 5. Guaranteeing Failure Transparency: Stop Failures consistent recovery Recovery is consistent if and only if there exists a complete, failure-free execution of the computation that would result in a sequence of visible events equivalent to the sequence of visible events actually output in the failed and recovered run. ignore ordering and duplicates non-deterministic events
  • 6. Guaranteeing Failure Transparency: Save Work save-work invariant A computation is guaranteed consistent recovery from stop failures if and only if for each executed non-deterministic event ei p that causally precedes a visible or commit event e, process p executes a commit event ej p such that ej p happens-before (or atomic with) e, and i <= j. save-work visible and save-work orphan invariant many protocols
  • 9. Guaranteeing Failure Transparency: Propagation Failures non-determinism helpful fixed vs transient events
  • 10. Guaranteeing Failure Transparency: Lose Work lose-work invariant Application-generic recovery from propagation failures is guaranteed to be possible if and only if the application executes no commit event on a dangerous path. single/multi-process dangerous paths algorithms techniques for adding non-determinism
  • 11. Evaluation: Performance Penalty 4 applications: nvi, magic, xpilot, TreadMarks failure transparency: Discount Checking modified to intercept non-deterministic system calls reliable memory: Rio File Cache transactions: Vista transaction library 7 protocols: CAND, CPVS, CBNDVS, CAND-LOG, CBNVS-LOG, CPV-2PC, CBNDV-2PC
  • 12. Evaluation: Performance Penalty - Results 24 performance analyses (no. of checkpoints, increase in execution time) interesting observations: commit frequency decreases and performance increases with increasing coordinates (except for xpilot) at least 1 protocol performs well for each app disk-based recovery with reasonably low overhead for interactive apps
  • 13. Evaluation: Conflicts (Lose-work vs Save-work) sample: 50 crashes for each of 7 fault types for 2 applications (nvi, postgres) application faults (commit on dangerous path after fault activation) lose-work violation: 35% guess for conflicts based on non-deterministic bug distribution in other applications: 90% OS faults failure to recover: 3% (postgres) - 15% (nvi) propagation failures: 10% (postgres) - 41% (nvi)
  • 14. Related work lot of work on transactions, fault tolerance and recovery from stop failures notable papers: E. N. Elnozahy et al.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems (thorough analysis of checkpointing & log-based recovery protocols, over 240 references) S. Chandra: Evaluating the Recovery-Related Properties of Software Faults (PhD thesis explaining most of the failure-recovery terms used in the paper) D. E. Lowell: Theory and Practice of Failure Transparency (consistent recovery, commit invariant, protocol space, discount checking) D. E. Lowell, P. M. Chen: Discount Checking (fast light checkpointing system)
  • 15. Conclusions & Contributions extending the concept of the stop failures to propagation failures invariant for surviving propagation failures evaluation of propagation failures for which consistent recovery is not possible failure transparency for stop failures possible, help from the application needed for propagation failures
  • 17. Discount Checking system used in evaluations is built on top of Vista transaction library. Why are transactions needed?
  • 18. What are the transactions needed for? DC doesn’t know anything about semantics of application operations and cannot use it for rollback transactions can be easily used to implement checkpointing interval between checkpoints = body of a transaction taking a checkpoint = transaction commiting process state rollback to the last checkpoint = transaction aborting
  • 19. Applications failed to recover in up to 15% cases using Discount Checking with the possibility to roll back to the last checkpoint. Why don’t they use multiple checkpoints? Would it help?
  • 20. Would using multiple checkpoints help? Why don’t they use them? it would definitely help to decrease the recovery failure rate but it wouldn’t solve all the problems (starting states) it’s not used probably because of implementation difficulty (system based on transactions, this kind of action would require another encapsulation) performance overhead would increase
  • 21. Failure transparency is great. What are the drawbacks?
  • 22. What are the drawbacks of failure transparency? increased cost complicated debugging (masked failures) interference with other components lower priority of fault correction
  • 23. How representative are the results of the evaluations?
  • 24. How representative are the results of the evaluations? small sample of different types of applications with prevailing types of events even though only 2 applications tested for performance, significantly different results only 1 checkpointing system part of the statistics inferred from other programs 12 years ago - new HW and possibly different bottlenecks, new possibly more efficient protocols etc.
  • 25. As a developer trying to make an application transparent to failures, would you use the same approach as the authors did for they evaluations (Discount Checking)? Why?
  • 26. Would you use Discount Checking? good solution for many use cases amazingly flexible, very easy setup for existing applications compared to other systems we mentioned works out of the box, but extensible if needed low performance overhead