SlideShare a Scribd company logo
Silent error resilience in
numerical time-stepping schemes
Austin Benson
arbenson@stanford.edu
Stanford University
ICME Colloquium, Jan. 26 2015
Joint work with
Sven Schmit, Stanford
Rob Schreiber, HP Labs
code + data: http://stanford.edu/~arbenson/silent.html
paper: Intl. J. of High Performance Computing Applications, 2014
1
 Computer systems are getting bigger and more complicated.
 Software systems are getting bigger and more complicated.
 Pushing energy limits.
 Things break. 2
What breaks?
 Hardware wears out
 Bit flips from cosmic rays
 Data races and other software bugs
 Firmware bugs
Silent errors are errors in application state that
have escaped low-level error detection.
3
What can we do?
 Checkpoint/restart: Occasionally save state of
system. If error is detected, restart.
Does not scale. How to detect errors?
 Other ABFT: Clever algorithms that address these
issues for particular algorithms.
 This work: Error detection for iterative
computation in general, numerical time-stepping
schemes in particular.
4
Spot the error!
5
At time step 120, multiplied single entry in
right-hand-side of Crank-Nicolson and
Backward Euler linear solves by 0.995. 6
General algorithm:
 “Base method” generates sequence B1, B2, …
 “Auxiliary method” generates sequence A1, A2, …
 If Di = ||Bi – Ai|| is abnormal, possible error
7
Base method:
high-order numerical integration scheme:
Runge-Kutta 5
Auxiliary method:
lower-order scheme: Runge-Kutta 4
Difference:
Di = |Bi – Ai|
Re-purposing an old idea for step-size control
[Fehlberg, 1969], [Dormand and Prince, 1980]
8
Key idea: re-use data
RK 1/2 scheme for u’ = f(t, u)
Second-order
scheme has
error O(h^3)
No extra function evaluations.
Provides O(h^2) check.
9
Key idea: re-use data
Implicit solve
that is stable
Explicit solve checks.
It is OK that the explicit solve may be unstable. (Why?) 10
e.g., Backward Euler
e.g., Forward Euler
 Backward/Forward Euler
 Richardson/Crank-Nicolson
 Runge-Kutta 1/2, 2/3, 4/5
 Adams-Bashforth linear multistep method 2/3, 4/5
 Explicit check on implicit scheme
 Extrapolation
Lots of these checks for
numerical time-stepping algorithms…
11
Exercise in step detection (change point detection)
Algorithmic details in the paper. Main parameters:
Relative jump
Variance change
12
Experimental setup:
 Solve heat equation for T time steps and
artificially inject error at one time step.
 Do this many times with different
types of errors.
 True positive rate:
#(real errors detected) / #(trials)
 False positive rate:
#(non-errors “detected”) / #(time steps)
13
Are large errors easier to detect?
Local truncation error (LTE)-normalized error
Output when no fault is injected.
Output when fault is injected.
14
Error injection:
Multiply single entry of RHS
in linear solves by
z ~ N(1, 5e-5) at a single
time step
15
Error injection:
Multiply q(x, t) at one
discrete x by z ~ N(1, 0.1)
at a single time step
16
Takeaways
17
 We have a general framework for detecting silent errors.
 Numerical integration is our central application.
 We detect large errors more easily.
 Not too many false positives.
 How many silent errors are there? How worried should we be?
 Do we need systems solutions or algorithmic solutions? Both?
 “Defense in depth” is good. But how easy are ABFT methods to
incorporate into existing solvers?
Resilience: what do we need to discuss?
18
Silent error resilience in
numerical time-stepping schemes
Austin Benson
arbenson@stanford.edu
Stanford University
ICME Colloquium, Jan. 26 2015
Joint work with
Sven Schmit, Stanford
Rob Schreiber, HP Labs
code + data: http://stanford.edu/~arbenson/silent.html
paper: Intl. J. of High Performance Computing Applications, 2014
19
Tardy error detection
20

More Related Content

PPTX
5 slideshare front slide included
PPTX
Strange Async Code - ReaxtiveX
PPT
Technical Instruments For Lab Work
PPTX
Tensor Spectral Clustering
PDF
Finger pointing
PDF
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (P...
PDF
Debug me
PPTX
UNIT-2-compiler design
5 slideshare front slide included
Strange Async Code - ReaxtiveX
Technical Instruments For Lab Work
Tensor Spectral Clustering
Finger pointing
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (P...
Debug me
UNIT-2-compiler design

Similar to Silent error resilience in numerical time-stepping schemes (20)

DOCX
Adsa u1 ver 1.0
PPTX
Error detector for the whole thing is the same as the
PPTX
Application Fault Tolerance (AFT)
PPT
Polyspace CETIC presentation
PPTX
Approximation and error
PPTX
Approximation and error
PPTX
V center operations enterprise standalone technical presentation
PDF
TMPA-2017: 5W+1H Static Analysis Report Quality Measure
PDF
Numerical analysis using Scilab: Error analysis and propagation
PPT
Integrating Model Checking and Procedural Languages
PPT
Orthogonal array approach a case study
PPTX
Anomaly detection
PPTX
UNIT 1.pptx Programming for Problem Solving
PPTX
Lecture 5: Asymptotic analysis of algorithms
PDF
Automatic Differentiation and SciML in Reality: What can go wrong, and what t...
PDF
Software Testing for Data Scientists
PDF
Data Error Analysis Data Error Analysis
PPTX
A Machine Learning approach to predict Software Defects
PPT
Testing
PPT
Numerical Method
Adsa u1 ver 1.0
Error detector for the whole thing is the same as the
Application Fault Tolerance (AFT)
Polyspace CETIC presentation
Approximation and error
Approximation and error
V center operations enterprise standalone technical presentation
TMPA-2017: 5W+1H Static Analysis Report Quality Measure
Numerical analysis using Scilab: Error analysis and propagation
Integrating Model Checking and Procedural Languages
Orthogonal array approach a case study
Anomaly detection
UNIT 1.pptx Programming for Problem Solving
Lecture 5: Asymptotic analysis of algorithms
Automatic Differentiation and SciML in Reality: What can go wrong, and what t...
Software Testing for Data Scientists
Data Error Analysis Data Error Analysis
A Machine Learning approach to predict Software Defects
Testing
Numerical Method
Ad

More from Austin Benson (20)

PDF
Hypergraph Cuts with General Splitting Functions (JMM)
PDF
Spectral embeddings and evolving networks
PDF
Computational Frameworks for Higher-order Network Data Analysis
PDF
Higher-order link prediction and other hypergraph modeling
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Higher-order link prediction
PDF
Simplicial closure & higher-order link prediction
PDF
Three hypergraph eigenvector centralities
PDF
Semi-supervised learning of edge flows
PDF
Choosing to grow a graph
PDF
Link prediction in networks with core-fringe structure
PDF
Higher-order Link Prediction GraphEx
PDF
Higher-order Link Prediction Syracuse
PDF
Random spatial network models for core-periphery structure
PDF
Random spatial network models for core-periphery structure.
PDF
Simplicial closure & higher-order link prediction
PDF
Simplicial closure and simplicial diffusions
PDF
Sampling methods for counting temporal motifs
PDF
Set prediction three ways
Hypergraph Cuts with General Splitting Functions (JMM)
Spectral embeddings and evolving networks
Computational Frameworks for Higher-order Network Data Analysis
Higher-order link prediction and other hypergraph modeling
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
Higher-order link prediction
Simplicial closure & higher-order link prediction
Three hypergraph eigenvector centralities
Semi-supervised learning of edge flows
Choosing to grow a graph
Link prediction in networks with core-fringe structure
Higher-order Link Prediction GraphEx
Higher-order Link Prediction Syracuse
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structure.
Simplicial closure & higher-order link prediction
Simplicial closure and simplicial diffusions
Sampling methods for counting temporal motifs
Set prediction three ways
Ad

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
composite construction of structures.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Construction Project Organization Group 2.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Welding lecture in detail for understanding
PPT
Mechanical Engineering MATERIALS Selection
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
OOP with Java - Java Introduction (Basics)
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
composite construction of structures.pdf
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Construction Project Organization Group 2.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Welding lecture in detail for understanding
Mechanical Engineering MATERIALS Selection
Model Code of Practice - Construction Work - 21102022 .pdf
OOP with Java - Java Introduction (Basics)
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...

Silent error resilience in numerical time-stepping schemes

  • 1. Silent error resilience in numerical time-stepping schemes Austin Benson arbenson@stanford.edu Stanford University ICME Colloquium, Jan. 26 2015 Joint work with Sven Schmit, Stanford Rob Schreiber, HP Labs code + data: http://stanford.edu/~arbenson/silent.html paper: Intl. J. of High Performance Computing Applications, 2014 1
  • 2.  Computer systems are getting bigger and more complicated.  Software systems are getting bigger and more complicated.  Pushing energy limits.  Things break. 2
  • 3. What breaks?  Hardware wears out  Bit flips from cosmic rays  Data races and other software bugs  Firmware bugs Silent errors are errors in application state that have escaped low-level error detection. 3
  • 4. What can we do?  Checkpoint/restart: Occasionally save state of system. If error is detected, restart. Does not scale. How to detect errors?  Other ABFT: Clever algorithms that address these issues for particular algorithms.  This work: Error detection for iterative computation in general, numerical time-stepping schemes in particular. 4
  • 6. At time step 120, multiplied single entry in right-hand-side of Crank-Nicolson and Backward Euler linear solves by 0.995. 6
  • 7. General algorithm:  “Base method” generates sequence B1, B2, …  “Auxiliary method” generates sequence A1, A2, …  If Di = ||Bi – Ai|| is abnormal, possible error 7
  • 8. Base method: high-order numerical integration scheme: Runge-Kutta 5 Auxiliary method: lower-order scheme: Runge-Kutta 4 Difference: Di = |Bi – Ai| Re-purposing an old idea for step-size control [Fehlberg, 1969], [Dormand and Prince, 1980] 8
  • 9. Key idea: re-use data RK 1/2 scheme for u’ = f(t, u) Second-order scheme has error O(h^3) No extra function evaluations. Provides O(h^2) check. 9
  • 10. Key idea: re-use data Implicit solve that is stable Explicit solve checks. It is OK that the explicit solve may be unstable. (Why?) 10 e.g., Backward Euler e.g., Forward Euler
  • 11.  Backward/Forward Euler  Richardson/Crank-Nicolson  Runge-Kutta 1/2, 2/3, 4/5  Adams-Bashforth linear multistep method 2/3, 4/5  Explicit check on implicit scheme  Extrapolation Lots of these checks for numerical time-stepping algorithms… 11
  • 12. Exercise in step detection (change point detection) Algorithmic details in the paper. Main parameters: Relative jump Variance change 12
  • 13. Experimental setup:  Solve heat equation for T time steps and artificially inject error at one time step.  Do this many times with different types of errors.  True positive rate: #(real errors detected) / #(trials)  False positive rate: #(non-errors “detected”) / #(time steps) 13
  • 14. Are large errors easier to detect? Local truncation error (LTE)-normalized error Output when no fault is injected. Output when fault is injected. 14
  • 15. Error injection: Multiply single entry of RHS in linear solves by z ~ N(1, 5e-5) at a single time step 15
  • 16. Error injection: Multiply q(x, t) at one discrete x by z ~ N(1, 0.1) at a single time step 16
  • 17. Takeaways 17  We have a general framework for detecting silent errors.  Numerical integration is our central application.  We detect large errors more easily.  Not too many false positives.
  • 18.  How many silent errors are there? How worried should we be?  Do we need systems solutions or algorithmic solutions? Both?  “Defense in depth” is good. But how easy are ABFT methods to incorporate into existing solvers? Resilience: what do we need to discuss? 18
  • 19. Silent error resilience in numerical time-stepping schemes Austin Benson arbenson@stanford.edu Stanford University ICME Colloquium, Jan. 26 2015 Joint work with Sven Schmit, Stanford Rob Schreiber, HP Labs code + data: http://stanford.edu/~arbenson/silent.html paper: Intl. J. of High Performance Computing Applications, 2014 19

Editor's Notes

  • #6: & u_t = \frac{1}{100}u_{xx} + 0.1\left(\sin(2\pi t) + \cos(2\pi x)\right) \nonumber \\ & t \in [0, 2], x \in [0, 1] \nonumber \\ & u(x, 0) = x(x-1) \nonumber \\ & \Delta x = 1 / 160, \Delta t = 1 / 100 \nonumber
  • #7: & u_t = \frac{1}{100}u_{xx} + 0.1\left(\sin(2\pi t) + \cos(2\pi x)\right) \nonumber \\ & t \in [0, 2], x \in [0, 1] \nonumber \\ & u(x, 0) = x(x-1) \nonumber \\ & \Delta x = 1 / 160, \Delta t = 1 / 100 \nonumber
  • #10: & \textcolor{blue}{k_1^{B}} = f(t_n, u_n^{B}) \nonumber \\ & u^{B}_{n+1} = u_n^{B} + hf\left(t_n + h/2, u_n^{B} + h\textcolor{blue}{k_1^{B}}/2\right) \nonumber \\ & \\ & u_{n+1}^{A} = u_n^{B} + h\textcolor{blue}{k_1^{B}} \nonumber \\ & \\ & D_{n+1} = \| u_{n+1}^{A} - u_{n+1}^{B} \|
  • #11: & AU^{B}_{n+1} = \textcolor{blue}{U^{B}_{n}} \nonumber \\ & \\ & U^{A}_{n+1} = B\textcolor{blue}{U^{B}_{n}} \nonumber \\ & \\ & D_{n+1} = \| U^{B}_{n+1} - U^{A}_{n+1} \| \nonumber
  • #13: & D_{n+1} = \| B_{n+1} - A_{n+1} \|_{\infty} \\ & J_{n+1} = \frac{D_{n+1} - D_n}{D_n} \\ & V_{n+1} = \frac{\text{Var}(D_{n-p+1}, \ldots, D_{n+1})}{\text{Var}(D_{n-p}, \ldots, D_{n})}
  • #15: L_i = \frac{\| B_i - \hat{B}_i \|}{\| \hat{B}_i - \hat{A}_i \|} \approx \frac{\text{Difference caused by error}}{\text{local truncation error}}
  • #16: $u_t = 0.001u_{xx} + (1 - \sqrt{1 - 4(t - t^2)}) / (2 - 2t)$ $u(x, 0) = 6|x - 1/2| - 3$
  • #17: $u_t = 0.01u_{xx} + q(x, t)$, \quad $q(x, t) = xe^{-t/2}$ $u(x, 0) = 4x(x-1)(x-2)$