SlideShare a Scribd company logo
Systems failure – a socio-
                  technical perspective




Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 1
Complex software systems
   •      Multi-purpose. Organisational systems that support
          different functions within an organisation
   •      System of systems. Usually distributed and normally
          constructed by integrating existing
          systems/components/services
   •      Unlimited. Not subject to limitations derived from the
          laws of physics (so, no natural constraints on their
          size)
   •      Data intensive. System data orders of magnitude
          larger than code; long-lifetime data
   •      Dynamic. Changing quickly in response to changes
          in the business environment
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 2
Systems of systems
                                                                       •   Operational
                                                                           independence
                                                                       •   Managerial
                                                                           independence
                                                                       •   Multiple
                                                                           stakeholder
                                                                           viewpoints
                                                                       •   Evolutionary
                                                                           development
                                                                       •   Emergent
                                                                           behaviour
                                                                       •
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012                      Slide 3
                                                                           Geographic
Complex system realities
  •       There is no definitive specification of what the system
          should ‘do’ and it is practically impossible to create
          such a specification
  •       The complexity of the system is such that it is not
          ‘understandable’ as a whole
  •       It is likely that, at all times, some parts of the system
          will not be fully operational
  •       Actors responsible for different parts of the system
          are likely to have conflicting goals


Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 4
System failure




Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 5
System dependability model

    System fault                                              System error

    A system                                                  An erroneous system
    characteristic that                                       state that can (but need
    can (but need not)                                        not) lead to a system
    lead to a system                                          failure
    error
                                                        System failure

                                                              Externally-
                                                              observed, unexpected
                                                              and undesirable system
                                                              behaviour



Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012                     Slide 6
A hospital system
  •       A hospital system is designed to maintain information about
          available beds for incoming patients and to provide information
          about the number of beds to the admissions unit.
  •       It is assumed that the hospital has a number of empty beds and
          this changes over time. The variable B reflects the number of
          empty beds known to the system.
  •       Sometimes the system reports that the number of empty beds is
          the actual number available; sometimes the system reports that
          fewer than the actual number are available .
  •       In circumstances where the system reports that an incorrect
          number of beds are available, is this a failure?



Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 7
What is failure?
                                                                  •    Technical, engineering
                                                                       view: a failure is ‘a
                                                                       deviation from a
                                                                       specification’.
                                                                  •    An oracle can examine a
                                                                       specification, observe a
                                                                       system’s behaviour and
                                                                       detect failures.
                                                                  •    Failure is an absolute -
                                                                       the system has either
                                                                       failed or it hasn’t
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012                        Slide 8
Bed management system
   •      The percentage of system users who considered the
          system’s incorrect reporting of the number of
          available beds to be a failure was 0%.
   •      Mostly, the number did not matter so long as it was
          greater than 1. What mattered was whether or not
          patients could be admitted to the hospital.
   •      When the hospital was very busy (available beds =
          0), then people understood that it was practically
          impossible for the system to be accurate.
   •      They used other methods to find out whether or not a
          bed was available for an incoming patient.

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 9
Failure is a judgement
•     Specifications are a gross simplification of reality for
      complex systems.
•     Users don’t read and don’t care about specifications
•     Whether or not system behaviour should be considered
      to be a failure, depends on the observer’s judgement
•     This judgement depends on:
     –      The observer’s expectations
     –      The observer’s knowledge and experience
     –      The observer’s role
     –      The observer’s context or situation
     –      The observer’s authority
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 10
Failures are inevitable
•       Technical reasons
      –       When systems are composed of opaque and uncontrolled
              components, the behaviour of these components cannot be
              completely understood
      –       Failures often can be considered to be failures in data rather than
              failures in behaviour

•       Socio-technical reasons
      –       Changing contexts of use mean that the judgement on what
              constitutes a failure changes as the effectiveness of the system in
              supporting work changes
      –       Different stakeholders will interpret the same behaviour in different
              ways because of different interpretations of ‘the problem’


    Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 11
Conflict inevitability
   •      Impossible to establish a set of requirements where
          stakeholder conflicts are all resolved
   •      Therefore, successful operation of a system for one
          set of stakeholders will inevitably mean ‘failure’ for
          another set of stakeholders
   •      Groups of stakeholders in organisations are often in
          perennial conflict (e.g. managers and clinicians in a
          hospital). The support delivered by a system
          depends on the power held at some time by a
          stakeholder group.



Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 12
Normal failures
   •      ‘Failures’ are not just catastrophic events but
          normal, everyday system behaviour that disrupts
          normal work and that mean that people have to
          spend more time on a task than necessary
   •      A system failure occurs when a direct or indirect user
          of a system has to carry out extra work, over and
          above that normally required to carry out some
          task, in response to some inappropriate or
          unexpected system behaviour
   •      This extra work constitutes the cost of recovery from
          system failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 13
The Swiss Cheese model




Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 14
Failure trajectories
  •       Failures rarely have a single cause. Generally, they
          arise because several events occur simultaneously
        –         Loss of data in a critical system
              •      User mistypes command and instructs data to be deleted
              •      System does not check and ask for confirmation of destructive
                     action
              •      No backup of data available

  •       A failure trajectory is a sequence of undesirable
          events that coincide in time, usually initiated by some
          human action. It represents a failure in the defensive
          layers in the system

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012          Slide 15
Vulnerabilities and defences
 •    Vulnerabilities
     –    Faults in the (socio-technical) system which, if triggered by a
          human or technical error, can lead to system failure
     –    e.g. missing check on input validity

 •    Defences
     –    System features that avoid, tolerate or recover from human
          error
     –    Type checking that disallows allocation of incorrect types of
          value

 •       When an adverse event happens, the key question is
         not ‘whose fault was it’ but ‘why did the system
         defences fail?’
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16
Reason’s Swiss Cheese Model




Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 17
Active failures
  •       Active failures
        –      Active failures are the unsafe acts committed by people who are in
               direct contact with the system or failures in the system technology.
        –      Active failures have a direct and usually short-lived effect on the
               integrity of the defenses.

  •       Latent conditions
        –      Fundamental vulnerabilities in one or more layers of the socio-
               technical system such as system faults, system and process
               misfit, alarm overload, inadequate maintenance, etc.
        –      Latent conditions may lie dormant within the system for many years
               before they combine with active failures and local triggers to create
               an accident opportunity.


Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012            Slide 18
Defensive layers
• Complex IT systems should have many defensive
  layers:
     – some are engineered - alarms, physical barriers, automatic
       shutdowns,
     – others rely on people - surgeons, anesthetists, pilots, control
       room operators,
     – and others depend on procedures and administrative
       controls.
• In an ideal word, each defensive layer would be intact.
• In reality, they are more like slices of Swiss
  cheese, having many holes- although unlike in the
  cheese, these holes are continually
  opening, shutting, and shifting their location.
 Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 19
Dynamic vulnerabilities
  •       While some vulnerabilities are static (e.g.
          programming errors), others are dynamic and depend
          on the context where the system is used.
  •       For example
        –      vulnerabilities may be related to human actions whose
               performance is dependent on workload, state of mind, etc. An
               operator may be distracted and forget to check something
        –      vulnerabilities may depend on configuration – checks may
               depend on particular programs being up and running so if
               program A is running in a system then a check may be made
               but if program B is running, then the check is not made


Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 20
Recovering from failure




Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 21
Coping with failure
                                                                       •   People are good at
                                                                           coping with
                                                                           unexpected situations
                                                                           when things go
                                                                           wrong.
                                                                           –   They can take the
                                                                               initiative, adopt
                                                                               responsibilities
                                                                               and, where
                                                                               necessary, break the
                                                                               rules or step outside
                                                                               the normal process of
                                                                               doing things.
                                                                           –   People can prioritise
                                                                               and focus on the
                                                                               essence of a problem
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012                              Slide 22
Recovery strategies
   •      Local knowledge
         –      Who to call; who knows what; where things are
   •      Process reconfiguration
         –      Doing things in a different way from that defined in the ‘standard’
                process
         –      Work-arounds, breaking the rules (safe violations)
   •      Redundancy and diversity
         –      Maintaining copies of information in different forms from that
                maintained in a software system
         –      Informal information annotation
         –      Using multiple communication channels
   •      Trust
         –      Relying on others to cope
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012             Slide 23
Design for recovery
  •       Holistic systems engineering
        –      Software systems design has to be seen as part of a wider
               process of socio-technical systems engineering

  •       We cannot build ‘correct’ systems
        –      We must therefore design systems to allow the broader
               socio-technical systems to recognise, diagnose and recover
               from failures

  •       Extend current systems to support recovery
  •       Develop recovery support systems as an integral part
          of systems of systems

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 24
Recovery strategy
•      Designing for recovery is a holistic approach to system design and
       not (just) the identification of ‘recovery requirements’
•      Should support the natural ability of people and organisations to
       cope with problems
       –        Ensure that system design decisions do not increase the amount
                of recovery work required
       –        Make system design decisions that make it easier to recover
                from problems (i.e. reduce extra work required)
            •       Earlier recognition of problems
            •       Visibility to make hypotheses easier to formulate
            •       Flexibility to support recovery actions



    Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 25
Key points
 •    Failures are inevitable in complex systems because
      multiple stakeholders see these systems in different
      ways and because there is no single manager of
      these systems
 •    Failures are a judgement – they are not absolute –
      but depend on the system observer
 •    The Swiss cheese model is a failure model based on
      active failures (trigger events) and latent errors
      (system vulnerabilities).
 •       People have developed strategies for coping with
         failure and systems should not be designed to make
         coping more difficult.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26

More Related Content

PPTX
Human failure (LSCITS EngD 2012)
PPT
Socio Technical Complex Systemsstcs
PPTX
Six sense technology
PPT
Computerized maintenance management system (cmms)
PPTX
Resilience and recovery
PDF
Designing Complex Systems for Recovery (LSCITS EngD 2011)
PPTX
Moser lightfoot pmc2012pres
PDF
On viable service systems
Human failure (LSCITS EngD 2012)
Socio Technical Complex Systemsstcs
Six sense technology
Computerized maintenance management system (cmms)
Resilience and recovery
Designing Complex Systems for Recovery (LSCITS EngD 2011)
Moser lightfoot pmc2012pres
On viable service systems

Similar to Socio-technical systems failure (LSCITS EngD 2012) (20)

PPTX
Requirements Engineering (CS 5032 2012)
PPTX
Socio technical systems (LSCITS EngD)
PPTX
Dependability Engineering 2 (CS 5032 2012)
PPTX
Bug or Feature? Covert Impairments to Human Computer Interaction
PPTX
System success and failure
PDF
Dependable Systems - Summary (16/16)
PPTX
Self healing-systems
PPTX
CS 5032 L4 requirements engineering 2013
PDF
Dependable Systems -Dependability Threats (2/16)
PPTX
Emergent properties
PDF
Software engineering socio-technical systems
PPTX
CS 5032 L3 socio-technical systems 2013
PPTX
Dependablity Engineering 1 (CS 5032 2012)
PPT
Artificial Intelligence: Agent Technology
PPTX
Introduction to Critical Systems Engineering (CS 5032 2012)
PPTX
Dependability and security (CS 5032 2012)
PPTX
Production based system
PDF
Socio technical system
PDF
Observability for Emerging Infra (what got you here won't get you there)
PPT
integrated system Design module introduction.ppt
Requirements Engineering (CS 5032 2012)
Socio technical systems (LSCITS EngD)
Dependability Engineering 2 (CS 5032 2012)
Bug or Feature? Covert Impairments to Human Computer Interaction
System success and failure
Dependable Systems - Summary (16/16)
Self healing-systems
CS 5032 L4 requirements engineering 2013
Dependable Systems -Dependability Threats (2/16)
Emergent properties
Software engineering socio-technical systems
CS 5032 L3 socio-technical systems 2013
Dependablity Engineering 1 (CS 5032 2012)
Artificial Intelligence: Agent Technology
Introduction to Critical Systems Engineering (CS 5032 2012)
Dependability and security (CS 5032 2012)
Production based system
Socio technical system
Observability for Emerging Infra (what got you here won't get you there)
integrated system Design module introduction.ppt
Ad

More from Ian Sommerville (20)

PPTX
Ultra Large Scale Systems
PPTX
Resp modellingintro
PPTX
LSCITS-engineering
PPTX
Requirements reality
PPTX
Dependability requirements for LSCITS
PPTX
Conceptual systems design
PPTX
Requirements Engineering for LSCITS
PPTX
An introduction to LSCITS
PPTX
Internet worm-case-study
PPTX
Designing software for a million users
PPTX
Security case buffer overflow
PPTX
CS5032 Case study Ariane 5 launcher failure
PPTX
CS5032 Case study Kegworth air disaster
PPTX
CS5032 L19 cybersecurity 1
PPTX
CS5032 L20 cybersecurity 2
PPTX
L17 CS5032 critical infrastructure
PPTX
CS5032 Case study Maroochy water breach
PPTX
CS 5032 L18 Critical infrastructure 2: SCADA systems
PPTX
CS5032 L9 security engineering 1 2013
PPTX
CS5032 L10 security engineering 2 2013
Ultra Large Scale Systems
Resp modellingintro
LSCITS-engineering
Requirements reality
Dependability requirements for LSCITS
Conceptual systems design
Requirements Engineering for LSCITS
An introduction to LSCITS
Internet worm-case-study
Designing software for a million users
Security case buffer overflow
CS5032 Case study Ariane 5 launcher failure
CS5032 Case study Kegworth air disaster
CS5032 L19 cybersecurity 1
CS5032 L20 cybersecurity 2
L17 CS5032 critical infrastructure
CS5032 Case study Maroochy water breach
CS 5032 L18 Critical infrastructure 2: SCADA systems
CS5032 L9 security engineering 1 2013
CS5032 L10 security engineering 2 2013
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation

Socio-technical systems failure (LSCITS EngD 2012)

  • 1. Systems failure – a socio- technical perspective Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1
  • 2. Complex software systems • Multi-purpose. Organisational systems that support different functions within an organisation • System of systems. Usually distributed and normally constructed by integrating existing systems/components/services • Unlimited. Not subject to limitations derived from the laws of physics (so, no natural constraints on their size) • Data intensive. System data orders of magnitude larger than code; long-lifetime data • Dynamic. Changing quickly in response to changes in the business environment Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2
  • 3. Systems of systems • Operational independence • Managerial independence • Multiple stakeholder viewpoints • Evolutionary development • Emergent behaviour • Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3 Geographic
  • 4. Complex system realities • There is no definitive specification of what the system should ‘do’ and it is practically impossible to create such a specification • The complexity of the system is such that it is not ‘understandable’ as a whole • It is likely that, at all times, some parts of the system will not be fully operational • Actors responsible for different parts of the system are likely to have conflicting goals Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4
  • 5. System failure Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5
  • 6. System dependability model System fault System error A system An erroneous system characteristic that state that can (but need can (but need not) not) lead to a system lead to a system failure error System failure Externally- observed, unexpected and undesirable system behaviour Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6
  • 7. A hospital system • A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit. • It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. • Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available . • In circumstances where the system reports that an incorrect number of beds are available, is this a failure? Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7
  • 8. What is failure? • Technical, engineering view: a failure is ‘a deviation from a specification’. • An oracle can examine a specification, observe a system’s behaviour and detect failures. • Failure is an absolute - the system has either failed or it hasn’t Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8
  • 9. Bed management system • The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%. • Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. • When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate. • They used other methods to find out whether or not a bed was available for an incoming patient. Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9
  • 10. Failure is a judgement • Specifications are a gross simplification of reality for complex systems. • Users don’t read and don’t care about specifications • Whether or not system behaviour should be considered to be a failure, depends on the observer’s judgement • This judgement depends on: – The observer’s expectations – The observer’s knowledge and experience – The observer’s role – The observer’s context or situation – The observer’s authority Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10
  • 11. Failures are inevitable • Technical reasons – When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood – Failures often can be considered to be failures in data rather than failures in behaviour • Socio-technical reasons – Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes – Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’ Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11
  • 12. Conflict inevitability • Impossible to establish a set of requirements where stakeholder conflicts are all resolved • Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders • Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group. Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12
  • 13. Normal failures • ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary • A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate or unexpected system behaviour • This extra work constitutes the cost of recovery from system failure Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13
  • 14. The Swiss Cheese model Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14
  • 15. Failure trajectories • Failures rarely have a single cause. Generally, they arise because several events occur simultaneously – Loss of data in a critical system • User mistypes command and instructs data to be deleted • System does not check and ask for confirmation of destructive action • No backup of data available • A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the system Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15
  • 16. Vulnerabilities and defences • Vulnerabilities – Faults in the (socio-technical) system which, if triggered by a human or technical error, can lead to system failure – e.g. missing check on input validity • Defences – System features that avoid, tolerate or recover from human error – Type checking that disallows allocation of incorrect types of value • When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’ Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16
  • 17. Reason’s Swiss Cheese Model Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17
  • 18. Active failures • Active failures – Active failures are the unsafe acts committed by people who are in direct contact with the system or failures in the system technology. – Active failures have a direct and usually short-lived effect on the integrity of the defenses. • Latent conditions – Fundamental vulnerabilities in one or more layers of the socio- technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc. – Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity. Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18
  • 19. Defensive layers • Complex IT systems should have many defensive layers: – some are engineered - alarms, physical barriers, automatic shutdowns, – others rely on people - surgeons, anesthetists, pilots, control room operators, – and others depend on procedures and administrative controls. • In an ideal word, each defensive layer would be intact. • In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location. Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19
  • 20. Dynamic vulnerabilities • While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used. • For example – vulnerabilities may be related to human actions whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something – vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not made Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20
  • 21. Recovering from failure Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21
  • 22. Coping with failure • People are good at coping with unexpected situations when things go wrong. – They can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. – People can prioritise and focus on the essence of a problem Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22
  • 23. Recovery strategies • Local knowledge – Who to call; who knows what; where things are • Process reconfiguration – Doing things in a different way from that defined in the ‘standard’ process – Work-arounds, breaking the rules (safe violations) • Redundancy and diversity – Maintaining copies of information in different forms from that maintained in a software system – Informal information annotation – Using multiple communication channels • Trust – Relying on others to cope Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23
  • 24. Design for recovery • Holistic systems engineering – Software systems design has to be seen as part of a wider process of socio-technical systems engineering • We cannot build ‘correct’ systems – We must therefore design systems to allow the broader socio-technical systems to recognise, diagnose and recover from failures • Extend current systems to support recovery • Develop recovery support systems as an integral part of systems of systems Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24
  • 25. Recovery strategy • Designing for recovery is a holistic approach to system design and not (just) the identification of ‘recovery requirements’ • Should support the natural ability of people and organisations to cope with problems – Ensure that system design decisions do not increase the amount of recovery work required – Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required) • Earlier recognition of problems • Visibility to make hypotheses easier to formulate • Flexibility to support recovery actions Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25
  • 26. Key points • Failures are inevitable in complex systems because multiple stakeholders see these systems in different ways and because there is no single manager of these systems • Failures are a judgement – they are not absolute – but depend on the system observer • The Swiss cheese model is a failure model based on active failures (trigger events) and latent errors (system vulnerabilities). • People have developed strategies for coping with failure and systems should not be designed to make coping more difficult. Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26