Designing Complex Systems for Recovery (LSCITS EngD 2011)

Design for Recovery,York EngD Programme, 2010

Slide 1

Design for recovery

Prof. Ian Sommerville


Slide 2

Objectives

•  To discuss the notion of ‘failure’ in software systems

•  To explain why this conventional notion of ‘failure’ is not
appropriate for many LSCITS

•  To propose an approach to failure management in LSCITS
based on recoverability rather than failure avoidance


Slide 3

Complex IT systems

•  Organisational systems that support different functions
within an organisation

•  Can usually be considered as systems of systems, ie
different parts are systems in their own right

•  Usually distributed and normally constructed by
integrating existing systems/components/services

•  Not subject to limitations derived from the laws of
physics (so, no natural constraints on their size)

•  Data intensive, with very long lifetime data

•  An integral part of wider socio-technical systems


Slide 4

What is failure?

•  From a reductionist perspective, a failure can be
considered to be ‘a deviation from a speciﬁcation’.

•  An oracle can examine a speciﬁcation and observe a
system’s behaviour and detect failures.

•  Failure is an absolute - the system has either failed or it
hasn’t

•  Of course, some failures are more serious than others; it
is widely accepted that failures with minor consequences
are to be expected and tolerated


Slide 5

A question to the audience

•  A hospital system is designed to maintain information about available
beds for incoming patients and to provide information about the
number of beds to the admissions unit.

•  It is assumed that the hospital has a number of empty beds and this
changes over time. The variable B reﬂects the number of empty beds
known to the system.

•  Sometimes the system reports that the number of empty beds is the
actual number available; sometimes the system reports that fewer
than the actual number are available .

•  In circumstances where the system reports that an incorrect number
of beds are available, is this a failure?


Slide 6

Bed management system

•  The percentage of system users who considered the
system’s incorrect reporting of the number of available
beds to be a failure was 0%.

•  Mostly, the number did not matter so long as it was
greater than 1. What mattered was whether or not
patients could be admitted to the hospital.

•  When the hospital was very busy (available beds = 0),
then people understood that it was practically impossible
for the system to be accurate.

•  They used other methods to ﬁnd out whether or not a
bed was available for an incoming patient.


Slide 7

Failure is a judgement

•  Specifications are a simplification of reality.

•  Users don’t read and don’t care about specifications

•  Whether or not system behaviour should be considered to
be a failure, depends on the judgement of an observer of that
behaviour

•  This judgement depends on:

•  The observer’s expectations

•  The observer’s knowledge and experience

•  The observer’s role

•  The observer’s context or situation

•  The observer’s authority


Slide 8

System failure

•  ‘Failures’ are not just catastrophic events but normal,
everyday system behaviour that disrupts normal work and
that mean that people have to spend more time on a task
than necessary

•  A system failure occurs when a direct or indirect user of
a system has to carry out extra work, over and above
that normally required to carry out some task, in
response to some inappropriate system behaviour

•  This extra work constitutes the cost of recovery from
system failure


Slide 9

Failures are inevitable

•  Technical reasons

•  When systems are composed of opaque and uncontrolled components,
the behaviour of these components cannot be completely understood

•  Failures often can be considered to be failures in data rather than failures
in behaviour

•  Socio-technical reasons

•  Changing contexts of use mean that the judgement on what constitutes a
failure changes as the effectiveness of the system in supporting work
changes

•  Different stakeholders will interpret the same behaviour in different
ways because of different interpretations of ‘the problem’


Slide 10

Conflict inevitability

•  Impossible to establish a set of requirements where
stakeholder conflicts are all resolved

•  Therefore, successful operation of a system for one set of
stakeholders will inevitably mean ‘failure’ for another set
of stakeholders

•  Groups of stakeholders in organisations are often in
perennial conflict (e.g. managers and clinicians in a
hospital). The support delivered by a system depends on
the power held at some time by a stakeholder group.


Slide 11

Where are we?

•  Large-scale information systems are inevitably complex
systems

•  Such systems cannot be created using a reductionist
approach

•  Failures are a judgement and this may change over time

•  Failures are inevitable and cannot be engineered out of a
system


Slide 12

The way forward

•  Software design has to be seen as part of a wider process
of LSCITS engineering

•  We need to accept that technical system failures will
always occur and examine how we can design these
systems to allow the broader socio-technical systems, in
which these technical systems are used, to recognise,
diagnose and recover from these failures


Slide 13

Software dependability

•  A reductionist approach to software dependability takes the view that
software failures are a consequence of software faults

•  Techniques to improve dependability include

•  Fault avoidance

•  Fault detection

•  Fault tolerance

•  These approaches have taken us quite a long way in improving
software dependability. However, further progress is unlikely to be
achieved by further improvement of these techniques as they rely on
a reductionist view of failure.


Slide 14

Failure recovery

•  Recognition

•  Recognise that inappropriate behaviour has occurred

•  Hypothesis

•  Formulate an explanation for the unexpected behaviour

•  Recovery

•  Take steps to compensate for the problem that has arisen


Slide 15

Coping with failure

•  Socio-technical systems are remarkably robust because
people are good at coping with unexpected situations
when things go wrong.

•  We have the unique ability to apply previous experience from
different areas to unseen problems.

•  Individuals can take the initiative, adopt responsibilities and,
where necessary, break the rules or step outside the normal
process of doing things.

•  People can prioritise and focus on the essence of a problem


Slide 16

Recovering from failure

•  Local knowledge

•  Who to call; who knows what; where things are

•  Process reconﬁguration

•  Doing things in a different way from that deﬁned in the ‘standard’ process

•  Work-arounds, breaking the rules (safe violations)

•  Redundancy and diversity

•  Maintaining copies of information in different forms from that maintained
in a software system

•  Informal information annotation

•  Using multiple communication channels

•  Trust

•  Relying on others to cope


Slide 17

Design for recovery

•  The aim of a strategy of design for recovery is to:

•  Ensure that system design decisions do not increase the amount of
recovery work required

•  Make system design decisions that make it easier to recover from
problems (i.e. reduce extra work required)

•  Earlier recognition of problems

•  Visibility to make hypotheses easier to formulate

•  Flexibility to support recovery actions

•  Designing for recovery is an holistic approach to system design and not
(just) the identiﬁcation of ‘recovery requirements’

•  Should support the natural ability of people and organisations to cope with
problems


Slide 18

Problems

•  Security and recoverability

•  Automation hiding

•  Process tyranny

•  Multi-organisational systems


Slide 19

Security and recoverability

•  There is an inherent tension between security and
recoverability

•  Recoverability

•  Relies on trusting operators of the system not to abuse privileges
that they may have been granted to help recover from problems

•  Security

•  Relies on mistrusting users and restricting access to information
on a ‘need to know’ basis


Slide 20

Automation hiding

•  A problem with automation is that information becomes subject to
organizational policies that restrict access to that information.

•  Even when access is not restricted, we don’t have any shared culture
in how to organise a large information store

•  Generally, authorisation models maintained by the system are based
on normal rather than exceptional operation.

•  When problems arise and/or when people are unavailable, breaking
the rules to solve these problems is made more difﬁcult.


Slide 21

Process tyranny

•  Increasingly, there is a notion that ‘standard’ business
processes can be deﬁned and embedded in systems that
support these processes

•  Implicitly or explicitly, the system enforces the use of the
‘standard’ process

•  But this assumes three things:

•  The standard process is always appropriate

•  The standard process has anticipated all possible failures

•  The system can be respond in a timely way to process changes


Slide 22

Multi-organisational systems

•  Many rules enforced in different ways by different systems.

•  No single manager or owner of the system . Who do you call when
failures occur?

•  Information is distributed - users may not be aware of where
information is located, who owns information, etc.

•  Processes involve remote actors so process reconﬁguration is more
difﬁcult

•  Restricted information channels (e.g. help unavailable outside normal
business hours; no phone numbers published, etc.)

•  Lack of trust. Owners of components will blame other components
for system failure. Learning is inhibited and trust compromised.


Slide 23

Design guidelines

•  Local knowledge

•  Process reconﬁguration

•  Redundancy and diversity


Slide 24

Local knowledge

•  Local knowledge includes knowledge of who does what,
how authority structures can be bypassed, what rules can
be broken, etc.

•  Impossible to replicate entirely in distributed systems but
some steps can be taken

•  Maintain information about the provenance of data

•  Who provided the data, where the data came from, when it
was created, edited, etc.

•  Maintain organisational models

•  Who is responsible for what, contact details


Slide 25

Process reconfiguration

•  Make workflows explicit rather than embedding them in the software

•  Not just ‘continue’ buttons! Users should know where they are and
where they are supposed to go

•  Support workflow navigation/interruption/restart

•  Design systems with an ‘emergency mode’ where the the system
changes from enforcing policies to auditing actions

•  This would allow the rules to be broken but the system would maintain a
log of what has been done and why so that subsequent investigations
could trace what happened

•  Support ‘Help, I’m in trouble!’ as well as ‘Help, I need information?’


Slide 26

Redundancy and diversity

•  Maintaining a single ‘golden copy’ of data may be efﬁcient but it may
not be effective or desirable

•  Encourage the creation of ‘shadow systems’ and provide import
and export from these systems

•  Allow schemas to be extended

•  Schemas for data are rarely designed for problem solving. Always
allow informal extension (a free text box) so that annotations,
explanations and additional information can be provided

•  Maintain organisational models

•  To allow for multi-channel communications when things go wrong


Slide 27

Summary

•  A reductionist approach to software engineering is no longer viable.
on its own, for complex systems engineering

•  Improving existing software engineering methods will help but will
not deal with the problems of complexity that are inherent in
distributed systems of systems

•  We must learn to live with normal, everyday failures

•  Design for recovery involves designing so that the work required to
recover from a failure is minimised

•  Recovery strategies include supporting information redundancy and
annotation and maintaining organisational models

Designing Complex Systems for Recovery (LSCITS EngD 2011)

More Related Content

What's hot (20)

Similar to Designing Complex Systems for Recovery (LSCITS EngD 2011) (20)

More from Ian Sommerville (20)

Recently uploaded (20)

Designing Complex Systems for Recovery (LSCITS EngD 2011)