TxB/4 Disaster Recovery & Business Continuity. A Ten Times Better (TxB)©️Approach.
Fire at Hayes Sub-Station near to Heathrow 21st March 2025

TxB/4 Disaster Recovery & Business Continuity. A Ten Times Better (TxB)©️Approach.

This a special posting in my Ten Times Better (TxB) series. It follows from the surprising news on Friday and Saturday this week regarding the Heathrow power outage. It’s a bit longer than normal, as it is an immediate response, and to repeat the old adage – I didn’t have time to write a shorter note.

 I must say 2 things at the outset:-

1) The impacts of the outage should not have happened in my view, and were preventable. This is especially the case for what appears to be a single event (the sub-station outage at Hayes) and considering the critical nature of Heathrow as the world class hub for UK. The personal, corporate, and commercial impact on passengers, goods and trade, and reputation even for 1 day of outage along with the costs of recovery and any remediation work will be very high indeed.

2) This is not actually simply about Heathrow - I am not an airport infrastructure expert. However I taken the incident and I have applied the basic concepts of disaster recovery and business continuity for IT infrastructure, which I hope will be largely applicable to all infrastructure installations. There may be some specific areas which will differ – but the observations below are the absolute basics to get right for such a critical system. I am aiming to provide some guidance on design, planning, backup systems, testing, risk assessment, commercial arrangements and responsive execution to prevent and mitigate large-scale power outages and their impact.

I am sure, that many, if not all, of the items below are actually implemented at Heathrow, but, notwithstanding that, the event did happen, which points to a weakness or failing somewhere in the broader system, implementation or operation. I am not seeking to apportion blame, but this event should be recognised. The impact has been pretty dramatic, and recovery/remediation and potential compensation costs are possibly set to run in to £millions.

So here goes …

1.      Risk Assessment and Business Impact Analysis

If you have read my TxB 002© Risk, Compliance and Controls ( TxB 002 Risk, Compliance and Controls ) posting you will know that I am a firm believer that business priorities should be driven by risk assessment, leadership and management – not just something to be done to tick a box and then put in the top drawer whilst the “real” work is done.

This is especially true for a business operation such as Heathrow that is so important and critical to UK plc in so many ways – including trade, freight, business, the smooth flow of politics and diplomacy, and of course the all-important business of holidays.

The importance of risk, compliance and controls needs to be permanently, constantly and dynamically weaved through the entire operation and needs to drive business, investment and operational decisions. People need to be trained in risk, identification, mitigation, and execution of effective strategies to avoid them, including, importantly, what is the real level of residual risk that the business can truthfully bear. This is essential and people really need to be cognisant that this is real and practical and not an abstract concept.

Heathrow is so critical to the UK that this risk management process should also be overseen or at least checked and challenged by an external independent body such as a regulator or perhaps another large-scale airport operator.

I am reasonably confident in saying that I believe that this outage would not have happened with a pure public operator such as for instance TfL, and believe that an independent review should be conducted as to whether the airport operator, which is essentially a regulated monopoly, has sufficient checks and challenges to enable to optimal decisions to arise.

One of my old bosses – George Muir who was the Director General of ATOC (the Association of Train Operating Companies) used to say we should “over-cure” critical problems such as those for passenger safety – in my view, he was right. These are areas where a different lens of risk versus cost should be applied.

• Critical Systems Identification: A full assessment of essential systems needs to be identified. For an airport this might include air traffic control, runway lighting, terminal operations, and IT networks and security. For this list a robust and appropriate design needs to be produced and implemented – these systems must be able to operate without interruption at all times. Life and limb is at stake here.

Interdependency Mapping: An analysis of how failures in one system may cascade into others, impacting overall operations along with how to prevent and staunch this cascading, how to zone systems and introduce “firebreaks”, plus how to self-heal/recover, and how to load balance, along with early warnings of failure.

Scenario Planning: A full range of scenarios should be investigated, involving all stakeholders and in particular people who really know the business and have been there and done it, along with some independent risk experts. You should bring in these outside experts to test and challenge the risks, the risk framework and the understanding of risk in the organisation to ensure “it could never happen here” and other unhelpful orthodoxies cannot prevail. Initially nothing should be dismissed – this really is “what if an asteroid landed on Heathrow” stuff – then an assessment can be made of risk probability versus likelihood, versus costs of mitigating. Evaluate scenarios like grid failures, cyberattacks, or sabotage, and other black swans, estimating operational, financial, compensatory, business operations losses, remediation and reputational impacts.

 2. Backup Power Systems and Redundancy

Multiple Power Feeds: Establish connections to at least two independent grids to avoid single points of failure. These also need to come in via independent routes totally geographically separated and ideally need to be from two separate power supply and distribution companies to offset the risk of failure of a power company to supply. There is absolutely no point in saying that the power company systems are so well-designed with multiple redundant systems that failure will never occur – it will, and planning needs to recognise this. Each of the individual power feeds needs to be able to provide over-capacity – ie providing the total power capacity for the infrastructure should be well within the design of the sub-power system systems, taking into account an appropriate view of possible failures / resultant capacity within the power system that may be evident from time to time in normal operations.

Primary Backup Sources: Fully integrated UPS (Uninterruptible Power Systems) need to be implemented with sufficient capacity to provide immediate power to identified critical systems to bridge the gap until generators or other more permanent power supplies are brought into service. If there are diesel or gas generators in use, these need to be cycled regularly and in addition a guaranteed supply of fuel or reservoir of fuel from at least two independent suppliers needs to be agreed. Once these UPS and back-up generator solutions have engaged, buying some operational time at least for critical systems, a rapid assessment needs to take place to switch over to permanent alternatives, if this has not already happened automatically through some form of “hot/automated” transfer or through some re-balancing of load amongst the sub-station power supplies depending on the design and commercial arrangements. What cannot happen in such critical infrastructure is that these transfers to alternative power rely on “cold” standby systems, or untested systems, or indeed no alternatives being available.

Business Partnering: Appropriate arrangements should be in place for partnering with power providers. And they need to be part of the overall system and airport management and leadership. They need to understand and demonstrate their power system design right down to a component / failure level, and demonstrate not just resilience and reliability but how they will maintain their systems and respond and react to catastrophic failure to provide seamless non-interruption of power.

3. Zoning and Load Prioritisation

Critical Zones Isolation: Separate power circuits for various zones, including, as a minimum, Air Traffic Control, emergency services and runway lighting, security and security access, and perhaps even on a terminal by terminal basis, need to be provided, so that critical and emergency/safety can continue to operate and perhaps even some partial operations at individual terminals can continue.

Tiered Load Management and Automated Re-allocation: Implement automated load management, load sharing and load shedding systems and protocols ensuring that there is sufficient capacity in lower tier sub-systems to take on the full load/failed load from other sub-systems – it goes without saying that there does need to be a suitable degree of geographic separation – resilience, such as component resilience, within the same building or location, whilst good design for low grade or localised failure, only gives limited resilience to wider or more catastrophic events. In addition, a tiering system needs to be in place to automatically maintain essential operations while deactivating lower-priority systems, and to be able to re-allocate power capacity to systems as the outage progresses and recovery commences.

Independent and Geographically Separate Power Feeds: Ensure isolated power supplies and, if possible, at least two independent power suppliers, with full load capability for the whole operation or as a very minimum for high-risk areas to reduce cascading effects.

4. Testing, Maintenance, and Drills

When I was on the railway we had one system which we all agreed would be a very big deal if it ever failed and we could not continue operations. We had a very expensive automated hot standby system to accommodate for critical events, but even though we all agreed we needed to test the invocation to ensure that our “insurance policy” would actually work, due to the nature of the operation of the railway, we could never gain approval to invoke the tested cutover to the standby system ie metaphorically “pull the plug” on the live system. When the failure came – it was a big deal, the standby system only partially worked, and this created some real problems for us. Testing and re-testing periodically, under actual expected load conditions is absolutely essential for large scale infrastructure installations.

Periodic Scheduled Testing: This needs to be planned to take place on a regular basis consistent with the risk of occurrence and impact, and whilst it should be designed to minimise disruption, it needs to take place at a sub-system and overall system level, to stress test possible scenarios, at anticipated load volumes, to prevent outages leading to the one I experienced whilst on the railway, and possibly the one at Heathrow yesterday.

Full-Scale Crisis Simulations and Blackout Drills: Organise annual exercises involving all stakeholders, from airline operators, airport operations, business partners, suppliers and emergency responders. These should again cover all scenarios, including total blackout drills, and should be run in full dark mode as though the incident has actually happened and not just a paper exercise. A fully open and “no-fault” findings reporting system should be positively encouraged and each and every finding should be acted upon to close down and improve the overall resilience and response to actual and potential future events.

Preventive Maintenance: Regularly inspect and service UPS units, back-up generators, and other battery systems to ensure readiness. Make sure that power service providers and partners provide comprehensive records of their own maintenance and fix records consistent with any procurement or commercial arrangements that exist, and act on any deviations. Make sure that the preventive maintenance systems and processes are updated to reflect the state of any equipment over time – especially those which degrade over time.

Data-Driven Analysis and Improvements: Utilise AI, SCADA and predictive analytics to detect and address potential faults before failures occur. Stochastic and heuristic assessments can detect early signs of equipment failure by monitoring outputs, frequencies, electrical quality, and operational modes. This enables timely fault isolation and fixes.

5. Power Infrastructure Design and Smart Grid Integration

Ring-Fenced Redundant Architecture: It is absolutely essential that the design incorporates redundant power feed lines and grids, which should be geographically separated and, if possible, using separate power providers. This should also incorporate micro-grids and zoning of systems to enhance system resiliency.

Fault-Tolerant Engineering: Systems need to be built and designed with redundancy built in, and with over-capacity for load. Further the systems themselves need to detect and isolate faults and re-route power/self-heal to maintain uninterrupted service, and provide a sufficient level of quality output via AI or SCADA or similar to provide timely, informative and intuitive information so that further preventive or fix actions can be taken to minimise follow on disruption. In all cases any shutdowns should also be designed to be “graceful” and not give “dirty” output which may further damage sub-systems.

Alternative Solutions: For really critical infrastructure consider designing and running small micro-grid “self” power stations to diversify and reduce the reliance on external power suppliers.

6. Commercial Arrangements and Emergency Fuel Supply

Power Purchase Agreements: Secure contracts with multiple power providers to ensure continuous supply during extraordinary or peak demand or more specifically during an outage or other event (eg a strike)– and ensure that these alternatives can be invoked rapidly if need be.

Fuel Contracts: Establish long-term emergency fuel supply agreements to guarantee availability during extended outages and maintain an adequate supply of fuel and delivery to run the generators at full load until a further delivery can be scheduled and fully delivered.

7. Immediate Outage Response and Crisis Management

Automated Failover Activation: It is absolutely essential that critical systems and as many other systems as commercially viable are configured to instantly transition to back-up power sources with minimal human intervention. This back-up needs to be smooth, quick and not introduce damaging or “dirty” power. It should not be left to manual processes, and in my view should not be operated on a “cold” standby basis – as a minimum it should be hot standby and even better if a form of load sharing is undertaken as a matter of course so that transfer to a smaller sub-set of systems in the event of failure can be smoother.

Crisis Management Team (CMT): It is critical that you have a fully documented and fully trained crisis management team, with clear roles and communications responsibilities, which not only can be mobilised efficiently and effectively, but is ready and willing to step up and lead in the event of an outage to restore power in a stepped, hierarchical and agreed priority order, as smoothly and quickly as possible, and to gather learnings on the events leading up to and including the outage for future improvements. It goes without saying that this crisis management team needs to have trained and drilled their response, and be committed to fully open communications – so that it is not the first time they have been through it when an actual event occurs.

Communication Protocols: Deploy pre-scripted public announcements and other forms of communications (eg SMS, or an an agreed app) for staff and passengers, ensuring timely information dissemination. Make sure these communications are sent out on an agreed regular and periodic basis to give confidence that the crisis management is proceeding as planned.

8. Technology Integration and Real-Time Monitoring

SCADA and AI Systems: Implement advanced monitoring for early detection and predictive information of anomalies and upcoming or potential issues in power systems.

Predictive Maintenance Tools: Use analytics to predict potential failures and schedule proactive repairs.

IoT Sensors: Deploy sensors across critical infrastructure to provide continuous feedback on system health.

Dashboard Integration: Centralise monitoring data for real-time and automated decision-making during both normal operations and emergencies and provide simple actionable dashboards to relevant personnel, with well-designed alarm systems that alert personnel early and fast and are not ignored.

9. Training, Communication, and Human Factors

Staff Training Programs: Regularly train operational staff on what to look for/what to expect in the period leading up to an outage, the crisis response and the crisis management team, emergency protocols, communications protocols, system resets, and manual override procedures.

Simulation-Based Learning: Use virtual, paper and live drills to familiarise teams with outage scenarios and response strategies. You need to really design these to stress the individuals involved and make them work so that the learning is remembered. Use unusual scenarios that make people think and really lay down muscle memory to respond to events.

Clear Communication Channels: Establish dedicated communication lines among departments to ensure swift information flow. Make it clear who communicates with who and when to prevent management distraction and also to make sure that there are not conflicting messages, to enable the focus to remain clearly on fixing the problems caused by the event. Above all – be open and non-judgemental and dismiss nothing.

10. Security, Cyber Resilience, and Regulatory Compliance

My next TxB post is on Cyber resilience, so I will not spend too much time on this point here, but in general:-

• Cybersecurity Measures: Strengthen, fortify and monitor digital and IT control systems against cyber threats that could compromise power management, and keep these up to date and current with ongoing protective monitoring and threat tests based on a zero-trust basis.

Physical Security: Physical security, ingress and egress must be access controlled and sufficient to minimise and reduce any threat, in particular physical security of fencing should be high enough and robust enough with sufficient “clean/sterile” space between fences to provide a further level of protection. I would also have thought that near an airport provision should be made for damage from above, especially with easy access to drones and other threats. Enhanced surveillance and patrols should also be deployed.

Continuous Audits: Regularly review security protocols and infrastructure resilience with third-party audits, best practice or regulatory guidelines, to ensure ongoing compliance and improvement. Implement any agreed recommendations as soon as possible. A fully independent check and challenge regime, which is empowered to really probe and test the provisions and protections against a disaster occurrence is really important.

Conclusion

By integrating these 10 critical points—from rigorous risk assessment and robust design and backup systems to advanced monitoring, comprehensive training, stringent security measures, an infrastructure operator, including an airport operator, can achieve a high level of resilience against large-scale power outages. Implementing these best practices ensures that operations remain safe, efficient, and reliable even in the face of unforeseen disruptions, ultimately keeping the airport “flying high” and “powered-up” no matter what challenges arise.

 

Bryan Altimas

Cyber Security Consultant | Government Adviser | Your Cyber Security Partner | Cyber Security | Data Protection | Governance, Risk & Compliance | Mission = Clear Cyber Security Protected Reputation

4mo

Great article Antony. I fully agree with what you are saying. For a major element of our critical infrastructure to be knocked out of operation by an electricity substation fire shows a complete lack of planning.

Like
Reply

Great insights Antony!

Like
Reply
Mithra Rampersad

Managing Director/ Owner at Rivelin Consultants Limited

4mo

Excellent insight Antony, this event had repercussions globally, even here in Trinidad.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics