CrowdStrike's July 19th Outage: A Lesson in Third-Party Dependencies and Patch Management

On July 19th, 2024, a significant service disruption impacted CrowdStrike customers worldwide. While not a security breach, this incident was triggered by a faulty configuration update pushed to the Falcon platform, causing widespread system crashes and disruptions. The root cause was traced to an error within CrowdStrike's systems, not a third-party vendor as initially assumed. However, this event highlights the critical need for robust internal processes around change management, testing, validation, and contingency planning. It also underscores the interconnectedness of software ecosystems, where even a seemingly minor update can have widespread repercussions.

Third-Party Dependencies and Patch Management

While CrowdStrike's outage wasn't directly caused by a third-party vendor, it reminds us of the risks associated with external dependencies. Many organisations rely on third-party software and services, which can introduce vulnerabilities if not managed properly. Patch management becomes crucial in such scenarios, as it regularly updates software to address security vulnerabilities and bugs.

When a third-party vendor releases a patch, organisations need to assess the potential impact on their systems, thoroughly test the patch in a controlled environment, and then deploy it in phase to minimise disruption. Effective vendor management involves establishing clear communication channels with vendors, understanding their patch release cycles, and having contingency plans in place should a vendor's patch cause issues.

Preventing Future Outages: Lessons Learned

  1. Rigorous Testing and Validation: Comprehensive testing procedures are essential for all software updates, whether developed internally or sourced from third parties. Thorough testing in controlled environments can help identify and rectify potential issues before they affect production systems.
  2. Phased Rollouts and Canary Deployments: Implementing updates in phases or using canary deployments (where updates are rolled out to a small subset of users first) allows for early detection of problems and limits the scope of impact in case of unforeseen issues. This approach allows for quicker identification and remediation, minimising disruption to users.
  3. Robust Rollback Mechanisms: The ability to quickly revert to a previous stable state is crucial when an update causes unexpected problems. Having well-defined rollback procedures can help minimise downtime and ensure service continuity.
  4. Change Management and Approval Processes: Implementing rigorous change management processes ensures that all system changes are carefully reviewed, tested, and approved before deployment. This helps mitigate the risk of unintended consequences and disruptions.
  5. Communication and Transparency: Timely and transparent communication with customers during an outage is essential. CrowdStrike acknowledged the issue quickly and provided regular updates, which helped maintain trust and manage expectations.
  6. Continuous Improvement: Treating every incident as a learning opportunity is key to preventing similar issues in the future. Conducting a thorough post-mortem analysis and implementing corrective actions can strengthen processes and enhance overall system resilience.

Conclusion

The CrowdStrike outage on July 19th, 2024, serves as a valuable reminder of the fragility of modern software systems and the potential for even minor errors to have widespread consequences. By prioritising robust testing, phased rollouts, effective rollback mechanisms, transparent communication, continuous improvement, and vigilant vendor management, organisations can minimise the risk of similar incidents and ensure the reliability of their services.

 

Ben Sugden

Head of Technology | InfoSec & Governance (Certified ISO 27001 Practitioner) | Enterprise Architecture & Agile Change | Open Group Certified (TOGAF & ArchiMate) | Driving Digital Transformation at Abzorb

1y

Well said Ossama. Whilst smaller organisations, with a smaller IT budget, may struggle to implement the teachings of all of those lessons, the spirit of those lessons can be maintained. Keeping a sensible eye and firm hand on vendor patches, and ensuring risks are assessed for business critical devices, should help in that regard. Your point on learning from incidents is central to our own incident management processes, it is the key to continual improvement.

Like
Reply

Good advice, Ossama. I haven't had the chance to understand how MS Windows is integrated with Crowdstrike software but as an OS it shouldn't just fall on its face when a dependency or consumed software or service is faulty.

Muhammad Aamir

IT Team Lead in Higher Education (King's College London)

1y

Great read Ossama. Keep up the good work!

Andrzej. Bubez.

Next Challenge Wanted! BAU Manager | IT Support Strategist | Change & Incident Management Specialist | Let’s Talk!

1y

Because these days everyone wants "cloud" solutions!

To view or add a comment, sign in

Others also viewed

Explore content categories