SlideShare a Scribd company logo
Amazon AWS DisruptionSuhas A. Kelkar, Director, Incubator TeamMay 2nd, 2011
5/6/20112What Happened…Easter Weekend AWS Disruption…Affected region
Amazon BasicsImage Credit: DongWoo Lee from blog.edog.net
5/6/20114AWS Basics Regions
Geographically five regions: US East (N Virginia), US West (N California), EU (Ireland), APAC (Singapore) and APAC (Tokyo)
AWS EC2 SLA 99.95% availability for each region
Contains one or more Availability Zones
 Availability Zones
Distinct locations engineered to be insulated from failures in other availability zones in the same region

More Related Content

PPT
Leveraging Amazon's Elastic Block Store
PDF
Terraform Architech
PPTX
HA and DR for Cloud Workloads
PPTX
From Ruby to Elixir
PPTX
BigDoor's Jeff Malek Gluecon Presentation
PPTX
Glue con2011 Jeff Malek from BigDoor
PPTX
Efficient way to manage environments in AWS
PDF
Amazon Elastic Beanstalk
Leveraging Amazon's Elastic Block Store
Terraform Architech
HA and DR for Cloud Workloads
From Ruby to Elixir
BigDoor's Jeff Malek Gluecon Presentation
Glue con2011 Jeff Malek from BigDoor
Efficient way to manage environments in AWS
Amazon Elastic Beanstalk

What's hot (12)

PDF
[Jun AWS 201] Elastic Beanstalk for Startups
PPTX
Retrospective from a startup built in the cloud: top three big lessons learne...
PPTX
AWS elastic beanstalk
PPT
Developing And Running A Website On Amazon S E
PDF
IDI 2020 - Containers Meet Serverless
PPTX
Welcome Azure Functions 2. 0
PDF
"AWS Fargate: Containerization meets Serverless" at AWS User Group Cologne 20...
PDF
Take control of your dev ops dumping ground
PPTX
Resource Management in the Enterprise Data Center
PDF
The ultimate dilemma of choosing container environment on AWS: ECS, EKS or Fa...
PPTX
The life in the Cloud
PPTX
cloud test 1002i
[Jun AWS 201] Elastic Beanstalk for Startups
Retrospective from a startup built in the cloud: top three big lessons learne...
AWS elastic beanstalk
Developing And Running A Website On Amazon S E
IDI 2020 - Containers Meet Serverless
Welcome Azure Functions 2. 0
"AWS Fargate: Containerization meets Serverless" at AWS User Group Cologne 20...
Take control of your dev ops dumping ground
Resource Management in the Enterprise Data Center
The ultimate dilemma of choosing container environment on AWS: ECS, EKS or Fa...
The life in the Cloud
cloud test 1002i

Similar to Amazon cloud failure (20)

PPTX
Scalable Web Architecture and Distributed Systems
PDF
Aws 201:Advanced Breakout Track on HA and DR
PPTX
High Availability in the Cloud - Architectural Best Practices
PPTX
Cnam azure 2014 storage
PDF
Scaling web application in the Cloud
PDF
Percona Live 2014 - Scaling MySQL in AWS
PPTX
Microsoft Azure fundamentals for AWS practitioners
PPT
Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertaint...
PDF
Elastic search from the trenches
PPTX
Cnam azure 2015 storage
PPT
Eucalyptus: Open Source for Cloud Computing
PPT
Amazon web services: A Quick Introduction from Cloudreach
PPT
Amazon web services a quick introduction
PDF
Taking Web Application Deployment from Infancy to Maturity in AWS
PPTX
Dell EMC Elastic Cloud Storage - Kemp at Network Field Day, DellTechWorld
PPT
01_Architecture_JFV14_01_Architecture_JFV14.ppt
PDF
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
PPTX
Diveinto AWS
PPTX
Amazon Aurora TechConnect
PPT
Microsoft Azure
Scalable Web Architecture and Distributed Systems
Aws 201:Advanced Breakout Track on HA and DR
High Availability in the Cloud - Architectural Best Practices
Cnam azure 2014 storage
Scaling web application in the Cloud
Percona Live 2014 - Scaling MySQL in AWS
Microsoft Azure fundamentals for AWS practitioners
Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertaint...
Elastic search from the trenches
Cnam azure 2015 storage
Eucalyptus: Open Source for Cloud Computing
Amazon web services: A Quick Introduction from Cloudreach
Amazon web services a quick introduction
Taking Web Application Deployment from Infancy to Maturity in AWS
Dell EMC Elastic Cloud Storage - Kemp at Network Field Day, DellTechWorld
01_Architecture_JFV14_01_Architecture_JFV14.ppt
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
Diveinto AWS
Amazon Aurora TechConnect
Microsoft Azure

More from Suhas Kelkar (6)

PPTX
APAC Sales QSR : Incredible India
PPTX
Unique ID Authority of India, Design and Cloud Connection
PDF
Changing Landscape of Data Centers
PDF
Software Product Management in Web 2.0
PPT
Cloud Application Development Lifecycle
PPT
SaaS Presentation at SCIT Conference
APAC Sales QSR : Incredible India
Unique ID Authority of India, Design and Cloud Connection
Changing Landscape of Data Centers
Software Product Management in Web 2.0
Cloud Application Development Lifecycle
SaaS Presentation at SCIT Conference

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf

Amazon cloud failure

  • 1. Amazon AWS DisruptionSuhas A. Kelkar, Director, Incubator TeamMay 2nd, 2011
  • 2. 5/6/20112What Happened…Easter Weekend AWS Disruption…Affected region
  • 3. Amazon BasicsImage Credit: DongWoo Lee from blog.edog.net
  • 5. Geographically five regions: US East (N Virginia), US West (N California), EU (Ireland), APAC (Singapore) and APAC (Tokyo)
  • 6. AWS EC2 SLA 99.95% availability for each region
  • 7. Contains one or more Availability Zones
  • 9. Distinct locations engineered to be insulated from failures in other availability zones in the same region
  • 10. EBS (Elastic Block Storage)
  • 12. One can create a file system on top of EBS volumes
  • 13. EBS volumes are kept in a Availability Zone and can be attached to instances also in that same zone
  • 14. Automatically replicated within same Availability Zone
  • 15. Control plane services coordinate user requests and propagate them to EBS clustersEBS ArchitectureRegions1..5Control Plane ServicesAvailability Zone 1..nEBS Cluster 1EBS Cluster nSecondary Replication Low Bandwidth NetworkPrimary High Bandwidth Network....Node 1Node 2Node n....5/6/20115
  • 16. 5/6/20116Amazon Cloud Disruption Post Mortem The Trigger
  • 17. Incorrect traffic shift onto the lower capacity EBS network
  • 18. Many nodes in the affected AZ got completely isolated and lost connection to their replicas
  • 20. After rolling back incorrect traffic shift, the previously isolated nodes now began searching the EBS cluster for available server space so they could re-mirror data
  • 21. Free capacity of cluster was soon exhausted leaving many nodes stuck in a loop searching for free space
  • 22. This led to a re-mirroring storm where a large number of volumes were effectively “stuck” while the nodes searched on and on
  • 23. Why did this affect other AZs
  • 24. The EBS cluster became unable to service “create volume” API requests
  • 25. This caused thread starvation in EBS control plane affecting service to other AZs5/6/20117Amazon Cloud Disruption Post Mortem (contd.) The 2 major factors at the root of this problem
  • 26. Nodes failing to find new nodes did not back off aggressively enough
  • 27. Race condition in the code (bug) that caused nodes to fail incorrectly when they were closing a large number of replication requests.Failure walk-throughControl Plane ServicesThreadsAvailability Zone 11. Traffic usually through primary high bandwidth networkEBS Cluster 12. Traffic incorrectly shifted (manual error) to low band width secondary networkSecondary Replication Low Bandwidth Network3. This caused congestion in secondary networkPrimary High Bandwidth Network4. Nodes assumed replica destination has failed5. Mistake quickly realized and Traffic shift rolled backNode 1Node 1Node 2Node nNode 26. Re-mirroring storm (due to previous node isolation)....7. Free space runs out , nodes get stuck in a loop, volumes get stuckNode n8. API requests from control plane get stuck holding up threads in the control planeServices to other AZs affected due to thread starvation in the control plane. Control Plane essentially experienced a distributed DoS attack!5/6/20118
  • 29. Other AWS Services such as the Relational Database Service (RDS) and EC2 instances rely on the EBS Control Plane for their data volume needs.
  • 30. RDS in particular uses multiple EBS volumes simultaneously.
  • 31. Hence this issue that started with one AZ in one region quickly affected other regions
  • 33. Applications should not rely on a single Region or AZs
  • 34. Appropriate Policies and Change Management Process (requiring manual approvals for risky changes) needs to be implemented
  • 35. Automation instead of manual changes would have helped prevent errors
  • 36. Modify the re-mirroring search algorithm to back off more aggressively in case of a large scale interruption
  • 37. Making highly-reliable multi-AZ deployments easy to design and operate so that the customer adoption is accelerated
  • 38. Finally, having a fail over hybridorheterogeneous environment would have helped.