SlideShare a Scribd company logo
Cloud War Stories 
Crashes on Live TV
Foxes TV show 
Concept 
• A Live TV broadcast that documented the lives of urban foxes in cities 
around the country 
• The show was a nature documentary 
• The web site provided a second screen experience for the users 
allowing them to: 
• Ask the presenters questions 
• Track live foxes in real time using GPS technologies 
• The broadcast was happening during a public holiday so larger than 
average audiences where predicted 
• The TV audience was predicted to be ~3000 viewers of which 10% 
would use the second screen
Media vs the Cloud 
Scenario 
• Predicted traffic profile 
• 300 users a second 
• Load testing carried out 
• Passed all the tests 
• Announcements going live at a specific time: 
• Expecting short 3 min traffic spikes through out the show 
• 4 spikes expected during the broadcast
What testing showed us 
Autoscale doesn’t work in short spaces of time! 
Why it failed: 
• Autoscale reacts to cloud watch metric for load 
• We set a very low load level on the metric to 
ensure capacity was added in plenty of time 
• However average spin up times for EC2 
instances are 3-5 mins 
• Our traffic spikes where predicted to last 3 
mins 
• This meant that the autoscale would not react 
in time to prevent the site crashing
The Solution 
With public announcements you have an advantage: 
You know when the announcements will take place this allows you to “pre-warm” 
your systems. You can: 
• Manually scale your systems 
• Removing the reliance on autoscale systems 
• Pre-Warm your web cache 
• Programmatically access your site so that content is already in the 
caches 
• This stops the initial rush of users knocking your site off line
What happened on the night! 
• Despite all this testing the system still crashed! 
• During the crash we put up a holding page to explain what was happening to the users. 
• This also crashed!
What went wrong? 
• Unprecedented audience figures of 15000 viewers and 5000 hits a second at 
peak on the website 
• Last min changes to the site had invalidated caching so we had a large rush of 
users hitting the origin 
• This wasn’t our only issue however! 
• As more people hit the cold cache there 
were more calls to the ELB 
• This caused the ELB to autoscale 
• You can’t stop this happening 
• Each time the ELB scaled users 
disconnected 
• Causing them to hit refresh and 
double the problem
Lessons Learned 
• Despite our testing for 300 users a second we didn’t know at what point the site would crash 
• To address this we took the following steps: 
• Future tests were handled by specialist load testers 
• Amazon Partners 
• We tested to destruction 
• This way we could fail the site gracefully 
• Even though Amazon Elastic Load Balancers (ELBs) are a highly available service we took a 
new approach: 
• Dual Elastic Load Balancers Were Deployed 
• We used AWS Route53 to provided weighted DNS ensuring a 50/50 split of traffic 
• We used Route53 to provide failure detection healthchecks 
• As one ELB scaled the traffic went to the other and no users were disconnected 
• All static assets (Images, Javascript, CSS and Video) where stored in S3 away from the main 
platform relieving pressure in the event of an origin rush
Improved Architecture
Getting it right

More Related Content

PDF
Building and Scaling a WebSockets Pubsub System
PPTX
BDM37 - Simon Grondin - Scaling an API proxy in OCaml
PDF
JUST EAT: Tools we use to enable our culture
PPTX
Load Balancing Algorithms - Which one to choose?
PDF
Reaching 5 Million Messaging Connections: Our Journey with Kubernetes
PPTX
Let's build a PaaS platform, how hard could it be?
PPTX
Intro to Netflix's Chaos Monkey
PPTX
VerneMQ - Distributed MQTT Broker
Building and Scaling a WebSockets Pubsub System
BDM37 - Simon Grondin - Scaling an API proxy in OCaml
JUST EAT: Tools we use to enable our culture
Load Balancing Algorithms - Which one to choose?
Reaching 5 Million Messaging Connections: Our Journey with Kubernetes
Let's build a PaaS platform, how hard could it be?
Intro to Netflix's Chaos Monkey
VerneMQ - Distributed MQTT Broker

What's hot (16)

PPTX
Updating Ember Models in Real-time with Sockets and Rx
PDF
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
PDF
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
KEY
How NYTimes.com uses Amazon Web Services - AWS Summit 2011
PPTX
Asynchronous Multiplayer on Mobile Network
PDF
Customer-centric Metrics
PDF
Clustering and load balancing : jboss
PDF
Traffic Control with Envoy Proxy
PDF
OSMC 2013 | Zabbix: A Practical Demo by Rihards Olups
PDF
Self Created Load Balancer for MTA on AWS
PDF
Rails On AWS - RubyFools Copenhagen 2008 by Jonathan Weiss
PDF
Get into less by tess hsu
PDF
uNite 2017 - Going serverless - Gertjan Vanthienen
PDF
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
PDF
How to Run a 1,000,000 VU Load Test using Apache JMeter and BlazeMeter
Updating Ember Models in Real-time with Sockets and Rx
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
How NYTimes.com uses Amazon Web Services - AWS Summit 2011
Asynchronous Multiplayer on Mobile Network
Customer-centric Metrics
Clustering and load balancing : jboss
Traffic Control with Envoy Proxy
OSMC 2013 | Zabbix: A Practical Demo by Rihards Olups
Self Created Load Balancer for MTA on AWS
Rails On AWS - RubyFools Copenhagen 2008 by Jonathan Weiss
Get into less by tess hsu
uNite 2017 - Going serverless - Gertjan Vanthienen
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
How to Run a 1,000,000 VU Load Test using Apache JMeter and BlazeMeter
Ad

Similar to Cloud War Stories (20)

PPTX
Embracing Failure - Fault Injection and Service Resilience at Netflix
PDF
3 the cloud
PPTX
The challenges of live events scalability
PPTX
Release it! - Takeaways
PDF
Coates bosc2010 clouds-fluff-and-no-substance
PDF
Mini-Training: Netflix Simian Army
PPTX
High Availability in the Cloud - Architectural Best Practices
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PPTX
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
PPTX
Scaling a MeteorJS SaaS app on AWS
PPTX
Release the Monkeys ! Testing in the Wild at Netflix
PDF
Curtis-Bray_Amazon_Introduction-to-Amazon-EC2.pdf
PDF
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PDF
The Need of Cloud-Native Application
PPTX
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
PDF
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
PDF
Benchmarking (RICON 2014)
PPTX
Azure Messaging Crossroads
PPTX
Eric Proegler Oredev Performance Testing in New Contexts
PPTX
The rice and fail of an IoT solution
Embracing Failure - Fault Injection and Service Resilience at Netflix
3 the cloud
The challenges of live events scalability
Release it! - Takeaways
Coates bosc2010 clouds-fluff-and-no-substance
Mini-Training: Netflix Simian Army
High Availability in the Cloud - Architectural Best Practices
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Scaling a MeteorJS SaaS app on AWS
Release the Monkeys ! Testing in the Wild at Netflix
Curtis-Bray_Amazon_Introduction-to-Amazon-EC2.pdf
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
The Need of Cloud-Native Application
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
Benchmarking (RICON 2014)
Azure Messaging Crossroads
Eric Proegler Oredev Performance Testing in New Contexts
The rice and fail of an IoT solution
Ad

More from Richard Harvey (20)

PPTX
Securityhub
PPTX
Core services
PPTX
Amplify console
PDF
AWS Identity Access Management
PDF
Introducing aws deep lens
PDF
AI Today
PDF
Re cap2018
PDF
Mitigating techniques
PPTX
Practical AWS Fargate
PDF
Amazon Container Services - Let me count the ways
PPTX
Amazon Container Services
PPTX
AWS Security and Encryption
PPTX
Deep dive - AWS security by design
PPTX
Lex and connect
PPTX
Amazon Workspaces Master Class
PPTX
Micro services and Containers
PPTX
AWS 101 Guide
PPTX
About Me
PPTX
Cloud Architecture
PPTX
Cloud Strategy
Securityhub
Core services
Amplify console
AWS Identity Access Management
Introducing aws deep lens
AI Today
Re cap2018
Mitigating techniques
Practical AWS Fargate
Amazon Container Services - Let me count the ways
Amazon Container Services
AWS Security and Encryption
Deep dive - AWS security by design
Lex and connect
Amazon Workspaces Master Class
Micro services and Containers
AWS 101 Guide
About Me
Cloud Architecture
Cloud Strategy

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf

Cloud War Stories

  • 1. Cloud War Stories Crashes on Live TV
  • 2. Foxes TV show Concept • A Live TV broadcast that documented the lives of urban foxes in cities around the country • The show was a nature documentary • The web site provided a second screen experience for the users allowing them to: • Ask the presenters questions • Track live foxes in real time using GPS technologies • The broadcast was happening during a public holiday so larger than average audiences where predicted • The TV audience was predicted to be ~3000 viewers of which 10% would use the second screen
  • 3. Media vs the Cloud Scenario • Predicted traffic profile • 300 users a second • Load testing carried out • Passed all the tests • Announcements going live at a specific time: • Expecting short 3 min traffic spikes through out the show • 4 spikes expected during the broadcast
  • 4. What testing showed us Autoscale doesn’t work in short spaces of time! Why it failed: • Autoscale reacts to cloud watch metric for load • We set a very low load level on the metric to ensure capacity was added in plenty of time • However average spin up times for EC2 instances are 3-5 mins • Our traffic spikes where predicted to last 3 mins • This meant that the autoscale would not react in time to prevent the site crashing
  • 5. The Solution With public announcements you have an advantage: You know when the announcements will take place this allows you to “pre-warm” your systems. You can: • Manually scale your systems • Removing the reliance on autoscale systems • Pre-Warm your web cache • Programmatically access your site so that content is already in the caches • This stops the initial rush of users knocking your site off line
  • 6. What happened on the night! • Despite all this testing the system still crashed! • During the crash we put up a holding page to explain what was happening to the users. • This also crashed!
  • 7. What went wrong? • Unprecedented audience figures of 15000 viewers and 5000 hits a second at peak on the website • Last min changes to the site had invalidated caching so we had a large rush of users hitting the origin • This wasn’t our only issue however! • As more people hit the cold cache there were more calls to the ELB • This caused the ELB to autoscale • You can’t stop this happening • Each time the ELB scaled users disconnected • Causing them to hit refresh and double the problem
  • 8. Lessons Learned • Despite our testing for 300 users a second we didn’t know at what point the site would crash • To address this we took the following steps: • Future tests were handled by specialist load testers • Amazon Partners • We tested to destruction • This way we could fail the site gracefully • Even though Amazon Elastic Load Balancers (ELBs) are a highly available service we took a new approach: • Dual Elastic Load Balancers Were Deployed • We used AWS Route53 to provided weighted DNS ensuring a 50/50 split of traffic • We used Route53 to provide failure detection healthchecks • As one ELB scaled the traffic went to the other and no users were disconnected • All static assets (Images, Javascript, CSS and Video) where stored in S3 away from the main platform relieving pressure in the event of an origin rush