SlideShare a Scribd company logo
CLOUDFAIL
SCALING TO INFINITY – BUT NOT BEYOND
Kunal Johar
MARCH 14, 2013
π Day
What would you do?
•

You take your senior design project to the next level

•

You have some traction – 10-15 people a week using it

•

A game-changing opportunity hits you in the face

•

You need to scale to tens of thousands of users per week
Act as If
•

Scaling is no big deal right?

•

Amazon’s Elastic Cloud; Rackspace’s Infinite Capacity

•

50,000 is a small number even in O(N^2)

•

I’m sure I can figure it out
“We are counting on you”
•

Our organization depends on this software for our annual operating budget

•

This year was a total disaster. Multi-week outages.

•

We need you to tell us that this will work, that the system won’t go down, no
matter how much traffic we send to it
No Problem
•

“The old vendor was amateur hour”

•

We’ll distribute the load across multiple servers

•

We’ll load test

•

We’ll scale up

•

DON’T WORRY
MAY 20, 2013
Paperwork Signed – Now the Challenge Begins
Our Software Does it all (soon)
•

It was a Brutal Summer
•

We had 12 weeks to learn, architect, and build what ended up being 1800 hours worth of
features

•

The margin for error was Zero

•

We also had to make sure our system would scale to meet the super-surge of traffic in
January
Full Team Buy-In
•

The stakes were known to everyone.

•

If we succeeded, we’d pivot ourselves to the top of the market.

•

If we failed, half the team would be out of work

•

Our client called failure “Mutually Assured Destruction”
Full Team Buy-In
•

The stakes were known to everyone.

•

If we succeeded, we’d pivot ourselves to the top of the market.

•

If we failed, half the team would be out of work

•

Our client called failure “Mutually Assured Destruction”
SEPTEMBER 2, 2013
Lot’s of Overtime, Heat, Stress, Anxiety. But we did it.
Memo to Developers
Load Test or Beta Test?
•

From the September 1 Launch date; until even today we have been hit with new
feature requests

•

“Oh! I forgot about that – but it’s really important”

•

How do you balance engineering priorities vs feature priorities?
How to Construct a Load Test
•

Write custom scripts that simulate real users using your app
•
•
•

•

Selenium Web Driver + Sauce Labs
Browser Mob (Neustar)
Load Impact

Write a custom handler that simulates the user payload
•

Loader.io
Our Loader.io Script PayLoad
•

POST 100 KB of data

•

Simulate Save to Database

•

GET 100 KB of data from Database
The Actual Load Test
300+ Users Per Second!
•

Whoo hoo!

•

300 users per second must mean what? Thousands of users per minute!

•

I report to client a very successful load test and put the matter towards some
wishful thinking
SURVIVORSHIP BIAS
http://youarenotsosmart.com/2013/05/23/survivorship-bias/
Survivorship Bias
The misconception
You should focus on the successful if you wish to be successful
The truth
When failure becomes invisible, the difference between failure and success my also
become invisible
Survivorship Bias
•

“A Cabal of Geniuses” assembled at the request of
the White House

•

Top women mathematicians (human computers),
Nobel Prize Winners, researchers formed the
Statistical Research Group
Keeping Airlines in the Sky
•

At its lowest; survivability of a WWII
bomber was 50% on a mission

•

“Ghosts already” is how airmen
were known

•

“How, the Army Air Force asked,
could they improve the odds of a
bomber making it home”
Armor
•

Military commanders inspected the planes that made it back

•

Ideally they could put armor on the whole plane, but then it wouldn’t fly

•

Tons of bullet holes in key areas of the fuselage, wings, near the gunners

•

The army was about to add plating to these parts of the bombers
Armor
•

The scientists successfully argued
“Survivorship Bias”

•

Stop looking at the survivors – it is the
other parts of the plane that need more
armor!
WHAT IS “CLOUDSCALE”
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
LOL
WE DON’T DO THAT
Zack’s first comment as I concluded that presentation
Our Architecture
PaaS / IaaS
WEEK OF JANUARY 6
Everyday is a Record Traffic Day
Scale up on IaaS
•

Someone trying to generate a 150
page PDF

•

The norm is 10-15 pages…

•

“OutOfMemoryException”
Thursday, January 9, 2014
Whoo Hoo!
•

No Issues on our highest
traffic day ever!

•

“Can’t wait till that
number hits 250 per
minute!”

•

“Tomorrow will be our
biggest day yet!”
Friday, January 10, 2014
•

Approximately 12:00 Noon
•
•
•
•

•

Site traffic is around 185 people, 50 less than the previous day’s high
1 out of every 12 hits times out
According to Rackspace, a node is failing on cloudsites and will be taken out of rotation
About 10 complaints so far, but I email “Everything is under control”

Approximately 12:30 PM
•
•
•

Traffic falls to about 150 people per minute
Things are fine
Phew
Friday, January 10, 2014
•

At 1:00 PM we have a job interview for a new support person

•

I have live chat open with Rackspace and am hopping back and forth between the
interview --- not the best way to hire someone

•

1:45 PM interview over, and I learn traffic is at 220+ people.

•

The site is pretty much dead

•

While I work on the issue, my phone is ringing with an frightened customer. Our
help desk is filling up with complaints non-stop

•

With a stone-cold face, I walk to my teammates. “This is bad. I need help”
Backup Plan
•

I knew CloudSites had some limit, but I had a plan to shift traffic at a moment’s
notice in a worst case situation
Backup Plan Now in Play
•

Using CloudFlare, a service that lets us rapidly change DNS records; traffic was
redirected to the super server

•

1 second later
Backup Plan Part II (Scale Up)
•

OK – I’ll spin up the most powerful server I can buy.

•

64 GB RAM

•

32 vCPU
Backup Plan Part II
•

19 seconds later
3:25 PM
•

Rackspace gives me a one time “boost” to capacity

•

Let’s me know about “HTE” for the future….
•

•

“If you are having a high traffic event, let us know in advance”

I kiss the floor. My company is saved by the whim of my hosting company
9:00 PM
•

Zack and I finish responding to customer complaints

•

It would be weeks before I could sleep normally again
What the heck happened?
•

The initial load test was testing people submitting one application at a time

•

The PDF issue was actually a harbinger of things to come

•

Thursday had record traffic, but Friday had people doing “Finalization” (commits)

•

Our commit code was very slow, and used a lot of RAM. As a server would get overloaded,
the app pool would restart – this would add load to other servers

•

Demand > Supply caused a chain reaction making servers continually failing until more
supply was added
Our Future Plans
•

I’m too scared of PaaS for a
complex use case!

•

Not enough data to know when
things fail.
Thanks!
Kunal Johar
kjohar@alumni.gwu.edu

More Related Content

PPT
Rich, Real-time Mobile User Experiences @Devoxx UK
PPTX
DevOpsDays Silicon Valley 2014 - The Game of Operations
PDF
KubeCon 2019 Recap (Parts 1-3)
PDF
An introduction to Reactive applications, Reactive Streams, and options for t...
PPTX
Building a slack bot
PDF
Scalability vs. Performance
PDF
Surviving in a microservices environment
PPTX
Brisbane DevOps Meetup - Reinvent 2015
Rich, Real-time Mobile User Experiences @Devoxx UK
DevOpsDays Silicon Valley 2014 - The Game of Operations
KubeCon 2019 Recap (Parts 1-3)
An introduction to Reactive applications, Reactive Streams, and options for t...
Building a slack bot
Scalability vs. Performance
Surviving in a microservices environment
Brisbane DevOps Meetup - Reinvent 2015

What's hot (20)

PDF
An Iterative Approach to Service Oriented Architecture
PPTX
Continuous database deployment
PPTX
One Does Not Simply Walk Into Devops
PDF
Serverless Application Model - Executing Lambdas Locally
PDF
Reactive Streams and the Wide World of Groovy
PDF
Qcon talk
PPTX
Why Enterprises Are Embracing the Cloud
PDF
An Introduction to Reactive Application, Reactive Streams, and options for JVM
PDF
JUST EAT: Tools we use to enable our culture
PDF
PPTX
The challenges of live events scalability
PPTX
Mobile Network Performance Testing
PDF
In the hunt of 100% delivery rate with mobile push notifications
PDF
Building a reliable, scalable service with Clojure and Core.async
PDF
Ansible Case Studies
PPTX
Test Driven Development with AngularJS
PDF
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
PPTX
Intro to event sourcing and CQRS
PPTX
Scala bay meetup 9.17.2015 - Presentation 1
PPTX
Running Yarn at Scale
An Iterative Approach to Service Oriented Architecture
Continuous database deployment
One Does Not Simply Walk Into Devops
Serverless Application Model - Executing Lambdas Locally
Reactive Streams and the Wide World of Groovy
Qcon talk
Why Enterprises Are Embracing the Cloud
An Introduction to Reactive Application, Reactive Streams, and options for JVM
JUST EAT: Tools we use to enable our culture
The challenges of live events scalability
Mobile Network Performance Testing
In the hunt of 100% delivery rate with mobile push notifications
Building a reliable, scalable service with Clojure and Core.async
Ansible Case Studies
Test Driven Development with AngularJS
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Intro to event sourcing and CQRS
Scala bay meetup 9.17.2015 - Presentation 1
Running Yarn at Scale
Ad

Similar to Cloud fail scaling to infinity but not beyond (20)

PDF
Building cloudy apps
PDF
Evolving to Cloud-Native - Anand Rao
PPTX
Cloud War Stories
PDF
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
PDF
The Reluctant SysAdmin : 360|iDev Austin 2010
PDF
Leveraging Automation for a Disposable Infrastructure
PPTX
CAMP IT Slides - Skytap - Brian White
PPTX
DevOps, CI, APIs, Oh My! - Texas Linux Fest 2012
PPTX
Managing Performance in the Cloud
ODP
Testing at-cloud-speed sans-app-sec-austin-2013
PPTX
Kb12012011 amitava cloud_computing
KEY
Cto cloud
PPTX
Black Friday Brilliance Managing a Billion Transactions with Tech, Tactics, a...
PPT
7 Stages of Scaling Web Applications
PPTX
How To Leverage Cloud Computing for Business & Operational Benefit - CAMP IT
PDF
Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemoti...
PDF
The Next Generation IT Department MUST HAVE CLOUD
PPTX
Embracing Failure - AzureDay Rome
PPTX
[Srijan Wednesday Webinars] 11 Things You Don't Know About Cloud
PDF
Cloud Computing Berkeley.pdf
Building cloudy apps
Evolving to Cloud-Native - Anand Rao
Cloud War Stories
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
The Reluctant SysAdmin : 360|iDev Austin 2010
Leveraging Automation for a Disposable Infrastructure
CAMP IT Slides - Skytap - Brian White
DevOps, CI, APIs, Oh My! - Texas Linux Fest 2012
Managing Performance in the Cloud
Testing at-cloud-speed sans-app-sec-austin-2013
Kb12012011 amitava cloud_computing
Cto cloud
Black Friday Brilliance Managing a Billion Transactions with Tech, Tactics, a...
7 Stages of Scaling Web Applications
How To Leverage Cloud Computing for Business & Operational Benefit - CAMP IT
Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemoti...
The Next Generation IT Department MUST HAVE CLOUD
Embracing Failure - AzureDay Rome
[Srijan Wednesday Webinars] 11 Things You Don't Know About Cloud
Cloud Computing Berkeley.pdf
Ad

More from Kunal Johar (6)

PPTX
Career - Senior Design (Computer Science)
PPT
Design part iii - Ready to Build
PPT
Journey of an Idea to Invention Part 1 of 3
PPTX
Real world software launch
PPT
Screencasting and Presenting for Engineers
PDF
Introduction to HTML5
Career - Senior Design (Computer Science)
Design part iii - Ready to Build
Journey of an Idea to Invention Part 1 of 3
Real world software launch
Screencasting and Presenting for Engineers
Introduction to HTML5

Cloud fail scaling to infinity but not beyond

  • 1. CLOUDFAIL SCALING TO INFINITY – BUT NOT BEYOND Kunal Johar
  • 3. What would you do? • You take your senior design project to the next level • You have some traction – 10-15 people a week using it • A game-changing opportunity hits you in the face • You need to scale to tens of thousands of users per week
  • 4. Act as If • Scaling is no big deal right? • Amazon’s Elastic Cloud; Rackspace’s Infinite Capacity • 50,000 is a small number even in O(N^2) • I’m sure I can figure it out
  • 5. “We are counting on you” • Our organization depends on this software for our annual operating budget • This year was a total disaster. Multi-week outages. • We need you to tell us that this will work, that the system won’t go down, no matter how much traffic we send to it
  • 6. No Problem • “The old vendor was amateur hour” • We’ll distribute the load across multiple servers • We’ll load test • We’ll scale up • DON’T WORRY
  • 7. MAY 20, 2013 Paperwork Signed – Now the Challenge Begins
  • 8. Our Software Does it all (soon) • It was a Brutal Summer • We had 12 weeks to learn, architect, and build what ended up being 1800 hours worth of features • The margin for error was Zero • We also had to make sure our system would scale to meet the super-surge of traffic in January
  • 9. Full Team Buy-In • The stakes were known to everyone. • If we succeeded, we’d pivot ourselves to the top of the market. • If we failed, half the team would be out of work • Our client called failure “Mutually Assured Destruction”
  • 10. Full Team Buy-In • The stakes were known to everyone. • If we succeeded, we’d pivot ourselves to the top of the market. • If we failed, half the team would be out of work • Our client called failure “Mutually Assured Destruction”
  • 11. SEPTEMBER 2, 2013 Lot’s of Overtime, Heat, Stress, Anxiety. But we did it.
  • 13. Load Test or Beta Test? • From the September 1 Launch date; until even today we have been hit with new feature requests • “Oh! I forgot about that – but it’s really important” • How do you balance engineering priorities vs feature priorities?
  • 14. How to Construct a Load Test • Write custom scripts that simulate real users using your app • • • • Selenium Web Driver + Sauce Labs Browser Mob (Neustar) Load Impact Write a custom handler that simulates the user payload • Loader.io
  • 15. Our Loader.io Script PayLoad • POST 100 KB of data • Simulate Save to Database • GET 100 KB of data from Database
  • 17. 300+ Users Per Second! • Whoo hoo! • 300 users per second must mean what? Thousands of users per minute! • I report to client a very successful load test and put the matter towards some wishful thinking
  • 19. Survivorship Bias The misconception You should focus on the successful if you wish to be successful The truth When failure becomes invisible, the difference between failure and success my also become invisible
  • 20. Survivorship Bias • “A Cabal of Geniuses” assembled at the request of the White House • Top women mathematicians (human computers), Nobel Prize Winners, researchers formed the Statistical Research Group
  • 21. Keeping Airlines in the Sky • At its lowest; survivability of a WWII bomber was 50% on a mission • “Ghosts already” is how airmen were known • “How, the Army Air Force asked, could they improve the odds of a bomber making it home”
  • 22. Armor • Military commanders inspected the planes that made it back • Ideally they could put armor on the whole plane, but then it wouldn’t fly • Tons of bullet holes in key areas of the fuselage, wings, near the gunners • The army was about to add plating to these parts of the bombers
  • 23. Armor • The scientists successfully argued “Survivorship Bias” • Stop looking at the survivors – it is the other parts of the plane that need more armor!
  • 38. LOL WE DON’T DO THAT Zack’s first comment as I concluded that presentation
  • 41. WEEK OF JANUARY 6 Everyday is a Record Traffic Day
  • 42. Scale up on IaaS • Someone trying to generate a 150 page PDF • The norm is 10-15 pages… • “OutOfMemoryException”
  • 44. Whoo Hoo! • No Issues on our highest traffic day ever! • “Can’t wait till that number hits 250 per minute!” • “Tomorrow will be our biggest day yet!”
  • 45. Friday, January 10, 2014 • Approximately 12:00 Noon • • • • • Site traffic is around 185 people, 50 less than the previous day’s high 1 out of every 12 hits times out According to Rackspace, a node is failing on cloudsites and will be taken out of rotation About 10 complaints so far, but I email “Everything is under control” Approximately 12:30 PM • • • Traffic falls to about 150 people per minute Things are fine Phew
  • 46. Friday, January 10, 2014 • At 1:00 PM we have a job interview for a new support person • I have live chat open with Rackspace and am hopping back and forth between the interview --- not the best way to hire someone • 1:45 PM interview over, and I learn traffic is at 220+ people. • The site is pretty much dead • While I work on the issue, my phone is ringing with an frightened customer. Our help desk is filling up with complaints non-stop • With a stone-cold face, I walk to my teammates. “This is bad. I need help”
  • 47. Backup Plan • I knew CloudSites had some limit, but I had a plan to shift traffic at a moment’s notice in a worst case situation
  • 48. Backup Plan Now in Play • Using CloudFlare, a service that lets us rapidly change DNS records; traffic was redirected to the super server • 1 second later
  • 49. Backup Plan Part II (Scale Up) • OK – I’ll spin up the most powerful server I can buy. • 64 GB RAM • 32 vCPU
  • 50. Backup Plan Part II • 19 seconds later
  • 51. 3:25 PM • Rackspace gives me a one time “boost” to capacity • Let’s me know about “HTE” for the future…. • • “If you are having a high traffic event, let us know in advance” I kiss the floor. My company is saved by the whim of my hosting company
  • 52. 9:00 PM • Zack and I finish responding to customer complaints • It would be weeks before I could sleep normally again
  • 53. What the heck happened? • The initial load test was testing people submitting one application at a time • The PDF issue was actually a harbinger of things to come • Thursday had record traffic, but Friday had people doing “Finalization” (commits) • Our commit code was very slow, and used a lot of RAM. As a server would get overloaded, the app pool would restart – this would add load to other servers • Demand > Supply caused a chain reaction making servers continually failing until more supply was added
  • 54. Our Future Plans • I’m too scared of PaaS for a complex use case! • Not enough data to know when things fail.