SlideShare a Scribd company logo
The guidebook of FailSucceeding in the Cloud
Steve Poole – IBM
Making Java Real Since Version 0.9
DevOps Practitioner @spoole167
This talk
• Come from personal and team experiences as a Leader of
a DevOps team
• Comes from weekly consultancy etc with product teams
and external customers
Agenda of Fail
• Fail 0 – Believing Migration to Cloud is easy
• Fail 1 – No Clarity of Purpose
• Fail 2 – Lack of education
• Fail 3 – Not kicking the tires enough first
• Fail 4 – Ignoring unpleasant discoveries
• Fail 5 – Fudging the hard decisions
• Fail 6 – Lack of preparation
• Fail 7 – Not enough exercise
• Fail 8 – Too much excitement
• Fail 9 - Big bang deployment
• Fail A – A few other things
Fail 0.0 : Believing Migration to Cloud is Easy
• ‘Cloud’ is not easy
• It may be self-service but don’t be fooled
• It may look like a nice walk into the forest to grandma’s house. ..
• Get yourself together for a large and painful exercise.
• Ever moved a Data Centre?
• Experience is key.
• Staff. Who’s going to do this – are they qualified?
• Prepare to change your plans
• Most migrations require architectural design changes within the first 6 months
• Half of all projects fail
• Half of all projects will need significant increases in budget
• Think it through
• Projects fail later on when new objectives get added be clear on your ultimate goal
Emigration not
Migration
(Migration
suggests its
something you
want to do
annually)
Fail 1.0: No Clarity of Purpose
• There are many reasons for moving applications to the ’Cloud’
• There are many types of application
• There are many ‘Clouds’ to move to
• What’s the chance of you getting it right first time?
• What’s the consequence of failure?
• Do you even know if you’ll even know it’s failed in time to recover?
• Clarity of purpose reduces your risk
• Clarity of purpose gives you focus
Not understanding the communications process
• How do they talk to you?
• What’s the ticketing system?
• How do you get told of a problem?
• How do you get told of planned outages?
• How much notice do you get for planned outages?
• How do you raise a problem?
• How do you ESCALATE?
• What is the communications SLA here?
• Know your rights
Fail 2.0 : Lack of Education or RTFM!
DOH! Ask me
about passwords
I was a single point of failure
And I didn’t even know it
I think I’m in control of my account until I need my password reset
I had no idea where the reset email was going to
Cloud support could trigger the reset but wouldn’t/couldn’t tell me more.
Suggested I go to my Admin!! - Which I thought was me.
Turns out there’s a corporate owner of the accounts. Took me days to resolve.
Fail 2.1 : Lack of Education or RTFM!
The one thing you should remember from this talk
We’re techies. We get excited about APIs. We understand APIs
Moving to the Cloud means giving your data, applications, security etc to a 3rd Party.
That means the ‘API’ extends into the human world. The contract and it’s SLA defines
what you can and cannot do when using Cloud services
Cloud providers benefit from economies of scale and have large numbers of customers
Just like the more usual service providers you use at home. Gas, Electricity, Broadband,
Satellite TV.
You know how that can work at home. Cloud Provisioning is much more complicated..
Fail 2.2 : Lack of Education or RTFM!
Not understanding the Service Level Agreement
• Does it have location specific differences?
• How is the SLA measured?
• How well defined are the criteria?
• How are issues resolved?
• What are your responsibilities?
• If you don’t know your SLA you will fail
Fail 2.3 : Lack of Education or RTFM!
Example: Can you assess free
capacity?
If a location is at capacity
what happens?
Not understanding the Service Level Agreement (2)
• True story
• Go to a service provider SLA dashboard
• Service says SLA available of 99.5%
• I think that means 
• Turns out that actual availability is 95.8
Fail 2.4 : Lack of Education or RTFM!
https://uptime.is/
Daily: 43.2s
Weekly: 5m 2.4s
Monthly: 21m 54.9s
Yearly: 4h 22m 58.5
Daily: 1h 0m 28.8s
Weekly: 7h 3m 21.6s
Monthly: 1d 6h 40m 49.3s
Yearly: 15d 8h 9m 52.0s
• True story
• The difference is because the provider
has a planned daily outage of 1hr
• They still claim 99.5%
• Get’s worse.
• Outages beyond their control don’t
‘count’ either.
Fail 2.5 : Lack of Education or RTFM!
Not understanding the Service Level Agreement (3)
•Daily: 3h 36m 0.0s
•Weekly: 1d 1h 12m 0.0s
•Monthly: 4d 13h 34m 21.9s
•Yearly: 54d 18h 52m 22.8s
85%
Not understanding the cost model
• Units of cost.
• CPU / RAM / Network / Storage / IP Addresses ….
• Penalty costs if you overrun?
• When does the time start and end?
• Costs change by location?
Fail 2.6 : Lack of Education or RTFM!
DOH! Ask me
about GPUs
Not understanding the cost model
• I’m testing new GPU support In IBM’s JVM 8.0
• IBM has GPU support in SoftLayer
• Amazon has GPU support in AWS
• I want to do some scale performance testing
• Got my VirtualBox and Ansible config
• Point it at AWS. Deploy < 1hr x 2
• Costs me $39 ?
• Other charges included 
Fail 2.7 : Lack of Education or RTFM!
p2.16xlarge
16 GPU
64 vCPU
732 GB ram
$14/hr
Not understanding how security and compliance is managed
• What are the security, compliance and image update policies?
• How did they handle the last pervasive vulnerability?
• Firewalls – do you get one for free? Can you configure it? What’s the default policy
for firewalls with deployments?
• SSL certificates – do you own and manage or do they offer a service?
• How do you access your VMs ? (ssh, telnet, web?)
• Passwords vs keys?
• Where are the keys kept?
• Can you retrieve the keys in an emergency?
Fail 2.8 : Lack of Education or RTFM!
You do
understand
penetration
attack vectors?
Misunderstanding what APIs exist
• Are there APIs for all the actions you want to perform
• Are they symmetrical?
• Do any need human interaction to complete?
• Are the APIs proprietary or standard?
• Are there plugins for IaC tools?
Fail 2.9 : Lack of Education or RTFM!
DOH! Ask me
about VM
termination APIs
Lack of a Community
• What do others think of this Cloud?
• Is there an active DevOps community?
• Do you see active participation from the Cloud provider?
Fail 2.A : Lack of Education or RTFM!
Fail 3.0 : Not Kicking the tires enough first
Poor assumptions about ’how things work’
• For instance:
• “I don’t need a public IP address for my VM as I have a
private gateway”
• “Now I can’t do apt-get update!”
• “what do you mean I have to buy public IP addresses?”
Fail 3.1 : Not Kicking the tires enough first
• If you don’t start with IaC techniques from Day 1 you will fail.
• Environments are all different
• Is your memory that good enough?
• You must encode.
• Trying by hand and then encoding into IaC
• helps you learn about your target environments (API’s anyone?)
• Builds up a IaC asset base you’ll need in the future.
“The human
touch”
Fail 3.2 : Not Kicking the tires enough first
• Get a buddy - “Extreme Deployment”
• Install VirtualBox and Vagrant
• Build a Vagrantfile for an environment you care about
• Provision locally “vagrant up –provider=virtualbox”
• Pick a Cloud. (Use the ’free tier’!)
• Try to deploy a VM by hand.
• Now do “vagrant up –provider=XXXXXXX”
• Examine the differences..
• Add more and repeat
Look for how IP addresses are allocated.
Look at the options for memory size,
networking, disk space, disk types (IO
speeds)
What CPU’s can you get?
What OS’s can you provision?
What architectures are available?
What’s the cost?
Fail 3.3 : Not Kicking the tires enough first
Try another Cloud
Try someone's IaC pattern
Ansible script to deploy a docker swarm
Go wild:
Try to deploy OpenStack on your laptop (with 32GB)
https://www.rdoproject.org/
Now do it all again with Docker
Not understanding that your initial deploys are the least secure
• How long until your newly deployed VM is attacked? 20 seconds -> 40 minutes
• So deploying and then adding vulnerability patches is not the right answer
• War story:
• Customer deploys a VM to Cloud.
• VM gets hacked immediately
• Customer patches the VM.
• Customer keeps the VM and uses it in production
• Customer gets bill for $500,000 network traffic. VM is now being use to host
warez
Fail 3.4 : Not Kicking the tires enough first
Fail 3.5 : Not Kicking the tires enough first
• Time to think about security
• If you don’t get your security posture defined before you deploy you’ll fail and
possibly get some interesting bills
• Maybe you’ll go out of business.
• Worst case (maybe) is you have provided a gateway into your company network
• Regular Vulnerability scanning & fixing.
• Keys not passwords
• Specific IP address access for VMs
• Whitelisted access to internal systems (inside your firewall)
• Whitelisted access to remote systems (on the internet) …
Fail 4.0 – Ignoring unpleasant discoveries
• Not all the OS’s you want are there
• Performance of the Cloud is less than you expected
• Now you know what multi-tenancy means.
• Managing VMs in the Cloud is complicated
• Keeping systems secure and compliant is hard
• Deployment times vary (and fail unexpectedly)
• Debugging problems remotely is difficult
• It costs more than you realized.
• Cost is your responsibility. (No one is going to help you save money!)
• Clouds fill up
So now you know some of those ‘unexpected’ restrictions
Initial cloud deployments are
juicy targets for the bad guys
Fail 4.1 – Ignoring unpleasant discoveries
• Deploy anyway.
• Just run with a smaller JVM heap
• Ok I get it wont scale – deploy anyway and we’ll fit scaling later
• You’ll just have to deploy with a small budget for VMs
• Use the public multitenancy option – its cheaper.
• Can’t you add some sort of cache?
I’m impressed by the number of customers who can change the rules of
physics
Fail 5.0 – Fudging the hard decisions
• You have to pick one. Changing your mind later is going to be expensive and
complicated
• IaC is critical but it’s not magic.
Not realizing Clouds are sticky
Many of my consultancy discussions started with a company saying to itself:
“It’s ok. If Cloud XXX is too expensive we’ll just move over to YYY”
Fail 5.1 – Fudging the hard decisions
For instance:
A large rich-client application used in-house in multiple locations . Plan was to consolidate into
the Cloud.
Network traffic between client and servers measured in TB’s / day
To reduce costs, plan was to create special proxies/data caches on-prem
Consequence: Increased complexity of design, poor performance, Untried new system -> fail.
Should have spent the money on replacing the rich-client with a web based one.
Compromising the architecture because of cost
Unexpected expensive items (such as network costs) can drive you to weird hybrid
configurations that increase complexity and ultimately fails
Fail 5.2 – Fudging the hard decisions
Offering RAM Cost (2015) CPUs
IBM Bluemix (CF) $24.15 GB/Month 4vCPUs per instance
IBM Bluemix (Containers) $ 9.94 GB/Month 4vCPUs per GB
run.pivotal.io $21.60 GB/Month 4vCPUs per instance
Heroku (Hobby) $14.00 GB/Month 1 "CPU share" per 512MB in an
instance
Heroku (Professional) $50.00 GB/Month 1 "CPU share" per 512MB in an
instance
Amazon EC2 (SLES) $16.56 GB/Month 1 vCPU per 4GB in an instance.
Not understanding the cost projections
Old data for example only
Fail 6.0 – Lack of preparation
Driving straight into live deployment
Premature deployment based on happy path will ultimately fail
It is critical that you have exercised an end-to-end deployment and support model before
you go live
So many projects fail because of problems later.
Even simple applications need security, logging and monitoring
Fail 6.1 – Lack of preparation
Not having a solid monitoring and diagnostics solution
Most successful cloud applications consider their monitoring solution to
be the most critical part of their system
If your monitoring solution fails – you’re running blind
Build the monitoring system and then exercise it
Break things, Scale things, Build run away jobs
Figure out what is important and monitor it
Now build dashboards
Do you get the events you need when you need?
Are you measuring end
user response times?
Fail 6.2 – Lack of preparation
Not having enough dashboards!
My team was a traditional IT one.
Responded to tickets – so customers always found the problem first
We added dashboards and an objective “First to Know”
We moved from being last to know to being the one to tell the customer.
Dashboards allowed my team to see issues clearly when there was a failure and when
trends showed bad things we’re going to happen.
Dashboards changed my teams attitudes. Makes automation and monitoring more
acceptable
Fail 6.3 – Lack of preparation
Not having a robust and automated deployment solution
After your application goes live things will go wrong
It’s not just about having a robust application design.
How quickly you can remediate issues is dependent on your ability to deliver those
fixes
Design for Failure. "Everything fails, all the time". Werner Vogels, CTO
Amazon.com
Your deployment solution is your disaster recovery solution
Fail 6.4 – Lack of preparation
Cloud location goes off-line -> can you fail-over to a new location?
What happens if your database gets corrupted?
Where is you data backed up to?
Can you get the data back into the Cloud fast enough?
Who does the backups?
When was the last backup taken?
If your deployment solution is not your disaster recovery solution
Fail 7.0 – Not enough exercise
Scale testing reveals bottlenecks
Even just running two instances can be revealing
Break things too (chaos monkey)
Your aim is to understand how well your application can react to demand
Scale across Cloud locations - Data costs increase? Response times get worse?
Timeouts occur?
Scale testing reveals design issues in application and infrastructure. Things you want to
know about before you go live. And tells you if your monitoring is going to be any use
Not testing how your application scales
Fail 7.1 – Not enough exercise
Failing to scale appropriately costs money
0
20
40
60
80
100
120
a b c d e f g h i j
Demand
Provisioned
Fail 7.2 – Not enough exercise
Failing to scale appropriately costs money
0
20
40
60
80
100
120
a b c d e f g h i j
Demand
Provisioned
Fail 8.0 – Too much excitement
Projects can fail because of an excess of enthusiasm
”Lets take the opportunity to rewrite the application”
“Lets use this new tech”
Often fails due to a lack of situational awareness of the state of play in the industry
It’s easy to get carried away.
Fail 9.0 – Staged deployment
Going from Lift and Shift to what?
You can lift and shift. Probably going to bite you. Unexpected dependencies on local
items such a C:/ or a local services and servers (authentication servers etc)
Consider your options
The “strangler pattern” – staged conversion to micro services
Time for a rewrite?
Look at new options - “serverless” ?
BTW – adding in sufficient debug capability can be just as expensive and increase risk
How far into the woods do you want to go?
Fail A.0 - A few other things
• Cloud providers often offer additional services
• Why build your own when you can use a provided one?
• Skill sets
• We have lots of tech experts but not that many systems experts.
• Take a look at your team. Do they have the skills and experience you need?
• IaC & DevOps skills?
• Some parts of your process are going to become more critical than before
• Who’s doing the data backups?
• Who owns your build and test infrastructure?
• Deployment process
• How long does it take to deploy a change?
• Does your team understand the importance of the process?
Wrap up
• Moving anything into a cloud environment is always a challenge
• Lack of clarity around why you want to do this will cost you money,
sleep and probably doom the project
• Be sure your team is skilled and commitment . It’s their sleep too
• Most of the projects that fail – fail because of the approach. Not the
technology
• But not understanding the economics drivers on systems will also
lead to fail
Fail to adapt -> Fail
How you design, code, deploy, debug,
support etc will be effected by the metrics
and limits imposed on you.
Financial metrics and limits always change
behavior. It also creates opportunity
You will have to learn new techniques and
tools
Applications have to get leaner and
meaner
https://www.flickr.com/photos/beigephotos/
Thank you

More Related Content

PDF
Scaling a Web Site - OSCON Tutorial
PDF
Joyent circa 2006 (Scale with Rails)
PPTX
Scaling High Traffic Web Applications
PDF
Visualizing Systems with Statemaps
PDF
Software Process... the good parts
PDF
Software Architecture Anti-Patterns
PDF
Open is as Open does
PPTX
Scalable game-servers-tgc
Scaling a Web Site - OSCON Tutorial
Joyent circa 2006 (Scale with Rails)
Scaling High Traffic Web Applications
Visualizing Systems with Statemaps
Software Process... the good parts
Software Architecture Anti-Patterns
Open is as Open does
Scalable game-servers-tgc

What's hot (20)

PDF
Attack-driven defense
PPT
PHP – Faster And Cheaper. Scale Vertically with IBM i
PDF
Real-world consistency explained
PDF
Effective approaches to web application security
PDF
It's XP, Stupid
PPT
Lecture 3 object-oriented design
PDF
How to adapt the SDLC to the era of DevSecOps
PDF
No stress with state
PPTX
Automate Everything! (No stress development/Tallinn)
PPTX
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
PDF
Faster Secure Software Development with Continuous Deployment - PH Days 2013
PDF
JUG CH September 2021 - Debugging distributed systems
PDF
Software architecture in a DevOps world
PPTX
It's XP Stupid (2019)
PPTX
Software Craftsmanship Essentials
PPTX
Continuous Delivery (The newest)
PDF
Microservices - Scaling Development and Service
PPTX
How I Learned to Stop Worrying and Love Legacy Code.....
PPTX
Extreme Programming (XP): Revisted
PDF
Agility via Software Engineering Practices - Agile Tour Montreal 2015
Attack-driven defense
PHP – Faster And Cheaper. Scale Vertically with IBM i
Real-world consistency explained
Effective approaches to web application security
It's XP, Stupid
Lecture 3 object-oriented design
How to adapt the SDLC to the era of DevSecOps
No stress with state
Automate Everything! (No stress development/Tallinn)
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
Faster Secure Software Development with Continuous Deployment - PH Days 2013
JUG CH September 2021 - Debugging distributed systems
Software architecture in a DevOps world
It's XP Stupid (2019)
Software Craftsmanship Essentials
Continuous Delivery (The newest)
Microservices - Scaling Development and Service
How I Learned to Stop Worrying and Love Legacy Code.....
Extreme Programming (XP): Revisted
Agility via Software Engineering Practices - Agile Tour Montreal 2015
Ad

Similar to Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail (20)

PDF
cloud session uklug
PPTX
Serverless Toronto helps Startups
PPTX
Scaling a High Traffic Web Application: Our Journey from Java to PHP
PDF
eduroam diagnostics in NTLR, IdPs and SPs
PPTX
YTD Video Downloader Pro Crack [Latest 2025]
PPTX
Distributed systems: design, principles and experiencies
PDF
Key considerations when adopting cloud: expectations vs hurdles
PPTX
SQL Server High Availability and DR - Too Many Choices!
PDF
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
PDF
devops, microservices, and platforms, oh my!
PDF
Sage Summit 2012: Cloud Computing for Accountants
PDF
Belfast Selenium Meetup
PDF
WSO2Con EU 2015: Keynote - Cloud Native Apps… from a user point of view
PPTX
Green Code Lab Challenge 2015 Subject Details
PDF
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
PPTX
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
PPTX
Mapping Life Science Informatics to the Cloud
PPT
When small problems become big problems
PPTX
Web benefits
PDF
Closing the Cloud Skills Gap
cloud session uklug
Serverless Toronto helps Startups
Scaling a High Traffic Web Application: Our Journey from Java to PHP
eduroam diagnostics in NTLR, IdPs and SPs
YTD Video Downloader Pro Crack [Latest 2025]
Distributed systems: design, principles and experiencies
Key considerations when adopting cloud: expectations vs hurdles
SQL Server High Availability and DR - Too Many Choices!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
devops, microservices, and platforms, oh my!
Sage Summit 2012: Cloud Computing for Accountants
Belfast Selenium Meetup
WSO2Con EU 2015: Keynote - Cloud Native Apps… from a user point of view
Green Code Lab Challenge 2015 Subject Details
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
Mapping Life Science Informatics to the Cloud
When small problems become big problems
Web benefits
Closing the Cloud Skills Gap
Ad

More from Steve Poole (20)

PPTX
Key Takeaways for Java Developers from the State of the Software Supply Chain...
PPTX
THRIVING IN THE GEN AI ERA: NAVIGATING CHANGE IN TECH
PPTX
Maven Central++ What's happening at the core of the Java supply chain
PPTX
GIDS-2023 A New Hope for 2023? What Developers Must Learn Next
PPTX
A new hope for 2023? What developers must learn next
PPTX
Stop Security by Sleight Of Hand.pptx
PPTX
Superman or Ironman - can everyone be a 10x developer?
PPTX
The Secret Life of Maven Central
PPTX
The Secret Life of Maven Central.pptx
PPTX
Devoxx France 2022: Game Over or Game Changing? Why Software Development May ...
PPTX
Log4Shell - Armageddon or Opportunity.pptx
PPTX
DevnexusRansomeware.pptx
PPTX
Game Over or Game Changing? Why Software Development May Never be the same again
PPTX
Cybercrime and the developer 2021 style
PPTX
Agile Islands 2020 - Dashboards and Culture
PPTX
LJC Speaker Clnic June 2020
PPTX
Agile Tour London 2018: DASHBOARDS AND CULTURE – HOW OPENNESS CHANGES YOUR BE...
PPTX
Beyond the Pi: What’s Next for the Hacker in All of Us?
PPTX
A Modern Fairy Tale: Java Serialization
PPTX
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Key Takeaways for Java Developers from the State of the Software Supply Chain...
THRIVING IN THE GEN AI ERA: NAVIGATING CHANGE IN TECH
Maven Central++ What's happening at the core of the Java supply chain
GIDS-2023 A New Hope for 2023? What Developers Must Learn Next
A new hope for 2023? What developers must learn next
Stop Security by Sleight Of Hand.pptx
Superman or Ironman - can everyone be a 10x developer?
The Secret Life of Maven Central
The Secret Life of Maven Central.pptx
Devoxx France 2022: Game Over or Game Changing? Why Software Development May ...
Log4Shell - Armageddon or Opportunity.pptx
DevnexusRansomeware.pptx
Game Over or Game Changing? Why Software Development May Never be the same again
Cybercrime and the developer 2021 style
Agile Islands 2020 - Dashboards and Culture
LJC Speaker Clnic June 2020
Agile Tour London 2018: DASHBOARDS AND CULTURE – HOW OPENNESS CHANGES YOUR BE...
Beyond the Pi: What’s Next for the Hacker in All of Us?
A Modern Fairy Tale: Java Serialization
Eclipse OpenJ9 - SpringOne 2018 Lightning talk

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Spectroscopy.pptx food analysis technology
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail

  • 1. The guidebook of FailSucceeding in the Cloud
  • 2. Steve Poole – IBM Making Java Real Since Version 0.9 DevOps Practitioner @spoole167
  • 3. This talk • Come from personal and team experiences as a Leader of a DevOps team • Comes from weekly consultancy etc with product teams and external customers
  • 4. Agenda of Fail • Fail 0 – Believing Migration to Cloud is easy • Fail 1 – No Clarity of Purpose • Fail 2 – Lack of education • Fail 3 – Not kicking the tires enough first • Fail 4 – Ignoring unpleasant discoveries • Fail 5 – Fudging the hard decisions • Fail 6 – Lack of preparation • Fail 7 – Not enough exercise • Fail 8 – Too much excitement • Fail 9 - Big bang deployment • Fail A – A few other things
  • 5. Fail 0.0 : Believing Migration to Cloud is Easy • ‘Cloud’ is not easy • It may be self-service but don’t be fooled • It may look like a nice walk into the forest to grandma’s house. .. • Get yourself together for a large and painful exercise. • Ever moved a Data Centre? • Experience is key. • Staff. Who’s going to do this – are they qualified? • Prepare to change your plans • Most migrations require architectural design changes within the first 6 months • Half of all projects fail • Half of all projects will need significant increases in budget • Think it through • Projects fail later on when new objectives get added be clear on your ultimate goal Emigration not Migration (Migration suggests its something you want to do annually)
  • 6. Fail 1.0: No Clarity of Purpose • There are many reasons for moving applications to the ’Cloud’ • There are many types of application • There are many ‘Clouds’ to move to • What’s the chance of you getting it right first time? • What’s the consequence of failure? • Do you even know if you’ll even know it’s failed in time to recover? • Clarity of purpose reduces your risk • Clarity of purpose gives you focus
  • 7. Not understanding the communications process • How do they talk to you? • What’s the ticketing system? • How do you get told of a problem? • How do you get told of planned outages? • How much notice do you get for planned outages? • How do you raise a problem? • How do you ESCALATE? • What is the communications SLA here? • Know your rights Fail 2.0 : Lack of Education or RTFM! DOH! Ask me about passwords
  • 8. I was a single point of failure And I didn’t even know it I think I’m in control of my account until I need my password reset I had no idea where the reset email was going to Cloud support could trigger the reset but wouldn’t/couldn’t tell me more. Suggested I go to my Admin!! - Which I thought was me. Turns out there’s a corporate owner of the accounts. Took me days to resolve. Fail 2.1 : Lack of Education or RTFM!
  • 9. The one thing you should remember from this talk We’re techies. We get excited about APIs. We understand APIs Moving to the Cloud means giving your data, applications, security etc to a 3rd Party. That means the ‘API’ extends into the human world. The contract and it’s SLA defines what you can and cannot do when using Cloud services Cloud providers benefit from economies of scale and have large numbers of customers Just like the more usual service providers you use at home. Gas, Electricity, Broadband, Satellite TV. You know how that can work at home. Cloud Provisioning is much more complicated.. Fail 2.2 : Lack of Education or RTFM!
  • 10. Not understanding the Service Level Agreement • Does it have location specific differences? • How is the SLA measured? • How well defined are the criteria? • How are issues resolved? • What are your responsibilities? • If you don’t know your SLA you will fail Fail 2.3 : Lack of Education or RTFM! Example: Can you assess free capacity? If a location is at capacity what happens?
  • 11. Not understanding the Service Level Agreement (2) • True story • Go to a service provider SLA dashboard • Service says SLA available of 99.5% • I think that means  • Turns out that actual availability is 95.8 Fail 2.4 : Lack of Education or RTFM! https://uptime.is/ Daily: 43.2s Weekly: 5m 2.4s Monthly: 21m 54.9s Yearly: 4h 22m 58.5 Daily: 1h 0m 28.8s Weekly: 7h 3m 21.6s Monthly: 1d 6h 40m 49.3s Yearly: 15d 8h 9m 52.0s
  • 12. • True story • The difference is because the provider has a planned daily outage of 1hr • They still claim 99.5% • Get’s worse. • Outages beyond their control don’t ‘count’ either. Fail 2.5 : Lack of Education or RTFM! Not understanding the Service Level Agreement (3) •Daily: 3h 36m 0.0s •Weekly: 1d 1h 12m 0.0s •Monthly: 4d 13h 34m 21.9s •Yearly: 54d 18h 52m 22.8s 85%
  • 13. Not understanding the cost model • Units of cost. • CPU / RAM / Network / Storage / IP Addresses …. • Penalty costs if you overrun? • When does the time start and end? • Costs change by location? Fail 2.6 : Lack of Education or RTFM! DOH! Ask me about GPUs
  • 14. Not understanding the cost model • I’m testing new GPU support In IBM’s JVM 8.0 • IBM has GPU support in SoftLayer • Amazon has GPU support in AWS • I want to do some scale performance testing • Got my VirtualBox and Ansible config • Point it at AWS. Deploy < 1hr x 2 • Costs me $39 ? • Other charges included  Fail 2.7 : Lack of Education or RTFM! p2.16xlarge 16 GPU 64 vCPU 732 GB ram $14/hr
  • 15. Not understanding how security and compliance is managed • What are the security, compliance and image update policies? • How did they handle the last pervasive vulnerability? • Firewalls – do you get one for free? Can you configure it? What’s the default policy for firewalls with deployments? • SSL certificates – do you own and manage or do they offer a service? • How do you access your VMs ? (ssh, telnet, web?) • Passwords vs keys? • Where are the keys kept? • Can you retrieve the keys in an emergency? Fail 2.8 : Lack of Education or RTFM! You do understand penetration attack vectors?
  • 16. Misunderstanding what APIs exist • Are there APIs for all the actions you want to perform • Are they symmetrical? • Do any need human interaction to complete? • Are the APIs proprietary or standard? • Are there plugins for IaC tools? Fail 2.9 : Lack of Education or RTFM! DOH! Ask me about VM termination APIs
  • 17. Lack of a Community • What do others think of this Cloud? • Is there an active DevOps community? • Do you see active participation from the Cloud provider? Fail 2.A : Lack of Education or RTFM!
  • 18. Fail 3.0 : Not Kicking the tires enough first Poor assumptions about ’how things work’ • For instance: • “I don’t need a public IP address for my VM as I have a private gateway” • “Now I can’t do apt-get update!” • “what do you mean I have to buy public IP addresses?”
  • 19. Fail 3.1 : Not Kicking the tires enough first • If you don’t start with IaC techniques from Day 1 you will fail. • Environments are all different • Is your memory that good enough? • You must encode. • Trying by hand and then encoding into IaC • helps you learn about your target environments (API’s anyone?) • Builds up a IaC asset base you’ll need in the future. “The human touch”
  • 20. Fail 3.2 : Not Kicking the tires enough first • Get a buddy - “Extreme Deployment” • Install VirtualBox and Vagrant • Build a Vagrantfile for an environment you care about • Provision locally “vagrant up –provider=virtualbox” • Pick a Cloud. (Use the ’free tier’!) • Try to deploy a VM by hand. • Now do “vagrant up –provider=XXXXXXX” • Examine the differences.. • Add more and repeat Look for how IP addresses are allocated. Look at the options for memory size, networking, disk space, disk types (IO speeds) What CPU’s can you get? What OS’s can you provision? What architectures are available? What’s the cost?
  • 21. Fail 3.3 : Not Kicking the tires enough first Try another Cloud Try someone's IaC pattern Ansible script to deploy a docker swarm Go wild: Try to deploy OpenStack on your laptop (with 32GB) https://www.rdoproject.org/ Now do it all again with Docker
  • 22. Not understanding that your initial deploys are the least secure • How long until your newly deployed VM is attacked? 20 seconds -> 40 minutes • So deploying and then adding vulnerability patches is not the right answer • War story: • Customer deploys a VM to Cloud. • VM gets hacked immediately • Customer patches the VM. • Customer keeps the VM and uses it in production • Customer gets bill for $500,000 network traffic. VM is now being use to host warez Fail 3.4 : Not Kicking the tires enough first
  • 23. Fail 3.5 : Not Kicking the tires enough first • Time to think about security • If you don’t get your security posture defined before you deploy you’ll fail and possibly get some interesting bills • Maybe you’ll go out of business. • Worst case (maybe) is you have provided a gateway into your company network • Regular Vulnerability scanning & fixing. • Keys not passwords • Specific IP address access for VMs • Whitelisted access to internal systems (inside your firewall) • Whitelisted access to remote systems (on the internet) …
  • 24. Fail 4.0 – Ignoring unpleasant discoveries • Not all the OS’s you want are there • Performance of the Cloud is less than you expected • Now you know what multi-tenancy means. • Managing VMs in the Cloud is complicated • Keeping systems secure and compliant is hard • Deployment times vary (and fail unexpectedly) • Debugging problems remotely is difficult • It costs more than you realized. • Cost is your responsibility. (No one is going to help you save money!) • Clouds fill up So now you know some of those ‘unexpected’ restrictions Initial cloud deployments are juicy targets for the bad guys
  • 25. Fail 4.1 – Ignoring unpleasant discoveries • Deploy anyway. • Just run with a smaller JVM heap • Ok I get it wont scale – deploy anyway and we’ll fit scaling later • You’ll just have to deploy with a small budget for VMs • Use the public multitenancy option – its cheaper. • Can’t you add some sort of cache? I’m impressed by the number of customers who can change the rules of physics
  • 26. Fail 5.0 – Fudging the hard decisions • You have to pick one. Changing your mind later is going to be expensive and complicated • IaC is critical but it’s not magic. Not realizing Clouds are sticky Many of my consultancy discussions started with a company saying to itself: “It’s ok. If Cloud XXX is too expensive we’ll just move over to YYY”
  • 27. Fail 5.1 – Fudging the hard decisions For instance: A large rich-client application used in-house in multiple locations . Plan was to consolidate into the Cloud. Network traffic between client and servers measured in TB’s / day To reduce costs, plan was to create special proxies/data caches on-prem Consequence: Increased complexity of design, poor performance, Untried new system -> fail. Should have spent the money on replacing the rich-client with a web based one. Compromising the architecture because of cost Unexpected expensive items (such as network costs) can drive you to weird hybrid configurations that increase complexity and ultimately fails
  • 28. Fail 5.2 – Fudging the hard decisions Offering RAM Cost (2015) CPUs IBM Bluemix (CF) $24.15 GB/Month 4vCPUs per instance IBM Bluemix (Containers) $ 9.94 GB/Month 4vCPUs per GB run.pivotal.io $21.60 GB/Month 4vCPUs per instance Heroku (Hobby) $14.00 GB/Month 1 "CPU share" per 512MB in an instance Heroku (Professional) $50.00 GB/Month 1 "CPU share" per 512MB in an instance Amazon EC2 (SLES) $16.56 GB/Month 1 vCPU per 4GB in an instance. Not understanding the cost projections Old data for example only
  • 29. Fail 6.0 – Lack of preparation Driving straight into live deployment Premature deployment based on happy path will ultimately fail It is critical that you have exercised an end-to-end deployment and support model before you go live So many projects fail because of problems later. Even simple applications need security, logging and monitoring
  • 30. Fail 6.1 – Lack of preparation Not having a solid monitoring and diagnostics solution Most successful cloud applications consider their monitoring solution to be the most critical part of their system If your monitoring solution fails – you’re running blind Build the monitoring system and then exercise it Break things, Scale things, Build run away jobs Figure out what is important and monitor it Now build dashboards Do you get the events you need when you need? Are you measuring end user response times?
  • 31. Fail 6.2 – Lack of preparation Not having enough dashboards! My team was a traditional IT one. Responded to tickets – so customers always found the problem first We added dashboards and an objective “First to Know” We moved from being last to know to being the one to tell the customer. Dashboards allowed my team to see issues clearly when there was a failure and when trends showed bad things we’re going to happen. Dashboards changed my teams attitudes. Makes automation and monitoring more acceptable
  • 32. Fail 6.3 – Lack of preparation Not having a robust and automated deployment solution After your application goes live things will go wrong It’s not just about having a robust application design. How quickly you can remediate issues is dependent on your ability to deliver those fixes Design for Failure. "Everything fails, all the time". Werner Vogels, CTO Amazon.com Your deployment solution is your disaster recovery solution
  • 33. Fail 6.4 – Lack of preparation Cloud location goes off-line -> can you fail-over to a new location? What happens if your database gets corrupted? Where is you data backed up to? Can you get the data back into the Cloud fast enough? Who does the backups? When was the last backup taken? If your deployment solution is not your disaster recovery solution
  • 34. Fail 7.0 – Not enough exercise Scale testing reveals bottlenecks Even just running two instances can be revealing Break things too (chaos monkey) Your aim is to understand how well your application can react to demand Scale across Cloud locations - Data costs increase? Response times get worse? Timeouts occur? Scale testing reveals design issues in application and infrastructure. Things you want to know about before you go live. And tells you if your monitoring is going to be any use Not testing how your application scales
  • 35. Fail 7.1 – Not enough exercise Failing to scale appropriately costs money 0 20 40 60 80 100 120 a b c d e f g h i j Demand Provisioned
  • 36. Fail 7.2 – Not enough exercise Failing to scale appropriately costs money 0 20 40 60 80 100 120 a b c d e f g h i j Demand Provisioned
  • 37. Fail 8.0 – Too much excitement Projects can fail because of an excess of enthusiasm ”Lets take the opportunity to rewrite the application” “Lets use this new tech” Often fails due to a lack of situational awareness of the state of play in the industry It’s easy to get carried away.
  • 38. Fail 9.0 – Staged deployment Going from Lift and Shift to what? You can lift and shift. Probably going to bite you. Unexpected dependencies on local items such a C:/ or a local services and servers (authentication servers etc) Consider your options The “strangler pattern” – staged conversion to micro services Time for a rewrite? Look at new options - “serverless” ? BTW – adding in sufficient debug capability can be just as expensive and increase risk How far into the woods do you want to go?
  • 39. Fail A.0 - A few other things • Cloud providers often offer additional services • Why build your own when you can use a provided one? • Skill sets • We have lots of tech experts but not that many systems experts. • Take a look at your team. Do they have the skills and experience you need? • IaC & DevOps skills? • Some parts of your process are going to become more critical than before • Who’s doing the data backups? • Who owns your build and test infrastructure? • Deployment process • How long does it take to deploy a change? • Does your team understand the importance of the process?
  • 40. Wrap up • Moving anything into a cloud environment is always a challenge • Lack of clarity around why you want to do this will cost you money, sleep and probably doom the project • Be sure your team is skilled and commitment . It’s their sleep too • Most of the projects that fail – fail because of the approach. Not the technology • But not understanding the economics drivers on systems will also lead to fail
  • 41. Fail to adapt -> Fail How you design, code, deploy, debug, support etc will be effected by the metrics and limits imposed on you. Financial metrics and limits always change behavior. It also creates opportunity You will have to learn new techniques and tools Applications have to get leaner and meaner https://www.flickr.com/photos/beigephotos/