Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail

The guidebook of FailSucceeding in the Cloud

Steve Poole – IBM
Making Java Real Since Version 0.9
DevOps Practitioner @spoole167

This talk
• Come from personal and team experiences as a Leader of
a DevOps team
• Comes from weekly consultancy etc with product teams
and external customers

Agenda of Fail
• Fail 0 – Believing Migration to Cloud is easy
• Fail 1 – No Clarity of Purpose
• Fail 2 – Lack of education
• Fail 3 – Not kicking the tires enough first
• Fail 4 – Ignoring unpleasant discoveries
• Fail 5 – Fudging the hard decisions
• Fail 6 – Lack of preparation
• Fail 7 – Not enough exercise
• Fail 8 – Too much excitement
• Fail 9 - Big bang deployment
• Fail A – A few other things

Fail 0.0 : Believing Migration to Cloud is Easy
• ‘Cloud’ is not easy
• It may be self-service but don’t be fooled
• It may look like a nice walk into the forest to grandma’s house. ..
• Get yourself together for a large and painful exercise.
• Ever moved a Data Centre?
• Experience is key.
• Staff. Who’s going to do this – are they qualified?
• Prepare to change your plans
• Most migrations require architectural design changes within the first 6 months
• Half of all projects fail
• Half of all projects will need significant increases in budget
• Think it through
• Projects fail later on when new objectives get added be clear on your ultimate goal
Emigration not
Migration
(Migration
suggests its
something you
want to do
annually)

Fail 1.0: No Clarity of Purpose
• There are many reasons for moving applications to the ’Cloud’
• There are many types of application
• There are many ‘Clouds’ to move to
• What’s the chance of you getting it right first time?
• What’s the consequence of failure?
• Do you even know if you’ll even know it’s failed in time to recover?
• Clarity of purpose reduces your risk
• Clarity of purpose gives you focus

Not understanding the communications process
• How do they talk to you?
• What’s the ticketing system?
• How do you get told of a problem?
• How do you get told of planned outages?
• How much notice do you get for planned outages?
• How do you raise a problem?
• How do you ESCALATE?
• What is the communications SLA here?
• Know your rights
Fail 2.0 : Lack of Education or RTFM!
DOH! Ask me
about passwords

I was a single point of failure
And I didn’t even know it
I think I’m in control of my account until I need my password reset
I had no idea where the reset email was going to
Cloud support could trigger the reset but wouldn’t/couldn’t tell me more.
Suggested I go to my Admin!! - Which I thought was me.
Turns out there’s a corporate owner of the accounts. Took me days to resolve.

The one thing you should remember from this talk
We’re techies. We get excited about APIs. We understand APIs
Moving to the Cloud means giving your data, applications, security etc to a 3rd Party.
That means the ‘API’ extends into the human world. The contract and it’s SLA defines
what you can and cannot do when using Cloud services
Cloud providers benefit from economies of scale and have large numbers of customers
Just like the more usual service providers you use at home. Gas, Electricity, Broadband,
Satellite TV.
You know how that can work at home. Cloud Provisioning is much more complicated..

Not understanding the Service Level Agreement
• Does it have location specific differences?
• How is the SLA measured?
• How well defined are the criteria?
• How are issues resolved?
• What are your responsibilities?
• If you don’t know your SLA you will fail
Example: Can you assess free
capacity?
If a location is at capacity
what happens?

Not understanding the Service Level Agreement (2)
• True story
• Go to a service provider SLA dashboard
• Service says SLA available of 99.5%
• I think that means 
• Turns out that actual availability is 95.8
https://uptime.is/
Daily: 43.2s
Weekly: 5m 2.4s
Monthly: 21m 54.9s
Yearly: 4h 22m 58.5
Daily: 1h 0m 28.8s
Weekly: 7h 3m 21.6s
Monthly: 1d 6h 40m 49.3s
Yearly: 15d 8h 9m 52.0s

• True story
• The difference is because the provider
has a planned daily outage of 1hr
• They still claim 99.5%
• Get’s worse.
• Outages beyond their control don’t
‘count’ either.
Not understanding the Service Level Agreement (3)
•Daily: 3h 36m 0.0s
•Weekly: 1d 1h 12m 0.0s
•Monthly: 4d 13h 34m 21.9s
•Yearly: 54d 18h 52m 22.8s
85%

Not understanding the cost model
• Units of cost.
• CPU / RAM / Network / Storage / IP Addresses ….
• Penalty costs if you overrun?
• When does the time start and end?
• Costs change by location?
DOH! Ask me
about GPUs

Not understanding the cost model
• I’m testing new GPU support In IBM’s JVM 8.0
• IBM has GPU support in SoftLayer
• Amazon has GPU support in AWS
• I want to do some scale performance testing
• Got my VirtualBox and Ansible config
• Point it at AWS. Deploy < 1hr x 2
• Costs me $39 ?
• Other charges included 
p2.16xlarge
16 GPU
64 vCPU
732 GB ram
$14/hr

Not understanding how security and compliance is managed
• What are the security, compliance and image update policies?
• How did they handle the last pervasive vulnerability?
• Firewalls – do you get one for free? Can you configure it? What’s the default policy
for firewalls with deployments?
• SSL certificates – do you own and manage or do they offer a service?
• How do you access your VMs ? (ssh, telnet, web?)
• Passwords vs keys?
• Where are the keys kept?
• Can you retrieve the keys in an emergency?
You do
understand
penetration
attack vectors?

Misunderstanding what APIs exist
• Are there APIs for all the actions you want to perform
• Are they symmetrical?
• Do any need human interaction to complete?
• Are the APIs proprietary or standard?
• Are there plugins for IaC tools?
DOH! Ask me
about VM
termination APIs

Lack of a Community
• What do others think of this Cloud?
• Is there an active DevOps community?
• Do you see active participation from the Cloud provider?
Fail 2.A : Lack of Education or RTFM!

Fail 3.0 : Not Kicking the tires enough first
Poor assumptions about ’how things work’
• For instance:
• “I don’t need a public IP address for my VM as I have a
private gateway”
• “Now I can’t do apt-get update!”
• “what do you mean I have to buy public IP addresses?”

• If you don’t start with IaC techniques from Day 1 you will fail.
• Environments are all different
• Is your memory that good enough?
• You must encode.
• Trying by hand and then encoding into IaC
• helps you learn about your target environments (API’s anyone?)
• Builds up a IaC asset base you’ll need in the future.
“The human
touch”

• Get a buddy - “Extreme Deployment”
• Install VirtualBox and Vagrant
• Build a Vagrantfile for an environment you care about
• Provision locally “vagrant up –provider=virtualbox”
• Pick a Cloud. (Use the ’free tier’!)
• Try to deploy a VM by hand.
• Now do “vagrant up –provider=XXXXXXX”
• Examine the differences..
• Add more and repeat
Look for how IP addresses are allocated.
Look at the options for memory size,
networking, disk space, disk types (IO
speeds)
What CPU’s can you get?
What OS’s can you provision?
What architectures are available?
What’s the cost?

Try another Cloud
Try someone's IaC pattern
Ansible script to deploy a docker swarm
Go wild:
Try to deploy OpenStack on your laptop (with 32GB)
https://www.rdoproject.org/
Now do it all again with Docker

Not understanding that your initial deploys are the least secure
• How long until your newly deployed VM is attacked? 20 seconds -> 40 minutes
• So deploying and then adding vulnerability patches is not the right answer
• War story:
• Customer deploys a VM to Cloud.
• VM gets hacked immediately
• Customer patches the VM.
• Customer keeps the VM and uses it in production
• Customer gets bill for $500,000 network traffic. VM is now being use to host
warez

• Time to think about security
• If you don’t get your security posture defined before you deploy you’ll fail and
possibly get some interesting bills
• Maybe you’ll go out of business.
• Worst case (maybe) is you have provided a gateway into your company network
• Regular Vulnerability scanning & fixing.
• Keys not passwords
• Specific IP address access for VMs
• Whitelisted access to internal systems (inside your firewall)
• Whitelisted access to remote systems (on the internet) …

Fail 4.0 – Ignoring unpleasant discoveries
• Not all the OS’s you want are there
• Performance of the Cloud is less than you expected
• Now you know what multi-tenancy means.
• Managing VMs in the Cloud is complicated
• Keeping systems secure and compliant is hard
• Deployment times vary (and fail unexpectedly)
• Debugging problems remotely is difficult
• It costs more than you realized.
• Cost is your responsibility. (No one is going to help you save money!)
• Clouds fill up
So now you know some of those ‘unexpected’ restrictions
Initial cloud deployments are
juicy targets for the bad guys

Fail 4.1 – Ignoring unpleasant discoveries
• Deploy anyway.
• Just run with a smaller JVM heap
• Ok I get it wont scale – deploy anyway and we’ll fit scaling later
• You’ll just have to deploy with a small budget for VMs
• Use the public multitenancy option – its cheaper.
• Can’t you add some sort of cache?
I’m impressed by the number of customers who can change the rules of
physics

Fail 5.0 – Fudging the hard decisions
• You have to pick one. Changing your mind later is going to be expensive and
complicated
• IaC is critical but it’s not magic.
Not realizing Clouds are sticky
Many of my consultancy discussions started with a company saying to itself:
“It’s ok. If Cloud XXX is too expensive we’ll just move over to YYY”

For instance:
A large rich-client application used in-house in multiple locations . Plan was to consolidate into
the Cloud.
Network traffic between client and servers measured in TB’s / day
To reduce costs, plan was to create special proxies/data caches on-prem
Consequence: Increased complexity of design, poor performance, Untried new system -> fail.
Should have spent the money on replacing the rich-client with a web based one.
Compromising the architecture because of cost
Unexpected expensive items (such as network costs) can drive you to weird hybrid
configurations that increase complexity and ultimately fails

Offering RAM Cost (2015) CPUs
IBM Bluemix (CF) $24.15 GB/Month 4vCPUs per instance
IBM Bluemix (Containers) $ 9.94 GB/Month 4vCPUs per GB
run.pivotal.io $21.60 GB/Month 4vCPUs per instance
Heroku (Hobby) $14.00 GB/Month 1 "CPU share" per 512MB in an
instance
Heroku (Professional) $50.00 GB/Month 1 "CPU share" per 512MB in an
instance
Amazon EC2 (SLES) $16.56 GB/Month 1 vCPU per 4GB in an instance.
Not understanding the cost projections
Old data for example only

Fail 6.0 – Lack of preparation
Driving straight into live deployment
Premature deployment based on happy path will ultimately fail
It is critical that you have exercised an end-to-end deployment and support model before
you go live
So many projects fail because of problems later.
Even simple applications need security, logging and monitoring

Not having a solid monitoring and diagnostics solution
Most successful cloud applications consider their monitoring solution to
be the most critical part of their system
If your monitoring solution fails – you’re running blind
Build the monitoring system and then exercise it
Break things, Scale things, Build run away jobs
Figure out what is important and monitor it
Now build dashboards
Do you get the events you need when you need?
Are you measuring end
user response times?

Not having enough dashboards!
My team was a traditional IT one.
Responded to tickets – so customers always found the problem first
We added dashboards and an objective “First to Know”
We moved from being last to know to being the one to tell the customer.
Dashboards allowed my team to see issues clearly when there was a failure and when
trends showed bad things we’re going to happen.
Dashboards changed my teams attitudes. Makes automation and monitoring more
acceptable

Not having a robust and automated deployment solution
After your application goes live things will go wrong
It’s not just about having a robust application design.
How quickly you can remediate issues is dependent on your ability to deliver those
fixes
Design for Failure. "Everything fails, all the time". Werner Vogels, CTO
Amazon.com
Your deployment solution is your disaster recovery solution

Cloud location goes off-line -> can you fail-over to a new location?
What happens if your database gets corrupted?
Where is you data backed up to?
Can you get the data back into the Cloud fast enough?
Who does the backups?
When was the last backup taken?
If your deployment solution is not your disaster recovery solution

Fail 7.0 – Not enough exercise
Scale testing reveals bottlenecks
Even just running two instances can be revealing
Break things too (chaos monkey)
Your aim is to understand how well your application can react to demand
Scale across Cloud locations - Data costs increase? Response times get worse?
Timeouts occur?
Scale testing reveals design issues in application and infrastructure. Things you want to
know about before you go live. And tells you if your monitoring is going to be any use
Not testing how your application scales

Failing to scale appropriately costs money
0
20
40
60
80
100
120
a b c d e f g h i j
Demand
Provisioned

Fail 8.0 – Too much excitement
Projects can fail because of an excess of enthusiasm
”Lets take the opportunity to rewrite the application”
“Lets use this new tech”
Often fails due to a lack of situational awareness of the state of play in the industry
It’s easy to get carried away.

Fail 9.0 – Staged deployment
Going from Lift and Shift to what?
You can lift and shift. Probably going to bite you. Unexpected dependencies on local
items such a C:/ or a local services and servers (authentication servers etc)
Consider your options
The “strangler pattern” – staged conversion to micro services
Time for a rewrite?
Look at new options - “serverless” ?
BTW – adding in sufficient debug capability can be just as expensive and increase risk
How far into the woods do you want to go?

Fail A.0 - A few other things
• Cloud providers often offer additional services
• Why build your own when you can use a provided one?
• Skill sets
• We have lots of tech experts but not that many systems experts.
• Take a look at your team. Do they have the skills and experience you need?
• IaC & DevOps skills?
• Some parts of your process are going to become more critical than before
• Who’s doing the data backups?
• Who owns your build and test infrastructure?
• Deployment process
• How long does it take to deploy a change?
• Does your team understand the importance of the process?

Wrap up
• Moving anything into a cloud environment is always a challenge
• Lack of clarity around why you want to do this will cost you money,
sleep and probably doom the project
• Be sure your team is skilled and commitment . It’s their sleep too
• Most of the projects that fail – fail because of the approach. Not the
technology
• But not understanding the economics drivers on systems will also
lead to fail

Fail to adapt -> Fail
How you design, code, deploy, debug,
support etc will be effected by the metrics
and limits imposed on you.
Financial metrics and limits always change
behavior. It also creates opportunity
You will have to learn new techniques and
tools
Applications have to get leaner and
meaner
https://www.flickr.com/photos/beigephotos/

Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail

More Related Content

What's hot (20)

Similar to Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail (20)

More from Steve Poole (20)

Recently uploaded (20)

Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail