SlideShare a Scribd company logo
Going to the
CLOUD!
DISCLAIMER:
This talk is about work in progress. Completeness
and accuracy aren't guaranteed beyond best effort
Starting point
●
Old hardware
Starting point
●
Old hardware
●
A lot of profitable legacy software
Starting point
●
Old hardware
●
A lot of profitable legacy software
●
Openstack + bare metal
Starting point
●
Old hardware
●
A lot of profitable legacy software
●
Openstack + bare metal
●
Working CI/CD
Starting point
●
Old hardware
●
A lot of profitable legacy software
●
Openstack + bare metal
●
Working CI/CD
●
Working configuration management
Starting point
●
Old hardware
●
A lot of profitable legacy software
●
Openstack + bare metal
●
Working CI/CD
●
Working configuration management
●
Small infrastructure team
Starting point
●
Old hardware
●
A lot of profitable legacy software
●
Openstack + bare metal
●
Working CI/CD
●
Working configuration management
●
Small infrastructure team
●
Software is an essential business component, but
our business is not software
Starting point
●
Old hardware
●
A lot of profitable legacy software
●
Openstack + bare metal
●
Working CI/CD
●
Working configuration management
●
Small infrastructure team
●
Software is an essential business component, but our
business is not software
●
Developers are on call for production application issues
Cloud considerations
●
Scaling
– Cloud systems let you scale in smaller increments
on demand
Cloud considerations
●
Scaling
– Cloud systems let you scale in smaller increments
on demand
●
Variability in demand
– Low variability in demand for computing resources
supports staying in-house
– Highly variable systems benefit from moving to the
cloud far more
Cloud considerations
●
Scaling
– Cloud systems let you scale in smaller increments on demand
●
Variability in demand
– Low variability in demand for computing resources supports
staying in-house
– Highly variable systems benefit from moving to the cloud far more
●
Legal issues
– Privacy regulations in the EU itself
●
Also different laws between different EU countries
– Brexit
Cloud considerations
●
Scaling
– Cloud systems let you scale in smaller increments on demand
●
Variability in demand
– Low variability in demand for computing resources supports staying in-house
– Highly variable systems benefit from moving to the cloud far more
●
Legal issues
– Privacy regulations in the EU itself
●
Also different laws between different EU countries
– Brexit
●
Software design
– Observability must be built into the software
Vendor Choices
Vendor Choices
●
Already using Docker
Vendor Choices
●
Already using Docker
●
Already moving to microservices
Vendor Choices
●
Already using Docker
●
Already moving to microservices
●
Moving from Mesos to Kubernetes was easy
Vendor Choices
●
Already using Docker
●
Already moving to microservices
●
Moving from Mesos to Kubernetes was easy
●
This made Google's Cloud offering a slightly
better choice than Amazon
– Google being cheaper helped a bit
Vendor Choices
●
Already using Docker
●
Already moving to microservices
●
Moving from Mesos to Kubernetes was easy
●
This made Google's Cloud offering a slightly better
choice than Amazon
– Google being cheaper helped a bit
●
Neither was cheaper than running our own hardware
– Savings mostly come from the lack of a dedicated operations
group, and from being able to avoid some HA requirements
The technical research phase
●
Lasted about half a year
The technical research phase
●
Lasted about half a year
●
Focus on two main areas:
– How to manage infrastructure manually at the
vendor
– Tooling and automation
Why manual work?
●
Familiarisation
– Terminology
Why manual work?
●
Familiarisation
– Terminology
●
Concepts
Why manual work?
●
Familiarisation
– Terminology
●
Concepts
●
Discover limitations
– There are a lot of those
– Some more interesting than others (load balancing,
IPv6, DNS, ...)
Choosing automation tools
●
Shell scripts
– Via gcloud + gsutil
Choosing automation tools
●
Shell scripts
– Via gcloud + gsutil
●
Ansible
– We had Ansible experience
– Built some systems with ansible
– Very limited in what it can do without using gcloud
Choosing automation tools
●
Shell scripts
– Via gcloud + gsutil
●
Ansible
– We had Ansible experience
– Built some systems with ansible
– Very limited in what it can do without using gcloud
●
Puppet
– Was not a serious contender six months ago
Choosing automation tools
●
Shell scripts
– Via gcloud + gsutil
●
Ansible
– We had Ansible experience
– Built some systems with ansible
– Very limited in what it can do without using gcloud
●
Puppet
– Was not a serious contender six months ago
●
Terraform
– The best of the lot
●
It has improved a lot since this slideset was first made
Configuration management
●
Stateless systems implemented in a 12-factor
style are best put in containers and managed
via Kubernetes
– Alternatively, use what Google calls managed
groups and spin up VMs automatically in case of
crashes
Configuration management
●
Stateless systems implemented in a 12-factor
style are best put in containers and managed
via Kubernetes
– Alternatively, use what Google calls managed
groups and spin up VMs automatically in case of
crashes
●
We still need configuration management for
systems which aren't in a container
Configuration management
●
Stateless systems implemented in a 12-factor style are
best put in containers and managed via Kubernetes
– Alternatively, use what Google calls managed groups and
spin up VMs automatically in case of crashes
●
We still need configuration management for systems
which aren't in a container
●
Puppet was the obvious choice, because we were
already using it
– It doesn’t matter which specific tool you use, but use one.
Inventory
●
There isn't a nice CMDB out there yet, which
can automagically provision VMs in the cloud
and provide information to config-mgmt and
orchestration tools
– We currently hack our way around this by using
tags and the Google API
Moving into high speed
●
One meeting
– Three people
– Thirty minutes
Moving into high speed
●
One meeting
– Three people
– Thirty minutes
●
Decided on goals for a proof of concept
– Complete automation
– Custom tooling around the application
– Fixed target application for a test deployment
Moving into high speed
●
One meeting
– Three people
– Thirty minutes
●
Decided on goals for a proof of concept
– Complete automation
– Custom tooling around the application
– Fixed target application for a test deployment
●
Took us about three months of full time effort to wrap
up the PoC
Tools of choice
●
Terraform
– This is a pretty fast moving tool
– They have good documentation
●
For some value of good.
– Getting your first bits and pieces working are harder
than they should be, but the rest then follow pretty
easily
Tools of choice
●
Terraform
– This is a pretty fast moving tool
– They have good documentation
●
For some value of good.
– Getting your first bits and pieces working are harder than they
should be, but the rest then follow pretty easily
●
Puppet
– New Puppet repo, ignoring a lot of legacy.
– Jumped Puppet version
– Discarded large parts of the module approach recommended in
Puppet documentation
Terraform
●
Base network project
– All network related things are done in this project
Terraform
●
Base network project
– All network related things are done in this project
●
Other projects use instance groups with a
mostly standard template
– They reference network configs from the base
project
Terraform
●
Base network project
– All network related things are done in this project
●
Other projects use an instance group with a
mostly standard template
– They reference network configs from the base
project
●
Google metadata is used to tie together Puppet
and Terraform
Shared backends
●
We started with a simple backend for Terraform,
with no remote state.
– This does not scale to many users, but for the initial
proof of concept was useful.
Shared backends
●
We started with a simple backend for Terraform,
with no remote state.
– This does not scale to many users, but for the initial proof
of concept was useful.
●
We then spent a few days very carefully refactoring
this into per project state, with the shared state
being remote in a cloud storage bucket.
– https://charity.wtf/2016/03/30/terraform-vpc-and-why-you-
want-a-tfstate-file-per-env/ is a pretty good horror story of
what could go wrong
* Documentation
* API
* Stateful data
* IPv6
* Secrets
Google Cloud Documentation
●
Lags behind software
Google Cloud Documentation
●
Lags behind software
●
Is often inconsistent
Google Cloud Documentation
●
Lags behind software
●
Is often inconsistent
●
This has not changed in about three years
– This is not limited to Google though.
API
●
Quite inconsistent in some regards
– Particularly about referencing other properties
– Name or reference?
API
●
Quite inconsistent in some regards
– Particularly about referencing other properties
– Name or reference?
●
Needs actual examples
– A lot of examples
●
This has not really improved since I first wrote this talk
Stateful data
●
There are no good answers for high availability
Stateful data
●
There are no good answers for high availability
●
Google offers multiple options for storage
– Some of these are more reliable than others
– But they are more complex to use
– Or involve code changes
Stateful data
●
There are no good answers for high availability
●
Google offers multiple options for storage
– Some of these are more reliable than others
– But they are more complex to use
– Or involve code changes
●
Maintenance can cause outages
– automatic failover for CloudSQL needs a whole zone to
fail, so a maintenance can cause an unexpected outage
Stateful data
●
There are no good answers for high availability
●
Google offers multiple options for storage
– Some of these are more reliable than others
– But they are more complex to use
– Or involve code changes
●
Maintenance can cause outages
– automatic failover for CloudSQL needs a whole zone to fail, so a
maintenance can cause an unexpected outage
●
You may need to run your own database systems for more
reliable access to structured data
IPv6
●
Google does not put it's money where it's
mouth is wrt IPv6
– IPv6 support is very limited in the compute
environment
IPv6
●
Google does not put it's money where it's
mouth is wrt IPv6
– IPv6 support is very limited in the compute
environment
●
We started off by routing IPv6 traffic to our
loadbalancers in the legacy environment and
then proxying to IPv4 in Google
– This is no longer needed
Secrets
●
If you have containers, Google supports encrypted secrets.
Secrets
●
If you have containers, Google supports encrypted secrets.
●
Using Vault from Hashicorp looks like a good option, but you
still need to code applications to use those secrets instead of
reading from a config file
Secrets
●
If you have containers, Google supports encrypted secrets.
●
Using Vault from Hashicorp looks like a good option, but you
still need to code applications to use those secrets instead of
reading from a config file
●
Anything else which works with your configuration management
system is a good idea (eyaml with Puppet, for example)
– You still have the problem of managing a few master
encryption keys
Secrets
●
If you have containers, Google supports encrypted secrets.
●
Using Vault from Hashicorp looks like a good option, but you
still need to code applications to use those secrets instead of
reading from a config file
●
Anything else which works with your configuration management
system is a good idea (eyaml with Puppet, for example)
– You still have the problem of managing a few master
encryption keys
●
We tested hiera-vault, but performance was terrible
Loadbalancing
●
Google’s load balancer offering is limited in
some ways as compared to more advanced
tools like F5s, etc
●
We chose to replace the hardware LBs with
simple IP based load balancer + nginx proxies
– Note that code which tracks IP addresses or does
geolocation needs to change to handle this.
Monitoring
●
Stackdriver looks promising for log
management
– It has quite a few retention limitations
– New pricing makes it cheaper to run an ELK stack,
depending on log volume
Monitoring
●
Stackdriver looks promising for log
management
– It has quite a few retention limitations
– New pricing makes it cheaper to run an ELK stack,
depending on log volume
●
Stackdriver is a good replacement for the ELK
stack, but not for high quality
analytics/monitoring
Monitoring
●
Stackdriver looks promising for log management
– It has quite a few retention limitations
– New pricing makes it cheaper to run an ELK stack,
depending on log volume
●
Stackdriver is a good replacement for the ELK
stack, but not for high quality analytics/monitoring
●
There isn't a really good alternative to running your
own time-series database
– Especially if you use that data for alerting
Legacy code
●
Plan on migrating it wholesale
– Even if you plan to rewrite it
●
Rewrites will take longer than you plan for
– Even your planned migrations will take longer than
expected, because of environmental assumptions.
Legacy code
●
Plan on migrating it wholesale
– Even if you plan to rewrite it
●
Rewrites will take longer than you plan for
– Even your planned migrations will take longer than
expected, because of environmental assumptions.
●
This does not benefit from moving to the cloud
– You are just running it in an environment with
different assumptions on latency and reliability
Spectre/Meltdown impact
●
CPU utilisation doubles
– We are currently on rather over-provisioned
hardware, so actual impact is minimal
●
Anything which does a lot of system calls is
slowed quite a bit
– Large data import went from 26 hours to 56
Summary
●
Cloud migration is a business decision, but
remember that costs will probably increase
– Monitor your costs closely, you will discover a
number of ways in which money is wasted in the
cloud (debug logging, for example).
Summary
●
Cloud migration is a business decision, but
remember that costs will probably increase
– Monitor your costs closely, you will discover a
number of ways in which money is wasted in the
cloud (debug logging, for example).
●
Outsourcing your L1 operations team to people
who do not care about your business needs still
has the same problems as a decade or two ago
Summary
●
Cloud migration is a business decision, but
remember that costs will probably increase
– Monitor your costs closely, you will discover a number of
ways in which money is wasted in the cloud (debug
logging, for example).
●
Outsourcing your L1 operations team to people who
do not care about your business needs still has the
same problems as a decade or two ago
●
Choosing which provider to go with often involves
small differences based on your existing stack
Summary
●
Cloud migration is a business decision, but remember that
costs will probably increase
– Monitor your costs closely, you will discover a number of ways in
which money is wasted in the cloud (debug logging, for example).
●
Outsourcing your L1 operations team to people who do not
care about your business needs still has the same problems
as a decade or two ago
●
Choosing which provider to go with often involves small
differences based on your existing stack
●
The tooling available is still very raw, and we are still
discovering operational design patterns
Summary
●
Cloud migration is a business decision, but remember that costs will
probably increase
– Monitor your costs closely, you will discover a number of ways in which money is
wasted in the cloud (debug logging, for example).
●
Outsourcing your L1 operations team to people who do not care about
your business needs still has the same problems as a decade or two ago
●
Choosing which provider to go with often involves small differences based
on your existing stack
●
The tooling available is still very raw, and we are still discovering
operational design patterns
●
Migrating to the cloud may require a wholesale change in process
– If you are in a large ITIL shop, that will require a huge change.
?

More Related Content

ODP
Deploying your SaaS stack OnPrem
ODP
Repositories as Code
PDF
Open Source Monitoring in 2019
PDF
Migrating to Puppet 5
ODP
Is there a future for devops ?
PDF
Moby is killing your devops efforts
PDF
GitOps , done Right
PDF
Help , My Datacenter is on fire
Deploying your SaaS stack OnPrem
Repositories as Code
Open Source Monitoring in 2019
Migrating to Puppet 5
Is there a future for devops ?
Moby is killing your devops efforts
GitOps , done Right
Help , My Datacenter is on fire

What's hot (20)

PDF
Pipeline as Code
PDF
Modern Monitoring [ with Prometheus ]
PDF
Pipeline as code for your infrastructure as Code
PDF
Devops is a Security Requirement
PDF
The Return of the Dull Stack Engineer
PDF
Capacity Planning Infrastructure for Web Applications (Drupal)
PPTX
Que nos espera a los ALM Dudes para el 2013?
ODP
From devoops to devops
ODP
From MonitoringSucks to Monitoring Love , 2016 Edition
PDF
Continuous Infrastructure First
PDF
Idi2018 - Serverless does not mean Opsless
PDF
Continuous Infrastructure First
PDF
No, we can't do continuous delivery
PDF
meetup version of Paving the road to production
PDF
Continuous Delivery NYC: From GitOps to an adaptable CI/CD Pattern for Kubern...
ODP
The influence of "Distributed platforms" on #devops
ODP
On the Importance of Infrastructure as Code
PDF
Can we fix dev-oops ?
PPTX
How bigtop leveraged docker for build automation and one click hadoop provis...
PDF
Devops is dead, Long Live Devops
Pipeline as Code
Modern Monitoring [ with Prometheus ]
Pipeline as code for your infrastructure as Code
Devops is a Security Requirement
The Return of the Dull Stack Engineer
Capacity Planning Infrastructure for Web Applications (Drupal)
Que nos espera a los ALM Dudes para el 2013?
From devoops to devops
From MonitoringSucks to Monitoring Love , 2016 Edition
Continuous Infrastructure First
Idi2018 - Serverless does not mean Opsless
Continuous Infrastructure First
No, we can't do continuous delivery
meetup version of Paving the road to production
Continuous Delivery NYC: From GitOps to an adaptable CI/CD Pattern for Kubern...
The influence of "Distributed platforms" on #devops
On the Importance of Infrastructure as Code
Can we fix dev-oops ?
How bigtop leveraged docker for build automation and one click hadoop provis...
Devops is dead, Long Live Devops
Ad

Similar to OSDC 2018 | Migrating to the cloud by Devdas Bhagat (20)

PDF
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
PPTX
Interoperable Clouds and How to Build (or Buy) Them
PPTX
Immutable infrastructure isn’t the answer
PPTX
Cloud Native Summit 2019 Summary
PDF
Micro services may not be the best idea
PDF
Platform Clouds, Containers, Immutable Infrastructure Oh My!
PDF
Pets vs. Cattle: The Elastic Cloud Story
PDF
Five Years of EC2 Distilled
PDF
OpenStack Operations Guide 1st Edition Tom Fifield
PPTX
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
PDF
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
PPTX
What are clouds made from
PDF
Public Cloud Workshop
PDF
A real-life account of moving 100% to a public cloud
PPTX
Cf summit2014 roadmap
PDF
Greenfields tech decisions
PDF
Tackling complexity in giant systems: approaches from several cloud providers
PDF
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
PDF
Microservices: State of the Union
PPTX
Cloud Foundry Roadmap (Cloud Foundry Summit 2014)
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
Interoperable Clouds and How to Build (or Buy) Them
Immutable infrastructure isn’t the answer
Cloud Native Summit 2019 Summary
Micro services may not be the best idea
Platform Clouds, Containers, Immutable Infrastructure Oh My!
Pets vs. Cattle: The Elastic Cloud Story
Five Years of EC2 Distilled
OpenStack Operations Guide 1st Edition Tom Fifield
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
What are clouds made from
Public Cloud Workshop
A real-life account of moving 100% to a public cloud
Cf summit2014 roadmap
Greenfields tech decisions
Tackling complexity in giant systems: approaches from several cloud providers
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
Microservices: State of the Union
Cloud Foundry Roadmap (Cloud Foundry Summit 2014)
Ad

Recently uploaded (20)

PPTX
Transform Your Business with a Software ERP System
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Introduction to Artificial Intelligence
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
history of c programming in notes for students .pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Online Work Permit System for Fast Permit Processing
PDF
System and Network Administraation Chapter 3
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Transform Your Business with a Software ERP System
Odoo Companies in India – Driving Business Transformation.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Digital Strategies for Manufacturing Companies
CHAPTER 2 - PM Management and IT Context
Softaken Excel to vCard Converter Software.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Introduction to Artificial Intelligence
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
history of c programming in notes for students .pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Online Work Permit System for Fast Permit Processing
System and Network Administraation Chapter 3
Wondershare Filmora 15 Crack With Activation Key [2025
How Creative Agencies Leverage Project Management Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx

OSDC 2018 | Migrating to the cloud by Devdas Bhagat

  • 2. DISCLAIMER: This talk is about work in progress. Completeness and accuracy aren't guaranteed beyond best effort
  • 4. Starting point ● Old hardware ● A lot of profitable legacy software
  • 5. Starting point ● Old hardware ● A lot of profitable legacy software ● Openstack + bare metal
  • 6. Starting point ● Old hardware ● A lot of profitable legacy software ● Openstack + bare metal ● Working CI/CD
  • 7. Starting point ● Old hardware ● A lot of profitable legacy software ● Openstack + bare metal ● Working CI/CD ● Working configuration management
  • 8. Starting point ● Old hardware ● A lot of profitable legacy software ● Openstack + bare metal ● Working CI/CD ● Working configuration management ● Small infrastructure team
  • 9. Starting point ● Old hardware ● A lot of profitable legacy software ● Openstack + bare metal ● Working CI/CD ● Working configuration management ● Small infrastructure team ● Software is an essential business component, but our business is not software
  • 10. Starting point ● Old hardware ● A lot of profitable legacy software ● Openstack + bare metal ● Working CI/CD ● Working configuration management ● Small infrastructure team ● Software is an essential business component, but our business is not software ● Developers are on call for production application issues
  • 11. Cloud considerations ● Scaling – Cloud systems let you scale in smaller increments on demand
  • 12. Cloud considerations ● Scaling – Cloud systems let you scale in smaller increments on demand ● Variability in demand – Low variability in demand for computing resources supports staying in-house – Highly variable systems benefit from moving to the cloud far more
  • 13. Cloud considerations ● Scaling – Cloud systems let you scale in smaller increments on demand ● Variability in demand – Low variability in demand for computing resources supports staying in-house – Highly variable systems benefit from moving to the cloud far more ● Legal issues – Privacy regulations in the EU itself ● Also different laws between different EU countries – Brexit
  • 14. Cloud considerations ● Scaling – Cloud systems let you scale in smaller increments on demand ● Variability in demand – Low variability in demand for computing resources supports staying in-house – Highly variable systems benefit from moving to the cloud far more ● Legal issues – Privacy regulations in the EU itself ● Also different laws between different EU countries – Brexit ● Software design – Observability must be built into the software
  • 17. Vendor Choices ● Already using Docker ● Already moving to microservices
  • 18. Vendor Choices ● Already using Docker ● Already moving to microservices ● Moving from Mesos to Kubernetes was easy
  • 19. Vendor Choices ● Already using Docker ● Already moving to microservices ● Moving from Mesos to Kubernetes was easy ● This made Google's Cloud offering a slightly better choice than Amazon – Google being cheaper helped a bit
  • 20. Vendor Choices ● Already using Docker ● Already moving to microservices ● Moving from Mesos to Kubernetes was easy ● This made Google's Cloud offering a slightly better choice than Amazon – Google being cheaper helped a bit ● Neither was cheaper than running our own hardware – Savings mostly come from the lack of a dedicated operations group, and from being able to avoid some HA requirements
  • 21. The technical research phase ● Lasted about half a year
  • 22. The technical research phase ● Lasted about half a year ● Focus on two main areas: – How to manage infrastructure manually at the vendor – Tooling and automation
  • 24. Why manual work? ● Familiarisation – Terminology ● Concepts
  • 25. Why manual work? ● Familiarisation – Terminology ● Concepts ● Discover limitations – There are a lot of those – Some more interesting than others (load balancing, IPv6, DNS, ...)
  • 26. Choosing automation tools ● Shell scripts – Via gcloud + gsutil
  • 27. Choosing automation tools ● Shell scripts – Via gcloud + gsutil ● Ansible – We had Ansible experience – Built some systems with ansible – Very limited in what it can do without using gcloud
  • 28. Choosing automation tools ● Shell scripts – Via gcloud + gsutil ● Ansible – We had Ansible experience – Built some systems with ansible – Very limited in what it can do without using gcloud ● Puppet – Was not a serious contender six months ago
  • 29. Choosing automation tools ● Shell scripts – Via gcloud + gsutil ● Ansible – We had Ansible experience – Built some systems with ansible – Very limited in what it can do without using gcloud ● Puppet – Was not a serious contender six months ago ● Terraform – The best of the lot ● It has improved a lot since this slideset was first made
  • 30. Configuration management ● Stateless systems implemented in a 12-factor style are best put in containers and managed via Kubernetes – Alternatively, use what Google calls managed groups and spin up VMs automatically in case of crashes
  • 31. Configuration management ● Stateless systems implemented in a 12-factor style are best put in containers and managed via Kubernetes – Alternatively, use what Google calls managed groups and spin up VMs automatically in case of crashes ● We still need configuration management for systems which aren't in a container
  • 32. Configuration management ● Stateless systems implemented in a 12-factor style are best put in containers and managed via Kubernetes – Alternatively, use what Google calls managed groups and spin up VMs automatically in case of crashes ● We still need configuration management for systems which aren't in a container ● Puppet was the obvious choice, because we were already using it – It doesn’t matter which specific tool you use, but use one.
  • 33. Inventory ● There isn't a nice CMDB out there yet, which can automagically provision VMs in the cloud and provide information to config-mgmt and orchestration tools – We currently hack our way around this by using tags and the Google API
  • 34. Moving into high speed ● One meeting – Three people – Thirty minutes
  • 35. Moving into high speed ● One meeting – Three people – Thirty minutes ● Decided on goals for a proof of concept – Complete automation – Custom tooling around the application – Fixed target application for a test deployment
  • 36. Moving into high speed ● One meeting – Three people – Thirty minutes ● Decided on goals for a proof of concept – Complete automation – Custom tooling around the application – Fixed target application for a test deployment ● Took us about three months of full time effort to wrap up the PoC
  • 37. Tools of choice ● Terraform – This is a pretty fast moving tool – They have good documentation ● For some value of good. – Getting your first bits and pieces working are harder than they should be, but the rest then follow pretty easily
  • 38. Tools of choice ● Terraform – This is a pretty fast moving tool – They have good documentation ● For some value of good. – Getting your first bits and pieces working are harder than they should be, but the rest then follow pretty easily ● Puppet – New Puppet repo, ignoring a lot of legacy. – Jumped Puppet version – Discarded large parts of the module approach recommended in Puppet documentation
  • 39. Terraform ● Base network project – All network related things are done in this project
  • 40. Terraform ● Base network project – All network related things are done in this project ● Other projects use instance groups with a mostly standard template – They reference network configs from the base project
  • 41. Terraform ● Base network project – All network related things are done in this project ● Other projects use an instance group with a mostly standard template – They reference network configs from the base project ● Google metadata is used to tie together Puppet and Terraform
  • 42. Shared backends ● We started with a simple backend for Terraform, with no remote state. – This does not scale to many users, but for the initial proof of concept was useful.
  • 43. Shared backends ● We started with a simple backend for Terraform, with no remote state. – This does not scale to many users, but for the initial proof of concept was useful. ● We then spent a few days very carefully refactoring this into per project state, with the shared state being remote in a cloud storage bucket. – https://charity.wtf/2016/03/30/terraform-vpc-and-why-you- want-a-tfstate-file-per-env/ is a pretty good horror story of what could go wrong
  • 44. * Documentation * API * Stateful data * IPv6 * Secrets
  • 46. Google Cloud Documentation ● Lags behind software ● Is often inconsistent
  • 47. Google Cloud Documentation ● Lags behind software ● Is often inconsistent ● This has not changed in about three years – This is not limited to Google though.
  • 48. API ● Quite inconsistent in some regards – Particularly about referencing other properties – Name or reference?
  • 49. API ● Quite inconsistent in some regards – Particularly about referencing other properties – Name or reference? ● Needs actual examples – A lot of examples ● This has not really improved since I first wrote this talk
  • 50. Stateful data ● There are no good answers for high availability
  • 51. Stateful data ● There are no good answers for high availability ● Google offers multiple options for storage – Some of these are more reliable than others – But they are more complex to use – Or involve code changes
  • 52. Stateful data ● There are no good answers for high availability ● Google offers multiple options for storage – Some of these are more reliable than others – But they are more complex to use – Or involve code changes ● Maintenance can cause outages – automatic failover for CloudSQL needs a whole zone to fail, so a maintenance can cause an unexpected outage
  • 53. Stateful data ● There are no good answers for high availability ● Google offers multiple options for storage – Some of these are more reliable than others – But they are more complex to use – Or involve code changes ● Maintenance can cause outages – automatic failover for CloudSQL needs a whole zone to fail, so a maintenance can cause an unexpected outage ● You may need to run your own database systems for more reliable access to structured data
  • 54. IPv6 ● Google does not put it's money where it's mouth is wrt IPv6 – IPv6 support is very limited in the compute environment
  • 55. IPv6 ● Google does not put it's money where it's mouth is wrt IPv6 – IPv6 support is very limited in the compute environment ● We started off by routing IPv6 traffic to our loadbalancers in the legacy environment and then proxying to IPv4 in Google – This is no longer needed
  • 56. Secrets ● If you have containers, Google supports encrypted secrets.
  • 57. Secrets ● If you have containers, Google supports encrypted secrets. ● Using Vault from Hashicorp looks like a good option, but you still need to code applications to use those secrets instead of reading from a config file
  • 58. Secrets ● If you have containers, Google supports encrypted secrets. ● Using Vault from Hashicorp looks like a good option, but you still need to code applications to use those secrets instead of reading from a config file ● Anything else which works with your configuration management system is a good idea (eyaml with Puppet, for example) – You still have the problem of managing a few master encryption keys
  • 59. Secrets ● If you have containers, Google supports encrypted secrets. ● Using Vault from Hashicorp looks like a good option, but you still need to code applications to use those secrets instead of reading from a config file ● Anything else which works with your configuration management system is a good idea (eyaml with Puppet, for example) – You still have the problem of managing a few master encryption keys ● We tested hiera-vault, but performance was terrible
  • 60. Loadbalancing ● Google’s load balancer offering is limited in some ways as compared to more advanced tools like F5s, etc ● We chose to replace the hardware LBs with simple IP based load balancer + nginx proxies – Note that code which tracks IP addresses or does geolocation needs to change to handle this.
  • 61. Monitoring ● Stackdriver looks promising for log management – It has quite a few retention limitations – New pricing makes it cheaper to run an ELK stack, depending on log volume
  • 62. Monitoring ● Stackdriver looks promising for log management – It has quite a few retention limitations – New pricing makes it cheaper to run an ELK stack, depending on log volume ● Stackdriver is a good replacement for the ELK stack, but not for high quality analytics/monitoring
  • 63. Monitoring ● Stackdriver looks promising for log management – It has quite a few retention limitations – New pricing makes it cheaper to run an ELK stack, depending on log volume ● Stackdriver is a good replacement for the ELK stack, but not for high quality analytics/monitoring ● There isn't a really good alternative to running your own time-series database – Especially if you use that data for alerting
  • 64. Legacy code ● Plan on migrating it wholesale – Even if you plan to rewrite it ● Rewrites will take longer than you plan for – Even your planned migrations will take longer than expected, because of environmental assumptions.
  • 65. Legacy code ● Plan on migrating it wholesale – Even if you plan to rewrite it ● Rewrites will take longer than you plan for – Even your planned migrations will take longer than expected, because of environmental assumptions. ● This does not benefit from moving to the cloud – You are just running it in an environment with different assumptions on latency and reliability
  • 66. Spectre/Meltdown impact ● CPU utilisation doubles – We are currently on rather over-provisioned hardware, so actual impact is minimal ● Anything which does a lot of system calls is slowed quite a bit – Large data import went from 26 hours to 56
  • 67. Summary ● Cloud migration is a business decision, but remember that costs will probably increase – Monitor your costs closely, you will discover a number of ways in which money is wasted in the cloud (debug logging, for example).
  • 68. Summary ● Cloud migration is a business decision, but remember that costs will probably increase – Monitor your costs closely, you will discover a number of ways in which money is wasted in the cloud (debug logging, for example). ● Outsourcing your L1 operations team to people who do not care about your business needs still has the same problems as a decade or two ago
  • 69. Summary ● Cloud migration is a business decision, but remember that costs will probably increase – Monitor your costs closely, you will discover a number of ways in which money is wasted in the cloud (debug logging, for example). ● Outsourcing your L1 operations team to people who do not care about your business needs still has the same problems as a decade or two ago ● Choosing which provider to go with often involves small differences based on your existing stack
  • 70. Summary ● Cloud migration is a business decision, but remember that costs will probably increase – Monitor your costs closely, you will discover a number of ways in which money is wasted in the cloud (debug logging, for example). ● Outsourcing your L1 operations team to people who do not care about your business needs still has the same problems as a decade or two ago ● Choosing which provider to go with often involves small differences based on your existing stack ● The tooling available is still very raw, and we are still discovering operational design patterns
  • 71. Summary ● Cloud migration is a business decision, but remember that costs will probably increase – Monitor your costs closely, you will discover a number of ways in which money is wasted in the cloud (debug logging, for example). ● Outsourcing your L1 operations team to people who do not care about your business needs still has the same problems as a decade or two ago ● Choosing which provider to go with often involves small differences based on your existing stack ● The tooling available is still very raw, and we are still discovering operational design patterns ● Migrating to the cloud may require a wholesale change in process – If you are in a large ITIL shop, that will require a huge change.
  • 72. ?