SlideShare a Scribd company logo
Automating
Disaster Recovery
TechOps Adventures with Terraform
Context
Automating Disaster Recovery
Terraform (verb)
Automating Disaster Recovery
“transform (a planet) so as to
resemble the earth,
especially so that it can
support human life.”
Automating Disaster Recovery
Terraform (noun)
OBJECTIVES: Automate, Automate, Automate
What is a Pod?
Automating Disaster Recovery
All of the components required
to provide LogicMonitor for customers
Tomcat
Kafka
TSDB
MySQL
Relay
Global Resources:
APIs
HAProxy
Redis
S3
SQS
ELBs
Sitemonitor
Proxy
SMTP
Render
ECSSG
DNS
… what’s next?
ElasticSearch
Rserve
IAM
Horizontally scalable Cell Architecture
Conflict
Automating Disaster Recovery
• Runbook (Cookbook)
• CLI or web interface
• Co-workers .bash_history
• Crossing your fingers?
The Old Way
Automating Disaster Recovery
• Infrastructure as code (self documenting, repeatable)
• Provision and de-provision (important!)
• Scalable (change two parameters to create a new Pod)
Terraform
Automating Disaster Recovery
Terraform - change control
Automating Disaster Recovery
Terraform - preview changes
Automating Disaster Recovery
AMIs
Automating Disaster Recovery
Automating Disaster Recovery
Disaster Strikes
Automating Disaster Recovery
Terraform Puppet Ansible
Questions?
Automating Disaster Recovery
Come find anyone wearing LogicMonitor shirts
C

More Related Content

PPTX
Power system protection
PPT
Ppt of soap ui
PPT
Circuit Breaker
PPTX
GIS substation Information (Detailed Report)
PPTX
Identification and minimization of Harmonics
PDF
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
PPTX
LV switchgear FINAL.pptx
PPT
Dynamic voltage restorer (dvr)2
Power system protection
Ppt of soap ui
Circuit Breaker
GIS substation Information (Detailed Report)
Identification and minimization of Harmonics
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
LV switchgear FINAL.pptx
Dynamic voltage restorer (dvr)2

What's hot (19)

PPT
Auto Scaling on AWS
PDF
ETAP - Arcflash analysis & mitigation methods
PDF
PSPS Notes.pdf
PPTX
Harmonics Presentation by Baldev Raj Narang CEO Clariant Power System Ltd
PPTX
Reactive Architecture
PDF
ETAP - Coordination and protecion 2
PDF
Business Transactions with AppDynamics
PDF
three phase fault analysis with auto reset for temporary fault and trip for p...
PPT
Under voltage load shedding
PDF
ETAP - Power system modeling
PPT
Industrial control motor overload protection
PDF
ETAP - Load flow and panel rev2014-1
PDF
ETAP - Short circuit analysis iec standard
DOC
Miniature Circuit Breaker
PPTX
Substation protection devices
PDF
Surge Protection
PPT
Power System Protection course - Part I
PDF
A Step-by-Step Guide to Creating an Effective Swim Lane Diagram
PPT
Protection basic
Auto Scaling on AWS
ETAP - Arcflash analysis & mitigation methods
PSPS Notes.pdf
Harmonics Presentation by Baldev Raj Narang CEO Clariant Power System Ltd
Reactive Architecture
ETAP - Coordination and protecion 2
Business Transactions with AppDynamics
three phase fault analysis with auto reset for temporary fault and trip for p...
Under voltage load shedding
ETAP - Power system modeling
Industrial control motor overload protection
ETAP - Load flow and panel rev2014-1
ETAP - Short circuit analysis iec standard
Miniature Circuit Breaker
Substation protection devices
Surge Protection
Power System Protection course - Part I
A Step-by-Step Guide to Creating an Effective Swim Lane Diagram
Protection basic
Ad

Similar to AWS and Terraform for Disaster Recovery (20)

PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PPTX
London hug-samza
PPTX
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
ZIP
Elegant Systems Integration w/ Apache Camel
ODP
Bostonrb Amazon Talk
PPTX
Stream Processing Frameworks
KEY
WebWorkersCamp 2010
KEY
Building Distributed Systems in Scala
PDF
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
PPTX
Yahoo compares Storm and Spark
PDF
LAMP Stack (Reloaded) - Infrastructure as Code with Terraform & Packer
PDF
Reactive Summit 2017 Highlights!
PPTX
Amazon web services
PDF
Apache Camel v3, Camel K and Camel Quarkus
PPTX
Flink Streaming Hadoop Summit San Jose
PDF
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
PDF
Cloud Talk
PDF
Scaling an invoicing SaaS from zero to over 350k customers
PPT
Clustering van IT-componenten
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Don't Cross The Streams - Data Streaming And Apache Flink
London hug-samza
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Elegant Systems Integration w/ Apache Camel
Bostonrb Amazon Talk
Stream Processing Frameworks
WebWorkersCamp 2010
Building Distributed Systems in Scala
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Yahoo compares Storm and Spark
LAMP Stack (Reloaded) - Infrastructure as Code with Terraform & Packer
Reactive Summit 2017 Highlights!
Amazon web services
Apache Camel v3, Camel K and Camel Quarkus
Flink Streaming Hadoop Summit San Jose
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Cloud Talk
Scaling an invoicing SaaS from zero to over 350k customers
Clustering van IT-componenten
Metadata and Provenance for ML Pipelines with Hopsworks
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
A comparative analysis of optical character recognition models for extracting...
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Programs and apps: productivity, graphics, security and other tools
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf

AWS and Terraform for Disaster Recovery

Editor's Notes

  • #2: Hello. I’m Randall Thomson, Sr. TechOps Engineer at LogicMonitor. Our TechOps team manages the infrastructure that provides LogicMonitor service for our customers. We straddle the line of SRE or DevOps, whatever you want to call it nowadays. We are always juggling our time between re-active and pro-active tasks. This talk is about what our team has done to provide automation in disaster recovery situations using Terraform and AWS.
  • #3: I tend to jump right into the nitty gritty so I want to spend a brief couple slides going over the two main subjects to talk: Terraform & Pods (I will keep referring to these two things) Ask audience: Who has heard of or has experience with Terraform? Who has, or still does, provision AWS resources via the Web Portal? CLI? Other orchestration tools?
  • #4: This is Elon Musk’s Disaster Recovery plan for Earth. Not what I will be talking about today but definitely something fun to Google afterwards.
  • #5: Terraform - open source tool by Hashicorp (vagrant, packer, consul, vault) - will quote from website “Terraform enables you to safely and predictably create, change, and improve production infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.”
  • #6: Context part #2 - A Pod. LogicMonitor uses a Cell Architecture design we internally refer to as “Pods”. These, in addition to a handful of global resources are the infrastructure that powers the LogicMonitor service to our customers. Most of our Pods are a hybrid cloud model, where some of the resources are in our own datacenters with the rest being in AWS. The list on screen is only a subset (always changing), but as you can see there are a lot of resources that go into building a Pod. Lots of nuts & bolts.
  • #7: So this leads us to one of our challenges. How do you provide a reliable way to scale and keep your disaster recovery plan up-to-date?
  • #8: 15m (Resolution) Open with the Old Way of creating infrastructure (cli, web interface) In the past if we wanted to replicate how an existing server was built we would have to lookup the documentation (if any) and then assess if any manual changes were made (cross your fingers, or read through co-worker’s .bash_history). Black magic. This led to inconsistencies for environments that should ideally be exactly the same.
  • #9: Cue Terraform. The terraform code serves both as documentation of how infrastructure is built and a description of existing infrastructure. With Terraform you can both provision new infrastructure to be the same as old, as well as keep your older infrastructure up-to-date as you make changes along the way. Terraform is able to provision all of the resources which make up our pods except our bare-metal servers. Our DR plan utilizes a 100% AWS Cloud pod design with no data center dependencies. Scalable Worthwhile to maintain as it serves as the single source of truth. Documentation is always up-to-date. Turned processes we used to fear into near thoughtless tasks.
  • #10: - HCL, Modules, Projects, and Directory Organization. Private vs public facing resources. Data Providers. Terraform projects and modules can or rather should be stored in a code repository (but not your state files) even in a single person shop. This enables you to have all the normal benefits of a software project but for your infrastructure. Revision history, proper change control. We use modules (reusable resource provisioners) as templates for our various application servers. We define projects to represent our various pods (and global resources). Each AWS environment has a distinct terraform code repository. Terraform can operation across multiple AWS environments but this gets complicated quick. Suggest: Make use of data providers so that you are not defining variables in your code. For example, looking up network ranges or AMI numbers.
  • #11: 5m - The ability to preview changes is useful both when creating new resources and especially important when modifying old resources. It’s like a diff output showing additions, subtractions and changes. Somewhat colorized. You can (and will) configure various resources to ignore certain types of changes over time for cases when you don’t need your older resources modified. For example, AMI numbers. You may change the AMI over time but you don’t need to re-provision older servers as Puppet keeps them up-to-date.
  • #12: The Complication. I want to make a brief sidenote on AMI and the spectrum of Generic vs ready-to-run. We have about a dozen different types of application servers. For us it made sense to build a AMI that gets us about 95% of what we need and let Puppet do the final tweaks. For some it may be best to have a dozen different AMIs ready-to-run. The time savings can be dramatic when your instances don’t need a lot of post-configuration. It's another example of where you have to put a lot more work up-front to save time later. There are a variety of tools for building AMIs. We happen to use Packer, not because of any Terraform integration but simply that it does it’s one job very well. Also, make sure you copy your AMIs to any region where you may need to perform DR tasks.
  • #13: At this point you may be wondering what all this has to do with Disaster Recovery. So here’s where we are today. We agreed as a group that any resource we provision in AWS must be done via terraform. All of our pods are described in terraform projects. As it so happened, in a serendipitous way, our Disaster Recovery plan was born. We no longer needed one way to provision our production infrastructure and a different method for our DR plan. With Terraform it’s basically the same in either case.
  • #14: 10m - The day has come. Your datacenter lost power. It’s 5am and you’ve been up half the night with your toddler. How much thinking do you want to have to do? How much thinking will you even be capable of? Likely very little. terraform plan; terraform apply. copy the project file and repeat. (hope your VPN works, and that you have AMIs in the target regions)
  • #15: 10m - We’re currently making use of terraform to manage our QA environments as well. There is always room for improvement. We are looking at ways to automate application deployment in DR situations. One example we’re testing is using IAM roles combined with EC2 user-data scripts to fetch our WAR files directly from S3. Another example would be having CI/CD tool (such as Bamboo) run the terraform commands. Then even your boss manager could do it.