SlideShare a Scribd company logo
HowTo DR

Josh Berkus
PostgreSQL Experts
SCALE 2014
Disaster Recovery
“The process, policies and
procedures that are related to
preparing for recovery or
continuation of technology
infrastructure which are vital to an
organization after a natural or
human-induced disaster.”
Wikipedia, February 2014
Disaster Recovery
Restoring services after the
unexpected.
Disaster Recovery
Limiting:
1. Downtime
2. Data Loss
Do you have
a DR Plan?
Is it fairly
complete?
Have you
tested it?
Dilbert is a copyright of Scott Adams. Used here as parody. All rights reserved to Scott Adams.
Threat
Model
server
failure

getting
hacked
natural
disaster
server
failure

getting
hacked
natural
disaster

storage
failure
traffic
spike

network
failure

admin
error

bad
update

OS / VM
problem

software
bugs
server
failure

getting
hacked
natural
disaster

storage
failure
traffic
spike

network
failure

admin
error

bad
update

OS / VM
problem

software
bugs
Accepting
Loss
The Nines
Nines
99.9%
99.99%
99.999%

Down/Year
9 hours
1 hour
5 minutes
The Nines
●

Treats all downtime causes as
identical
–

●
●
●

except the ones it ignores

Doesn't address data loss
Really “Business Continuity”
also unrealistic
Disaster
Server Failure
Network Failure
Admin Error
Bad Update
Storage Failure
Getting Hacked
Natural Disaster

Downtime Data Loss

Detect
Disaster

Downtime Data Loss

Server Failure

0

0

Network Failure

0

0

Admin Error

0

0

Bad Update

0

0

Storage Failure

0

0

Getting Hacked

0

0

Natural Disaster

0

0

Detect

10 yrs

10 yrs
$$$$$$$$$$$$$$$$$$$$$$$
Disaster

Downtime Data Loss

Server Failure

5min

1min

Network Failure

3hrs

10min

Admin Error

1hr

1hr

Bad Update

1hr

1hr

Storage Failure

5min

30min

Getting Hacked

1hr

1hr

Natural Disaster

6hrs

1hr

Detect

3 mo

3 mo
Your
DR
Plan
Elements of a Plan
1. Backups/Replicas
2. Replacements
3. Procedures
4. People
Backups
●
●
●
●
●

pg_dump
mysqldump
rsync
zfs snapshot
SAN exportable snapshot
Backups++
●
●
●
●

Periodic
Portable
Simple
Recover point-in-time
Backups-●
●

Slow to restore
Data loss interval
Backups
●

Good for:
natural disaster
– admin error, bad update
– software bugs
– getting hacked
–

●

Bad for everything else
Replication
●
●
●
●
●
●

Postgres binary replication
MySQL replication
Redundant HBase nodes
Redis clustering
DRBD
GlusterFS etc.
Replication++
●
●
●

Continuous
Fast failover
Low data loss
Replication-●
●
●
●
●

Extra hardware
Complex
High-maintenance
Can hurt performance
Can replicate failures
Replication
●

Good For:
–

●

server, storage, network failure

Bad For:
admin error, getting hacked
– software bugs
–
Continuous Backup
●
●
●

Also “PITR”
Continous like replication
Partial recovery like backups
HowTo DR
… where you gonna
restore those backups
to?
Replacing Services
●
●
●
●
●

servers
network
storage
OS image
software reversion
Procedures
Written
Procedures
3AM
is not the time
to improvise
Procedures
… for each recovery step
… for deciding what steps
Database Server
Does Not Respond
1. Determine if physical server is
down
a. if network is down, use plan N1.
2. If not, try to restart database
using command …
3. Still down? Fail over to replica
using command …
4. Check replica.
5. Not working? Restore backup to
test server 1 using command ...
Good: detailed written
procedures
Better: written procedures with
pastable commands
Best: tested single-command
scripts
Fallback Procedures
●
●
●

Sometimes recovery fails
Have fallback procedures
If the fallback fails
… time for a meeting!
Never
ever
ever
ever
improvise.
People

Who
You
Gonna
Call?
Know who to call
●
●
●
●
●

on call staff
experts in each service
consultants/contractors
vendors
required authorizations
Contact Book
●

●

Include as much contact
information as possible
Put copies in more than one
place
–

●

including paper!

Keep it up to date
Test Your DR
Good: when you create the
procedure
Better: quarterly
Best: as part of daily/weekly
provisioning
An untested
backup
is one which
doesn't work.
DR
in the
Cloud
“It's a cloud, right?
That means it's
redundant, right?”
… not necessarily for
your servers

unless you pay for it!
Some new problems
●
●
●
●

Instance failure
Resource overcommit
Zone failures
Admin error at scale
Some new solutions
●

Redundant services
–

●
●

RDS, VIP, S3

Rapid server deployment
Cheap replicas
… otherwise
pretty much
the same.
backup locations
●

shared instance storage (EBS)
–

●

fast failover for instance fail

long-term storage API (S3)
redundant
– large
–
Use your rapid deploy!
●
●

Continuous backup to S3
Deploy scripts + server images
–

●

Chef/Salt/Puppet/etc. helps here

= fast recovery
–

with low running costs
DR Tips
●

Have multiple copies of your plan
–

●
●

in multiple locations

A SAN is not a DR solution
One form of backup is seldom
enough
Questions?
●

Josh Berkus
www.databasesoup.com
– www.pgexperts.com
–

●

Coming up:
NYC pgDay April 3-4
– pgCon May 21-24
–

Copyright 2013 PostgreSQL Experts Inc. Released under the Creative Commons
Share-Alike 3.0 License. All images, logos and trademarks are the property of their
respective owners and are used under principles of fair use unless otherwise noted.

More Related Content

ODP
Give A Great Tech Talk 2013
ODP
Fail over fail_back
PDF
Monitor some of the things
PPTX
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis
PDF
On The Building Of A PostgreSQL Cluster
PPTX
CPN302 your-linux-ami-optimization-and-performance
PPTX
Артем Оробец «На пути к low-latency»
PDF
Monitor Your Business
Give A Great Tech Talk 2013
Fail over fail_back
Monitor some of the things
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis
On The Building Of A PostgreSQL Cluster
CPN302 your-linux-ami-optimization-and-performance
Артем Оробец «На пути к low-latency»
Monitor Your Business

What's hot (20)

PDF
Hotspot Garbage Collection - Tuning Guide
PPTX
Preparing for SRE Interviews
PDF
Beat the devil: towards a Drupal performance benchmark
PDF
Packaging is the Worst Way to Distribute Software, Except for Everything Else
PDF
Infrastructure as code might be literally impossible part 2
PDF
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
PPTX
Prometheus design and philosophy
PDF
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
PDF
Gophers Riding Elephants: Writing PostgreSQL tools in Go
PDF
HBase: How to get MTTR below 1 minute
PDF
Scaling Apache Pulsar to 10 Petabytes/Day
PDF
Capacity Planning For LAMP
PDF
How to monitor NGINX
PPTX
Lab: JVM Production Debugging 101
PDF
Gearman - Northeast PHP 2012
PDF
How to tune Kafka® for production
PPTX
Prometheus (Monitorama 2016)
DOCX
Lab3 advanced port scanning 30 oct 21
PDF
Behind the Scenes at LiveJournal: Scaling Storytime
PPTX
Cpu steal time
Hotspot Garbage Collection - Tuning Guide
Preparing for SRE Interviews
Beat the devil: towards a Drupal performance benchmark
Packaging is the Worst Way to Distribute Software, Except for Everything Else
Infrastructure as code might be literally impossible part 2
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Prometheus design and philosophy
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Gophers Riding Elephants: Writing PostgreSQL tools in Go
HBase: How to get MTTR below 1 minute
Scaling Apache Pulsar to 10 Petabytes/Day
Capacity Planning For LAMP
How to monitor NGINX
Lab: JVM Production Debugging 101
Gearman - Northeast PHP 2012
How to tune Kafka® for production
Prometheus (Monitorama 2016)
Lab3 advanced port scanning 30 oct 21
Behind the Scenes at LiveJournal: Scaling Storytime
Cpu steal time
Ad

Similar to HowTo DR (20)

PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
[@IndeedEng] Redundant Array of Inexpensive Datacenters
PDF
Data Integrity - Patryk Hes
PPT
PDF
Becoming a Rock Star DBA
PDF
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
PDF
How to Create a Runbook: A Guide for Sysadmins & MSPs
PPTX
Performance Tuning
PDF
Adding Value in the Cloud with Performance Test
PDF
The on-call survival guide - how to be confident on-call
PDF
Geek Sync | Planning a SQL Server to Azure Migration in 2021 - Brent Ozar
PDF
Maximizing Business Continuity and Minimizing Recovery Time Objectives in Win...
PDF
Guide on Raid Data Recovery
PPTX
Scaling apps for the big time
PDF
Getting It Done
PDF
Applying Chaos Engineering to Build Resilient Serverless Applications
PPT
Creating And Implementing A Data Disaster Recovery Plan
PPT
Creating And Implementing A Data Disaster Recovery Plan
PDF
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
PDF
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Overview of Site Reliability Engineering (SRE) & best practices
[@IndeedEng] Redundant Array of Inexpensive Datacenters
Data Integrity - Patryk Hes
Becoming a Rock Star DBA
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
How to Create a Runbook: A Guide for Sysadmins & MSPs
Performance Tuning
Adding Value in the Cloud with Performance Test
The on-call survival guide - how to be confident on-call
Geek Sync | Planning a SQL Server to Azure Migration in 2021 - Brent Ozar
Maximizing Business Continuity and Minimizing Recovery Time Objectives in Win...
Guide on Raid Data Recovery
Scaling apps for the big time
Getting It Done
Applying Chaos Engineering to Build Resilient Serverless Applications
Creating And Implementing A Data Disaster Recovery Plan
Creating And Implementing A Data Disaster Recovery Plan
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Ad

More from PostgreSQL Experts, Inc. (20)

ODP
Shootout at the PAAS Corral
ODP
Shootout at the AWS Corral
ODP
PostgreSQL Replication in 10 Minutes - SCALE
PDF
Pg py-and-squid-pypgday
PDF
92 grand prix_2013
PDF
Five steps perform_2013
PDF
7 Ways To Crash Postgres
PDF
PWNage: Producing a newsletter with Perl
PDF
10 Ways to Destroy Your Community
PDF
Open Source Press Relations
PDF
5 (more) Ways To Destroy Your Community
PDF
Preventing Community (from Linux Collab)
PDF
Development of 8.3 In India
PDF
PostgreSQL and MySQL
PDF
50 Ways To Love Your Project
PDF
8.4 Upcoming Features
PDF
Elephant Roads: PostgreSQL Patches and Variants
PDF
Writeable CTEs: The Next Big Thing
PDF
PostgreSQL Development Today: 9.0
PDF
9.1 Mystery Tour
Shootout at the PAAS Corral
Shootout at the AWS Corral
PostgreSQL Replication in 10 Minutes - SCALE
Pg py-and-squid-pypgday
92 grand prix_2013
Five steps perform_2013
7 Ways To Crash Postgres
PWNage: Producing a newsletter with Perl
10 Ways to Destroy Your Community
Open Source Press Relations
5 (more) Ways To Destroy Your Community
Preventing Community (from Linux Collab)
Development of 8.3 In India
PostgreSQL and MySQL
50 Ways To Love Your Project
8.4 Upcoming Features
Elephant Roads: PostgreSQL Patches and Variants
Writeable CTEs: The Next Big Thing
PostgreSQL Development Today: 9.0
9.1 Mystery Tour

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Modernizing your data center with Dell and AMD
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
Modernizing your data center with Dell and AMD
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks

HowTo DR