Disaster Recovery - On-Premise & Cloud

CLOUDCONF 2014
Database: backup e disaster recovery in Cloud
Walter Dal Mut
@walterdalmut – www.corley.it – walterdalmut.com

DISASTER RECOVERY
Disaster recovery (DR) is about preparing for and recovering from a
disaster.

DISASTER
Any event that has a negative impact on
your business continuity or finances could be termed a disaster.

WHYWEARETALKINGABOUT DR?
• Over 70% of businesses involved in a major fire either do not reopen, or subsequently fail
within 3 years of fire. (Source continuitycentral.com)
• 80% of businesses affected by a major
incident either never re-open or close within 18 months (SourceAxa)
• 70 percent of companies go out of business after a major data loss (Source
continuitycentral.com)
• 80% of businesses suffering a computer disaster, who have no disaster recovery plans, go
out of business. (Source “A BridgeToo Far”, IBM BusinessRecovery Service & Cranfield,
1993)
• A recent study from Gartner, Inc., found that 90 percent of companies that experience
data loss go out of business within two years.
• 80 percent of companies without well-conceived data protection and recovery strategies
go out of business within 2 years of a major disaster. (Source: US NationalArchives and
Records Administration)

RTO – RECOVERYTIME
OBJECTIVE
This is the duration of time and the service level to which a business
process must be restored after a disaster

RTO what it implies?
• Have a system that records 1000 transaction at hour
• Take a snapshot of a system at 03:00 am (every day)
• 10:00 am a disaster event occurs
• You spend 1 hour to sort things out for the backup (off-site, preparation, etc.)
• Recover operation takes 4 hours in order to get back to operate (at minimum
service level)
• 5 hours is the: RECOVERYTIME OBJECTIVE

RPO – RECOVERY POINT
OBJECTIVE
This describes the acceptable amount of data loss measured in time.

RPO –WHAT IT IMPLIES?
• Have a system that records 1000 transaction at hour
• Take a snaphot of a system at 03:00 am (every day)
• 10:00 am a disaster event occurs
• In this case we lost around 7000 transactions.
• 1000 transactions 03:00 04:00
• 1000 transactions 04:00 05:00
• …
• But: we are accepting 24 hours of data loss 24000 transactions (RPO)

DISASTER RECOVERY STRATEGIES
Local
tape
backup
Online
backup
Pilot-Light
Warm
Stand-by
And
More…
$ $$$ $$$$$$
Seconds
Days

ON-PREMISE & CLOUD
Use cloud resources in order to provide business continuity

Disaster Recovery & Cloud?
• On Demand
• We can allocate and release new resources whenever we need
• Cost Effective
• Pay as you go model.We pay only for resources that we are effectively
using
• Scalable
• We can scale freely and adapt our strategy thanks to autoscaling and
other mechanisms
• Secure
• Control doesn’t mean security

FOCUS ON DATABASES
We will focus on MySQL but you can apply to your infrastructure without
any problem.

BACKUP & RESTORE
Take a snapshot of a system and restore it when you need it

RTO & RPO?
Things to remember…

RTO
What resources can impact on my RTO

RESOURCES
ALLOCATION
How fast we can set up all resources, eg: instances, network, etc etc.

DB RESTORE
How many time the database restore can takes?

RPO
What resources can impact on my RPO

DB SNAPSHOT
How many time we need to recover all data from our snapshot?

Backup & Restore – RPO & RTO
Configuration
• Resources Allocation
• ???
• Restore Operation
• ???
• DNS
• TTL 30 minutes
• Snapshot
• Every 24 hour
Effects
• RTO – RecoveryTime Objective
• 30 minutes + ??? + ???
• RPO – Recovery Point Objective
• 24 hour
• Downtime per month
• 99.8% availability 86.23 minutes

COSTS ON S3 (AWS)
0.085$ per GB durability 99,999999999%
$0.068 / GB durability 99,99%
$0.010 / GB durability 99.999999999% [glacier]

Pilot light
We can let a little resource always active
that can help us to activate a whole
system

Replication
Basically pilot-light is based on database
replication strategies
For MySQL async replication is used as
base strategy
http://www.slideshare.net/corleycloud/m
ysql-scale-out-cloudparty-2013-milano-
talent-garden

READ REPLICA ON A CLOUD PROVIDER

RESOURCES
ALLOCATION
run and configure new instances typically takes a couple of minutes
you have always to care about resources and times.

DNS PROPAGATION
DNS takes a little while before propagate new addresses (TimeTo Live)

DB REPLICATION
Remember that Master/Slave replications are ASYNC!
It implies LAG replication time and that impact with your RPO!

MONITORYOUR
INFRASTRUCTURE
Setting an RPO about 20 minutes implies that your replication LAG time
should be always under 20 minutes!

Pilot Light – RPO & RTO
Configuration
• 20 minutes
• DNS
• Replication LAG
• 20 minutes
Effects
• 50 minutes
• 20 minutes

COSTS ON AWS
0.06$ per hour  1 m1.small~43$ per month
0.05$ per GB EBS
0.05$ per 1 million I/O requests EBS

WARM STANDBY
Extends pilot-light resource allocation and preparation

Warm StandBy – RPO & RTO
Configuration
• 5 minutes
• DNS
• Replication LAG
• 20 minutes
Effects
• 35 minutes
• 20 minutes

COSTS ON AWS
0.06$ per hour 2 m1.small~86$ per month
0.05$ per GB EBS
0.05$ per 1 million I/O requests EBS
ELB 20$ per month

PILOT LIGHT
VS
WARM STAND-BY
Effectively in our examples
Pilot Light is much more effective than warm stand-by.
Doesn’t it?

DEPENDS ON
ASSUMPTIONS
We assume that we don’t need to scale out our database but that is
enough to scale it up only!
Resource allocation for new read replicas? How long does it takes?

Disaster Recovery - On-Premise & Cloud

More Related Content

Similar to Disaster Recovery - On-Premise & Cloud (20)

More from Corley S.r.l. (20)

Recently uploaded (20)

Disaster Recovery - On-Premise & Cloud