Disaster Recovery - Business & Technology

Disaster Recovery
Business & Technology
Varrow Madness
March 15, 2012

Andrew Miller
Technical Consultant
t: @andriven w:www.thinkmeta.net

One Big Reason to Do This

Expectations for Disaster
Recovery
≠ IT Capabilities
for Disaster Recovery

What is a Disaster?
• Disaster: An event that affects a service or system such
that significant effort is required to restore the original
performance level.
» IT Service Management Forum

 But what does that look like IN
OUR ENVIRONMENT?
 What disaster and recovery
scenarios should we plan for?
 Where do we begin?
 How do we do it?

Disaster Recovery vs. Operational Recovery
• Disaster Recovery
– To cope with & recover from an IT crisis that moves work to an
alternative system in a non-routine way.
– A real “disaster” is large in scope and impact
– DR typically implies failure of the primary data center and recovery to an
alternate site
• Operational Recovery
– Addresses more “routine” types of failures (server, network, storage,
etc.)
– Events are smaller in scope and impact than a full “disaster”
– Typically implies recovering to alternate equipment within the primary
data center
• Business expectations for recovery timeframe is typically
shorter for “operational recovery” issues than a true “disaster”
• Each should have its own clearly defined objectives

Risks, Threats and Vulnerabilities

Risk is a function of the likelihood of a given threat
acting upon a particular potential vulnerability,
and the resulting impact of that adverse event on
the organization.

Some threats that can cause Disasters…
• Human Error
• Localized IT systems /
network failure
• Extended power outage
• Telecommunications outage
• Storm / Weather damage
• Earthquake / Volcano
• Fire in the facility
• Facility flooding
• Local evacuation
• Cyber attack
• Sabotage

(Varrow) Disaster Recovery Approach
• Interviews with key personnel to understand Business Process priorities
and establish Business Impact Analysis (BIA).
• Review existing IT production infrastructure, including applications,
servers, storage, network, and external connectivity. Identify Risks and
Gaps.
• Establish Disaster Impact Scenarios and Disaster Recovery strategies to
meet requirements.
• Recommend Roadmap for establishing recovery capabilities and
documenting plans.
• Implement required recovery capabilities.
• Develop framework and content for IT DR Plan.
• Develop maintenance and test procedures for IT DR Plan.
• Address Business Continuity requirements and planning as appropriate.

What is the Business Impact Analysis?
• A conversation between IT and key stakeholders to
understand:
– What are the most time-critical and information-critical
business processes?
– How does the business REALLY rely upon IT Service and
Application availability?
– What are the Student, Financial, Regulatory, Reputational,
and other impacts of IT Service and Application
unavailability?
– What availability or recoverability capabilities are justifiable
based on these requirements, potential impact, and costs?

Disaster Recovery: Key Measures

Recovery Point Objectives Recovery Time Objectives
(RPO) (RTO)

5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
a.m. a.m. a.m. a.m. a.m. a.m. a.m. a.m. p.m. p.m. p.m. p.m. p.m. p.m. p.m.

RPO: Amount of data lost from DECLARE RTO: Targeted amount of time
failure, measured as the amount DISASTER to restart a business service
10 a.m.
of time from a disaster event after a disaster event

Disaster Recovery: Key Measures
• Recovery Time Objective (RTO)
Maximum duration of disruption of service
• Recovery Point Objective (RPO)
Point in time to which application data is recovered / Maximum data loss

Weeks Days Hours Minutes Seconds Seconds Minutes Hours Days Weeks

Recovery Point Recovery Time

Real Time

Cost

BIA - Example Priority Tiers
Priority Tier Description
Priority 1 Services whose unavailability more than a brief period can have a severe impact on
High Availability / customers or time-critical business operations.
Immediate Recovery
Priority 2 Services whose unavailability significantly impacts customers or business
1-2 day recovery operations.
Priority 3 Services which can tolerate up to five days of disruption in a disaster.
3-5 day recovery
Priority 4 Services which can tolerate up to ten days of disruption in a disaster.
6-10 day recovery
Priority 3 and 4 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first.
Priority 5 Non-critical services which can tolerate two weeks or more of disruption in a
“Best effort” recovery disaster. These systems will be restored on a best-effort basis, after other more
critical systems have been restored and ongoing operations have resumed.

Priority 5 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first. In some cases, systems
deemed to not be required for continued operations may not be restored.

What does it take to RECOVER
from an IT Disaster?
• Data Protection
– Backups, Replication
• Recovery Facility
– Location to rebuild IT infrastructure or provision services
• Data Recovery & Storage
– Get Data into a form that is usable
• Servers / Compute Capacity
– Sufficient servers or virtual compute capacity to actually run the applications
• Network, Voice, and Data Communications
– Connect servers, storage and workers
– Connect the recovery site to work sites
– Communicate with customers
– Includes network, telecom, demarcation equipment; cabling; telecom provisioning
• DR Plan
– Documented and tested procedures for what to do, and how to do it
• People

Example Disaster Recovery Strategies
Priority Disaster Recovery Strategy Data Protection Approach

Priority 1 Establish hot site for systems and data in a Replicate / remote mirror / short
4 hour RTO or secondary data center at a remote interval remote disk-to-disk
less location that is unlikely to be impacted backup
by a local or regional event.

Priority 2 Maintain sufficient remote physical or virtual Remote disk-to-disk backup
24-48 hour RTO infrastructure for restoration. Ensure
sufficient space/power in recovery
facility.
Priority 3 Ensure ability to quickly acquire Tape (with sufficient off-site rotation)
72 hour RTO infrastructure for restoration. Ensure or remote disk-to-disk backup
facility.
Priority 4 Ensure ability to quickly acquire Tape (with sufficient off-site rotation)
1-2 week RTO infrastructure for restoration. Ensure or remote disk-to-disk backup
facility.

Storage Arrays + Replication
PRODUCTION SITE OPTIONAL DISASTER RECOVERY SITE

Application Local RecoverPoint bi-directional Remote Standby
servers copy replication/recovery copy servers
RecoverPoint RecoverPoint
appliance appliance
Production and
local journals

Prod Fibre Remote
SAN LUN Channel/WAN journal SAN
s

Storage Storage
Host-based write splitter arrays arrays
Fabric-based write splitter
Symmetrix VMAXe, VNX-, and
CLARiiON-based write splitter

Site A (Primary) Site B (Recovery)
Site Site
vCenter Server Recovery vCenter Server Recovery
Manager Manager

vSphere vSphere
vSphere
Replication

Storage-based
replication
vSphere Replication
Simple, cost-efficient replication for Tier 2 applications and smaller sites

Storage-based Replication
High-performance replication for business-critical applications in larger sites

Disaster Recovery - Business & Technology

More Related Content

What's hot (20)

Similar to Disaster Recovery - Business & Technology (20)

More from Andrew Miller (6)

Recently uploaded (20)

Disaster Recovery - Business & Technology

Editor's Notes