SlideShare a Scribd company logo
Disaster recovery with OpenNebula
Carlo Daffara
First, let me get
some coffee.
Disaster recovery with open nebula
Disaster recovery with open nebula
Disaster recovery with open nebula
“Disaster recovery (DR) involves a set of policies and
procedures to enable the recovery or continuation of vital
technology infrastructure and systems following a natural
or human-induced disaster. Disaster recovery focuses on
the IT or technology systems supporting critical business
functions, as opposed to business continuity, which
involves keeping all essential aspects of a business
functioning despite significant disruptive events. Disaster
recovery is therefore a subset of business continuity.”
80% of businesses affected by a major
incident either never re-open or close
within 18 months (Source: Axa)
From “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact on Infrastructure Vulnerability”, Ponemon Research
“Let’s begin with one very interesting fact. According to a
survey completed in 2010, human error is responsible for
40% of all data loss, as compared to just 29% for hardware
or system failures. An earlier IBM study determined data
loss due to human error was as high as 80%” (From:
Business continuity and disaster recovery planning for IT
professionals”, Elsevier press, 2014)
Disaster recovery with open nebula
Disaster recovery with open nebula
Disaster recovery with open nebula
The recovery time objective (RTO) is the targeted duration of
time and a service level within which a business process must
be restored after a disaster (or disruption) in order to avoid
unacceptable consequences associated with a break in
business continuity.
The recovery point objective (RPO), is the maximum tolerable
period in which data might be lost from an IT service due to a
major incident.
“Alternative storage-based replication solutions cost a
minimum of $10,000 per terabyte of data covered plus
ongoing maintenance. For the composite organization’s
225 protected VMs with an average size of 100 gigabytes
(GB), the three year costs for licenses and maintenance are
estimated at $328,500” (Forrester research, “The Total
Economic Impact of VMware vCenter Site Recovery
Manager”, 2013)
3 simple rules to make a working DR:
Rule 1: never put all eggs in one
basket (be it hardware, software, cloud)
Disaster recovery with open nebula
Customer buys full DR and snapshot capability from local
data center; data center updates SAN firmware and loses
everything. Customer discovers that snapshots and
backups were kept in the same SAN with everything else.
Disaster recovery with open nebula
In electronics, an opto-isolator, also called an optocoupler,
photocoupler, or optical isolator, is a component that transfers
electrical signals between two isolated circuits by using light.
Opto-isolators prevent high voltages from affecting the system
receiving the signal.
Disaster recovery with open nebula
Rule 2: RTO and RPO are usually
different from VM to VM
Disaster recovery with open nebula
Disaster recovery with open nebula
Needs to be
replicated
constantly
No one cares
if this dies
Disaster recovery with open nebula
Disaster recovery with open nebula
Rule 3: design a reliable oracle
Disaster recovery with open nebula
Disaster recovery with open nebula
Oracle of
Delphi
How the others do it:
Disaster recovery with open nebula
Disaster recovery with open nebula
How we do it:
Disaster recovery with open nebula
Our approach takes advantage of three
individual factors:
● LizardFS’ thinly-provisioned snapshots
● online replication of chunks & tiering
● OpenNebula’s datastores
Disaster recovery with open nebula
Disaster recovery with open nebula
# An example of configuration of goals. It contains the default values.
1 1 : _
2 2 : _ _
3 3 : _ _ _
4 4 : _ _ _ _
5 5 : _ _ _ _ _
# (...)
20 20 : _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# But you don't have to specify all of them -- defaults will be assumed.
# You can define your own custom goals using labels if you use them, e.g.:
# 14 min_two_locations: _ locationA locationB # one copy in A, one in B, third anywhere
# 15 fast_access : ssd _ _ # one copy on ssd, two additional on any
drives
# 16 two_manufacturers: WD HT # one on WD disk, one on HT disk
● Most disasters are “local”, for example a fire
in the server room or a flood
● Two different DR sites, one near (eg. next
building/other side of the building) and one
far (external datacenter)
● near DR receives a copy of the chunks that
are part of the marked datastores
Disaster recovery with open nebula
● Remote snapshots are handled in the same
way: we take a full snapshot of the
datastore, and differentially replicate it
● We use the “snapshot of snapshot” approach
to avoid the cost of deduplication
● This way we can prioritize sync queues, and
in the receiving end we got a complete and
decoupled + working OpenNebula
For example, average dedup cost for ZFS: 5 to 30 GB of dedup table data for every TB of pool data, assuming an average block size of 64K.
/var/lib/one/datastore
↓
DRSNAP12H
/var/lib/one/snapshots
↓
<yyyymmddhh>
↓
DRSNAP12H
Local
VM changes only in
snapshots
/var/lib/one/datastore
↓
DRSNAP12H
/var/lib/one/snapshots
↓
<yyyymmddhh>
↓
DRSNAP12H
Remote
no chunk changes
in snapshots
inplace rsync
(25x speedup)
Disaster recovery with open nebula
virsh# domblkstat instance-0012 --device vda
vda rd_req 128
vda rd_bytes 2344448
vda wr_req 234
vda wr_bytes 618496
vda flush_operations 2
vda rd_total_times 106512819
vda wr_total_times 960359872
vda flush_total_times 1741727
Our “pilot light” approach: a running OpenNebula on two
nodes, with its own LizardFS store. Running only two VMs: the
Oracle and the Tester
The Oracle checks if DR is needed, and may need a human
confirmation for execution of the DR failover. If confirmation
is given, it takes the latest valid snapshotted datastore,
softlinks it and import the VMs (through snapshots, so it’s
instantaneous)
The Tester makes a snapshot of the current stable snapshot,
import the VMs and runs them into a separate, non-routed
vnet, then executes a test to see if everything works (workload
dependent), then deletes the intermediate snapshots
Only critical VMs are executed this way, if RTO<30 mins
For the VMs with higher RTO, buy one week of hardware on
demand, auto-install a node with Puppet or Ansible, and make
it join the OpenNebula cloud
Deployed usually in 30 mins. Other vendor guarantee <15 minutes.
Disaster recovery with open nebula
Disaster recovery with open nebula
Ideal for harsh indoor environments that
require protection from falling dirt or liquid,
dust, light splashing, oil or coolant seepage.
Its NEMA Zone 4 rating also makes it perfect
for facilities located in earthquake-prone
seismic zones or any environment prone to
extreme vibration such as factories, power
stations, construction areas, shipping
facilities, warehouses, processing plants,
railroads, airports and military installations.
Disaster recovery with open nebula
Disaster recovery with open nebula
● Have a “big red button” to stop DR if
needed. Sometimes you are already fighting
fire here, and you know it’s better not to
move everything in flight.
● Have two people that are competent as DR
firefighters, and give them a second phone
with a rechargeable card. And make sure
both don’t go on vacation together. (Hint:
don’t choose two married people)
● Use a gateway machine to provide a
consistent internal IP scheme, and two
different configurations for the gateway
router to provide unmodified routing for the
remaining VMs
● Aggregate functionality in a single VM (for
example, one that manages logs) to
optimize writes
● I favor consistency, so I tend to avoid
application-level replication, unless it’s
native to the app (eg. NoSQL). Otherwise
you have different solutions for different
machines (eg. quorum group in MS
replication with same UUID…)
● Try to reduce write amplification for
databases, especially MySQL. Eg. TokuDB
and its fractal tree
Disaster recovery with open nebula
Thank you!
Carlo Daffara
@cdaffara
linkedin.com/in/cdaffara

More Related Content

PPTX
RTOS- Real Time Operating Systems
PPTX
Real time operating systems (rtos) concepts 9
PPTX
REAL TIME OPERATING SYSTEM
PPT
presentation on real time operating system(RTOS's)
PPT
Data recovery
PPTX
Real time operating systems (rtos) concepts 1
PPTX
Real Time Operating Systems
PPT
Embedded Intro India05
RTOS- Real Time Operating Systems
Real time operating systems (rtos) concepts 9
REAL TIME OPERATING SYSTEM
presentation on real time operating system(RTOS's)
Data recovery
Real time operating systems (rtos) concepts 1
Real Time Operating Systems
Embedded Intro India05

What's hot (8)

PDF
Data Recovery
PDF
Resisting to The Shocks
PPTX
Rtos concepts
PPTX
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
PPT
Data recovery
PPT
Real-Time Operating Systems
PPT
Data recovery
PPT
Real time system tsp
Data Recovery
Resisting to The Shocks
Rtos concepts
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
Data recovery
Real-Time Operating Systems
Data recovery
Real time system tsp
Ad

Similar to Disaster recovery with open nebula (20)

PDF
OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
PDF
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
PPTX
Smartive STORM
PDF
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
PDF
Data Protection and Disaster Recovery Solutions: Ensuring Business Continuity
PPT
Ch13 Business Continuity Planning and Procedures
PDF
Locationless data science on a modern secure edge
PPTX
Availability conceptin operating system.
PPT
Business Continuity Presentation[1]
PDF
Disaster recovery glossary
PPT
Real Time Operating system (RTOS) - Embedded systems
PDF
Brochure triconex emergency_shutdownsystemssolutions_03-10
PPT
Business Continuity Presentation
PPTX
DATA CENTER
PPTX
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
DOCX
Disaster Recovery Plan
PDF
RTOS implementation
PPTX
1-Introduction.pptx computer Networking
DOCX
Joe Graziano – Challenge 2 Design Solution (Part 1)
DOCX
Dataloggers seminar Report
OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
Smartive STORM
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Data Protection and Disaster Recovery Solutions: Ensuring Business Continuity
Ch13 Business Continuity Planning and Procedures
Locationless data science on a modern secure edge
Availability conceptin operating system.
Business Continuity Presentation[1]
Disaster recovery glossary
Real Time Operating system (RTOS) - Embedded systems
Brochure triconex emergency_shutdownsystemssolutions_03-10
Business Continuity Presentation
DATA CENTER
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Disaster Recovery Plan
RTOS implementation
1-Introduction.pptx computer Networking
Joe Graziano – Challenge 2 Design Solution (Part 1)
Dataloggers seminar Report
Ad

More from Carlo Daffara (20)

PDF
mindtrek2016 - the economics of open source clouds
PDF
Economics of public and private clouds
PDF
Cloudexpoeurope open source cloud
PDF
Class conference 2014 daffara
PDF
Collaborative economics
PDF
Daffara economics
PDF
Making clouds: turning opennebula into a product
PDF
Da zero al cloud
PDF
Nonsoftwareoss
PDF
PDF
Businessonopen2012
PDF
Economic value of open source
PDF
Economic impact of open source software
PDF
Mythrealities
PDF
Transfersummit2011
PDF
Owf2010 daffara
PDF
Linuxtag daffara
PDF
Oss healthcare
PDF
Empoweringsme
PDF
mindtrek2016 - the economics of open source clouds
Economics of public and private clouds
Cloudexpoeurope open source cloud
Class conference 2014 daffara
Collaborative economics
Daffara economics
Making clouds: turning opennebula into a product
Da zero al cloud
Nonsoftwareoss
Businessonopen2012
Economic value of open source
Economic impact of open source software
Mythrealities
Transfersummit2011
Owf2010 daffara
Linuxtag daffara
Oss healthcare
Empoweringsme

Recently uploaded (20)

PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
Upgrade and Innovation Strategies for SAP ERP Customers
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Understanding Forklifts - TECH EHS Solution
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PTS Company Brochure 2025 (1).pdf.......
Reimagine Home Health with the Power of Agentic AI​
Why Generative AI is the Future of Content, Code & Creativity?
Odoo POS Development Services by CandidRoot Solutions
iTop VPN Free 5.6.0.5262 Crack latest version 2025
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Which alternative to Crystal Reports is best for small or large businesses.pdf
L1 - Introduction to python Backend.pptx
Digital Systems & Binary Numbers (comprehensive )

Disaster recovery with open nebula

  • 1. Disaster recovery with OpenNebula Carlo Daffara
  • 2. First, let me get some coffee.
  • 6. “Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity.”
  • 7. 80% of businesses affected by a major incident either never re-open or close within 18 months (Source: Axa)
  • 8. From “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact on Infrastructure Vulnerability”, Ponemon Research
  • 9. “Let’s begin with one very interesting fact. According to a survey completed in 2010, human error is responsible for 40% of all data loss, as compared to just 29% for hardware or system failures. An earlier IBM study determined data loss due to human error was as high as 80%” (From: Business continuity and disaster recovery planning for IT professionals”, Elsevier press, 2014)
  • 13. The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. The recovery point objective (RPO), is the maximum tolerable period in which data might be lost from an IT service due to a major incident.
  • 14. “Alternative storage-based replication solutions cost a minimum of $10,000 per terabyte of data covered plus ongoing maintenance. For the composite organization’s 225 protected VMs with an average size of 100 gigabytes (GB), the three year costs for licenses and maintenance are estimated at $328,500” (Forrester research, “The Total Economic Impact of VMware vCenter Site Recovery Manager”, 2013)
  • 15. 3 simple rules to make a working DR:
  • 16. Rule 1: never put all eggs in one basket (be it hardware, software, cloud)
  • 18. Customer buys full DR and snapshot capability from local data center; data center updates SAN firmware and loses everything. Customer discovers that snapshots and backups were kept in the same SAN with everything else.
  • 20. In electronics, an opto-isolator, also called an optocoupler, photocoupler, or optical isolator, is a component that transfers electrical signals between two isolated circuits by using light. Opto-isolators prevent high voltages from affecting the system receiving the signal.
  • 22. Rule 2: RTO and RPO are usually different from VM to VM
  • 25. Needs to be replicated constantly No one cares if this dies
  • 28. Rule 3: design a reliable oracle
  • 32. How the others do it:
  • 35. How we do it:
  • 37. Our approach takes advantage of three individual factors: ● LizardFS’ thinly-provisioned snapshots ● online replication of chunks & tiering ● OpenNebula’s datastores
  • 40. # An example of configuration of goals. It contains the default values. 1 1 : _ 2 2 : _ _ 3 3 : _ _ _ 4 4 : _ _ _ _ 5 5 : _ _ _ _ _ # (...) 20 20 : _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ # But you don't have to specify all of them -- defaults will be assumed. # You can define your own custom goals using labels if you use them, e.g.: # 14 min_two_locations: _ locationA locationB # one copy in A, one in B, third anywhere # 15 fast_access : ssd _ _ # one copy on ssd, two additional on any drives # 16 two_manufacturers: WD HT # one on WD disk, one on HT disk
  • 41. ● Most disasters are “local”, for example a fire in the server room or a flood ● Two different DR sites, one near (eg. next building/other side of the building) and one far (external datacenter) ● near DR receives a copy of the chunks that are part of the marked datastores
  • 43. ● Remote snapshots are handled in the same way: we take a full snapshot of the datastore, and differentially replicate it ● We use the “snapshot of snapshot” approach to avoid the cost of deduplication ● This way we can prioritize sync queues, and in the receiving end we got a complete and decoupled + working OpenNebula For example, average dedup cost for ZFS: 5 to 30 GB of dedup table data for every TB of pool data, assuming an average block size of 64K.
  • 44. /var/lib/one/datastore ↓ DRSNAP12H /var/lib/one/snapshots ↓ <yyyymmddhh> ↓ DRSNAP12H Local VM changes only in snapshots /var/lib/one/datastore ↓ DRSNAP12H /var/lib/one/snapshots ↓ <yyyymmddhh> ↓ DRSNAP12H Remote no chunk changes in snapshots inplace rsync (25x speedup)
  • 46. virsh# domblkstat instance-0012 --device vda vda rd_req 128 vda rd_bytes 2344448 vda wr_req 234 vda wr_bytes 618496 vda flush_operations 2 vda rd_total_times 106512819 vda wr_total_times 960359872 vda flush_total_times 1741727
  • 47. Our “pilot light” approach: a running OpenNebula on two nodes, with its own LizardFS store. Running only two VMs: the Oracle and the Tester The Oracle checks if DR is needed, and may need a human confirmation for execution of the DR failover. If confirmation is given, it takes the latest valid snapshotted datastore, softlinks it and import the VMs (through snapshots, so it’s instantaneous) The Tester makes a snapshot of the current stable snapshot, import the VMs and runs them into a separate, non-routed vnet, then executes a test to see if everything works (workload dependent), then deletes the intermediate snapshots
  • 48. Only critical VMs are executed this way, if RTO<30 mins For the VMs with higher RTO, buy one week of hardware on demand, auto-install a node with Puppet or Ansible, and make it join the OpenNebula cloud Deployed usually in 30 mins. Other vendor guarantee <15 minutes.
  • 51. Ideal for harsh indoor environments that require protection from falling dirt or liquid, dust, light splashing, oil or coolant seepage. Its NEMA Zone 4 rating also makes it perfect for facilities located in earthquake-prone seismic zones or any environment prone to extreme vibration such as factories, power stations, construction areas, shipping facilities, warehouses, processing plants, railroads, airports and military installations.
  • 54. ● Have a “big red button” to stop DR if needed. Sometimes you are already fighting fire here, and you know it’s better not to move everything in flight. ● Have two people that are competent as DR firefighters, and give them a second phone with a rechargeable card. And make sure both don’t go on vacation together. (Hint: don’t choose two married people)
  • 55. ● Use a gateway machine to provide a consistent internal IP scheme, and two different configurations for the gateway router to provide unmodified routing for the remaining VMs ● Aggregate functionality in a single VM (for example, one that manages logs) to optimize writes
  • 56. ● I favor consistency, so I tend to avoid application-level replication, unless it’s native to the app (eg. NoSQL). Otherwise you have different solutions for different machines (eg. quorum group in MS replication with same UUID…) ● Try to reduce write amplification for databases, especially MySQL. Eg. TokuDB and its fractal tree