SlideShare a Scribd company logo
High Availability and Disaster Recovery
in IBM PureApplication System
Scott Moonen <smoonen@us.ibm.com>
Agenda
• Principles and definitions
• HA and DR tools in PureApplication System
• Composing tools to meet your requirements
• Caveats
• Resources
Principles and definitions
3
Principles and definitions: HA and DR
• Business continuity
Ability to recover business operations within specified parameters in case of specified disasters
• Continuous availability
Operation of a system where unplanned outages prevent the operation for at most 5 minutes
per year (“five nines” or 99.999% availability)
• High availability
Operation of a system where unplanned outages prevent the operation for at most a few
seconds or minutes while failover occurs. Often used as an umbrella term to include continuous
availability.
• Disaster recovery
Operation of a system with a plan and process for reconstructing or recovering operations in a
separate location in case of disaster.
Principles and definitions: Active, Passive, etc.
• Active–Active
A system where continuous or high availability is achieved by having active operation in multiple
locations
• Active–Standby (or “warm standby”)
A system where high availability is achieved by having active operation in one location with
another location or locations able to become active within seconds or minutes, without a
“failover” of responsibility
• Active–Passive (or “cold standby”)
A system where high availability or disaster recovery is achieved by having active operation in
one location with another location or locations able to become active within minutes or hours
after a “failover” of responsibility
Principles and definitions: RTO and RPO
• RTO: recovery time objective
How long it takes for an HA or DR procedure to bring a system back into operation
• RPO: recovery point objective
How much data (measured in elapsed time) might be lost in the event of a disaster
RPO
zero daysminutesseconds hours
mirrored file systems
replicated file systems backup and restore
Principles and definitions: Scenarios
• Metropolitan distance: multiple data centers within 100–300km
–High availability is achievable using Active–Active or Active–Standby solutions that involve active
mirroring of data between sites.
–Disaster recovery with zero RPO is achievable using Active–Passive solutions that involve replication of
data between sites.
• Regional to global distance: multiple data centers beyond 200–300km
Disaster recovery with nonzero RPO is achievable using Active–Passive solutions that involve
replication of data between sites.
Principles and definitions: Personas
• Application architect
Responsible for planning the application design in such a way that high availability or disaster
recovery is achievable (e.g., separating application from data)
• Infrastructure administrator
Responsible for configuring and managing infrastructure in such a way as to achieve the ability
to implement high availability or disaster recovery (e.g., configuring and managing disk
mirroring or replication)
• Application administrator
Responsible for deploying and managing the components of an application in such a way as to
achieve high availability or disaster recovery (e.g., deploying the application in duplicate
between two sites and orchestrating the failover of the application and its disks together with
the infrastructure administrator)
Principles: Automation and repeatability
• Automate all aspects of your application’s deployment and configuration
–Using PureApplication patterns, pattern components, script packages, customized images
–Using external application lifecycle tooling such as IBM UrbanCode Deploy
• Why? This achieves rapid and confident repeatability of your application deployment, allowing:
–Quality and control: lower risk and chance of error
–Agility and simplicity
• Quickly recover application if you need to redeploy it
• Quickly deploy your application at separate sites for HA or DR purposes
• Quickly deploy new versions of the application for test or upgrade purposes
• Create a continuous integration lifecycle for faster and more frequent application deployment and testing
–Portability: deploy to other cloud environments (e.g., PureApplication Service)
Principles: Separation of application and data
• Ensure that all persistent data (transaction logs, database, etc.) is stored on separate disks from
the application or database application itself
• Why? This multiplies your recovery options because it decouples your strategy for application
and data recovery, which often must be addressed in different ways:
–Application recovery may involve backup & restore, re–deployment, or multiple deployment
Often the application cannot be replicated due to infrastructure entanglement
–Data recovery may involve backup & restore, replication, or mirroring
• This also allows additional flexibility for development and test cycles, for example:
–Deploy new versions of the application or database server and connect to original data
–Deploy test instances of the application using copies of the production data
Principles: Transaction consistency
If your application stores data in multiple locations (e.g., transaction logs on file server and transactions in
database), then you must ensure that either:
• The “lower” statements of record are replicated with total consistency together with the “higher”
statements of record, or else
• The “lower” statements of record are at all times replicated in advance of the “higher” statements of
record.
This ensures that you do not replicate inconsistent data (e.g., transaction log indicates a transaction is
committed but the transaction is not present in the database). So, for example:
• Your database and fileserver disks are replicated together with strict consistency, or instead
• Your database is replicated synchronously (zero RPO) but your fileserver asynchronously (nonzero RPO).
HA and DR tools in
PureApplication System
12
Tools: Compute node availability
• PureApplication System offers two options for planning for failure of compute nodes:
–Cloud group HA, if enabled, will reserve 1/n CPU and memory overhead on each compute node in a
cloud group containing n compute nodes. If one compute node fails, all VMs will be recovered into this
reserved space on the remaining nodes.
–System HA allows you to designate one or more compute nodes as spares for all cloud groups that are
enabled for system HA. This allows you both to (1) allocate more than one spare, and also (2) share a
spare between multiple cloud groups.
• If neither cloud group HA or system HA is enabled and a compute node fails, the system will
attempt to recover as many VMs as possible on the remaining nodes in the cloud group, in
priority order.
• VMs being recovered will experience an outage equivalent to being rebooted.
• Recommendation: always enable cloud group HA or system HA
–This ensures your workload capacity is restored quickly after a compute node failure
–This also ensures that workload does not need to be stopped for planned compute node maintenance
Tools: Block storage
Block storage volumes in PureApplication System:
• May be up to 8TB in size
• Are allocated and managed independently of VM storage, can be attached and detached
• Is not included in VM snapshots
• Can be cloned (copied)
• Can be exported and imported against external scp servers
• Groups of volumes can be created for time–consistent cloning or export of multiple volumes
Tools: Shared block storage
• Block storage volumes may be shared (simultaneously attached) by virtual machines
–On the same system
Note: this is supported on Intel, and on Power beginning with V2.2.
–Between systems. Notes:
• This is supported only for external block storage that resides outside of the system (see later slide).
• This is supported on Intel. Support on Power is forthcoming.
• This allows for creation of highly available clusters (GPFS, GFS, DB2 pureScale, Windows cluster)
–A clustering protocol is necessary for sharing of the disk
–The IBM GPFS pattern (see later slide) supports GPFS clusters on a single rack using shared block
storage, but does not support cross–system clusters using shared external block storage
• Restrictions
–Storage volumes must be specifically created as “shared” volumes
–Special placement techniques are required in the pattern to ensure anti–collocation of VMs
–IBM GPFS pattern supports clustering (see below)
Tools: Block storage replication
Two PureApplication Systems can be connected for replication of block storage
• Connectivity options
–Fiber channel connectivity supported beginning in V2.0
–TCP/IP connectivity supported beginning in V2.2
• Volumes are selected for replication individually
–Replicate in either direction
–Replicate synchronously up to 3ms latency (~300km), asynchronously up to 80ms latency (~8000km).
RPO for asynchronous replication is up to 1 second.
• All volumes are replicated together with strict consistency
• Target volume must not be attached while replication is taking place
• Replication may be terminated (unplanned failover) or reversed in place (planned failover).
Reverse in place requires volume to be unattached on both sides.
Tools: External block storage
• PureApplication System can connect to external SVC, V7000, V9000 devices:
–Allows for block and block “shared” volumes to be accessed by VMs on PureApplication System.
Base VM disks cannot reside on external storage.
–Depending on extent size, allows for volumes larger than 8TB in size
–Requires both TCP/IP and fiber channel connectivity to external device
• All volume management is performed outside of system
–Volumes are allocated and deleted by admin on external device
–Alternate storage providers, RAID configurations, or combinations of HDD and SSD may be used
–Volumes may be mirrored externally (e.g., SVC–managed mirroring across multiple devices)
–Volumes may be replicated externally (e.g., SVC to SVC replication between data centers)
• Advanced scenarios, sharing access to the same SVC cluster or V7000, or replicated ones:
–Two systems sharing access to cluster or to replicated volumes
–PureApplication System and PureApplication Software sharing access to cluster or replicated volumes
Tools: IBM GPFS (General Parallel File System) / Spectrum Scale
• GPFS is:
–A shared filesystem (like NFS)
–Optionally: a clustered filesystem (unlike NFS) providing HA and high performance.
Note: clustering supported on Power Systems beginning with V2.2.
–Optionally: mirrored between cloud groups or systems
• A tiebreaker (on third rack or external system) is required for quorum
• Mirroring is not recommended above 1–3ms (~100–300km) latency
–Optionally: (using block storage or external storage replication) replicated between systems
ServerServer
Server
Data
Client Client
Server
Data
Client Client
ServerServer
Server
Data
Client Client
ServerServer
Server
Data
ServerServer
Server
Data
Client Client
ServerServer
Server
Data
Shared Clustered Mirrored Replicated
Tie
Tools: Multi–system deployment
• Connect systems in a “deployment subdomain” for cross–system pattern deployment
–Virtual machines for individual vsys.next or vapp deployments may be distributed across systems
–Allows for easier deployment and management of highly available applications using a single pattern
–Systems may be located in same or different data centers
• Notes and restrictions
–Up to four systems may be connected (limit is two systems prior to V2.2)
–Inter–system network latencies must be less than 3ms (~300km)
–An external 1GB iSCSI tiebreaker target must be configured for quorum purposes
–Special network configuration is required for inter–system management communications
Composing tools
to meet your requirements
20
Scenario: Test application, middleware, or schema update
Copy block storage from production application for use in testing
Data
base
Data
App
Data
base
Data
Test
App
copy
Scenario: Update application or middleware
When both the current and new application and middleware can share the same database
without conflict (e.g., no changes to database schema), you can run the newer version of the
application or middleware side by side for testing, and then eventually direct clients to the new
version and retire the old version.
Data
base
Data
App
App
V2
Scenario: Backward incompatible updates to database or schema
In some cases, a new version of an application, database server, or database schema may be
unable to coexist with the existing application. In this case, you can use the “copy” strategy on a
previous slide to test the upgrade of your application. When you are ready to promote the new
version to production, you can detach the block storage from the existing deployment and attach
it to the upgraded deployment.
Data
base
Data
App
DB
V2
App
V2
detach attach
Scenario: HA planning for compute node failure
Principles:
• Deploy multiple instances of each service so that each service continues if one instance is lost
• Enable cloud group or system HA so that failed instances can be recovered quickly
DB
primary
Data
App
DB
secondary
Data
App
HADR
Load balancer
GPFSGPFS
GPFS
Data
Scenario: recovery planning for VM failure or corruption
Three scenarios:
• Backup and restore of the VM itself is feasible if it can be recovered in place
• If the VM cannot be recovered:
–If the VM is part of a horizontally scalable cluster, you can scale
in to remove the failed VM and scale out to create a new VM
–If the VM is not horizontally scalable, you must plan to re–deploy it:
• You can deploy the entire pattern again and recover the data to it
• You may be able to deploy a new pattern that recreates only the failed VM,
and use manual or scripted configuration to reconnect it to your existing
deployment
Scenario: recovery planning for database corruption
You may use your database’s own capabilities for backup and restore, import and export.
Alternatively, you may use block storage copies (and optionally export and import) to backup your
database. Attach the backup copy (importing it beforehand if necessary) to restore.
Data
base
Data
App
Data
copy
export/import
detach
attach
Scenario: HA planning for system or site failure
• As with planning for compute node failure, deploy multiple instances: now across systems.
• You may deploy separately on each system, or use multi–system deployment across systems.
• Distance at which HA is possible is limited.
• GPFS clustering is optional. It can provide additional throughput and also additional availability
on a single system.
DB
primary
Data
App
DB
secondary
Data
HADR
Load balancer
GPFSGPFS
GPFS
Data
GPFSGPFS
GPFS
Data
App
System A System B
mirror
Tie
Scenario: Two–tier HA planning for system or site failure
• Compared to the previous slide, if you desire HA both within a site and also between sites, you
must duplicate your application, database and filesystem both within and between sites.
• Native database replication between sites must be synchronous, or may be asynchronous if you
have no need of GPFS (see slide 11).
DB
primary
Data
App
DB
secondary
Data
HADR
Load balancer or DNS
GPFSGPFS
GPFS
Data
GPFSGPFS
GPFS
Data
App
may be
standby
Site A Site B
DB
secondary
Data
HADR
App
mirror
Tie
Scenario: DR planning for rack or site failure
• You should expect nonzero RPO if the sites are too far apart to allow synchronous replication
• Applications must be quiesced at the recovery site because replicated disks are inaccessible
• The database is here replicated using disk replication for transaction consistency. You can use
native database replication (as on slide 28) only if it is synchronous, or asynchronously only if
you have no need of GPFS (see slide 11).
DB
primary
Data
App
DB
secondary
Data
HADR
Load balancer or DNS
GPFSGPFS
GPFS
Data
GPFSGPFS
GPFS
Data
App
System A System B
DB
primary
Data
DB
secondary
Data
HADR
Replication
Scenario: horizontal scaling and bursting
• Use of the base scaling policy allows you to horizontally scale, manually or in some cases
automatically, new instances of a virtual machine with clustered software components.
• When using multi–system deployment, horizontally scaled virtual machines will be distributed
as much as possible across systems referenced in your environment profile
• An alternate approach, especially in heterogeneous environments like PureApplication System
and PureApplication Service, is to deploy new pattern instances for scaling or bursting, and
federate them together.
Caveats
31
Caveats: Networking considerations
• Some middleware is sensitive to IP addresses and hostnames (e.g., WAS) and for DR purposes
you may need to plan to duplicate either IP addresses or hostnames in your backup data center
• Both HA architectures and zero–RPO DR architectures are sensitive to latency. If latency is too
high you can experience poor write throughput or even mirroring or replication failure. For
these cases you should ideally plan for less than 1ms (~100km) of latency between sites.
• You must also plan for adequate network throughput between sites when mirroring or
replicating.
• HA architectures require the use of a tiebreaker to govern quorum–leader determination in case
of a network split. In a multi–site HA design, you should plan to locate the quorum at a third
location, with equally low latency.
Caveats: Middleware–specific considerations
• Combining both mirroring and replication (Active–Active–Passive–Passive)
–The IBM GPFS pattern does not support combining both mirroring and replication
–This combination is possible for other middleware (e.g., DB2 as on slide 29), but you must manually
determine and designate which instance is Primary or Secondary at the time of recovery
• Read carefully your middleware’s recommendations for configuring HA. For example:
–IBM WebSphere recommends against cross–site cells
–The IBM DB2 HADR pattern preconfigures a reservationless IP–based tiebreaker, which is not
recommended
–IBM DB2 HADR provides a variety of synchronization modes with different RPO characteristics
• Ensure your middleware tolerates attaching existing storage if you replicate or copy volumes
–The IBM DB2 HADR pattern requires an empty disk when first deploying. You can attach a new disk or
replicate into this disk only after deployment.
–The IBM GPFS pattern does not support attaching existing GPFS disks
Caveats: Virtual machine backup and restore
The power and flexibility of PureApplication patterns means that your PureApplication VMs are
tightly integrated both within a single deployment, and with the system on which they are
deployed.
Because of this tight integration, you cannot use backup and restore techniques to recover your
PureApplication VMs unless you are recovering to the exact same virtual machine that was
previously backed up.
Your cloud strategy for recovering corrupted deployments should build on the efficiency and
repeatability of patterns so that you are able to re–deploy in the event of extreme failure
scenarios such as accidental virtual machine deletion or total system failure.
Caveats: Practice, practice, practice
Because of the complexity of HA and DR implementation, and especially because of some of the
caveats we have noted and which you may encounter in your unique situation, it is vital for you to
practice all aspects of your HA or DR implementation and lifecycle before you roll it out into
production.
This includes testing network bandwidth and latency to their expected limits. It also includes
simulating failures and verifying and perfecting your procedures for recovery and also for failback.
Resources
36
Resources
• Implementing High Availability and Disaster Recovery in IBM PureApplication Systems V2
http://www.redbooks.ibm.com/abstracts/sg248246.html
• “Implement multisystem management and deployment with IBM PureApplication System”
http://www.ibm.com/developerworks/websphere/techjournal/1506_vanrun/1506_vanrun-
trs.html
• “Demystifying virtual machine placement in IBM PureApplication System”
http://www.ibm.com/developerworks/websphere/library/techarticles/1605_moonen-
trs/1605_moonen.html
Resources, continued
• “High availability (again) versus continuous availability”
http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.ht
ml
• “Can I run a WebSphere Application Server cell over multiple data centers?”
http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcot
t.html#sec1d
• “Increase DB2 availability”
http://www.ibm.com/developerworks/data/library/techarticle/dm-1406db2avail/index.html
• “HADR sync mode”
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/DB2HADR/pag
e/HADR%20sync%20mode

More Related Content

PPT
Disaster Recovery: Is Your iSeries Recoverable?
 
PDF
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
PDF
Accelerate Your Signature Banking Applications with IBM Storage Offerings
PDF
Architecting with power vm
PDF
Implementing a Disaster Recovery Solution using VMware Site Recovery Manager ...
PDF
Introduction to Pure, John Kaemmerer and Gerry Kovan
PPTX
Patterns
PPT
Cio Breakfast Roundtable 05142009 Final Virtualization
Disaster Recovery: Is Your iSeries Recoverable?
 
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
Accelerate Your Signature Banking Applications with IBM Storage Offerings
Architecting with power vm
Implementing a Disaster Recovery Solution using VMware Site Recovery Manager ...
Introduction to Pure, John Kaemmerer and Gerry Kovan
Patterns
Cio Breakfast Roundtable 05142009 Final Virtualization

What's hot (20)

PPT
1.ibm pure flex system mar 2013
PDF
Whitepaper Exchange 2007 Changes, Resilience And Storage Management
PDF
Sap fundamentals overview_for_sap_minors
PDF
VMworld 2013: Building a Validation Factory for VMware Partners
PPTX
Java/Hybris performance monitoring and optimization
PPT
2.ibm flex system manager overview
PPTX
Dell data protection & performance management solutions
PPTX
Comprehensive and Simplified Management for VMware vSphere Environments - now...
PDF
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part One
PDF
1457 - Reviewing Experiences from the PureExperience Program
PDF
Sap solutions-on-v mware-best-practices-guide
PDF
A Year of Testing in the Cloud: Lessons Learned
PDF
Designing Highly-Available Architectures for OTM
PPT
Big data and ibm flashsystems
PPTX
VMware Log Insight
PPTX
Smb Sme Virtualization_Bundles - EMC - Accel
PDF
VMworld 2013: Real-world Design Examples for Virtualized SAP Environments
PPTX
Availability Considerations for SQL Server
PDF
Cast Iron Overview Webinar 6.13
PPTX
Modern infrastructure for business data lake
 
1.ibm pure flex system mar 2013
Whitepaper Exchange 2007 Changes, Resilience And Storage Management
Sap fundamentals overview_for_sap_minors
VMworld 2013: Building a Validation Factory for VMware Partners
Java/Hybris performance monitoring and optimization
2.ibm flex system manager overview
Dell data protection & performance management solutions
Comprehensive and Simplified Management for VMware vSphere Environments - now...
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part One
1457 - Reviewing Experiences from the PureExperience Program
Sap solutions-on-v mware-best-practices-guide
A Year of Testing in the Cloud: Lessons Learned
Designing Highly-Available Architectures for OTM
Big data and ibm flashsystems
VMware Log Insight
Smb Sme Virtualization_Bundles - EMC - Accel
VMworld 2013: Real-world Design Examples for Virtualized SAP Environments
Availability Considerations for SQL Server
Cast Iron Overview Webinar 6.13
Modern infrastructure for business data lake
 
Ad

Viewers also liked (10)

PDF
High Availability Options for DB2 Data Centre
PPT
Multi-Dimensional Clustering: A High-Level Overview
PDF
DB2 pureScale Technology Preview
PDF
Db2 recovery IDUG EMEA 2013
PDF
SAP HANA SPS10- Scale-Out, High Availability and Disaster Recovery
PPTX
OpenStack High Availability
PDF
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
PDF
DB2 LUW - Backup and Recovery
PPTX
D02 Evolution of the HADR tool
PDF
DB2 V 10 HADR Multiple Standby
High Availability Options for DB2 Data Centre
Multi-Dimensional Clustering: A High-Level Overview
DB2 pureScale Technology Preview
Db2 recovery IDUG EMEA 2013
SAP HANA SPS10- Scale-Out, High Availability and Disaster Recovery
OpenStack High Availability
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
DB2 LUW - Backup and Recovery
D02 Evolution of the HADR tool
DB2 V 10 HADR Multiple Standby
Ad

Similar to High availability and disaster recovery in IBM PureApplication System (20)

PDF
Planning For Catastrophe with IBM WAS and IBM BPM
PDF
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
PDF
High Availability og virtualisering, IBM Power Event
PPTX
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
PDF
PDF
STN Event 12.8.09 - Chris Vain Powerpoint Presentation
PDF
Sample Solution Blueprint
PDF
Disaster recovery toolkit final version
PDF
Whitepaper dave kostas v0 7pdf
PPTX
HA & DR System Design - Concepts and Solution
DOCX
Disaster recovery in the Cloud - whitepaper
PPTX
Metro Cluster High Availability or SRM Disaster Recovery?
PDF
5 Ways to Avoid Server and Application Downtime
PDF
A Cost-Effective Integrated Solution for Backup and Disaster Recovery
PPTX
Sql server 2012 ha and dr sql saturday tampa
PDF
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
PDF
VMworld 2011 (BCO3276)
PPT
Plate Spin Disaster Recovery Solution
PDF
[NetApp] Simplified HA:DR Using Storage Solutions
PDF
Paper id 25201464
Planning For Catastrophe with IBM WAS and IBM BPM
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
High Availability og virtualisering, IBM Power Event
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
STN Event 12.8.09 - Chris Vain Powerpoint Presentation
Sample Solution Blueprint
Disaster recovery toolkit final version
Whitepaper dave kostas v0 7pdf
HA & DR System Design - Concepts and Solution
Disaster recovery in the Cloud - whitepaper
Metro Cluster High Availability or SRM Disaster Recovery?
5 Ways to Avoid Server and Application Downtime
A Cost-Effective Integrated Solution for Backup and Disaster Recovery
Sql server 2012 ha and dr sql saturday tampa
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
VMworld 2011 (BCO3276)
Plate Spin Disaster Recovery Solution
[NetApp] Simplified HA:DR Using Storage Solutions
Paper id 25201464

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
NewMind AI Weekly Chronicles - August'25 Week I
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
GamePlan Trading System Review: Professional Trader's Honest Take
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Monthly Chronicles - July 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction

High availability and disaster recovery in IBM PureApplication System

  • 1. High Availability and Disaster Recovery in IBM PureApplication System Scott Moonen <smoonen@us.ibm.com>
  • 2. Agenda • Principles and definitions • HA and DR tools in PureApplication System • Composing tools to meet your requirements • Caveats • Resources
  • 4. Principles and definitions: HA and DR • Business continuity Ability to recover business operations within specified parameters in case of specified disasters • Continuous availability Operation of a system where unplanned outages prevent the operation for at most 5 minutes per year (“five nines” or 99.999% availability) • High availability Operation of a system where unplanned outages prevent the operation for at most a few seconds or minutes while failover occurs. Often used as an umbrella term to include continuous availability. • Disaster recovery Operation of a system with a plan and process for reconstructing or recovering operations in a separate location in case of disaster.
  • 5. Principles and definitions: Active, Passive, etc. • Active–Active A system where continuous or high availability is achieved by having active operation in multiple locations • Active–Standby (or “warm standby”) A system where high availability is achieved by having active operation in one location with another location or locations able to become active within seconds or minutes, without a “failover” of responsibility • Active–Passive (or “cold standby”) A system where high availability or disaster recovery is achieved by having active operation in one location with another location or locations able to become active within minutes or hours after a “failover” of responsibility
  • 6. Principles and definitions: RTO and RPO • RTO: recovery time objective How long it takes for an HA or DR procedure to bring a system back into operation • RPO: recovery point objective How much data (measured in elapsed time) might be lost in the event of a disaster RPO zero daysminutesseconds hours mirrored file systems replicated file systems backup and restore
  • 7. Principles and definitions: Scenarios • Metropolitan distance: multiple data centers within 100–300km –High availability is achievable using Active–Active or Active–Standby solutions that involve active mirroring of data between sites. –Disaster recovery with zero RPO is achievable using Active–Passive solutions that involve replication of data between sites. • Regional to global distance: multiple data centers beyond 200–300km Disaster recovery with nonzero RPO is achievable using Active–Passive solutions that involve replication of data between sites.
  • 8. Principles and definitions: Personas • Application architect Responsible for planning the application design in such a way that high availability or disaster recovery is achievable (e.g., separating application from data) • Infrastructure administrator Responsible for configuring and managing infrastructure in such a way as to achieve the ability to implement high availability or disaster recovery (e.g., configuring and managing disk mirroring or replication) • Application administrator Responsible for deploying and managing the components of an application in such a way as to achieve high availability or disaster recovery (e.g., deploying the application in duplicate between two sites and orchestrating the failover of the application and its disks together with the infrastructure administrator)
  • 9. Principles: Automation and repeatability • Automate all aspects of your application’s deployment and configuration –Using PureApplication patterns, pattern components, script packages, customized images –Using external application lifecycle tooling such as IBM UrbanCode Deploy • Why? This achieves rapid and confident repeatability of your application deployment, allowing: –Quality and control: lower risk and chance of error –Agility and simplicity • Quickly recover application if you need to redeploy it • Quickly deploy your application at separate sites for HA or DR purposes • Quickly deploy new versions of the application for test or upgrade purposes • Create a continuous integration lifecycle for faster and more frequent application deployment and testing –Portability: deploy to other cloud environments (e.g., PureApplication Service)
  • 10. Principles: Separation of application and data • Ensure that all persistent data (transaction logs, database, etc.) is stored on separate disks from the application or database application itself • Why? This multiplies your recovery options because it decouples your strategy for application and data recovery, which often must be addressed in different ways: –Application recovery may involve backup & restore, re–deployment, or multiple deployment Often the application cannot be replicated due to infrastructure entanglement –Data recovery may involve backup & restore, replication, or mirroring • This also allows additional flexibility for development and test cycles, for example: –Deploy new versions of the application or database server and connect to original data –Deploy test instances of the application using copies of the production data
  • 11. Principles: Transaction consistency If your application stores data in multiple locations (e.g., transaction logs on file server and transactions in database), then you must ensure that either: • The “lower” statements of record are replicated with total consistency together with the “higher” statements of record, or else • The “lower” statements of record are at all times replicated in advance of the “higher” statements of record. This ensures that you do not replicate inconsistent data (e.g., transaction log indicates a transaction is committed but the transaction is not present in the database). So, for example: • Your database and fileserver disks are replicated together with strict consistency, or instead • Your database is replicated synchronously (zero RPO) but your fileserver asynchronously (nonzero RPO).
  • 12. HA and DR tools in PureApplication System 12
  • 13. Tools: Compute node availability • PureApplication System offers two options for planning for failure of compute nodes: –Cloud group HA, if enabled, will reserve 1/n CPU and memory overhead on each compute node in a cloud group containing n compute nodes. If one compute node fails, all VMs will be recovered into this reserved space on the remaining nodes. –System HA allows you to designate one or more compute nodes as spares for all cloud groups that are enabled for system HA. This allows you both to (1) allocate more than one spare, and also (2) share a spare between multiple cloud groups. • If neither cloud group HA or system HA is enabled and a compute node fails, the system will attempt to recover as many VMs as possible on the remaining nodes in the cloud group, in priority order. • VMs being recovered will experience an outage equivalent to being rebooted. • Recommendation: always enable cloud group HA or system HA –This ensures your workload capacity is restored quickly after a compute node failure –This also ensures that workload does not need to be stopped for planned compute node maintenance
  • 14. Tools: Block storage Block storage volumes in PureApplication System: • May be up to 8TB in size • Are allocated and managed independently of VM storage, can be attached and detached • Is not included in VM snapshots • Can be cloned (copied) • Can be exported and imported against external scp servers • Groups of volumes can be created for time–consistent cloning or export of multiple volumes
  • 15. Tools: Shared block storage • Block storage volumes may be shared (simultaneously attached) by virtual machines –On the same system Note: this is supported on Intel, and on Power beginning with V2.2. –Between systems. Notes: • This is supported only for external block storage that resides outside of the system (see later slide). • This is supported on Intel. Support on Power is forthcoming. • This allows for creation of highly available clusters (GPFS, GFS, DB2 pureScale, Windows cluster) –A clustering protocol is necessary for sharing of the disk –The IBM GPFS pattern (see later slide) supports GPFS clusters on a single rack using shared block storage, but does not support cross–system clusters using shared external block storage • Restrictions –Storage volumes must be specifically created as “shared” volumes –Special placement techniques are required in the pattern to ensure anti–collocation of VMs –IBM GPFS pattern supports clustering (see below)
  • 16. Tools: Block storage replication Two PureApplication Systems can be connected for replication of block storage • Connectivity options –Fiber channel connectivity supported beginning in V2.0 –TCP/IP connectivity supported beginning in V2.2 • Volumes are selected for replication individually –Replicate in either direction –Replicate synchronously up to 3ms latency (~300km), asynchronously up to 80ms latency (~8000km). RPO for asynchronous replication is up to 1 second. • All volumes are replicated together with strict consistency • Target volume must not be attached while replication is taking place • Replication may be terminated (unplanned failover) or reversed in place (planned failover). Reverse in place requires volume to be unattached on both sides.
  • 17. Tools: External block storage • PureApplication System can connect to external SVC, V7000, V9000 devices: –Allows for block and block “shared” volumes to be accessed by VMs on PureApplication System. Base VM disks cannot reside on external storage. –Depending on extent size, allows for volumes larger than 8TB in size –Requires both TCP/IP and fiber channel connectivity to external device • All volume management is performed outside of system –Volumes are allocated and deleted by admin on external device –Alternate storage providers, RAID configurations, or combinations of HDD and SSD may be used –Volumes may be mirrored externally (e.g., SVC–managed mirroring across multiple devices) –Volumes may be replicated externally (e.g., SVC to SVC replication between data centers) • Advanced scenarios, sharing access to the same SVC cluster or V7000, or replicated ones: –Two systems sharing access to cluster or to replicated volumes –PureApplication System and PureApplication Software sharing access to cluster or replicated volumes
  • 18. Tools: IBM GPFS (General Parallel File System) / Spectrum Scale • GPFS is: –A shared filesystem (like NFS) –Optionally: a clustered filesystem (unlike NFS) providing HA and high performance. Note: clustering supported on Power Systems beginning with V2.2. –Optionally: mirrored between cloud groups or systems • A tiebreaker (on third rack or external system) is required for quorum • Mirroring is not recommended above 1–3ms (~100–300km) latency –Optionally: (using block storage or external storage replication) replicated between systems ServerServer Server Data Client Client Server Data Client Client ServerServer Server Data Client Client ServerServer Server Data ServerServer Server Data Client Client ServerServer Server Data Shared Clustered Mirrored Replicated Tie
  • 19. Tools: Multi–system deployment • Connect systems in a “deployment subdomain” for cross–system pattern deployment –Virtual machines for individual vsys.next or vapp deployments may be distributed across systems –Allows for easier deployment and management of highly available applications using a single pattern –Systems may be located in same or different data centers • Notes and restrictions –Up to four systems may be connected (limit is two systems prior to V2.2) –Inter–system network latencies must be less than 3ms (~300km) –An external 1GB iSCSI tiebreaker target must be configured for quorum purposes –Special network configuration is required for inter–system management communications
  • 20. Composing tools to meet your requirements 20
  • 21. Scenario: Test application, middleware, or schema update Copy block storage from production application for use in testing Data base Data App Data base Data Test App copy
  • 22. Scenario: Update application or middleware When both the current and new application and middleware can share the same database without conflict (e.g., no changes to database schema), you can run the newer version of the application or middleware side by side for testing, and then eventually direct clients to the new version and retire the old version. Data base Data App App V2
  • 23. Scenario: Backward incompatible updates to database or schema In some cases, a new version of an application, database server, or database schema may be unable to coexist with the existing application. In this case, you can use the “copy” strategy on a previous slide to test the upgrade of your application. When you are ready to promote the new version to production, you can detach the block storage from the existing deployment and attach it to the upgraded deployment. Data base Data App DB V2 App V2 detach attach
  • 24. Scenario: HA planning for compute node failure Principles: • Deploy multiple instances of each service so that each service continues if one instance is lost • Enable cloud group or system HA so that failed instances can be recovered quickly DB primary Data App DB secondary Data App HADR Load balancer GPFSGPFS GPFS Data
  • 25. Scenario: recovery planning for VM failure or corruption Three scenarios: • Backup and restore of the VM itself is feasible if it can be recovered in place • If the VM cannot be recovered: –If the VM is part of a horizontally scalable cluster, you can scale in to remove the failed VM and scale out to create a new VM –If the VM is not horizontally scalable, you must plan to re–deploy it: • You can deploy the entire pattern again and recover the data to it • You may be able to deploy a new pattern that recreates only the failed VM, and use manual or scripted configuration to reconnect it to your existing deployment
  • 26. Scenario: recovery planning for database corruption You may use your database’s own capabilities for backup and restore, import and export. Alternatively, you may use block storage copies (and optionally export and import) to backup your database. Attach the backup copy (importing it beforehand if necessary) to restore. Data base Data App Data copy export/import detach attach
  • 27. Scenario: HA planning for system or site failure • As with planning for compute node failure, deploy multiple instances: now across systems. • You may deploy separately on each system, or use multi–system deployment across systems. • Distance at which HA is possible is limited. • GPFS clustering is optional. It can provide additional throughput and also additional availability on a single system. DB primary Data App DB secondary Data HADR Load balancer GPFSGPFS GPFS Data GPFSGPFS GPFS Data App System A System B mirror Tie
  • 28. Scenario: Two–tier HA planning for system or site failure • Compared to the previous slide, if you desire HA both within a site and also between sites, you must duplicate your application, database and filesystem both within and between sites. • Native database replication between sites must be synchronous, or may be asynchronous if you have no need of GPFS (see slide 11). DB primary Data App DB secondary Data HADR Load balancer or DNS GPFSGPFS GPFS Data GPFSGPFS GPFS Data App may be standby Site A Site B DB secondary Data HADR App mirror Tie
  • 29. Scenario: DR planning for rack or site failure • You should expect nonzero RPO if the sites are too far apart to allow synchronous replication • Applications must be quiesced at the recovery site because replicated disks are inaccessible • The database is here replicated using disk replication for transaction consistency. You can use native database replication (as on slide 28) only if it is synchronous, or asynchronously only if you have no need of GPFS (see slide 11). DB primary Data App DB secondary Data HADR Load balancer or DNS GPFSGPFS GPFS Data GPFSGPFS GPFS Data App System A System B DB primary Data DB secondary Data HADR Replication
  • 30. Scenario: horizontal scaling and bursting • Use of the base scaling policy allows you to horizontally scale, manually or in some cases automatically, new instances of a virtual machine with clustered software components. • When using multi–system deployment, horizontally scaled virtual machines will be distributed as much as possible across systems referenced in your environment profile • An alternate approach, especially in heterogeneous environments like PureApplication System and PureApplication Service, is to deploy new pattern instances for scaling or bursting, and federate them together.
  • 32. Caveats: Networking considerations • Some middleware is sensitive to IP addresses and hostnames (e.g., WAS) and for DR purposes you may need to plan to duplicate either IP addresses or hostnames in your backup data center • Both HA architectures and zero–RPO DR architectures are sensitive to latency. If latency is too high you can experience poor write throughput or even mirroring or replication failure. For these cases you should ideally plan for less than 1ms (~100km) of latency between sites. • You must also plan for adequate network throughput between sites when mirroring or replicating. • HA architectures require the use of a tiebreaker to govern quorum–leader determination in case of a network split. In a multi–site HA design, you should plan to locate the quorum at a third location, with equally low latency.
  • 33. Caveats: Middleware–specific considerations • Combining both mirroring and replication (Active–Active–Passive–Passive) –The IBM GPFS pattern does not support combining both mirroring and replication –This combination is possible for other middleware (e.g., DB2 as on slide 29), but you must manually determine and designate which instance is Primary or Secondary at the time of recovery • Read carefully your middleware’s recommendations for configuring HA. For example: –IBM WebSphere recommends against cross–site cells –The IBM DB2 HADR pattern preconfigures a reservationless IP–based tiebreaker, which is not recommended –IBM DB2 HADR provides a variety of synchronization modes with different RPO characteristics • Ensure your middleware tolerates attaching existing storage if you replicate or copy volumes –The IBM DB2 HADR pattern requires an empty disk when first deploying. You can attach a new disk or replicate into this disk only after deployment. –The IBM GPFS pattern does not support attaching existing GPFS disks
  • 34. Caveats: Virtual machine backup and restore The power and flexibility of PureApplication patterns means that your PureApplication VMs are tightly integrated both within a single deployment, and with the system on which they are deployed. Because of this tight integration, you cannot use backup and restore techniques to recover your PureApplication VMs unless you are recovering to the exact same virtual machine that was previously backed up. Your cloud strategy for recovering corrupted deployments should build on the efficiency and repeatability of patterns so that you are able to re–deploy in the event of extreme failure scenarios such as accidental virtual machine deletion or total system failure.
  • 35. Caveats: Practice, practice, practice Because of the complexity of HA and DR implementation, and especially because of some of the caveats we have noted and which you may encounter in your unique situation, it is vital for you to practice all aspects of your HA or DR implementation and lifecycle before you roll it out into production. This includes testing network bandwidth and latency to their expected limits. It also includes simulating failures and verifying and perfecting your procedures for recovery and also for failback.
  • 37. Resources • Implementing High Availability and Disaster Recovery in IBM PureApplication Systems V2 http://www.redbooks.ibm.com/abstracts/sg248246.html • “Implement multisystem management and deployment with IBM PureApplication System” http://www.ibm.com/developerworks/websphere/techjournal/1506_vanrun/1506_vanrun- trs.html • “Demystifying virtual machine placement in IBM PureApplication System” http://www.ibm.com/developerworks/websphere/library/techarticles/1605_moonen- trs/1605_moonen.html
  • 38. Resources, continued • “High availability (again) versus continuous availability” http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.ht ml • “Can I run a WebSphere Application Server cell over multiple data centers?” http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcot t.html#sec1d • “Increase DB2 availability” http://www.ibm.com/developerworks/data/library/techarticle/dm-1406db2avail/index.html • “HADR sync mode” https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/DB2HADR/pag e/HADR%20sync%20mode