IBM MQ High Availability 2019

IBM MQ High Availability
David Ware
IBM MQ Chief Architect
© 2019 IBM Corporation 1

© 2018 IBM Corporation
High availability
2
A system is said to be available if it is able
to perform its required function, such as
successfully process requests from
users.
A requirement, or a capability, of a
system to be operational for a greater
proportion of time than is common for
other, less important, systems.
Available Highly Available
Often, greater availability means greater complexity and cost
A sliding scale
0% 100%

Measuring availability
3
Target
95%
99%
99.5%
99.9%
99.99%
99.999%
99.9999%
Impacts on availability
o Applying maintenance
o Likelihood of outages (meantime to
failure, and speed of recovery)
o Operational errors
Overall availability is the combined
availability of all components
o The platform
o The middleware
o The applications
Yearly outage
~ 18 days
< 4 days
< 2 days
< 9 hours
~ 1 hour
~ 5 minutes
~ 30 seconds
‘5 nines’

Messaging system availability
4
Asynchronous messaging can improve application availability by providing a buffer but the
messaging system itself must be highly available to achieve that
Redundancy
Multiple active options
available for applications to
connect
Routing
Ability to route messages
around failures
Message availability
Critical messages are not
locked to a single runtime
and quickly available from
elsewhere

Messaging system availability
5
Redundancy
connect
Routing
around failures
elsewhere
Required for continuous
availabilities Only required for certain message flows

Messaging system availability – MQ
6
Redundancy
connect
Routing
around failures
elsewhere
Application client
connectivity
MQ Cluster routing MQ ‘HA’

Application client connectivity

Decouple the applications from queue managers
Applications locally bound to a queue manager will
limit the availability of the solution.
Running applications remote from the queue
managers, always connecting as MQ clients,
decouples the application and system runtimes,
enabling higher availability.

Step 1
Connect the application as a client
Benefits:
o Ability to support solutions where a queue manager
may fail-over between systems (more later).
o Separates application system requirements from the
queue manager’s, reducing maintenance conflicts and
therefore, availability.
o Restart times on either side can be reduced.
Should be relatively invisible to the application
o Don’t hardcode that connection configuration!
o Use client auto-reconnect to hide a queue manager
restart from the application
Qmgr
App

Step 2
Allow the application to connect to a set of queue managers
Benefits:
o Applications can continue to interact with MQ even
whilst a queue manager is failing over or unavailable
during maintenance
o With multiple applications connected, only a subset
will be impacted by a queue manager outage
How does your application find the queue manager?
o Network routing
o Connection name lists
o Client Channel Definition Tables
The application may need to re-evaluate how it exploits all of
MQ’s capabilities
o Message ordering may change if it is currently
expected across connections
o Applications may be reliant on transitory state:
o Dynamic queues and subscriptions
o Reply messages
o XA transaction recovery
Qmgr
App
Qmgr Qmgr
This might not work for all applications

What do CCDTs enable?
These provide encapsulation and abstraction of
connection information for applications, hiding the
MQ architecture and configuration from the
application
They also enable security, high availability and
workload balancing of clients
Applications simply connect to an abstracted
“queue manager” name (which doesn’t need to be
the actual queue manager name – use an ‘*’)
CCDT defines which real queue managers the
application will connect to. Which could be a single
queue manager or a group of
Across a group, selection can be ordered or
randomised and weighted
CCDT
QMGRP1: QM1
QMGRP2: QM2
QMGRP2: QM3
Application 1
connect: *QMGRP1
Application 2
connect: *QMGRP2
Application 2
connect: *QMGRP2
QM1
QM2
QM3

Creating the CCDTs
CCDTs can be used to represent connection details to multiple queue
managers
Prior to MQ 9.1.2, a CCDT is generated by using MQ tooling to define
CLNTCONN channels that identify the SVRCONN channels
Define multiple CLNTCONNs in a central place to generate the CCDT
It doesn’t have to be any of the queue managers owning the SVRCONNs
Pre-MQ V8: You needed a dedicated queue manager for this purpose
MQ V8+: Use runmqsc –n to remove the need for a queue manager
MQ 9.1.2 added the ability to build your own JSON format CCDT files.
This has also added the ability to define multiple channels of the same name on
different queue managers
A single CCDT for your MQ estate or one per application?
A single CCDT can be easier to create but updates can be expensive
Separate CCDTs make it easier to update when an application’s needs change
Create the QMgrs and define
their SVRCONNs
Centrally define all the
CLNTCONNs that represent
all the QMgrs
Take the CCDT and make
available to the connecting
applications
{
“channel”:[
{
“name”:”ABC”,
”queueManager”:”A”
},
{
“name”:”ABC”,
”queueManager”:”B”
},
]
}

Accessing the CCDTs
CCDT files need to be accessible to the applications connecting to MQ
?
CCDT
HTTP
server
CCDT
QM1
QM2
QM3
Either accessible through the client’s filesystem
User needs to manage distribution of CCDT files themselves
Or remotely over HTTP or FTP
Available for JMS/XMS applications for a number of releases
Added for MQI applications in MQ V9 LTS
MQI App
MQCONN(QMGRP2)
CCDT
QMGR1
QMGR2
QMGR3
CCDT
QM1
QM2
QM3
QM2 QM3QM1 QM2 QM3QM1
MQI App
MQCONN(QMGRP2)
MQI App
MQCONN(QMGRP2)
MQI App
MQCONN(QMGRP2)
CCDT
QMGR1
QMGR2
QMGR3
CCDT
QMGR1
QMGR2
QMGR3
CCDT
QMGR1
QMGR2
QMGR3
MQI App
MQCONN(QMGRP2)
MQI App
MQCONN(QMGRP2)
MQI App
MQCONN(QMGRP2)
MQI App
MQCONN(QMGRP2)

MQ Cluster routing

Routing on availability with MQ Clusters
15
• MQ Clusters provide a way to route messages
based on availability
• In a cluster there can be multiple potential targets
for any message. This alone can improve the
availability of the solution, always providing an
option to process new messages.
• A queue manager in a cluster also has the ability to
route new and old messages based on the
availability of the channels, routing messages to
running queue managers.
• Clustering can also be used to route messages to
active consuming applications.
• Clustering is used by many customers who
operate critical services at scale
• Available on all supported MQ platforms
ServiceService
App 1App 1Client
Service
Queue Manager Queue Manager Queue Manager
Queue Manager

MQ ‘HA’
(message availability)

Message high availability
17
– Consider a single message
– Tied to a single runtime, on a single piece of
hardware
– Any failure locks it away until recovery
completes
The problem The objective
– Messages are not tied to a single anything
– In the event of a failure, there is a fast route
to access the message

18
– Messages are highly available, through
replication
– Only one runtime is the leader and has access
to the messages at a time
– A failure results in a new leader taking over
– Any message is available from any runtime at
any time
– Coordinated access to each message
– A failed runtime does not prevent access to a
message by another runtime
Active / active messages Active / passive messages

19
Active / active messages Active / passive messages
IBM MQ Distributed HA solutions
(but there’s nothing to stop you having multiple different
queue managers to share the load and the risk
More later)
IBM MQ for z/OS shared queues

Coupling Facility
MQ for z/OS shared queues
20
–MQ queue-sharing groups (QSGs)
– Available with z/OS parallel sysplex
• A tightly coupled cluster of independent z/OS instances
– Multiple queue managers are members of a queue
sharing group (QSG)
– Shared queues are held in the Parallel SysPlex
Coupling Facility
• A highly optimised and resilient z/OS technology
– All queue managers in a QSG can access the same
shared queues and their messages
– Benefits:
ü Messages remain available even if a queue manager
fails
ü Pull workload balancing
ü Applications can connect to the group using a QSG
name
ü Removes affinity to a specific queue manager
Queue Sharing Group
QMgr QMgr
QMgr
App
App
App

IBM MQ Distributed HA solutions
21
MQ managed
The resilient data and the automatic takeover is
provided by the MQ system
Externally managed
External mechanisms are relied on to protect
the data and provide automatic takeover
capabilities
System managed HA
QMgr QMgr
Multi-instance
queue managers
QMgr QMgr
MQ Appliance
QMgr
QMgr
Replicated data
queue managers
QMgr QMgr QMgr
new

Externally managed HA

System managed HA
23
The HA manager monitors the MQ system (e.g. a queue
manager in a VM or container), on detecting a failure it
will start a new system, remount storage and reroute
network traffic
– Relies on external, highly available, storage (e.g. SAN)
– A queue manager is unaware of the HA system
– Availability depends on speed to detect problems and
to restart all layers of the system required (e.g. VM
and queue manager)
Examples:
– HA Clusters
• Veritas Cluster Server, IBM PowerHA, Microsoft Cluster
Server
– Cloud platforms
• IBM Cloud, AWS EC2, Azure
– Containers
• Kubernetes, Docker Swarm
QMgr QMgr
HA manager
– Some systems can be relatively slow to restart
– Additional cost of infrastructure
– Multiple moving parts to configure and manage
active
active
passive
passive

Multi-instance queue managers
24
QMgr
– Only as reliable as the network attached storage
– Matching the MQ requirements to filesystem
behaviour can be tricky
– No IP address takeover, use client configuration
instead
– All queue manager data is held on network attached
storage (e.g. NFS, IBM Spectrum Scale).
– Two systems are running, both have an instance of
the same queue manager, pointed to the same
storage. One is active, the other is in standby.
– A failure of the active instance is detected by the
standby through regularly attempting to take
filesystem locks.
– The queue manager with the locks is the active
instance.
– Faster takeover, less to restart
– Cheaper – less specialised software or administration
skills needed
– Wide platform coverage, Windows, Unix, Linux
QMgr
active
active
active
passive

MQ managed HA

© 2018 IBM Corporation© 2018 IBM Corporation
o A pair of MQ Appliances are connected together and
configured as an HA group
o Queue managers created on one appliance can be
automatically replicated, along with all the MQ data, to the
other
o Appliances monitor each other
o Automatic failover, plus manual failover for
migration or maintenance
o Independent failover for queue managers so both
appliances can run workload (active / active load)
o Optional IP address associated with an HA queue
manager, automatically adopted by the active HA
appliance – single logical endpoint for client apps
o No persistent data loss on failure
o No external storage
o No additional skills required
IBM MQ Appliance
QMgrQMgr
QMgrQMgr
QMgr
active
active
QMgr
active
passive

MQ HA Group
Node 2 Node 3Node 1
Synchronous data replication
27
Replicated Data Queue Managers
o Linux only, MQ Advanced HA solution with no
need for a shared file system or HA cluster
o MQ configures the underlying resources to make
setup and operations natural to an MQ user
o Three-way replication for quorum support
o Synchronous data replication for once and once
only transactional delivery of messages
o Active/passive queue managers with automatic
takeover
o Per queue manager control to support
active/active utilisation of nodes
o Per queue manager IP address to provide simple
application setup
o Supported on RHEL v7 x86-64 only
Monitoring
App
Network
New in V9.0.4 CD / V9.1 LTS MQ Advanced
for Linux

MQ HA Group
Node 2 Node 3Node 1
Synchronous data replication
28
Monitoring
App
Networko Linux only, MQ Advanced HA solution with no
need for a shared file system or HA cluster
o MQ configures the underlying resources to make
setup and operations natural to an MQ user
o Three-way replication for quorum support
o Synchronous data replication for once and once
only transactional delivery of messages
o Active/passive queue managers with automatic
takeover
o Per queue manager control to support
active/active utilisation of nodes
o Per queue manager IP address to provide simple
application setup
o Supported on RHEL v7 x86-64 only
New in V9.0.4 CD / V9.1 LTS MQ Advanced
for Linux

MQ HA Group
Node 2 Node 3Node 1
29
QM 1
App
NetworkRecommended deployment pattern:
o Spread the workload across multiple queue
managers and distribute them across all three
nodes
o Even better, more than one queue manager per
node for better failover distribution
o Use MQ Clusters for additional routing of
messages to work around problems
o Pair this with the new Uniform Cluster capability
http://ibm.biz/MQ-UniCluster
MQ licensing is aligned to maximise benefits
o One full IBM MQ Advanced license and two
High Availability Replica licenses (previously
named Idle Standby)
App App

Node 2Node 1
data replication
36
External/MQ managed HA with RDQM App
Disaster recovery
9.0.5 CD MQ Advanced added the ability to build a
looser coupled pair of nodes for data replication with
manual failover
Data replication can be
Asynchronous for systems separated by a high
latency network
Synchronous for systems on a low latency
network
No automatic takeover means no need for a third
node to provide a quorum
Think 2018 / March 21, 2018 / © 2018 IBM Corporation

© 2019 IBM Corporation41
High Availability with Kubernetes
The RDQM solution does not apply to container
environments
High availability of the MQ data requires highly
available replicated storage
Container orchestrators such as Kubernetes handle
much of the monitoring and restart
responsibilities…
Node 1 Node 3Node 2
HA network
storage
Pod 1
StatefulSet, Replicas=1
Kubernetes

environments
responsibilities…
…but not all. StatefullSets such as MQ are not
automatically restarted following a Kubernetes
node failure
Node 3Node 2
HA network
storage
Kubernetes

environments
responsibilities…
…but not all. StatefullSets such as MQ are not
automatically restarted following a Kubernetes
node failure
The MQ container image and Certified Container
now supports a two-replica multi-instance queue
manager deployment pattern to handle Kubernetes
node failures
Node 1 Node 3Node 2
HA network
storage
Pod 1
multi-instance
active
Pod 2
multi-instance
standby
Kubernetes
IBM MQ 9.1.3 CD

Cost of a restart

Speed of failover
45
System managed
QMgr QMgr
Multi-instance
queue managers
QMgr QMgr
MQ Appliance
QMgr
QMgr
Replicated data
queue managers
QMgr QMgr QMgr
Starting a whole VM can be slow
Containers can b e much quicker
Recovering data from network attached
storage can be slow
Detect failure
Restart underlying system
Start queue manager
Recover messaging state
Reconnect applications
Reliant on storage configuration
Reconnecting applications
o The applications must also detect the failure
before attempting to reconnect
o Make sure the channel heartbeat interval is
set suitably low
Message state recovery
o This time is very dependent on the workload of
the queue manager
o High persistent traffic load and deep queues
can significantly increase the time needed
9.1.1/9.1.2 made
significant improvements
https://developer.ibm.com/messaging/2019/06/12/improved-switch-fail-over-times-in-mq-v9-1-2/

What, where?

Which HA fits where
47
Shared
queues
System
managed
Multi instance RDQM Appliance HA
z/OS ✓
Distributed
platforms
✓ ✓ ✓**
Containers ✓* ✓
MQ Appliance ✓
* This will depend heavily on the capabilities of the container management layer
** RHEL x86 only, bare metal and virtual machines

Pulling it all together

Building a highly available system
Decouple the applications from the
underlying MQ infrastructure
MQ
App
App
App
App
App
App
App
App
Service requestor or
event emitters
Service provider or
long running consumer

High availability of the individual MQ runtimes
should be baked into the design
• Remove any single point of failure
App
App
App
App
App
App
App
App
Make each queue manager
highly available
Have multiple equivalent queue managers

The applications should be designed and
configured to maximise the availability of the
MQ runtime
App
App
App
App
App
App
App
App
Use CCDTs and design
applications to be able to
connect to one of many
queue managers
Connect instances of
your service applications
to more than one queue
manager
MQ 9.1.2 introduced the
Uniform Cluster capability
to aid in building these
topologies
http://ibm.biz/MQ-UniCluster

IBM MQ High Availability 2019

More Related Content

What's hot (20)

Similar to IBM MQ High Availability 2019 (20)

More from David Ware (9)

Recently uploaded (20)

IBM MQ High Availability 2019