SlideShare a Scribd company logo
1
High Availability
on Linux
Roger Zhou
zzhou@suse.com
openSUSE.Asia
Summit 2015
SUSE
way
© SUSE, All rights reserved.2
CURIOSITY in the land
of Linux High Availability
3
Agenda
• HA architectural components
• Use case examples
• Future outlook & Demo
4
What is Cluster?
• HPC (super computing)
• Load Balancer (Very high capacity)
• High Availability
‒ 99.999% = 5 m/year MTTR
‒ SPOF(single point of failure)
‒ Murphy's Law
"Everything that can go wrong will
go wrong"
5
"HA", widely used, often confusing
• VMWare vSphere HA
‒ hypervisor and hardware level. Close-source.
‒ Agnostic on Guest OS inside the VM.
• SUSE HA
‒ Inside Linux OS.
‒ That said, Windows need Windows HA solution.
• Different Industries
‒ We are Enterprise.
‒ HADOOP HA (dot com)
‒ OpenStack HA ( paas )
‒ OpenSAF (telecom)
6
History of HA in Linux OS
• 1990s, Heartbeat project. Simply two nodes.
• Early 2000s, Heartbeat 2.0 too complex.
‒ Industry demands to split.
1) one for cluster membership
2) one for resource management
• Today, ClusterLabs.org
‒ A completely different solution in early days,
pacemaker + corosnc
‒ While merged Heartbeat project.
‒ 2015 HA Summit
7
Typical HA Problem - Split Brain
• Clustering
‒ multiple nodes share the same resources.
• Split partitions run the same service
‒ It just breaks data integrity !!!
• Two key concepts as the solution:
‒ Fencing
Cluster doesn't accept any confusing state.
STONITH - "shoot the other node in the head".
‒ Quorum
It stands for "majority". No quorum, then no actions,
no resource management, no fencing.
8
HA Hardware Components
• Multiple networks
‒ A user network for end user access.
‒ A dedicated network for cluster communication/heartbeat.
‒ A dedicated storage network infrastructure.
• Network Bonding
‒ aka. Link Aggregation
• Fencing/STONITH devices
‒ remote “powerswitch”
• Shared storage
‒ NAS(nfs/cifs), SAN(fc/iscsi)
9
Architectural Software Components
"clusterlabs.org"
‒ Corosync
‒ Pacemaker
‒ Resource Agents
‒ Fencing/STONITH Devices
‒ UI(crmsh and Hawk2)
‒ Booth for GEO Cluster
Outside of "clusterlabs.org"
‒ LVS: Layer 4, ip+port, kernel space.
‒ HAproxy: Layer 7/ HTTP, user space.
‒ Shared filesystem: OCFS2 / GFS2
‒ Block device replication:
DRBD, cLVM mirroring, cluster-md
‒ Shared storage:
SAN (FC / FCoE / iSCSI)
‒ Multipathing
Software Components in details
11
Corosync: messaging and membership
• Consensus algorithm
‒ "Totem Single Ring Ordering and Membership protocol"
• Closed Process Group
‒ Analogue “TCP/IP 3-way hand shaking”
‒ Membership handling.
‒ Message ordering.
• A quorum system
‒ notifies apps when quorum is achieved or lost.
• In-memory object database
‒ for Configuration engines and Service engines.
‒ Shared-nothing cluster.
12
Pacemaker: the resources manager
• The brain of the cluster.
• Policy engine for decision making.
‒ To start/stop resources on a node according to the score.
‒ To monitor resources according to interval.
‒ To restart resources if monitor fails.
‒ To fence/STONITH a node if stop operation fails.
13
Shoot The Other Node In The Head
• Data integrity does not
tolerate any confusing state.
Before migrating resources
to another node in the
cluster, the cluster must
confirm the suspicious node
really is down.
• STONITH is mandatory for
*enterprise* Linux HA
clusters.
14
Popular STONITH devices
• APC PDU
‒ network based powerswitch
• Standard Protocols Integrated with Servers
‒ Intel AMT, HP iLO, Dell DRAC, IBM IMM, IPMI Alliance
• Software libraries
‒ to deal with KVM, Xen and VMware Vms.
• Software based
‒ SBD (STONITH Block Device) to do self termination.
The last implicit option in the fencing topology.
• NOTE: Fencing devices can be chained.
15
Resources Agents (RAs)
• Write RA for your applications
• LSB shell scripts:
start / stop / monitor
• More than hundred contributors in upstream github.
16
Cluster Filesystem
• OCFS2 / GFS2
‒ On the shared storage.
‒ Multiple nodes concurrently access the same filesystem.
http://clusterlabs.org/doc/fr/Pacemaker/1.1/html/Clusters_from_Scratch/_pacemaker_architecture.html
17
Cluster Block Device
• DRBD
‒ network based raid1.
‒ high performance data replication over network.
• cLVM2 + cmirrord
‒ Clustered lvm2 mirroring.
‒ Multiple nodes can manipulate volumes on the shared disk.
‒ clvmd distributes LVM metadata updates in the cluster.
‒ Data replication speed is way too slow.
• Cluster md raid1
‒ multiple nodes use the shared disks as md-raid1.
‒ High performance raid1 solution in cluster.
Cluster Examples
19
NFS Server ( High Available NAS )
Cluster Example in Diagram
VM1
DRBD MasterDRBD Master
LVMLVM
Filesystem(ext4)Filesystem(ext4)
NFS exportsNFS exports
Virtual IPVirtual IP
VM2
DRBD SlaveDRBD Slave
LVMLVM
Filesystem(ext4)Filesystem(ext4)
NFS exportsNFS exports
Virtual IPVirtual IP
Failover
Pacemaker + CorosyncPacemaker + Corosync
KernelKernel KernelKernel
network raid1
20
HA iSCSI Server ( Active/Passive )
Cluster Example in Diagram
VM1
DRBD MasterDRBD Master
LVMLVM
iSCSITargetiSCSITarget
iSCSILogicalUnitiSCSILogicalUnit
Virtual IPVirtual IP
VM2
DRBD SlaveDRBD Slave
LVMLVM
iSCSITargetiSCSITarget
iSCSILogicalUnitiSCSILogicalUnit
Virtual IPVirtual IP
Failover
Pacemaker + CorosyncPacemaker + Corosync
network raid1
KernelKernel KernelKernel
21
Cluster FS - OCFS2 on shared disk
Cluster Example in Diagram
Host3Host1
OCFS2OCFS2
Host2
OCFS2OCFS2
KernelKernel KernelKernel
Virtual Machine Images and Configuration ( /mnt/images/ )Virtual Machine Images and Configuration ( /mnt/images/ )
Cluster MD-RAID1Cluster MD-RAID1 Cluster MD-RAID1Cluster MD-RAID1
Replication / Backup
/dev/md127 /dev/md127
VMVM VMmigration
Pacemaker + CorosyncPacemaker + Corosync
22
Future outlook
• Upstream activities
‒ OpenStack: from the control plane into the compute domain.
‒ Scalability of corosync/pacemaker
‒ Docker adoption
‒ “Zero” Downtime HA VM
‒ ...
23
Join us ( all *open-source* )
• Play with Leap 42.1: http://www.opensuse.org
Doc: https://www.suse.com/documentation/sle-ha-12/
• Report and Fix Bugs: http://bugzilla.opensuse.org
• Discussion: opensuse-ha@opensuse.org
• HA ClusterLabs: http://clusterlabs.org/
• General HA Users: users@clusterlabs.org
Demo + Q&A + Have fun
25

More Related Content

PDF
Linux clustering solution
PDF
Linux Cluster Concepts
PDF
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
PDF
RedHat Cluster!
PDF
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
PPTX
Redhat ha cluster with pacemaker
PDF
Red Hat Global File System (GFS)
PPTX
Containers are the future of the Cloud
Linux clustering solution
Linux Cluster Concepts
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
RedHat Cluster!
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
Redhat ha cluster with pacemaker
Red Hat Global File System (GFS)
Containers are the future of the Cloud

What's hot (20)

PDF
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
PDF
IITCC15: Xen Project 4.6 Update
PPTX
Rhel cluster basics 2
PDF
Cgroup resource mgmt_v1
PDF
GlusterFS Update and OpenStack Integration
PDF
PDF
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
PPTX
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
ODP
Software defined storage
PPTX
Gluster Storage
ODP
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
PDF
An Updated Performance Comparison of Virtual Machines and Linux Containers
PDF
Ceph and Mirantis OpenStack
PDF
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
PPTX
RHCE Training
PDF
Stig Telfer - OpenStack and the Software-Defined SuperComputer
PDF
Live migrating a container: pros, cons and gotchas
PDF
64-bit ARM Unikernels on uKVM
PDF
TechDay - Cambridge 2016 - OpenNebula Corona
PDF
Smb gluster devmar2013
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
IITCC15: Xen Project 4.6 Update
Rhel cluster basics 2
Cgroup resource mgmt_v1
GlusterFS Update and OpenStack Integration
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
Software defined storage
Gluster Storage
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
An Updated Performance Comparison of Virtual Machines and Linux Containers
Ceph and Mirantis OpenStack
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
RHCE Training
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Live migrating a container: pros, cons and gotchas
64-bit ARM Unikernels on uKVM
TechDay - Cambridge 2016 - OpenNebula Corona
Smb gluster devmar2013
Ad

Similar to Linux High Availability Overview - openSUSE.Asia Summit 2015 (20)

PDF
Ha cluster with openSUSE Leap
PPT
PPT
2.1 Red_Hat_Cluster1.ppt
PDF
High Availability Storage (susecon2016)
PPT
Pacemaker+DRBD
PDF
Linux-HA with Pacemaker
PDF
Linux-HA with Pacemaker
PDF
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
PDF
SLE12 SP2 : High Availability et Geo Cluster
PDF
Orchestration for the rest of us
PPTX
[발표자료] 오픈소스 기반 고가용성 Pacemaker 소개 및 적용 사례_20230703_v1.1F.pptx
PDF
55918644 13221359-heartbeat-tutorial
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
PDF
Breda Development Meetup 2016-06-08 - High Availability
PDF
How DreamHost builds a Public Cloud with OpenStack
PDF
How DreamHost builds a public cloud with OpenStack.pdf
PPT
Cluster Computing
PPTX
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
PDF
Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Ava...
Ha cluster with openSUSE Leap
2.1 Red_Hat_Cluster1.ppt
High Availability Storage (susecon2016)
Pacemaker+DRBD
Linux-HA with Pacemaker
Linux-HA with Pacemaker
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
SLE12 SP2 : High Availability et Geo Cluster
Orchestration for the rest of us
[발표자료] 오픈소스 기반 고가용성 Pacemaker 소개 및 적용 사례_20230703_v1.1F.pptx
55918644 13221359-heartbeat-tutorial
Quick-and-Easy Deployment of a Ceph Storage Cluster
Breda Development Meetup 2016-06-08 - High Availability
How DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a public cloud with OpenStack.pdf
Cluster Computing
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Ava...
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Cloud computing and distributed systems.
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Cloud computing and distributed systems.

Linux High Availability Overview - openSUSE.Asia Summit 2015

  • 1. 1 High Availability on Linux Roger Zhou zzhou@suse.com openSUSE.Asia Summit 2015 SUSE way
  • 2. © SUSE, All rights reserved.2 CURIOSITY in the land of Linux High Availability
  • 3. 3 Agenda • HA architectural components • Use case examples • Future outlook & Demo
  • 4. 4 What is Cluster? • HPC (super computing) • Load Balancer (Very high capacity) • High Availability ‒ 99.999% = 5 m/year MTTR ‒ SPOF(single point of failure) ‒ Murphy's Law "Everything that can go wrong will go wrong"
  • 5. 5 "HA", widely used, often confusing • VMWare vSphere HA ‒ hypervisor and hardware level. Close-source. ‒ Agnostic on Guest OS inside the VM. • SUSE HA ‒ Inside Linux OS. ‒ That said, Windows need Windows HA solution. • Different Industries ‒ We are Enterprise. ‒ HADOOP HA (dot com) ‒ OpenStack HA ( paas ) ‒ OpenSAF (telecom)
  • 6. 6 History of HA in Linux OS • 1990s, Heartbeat project. Simply two nodes. • Early 2000s, Heartbeat 2.0 too complex. ‒ Industry demands to split. 1) one for cluster membership 2) one for resource management • Today, ClusterLabs.org ‒ A completely different solution in early days, pacemaker + corosnc ‒ While merged Heartbeat project. ‒ 2015 HA Summit
  • 7. 7 Typical HA Problem - Split Brain • Clustering ‒ multiple nodes share the same resources. • Split partitions run the same service ‒ It just breaks data integrity !!! • Two key concepts as the solution: ‒ Fencing Cluster doesn't accept any confusing state. STONITH - "shoot the other node in the head". ‒ Quorum It stands for "majority". No quorum, then no actions, no resource management, no fencing.
  • 8. 8 HA Hardware Components • Multiple networks ‒ A user network for end user access. ‒ A dedicated network for cluster communication/heartbeat. ‒ A dedicated storage network infrastructure. • Network Bonding ‒ aka. Link Aggregation • Fencing/STONITH devices ‒ remote “powerswitch” • Shared storage ‒ NAS(nfs/cifs), SAN(fc/iscsi)
  • 9. 9 Architectural Software Components "clusterlabs.org" ‒ Corosync ‒ Pacemaker ‒ Resource Agents ‒ Fencing/STONITH Devices ‒ UI(crmsh and Hawk2) ‒ Booth for GEO Cluster Outside of "clusterlabs.org" ‒ LVS: Layer 4, ip+port, kernel space. ‒ HAproxy: Layer 7/ HTTP, user space. ‒ Shared filesystem: OCFS2 / GFS2 ‒ Block device replication: DRBD, cLVM mirroring, cluster-md ‒ Shared storage: SAN (FC / FCoE / iSCSI) ‒ Multipathing
  • 11. 11 Corosync: messaging and membership • Consensus algorithm ‒ "Totem Single Ring Ordering and Membership protocol" • Closed Process Group ‒ Analogue “TCP/IP 3-way hand shaking” ‒ Membership handling. ‒ Message ordering. • A quorum system ‒ notifies apps when quorum is achieved or lost. • In-memory object database ‒ for Configuration engines and Service engines. ‒ Shared-nothing cluster.
  • 12. 12 Pacemaker: the resources manager • The brain of the cluster. • Policy engine for decision making. ‒ To start/stop resources on a node according to the score. ‒ To monitor resources according to interval. ‒ To restart resources if monitor fails. ‒ To fence/STONITH a node if stop operation fails.
  • 13. 13 Shoot The Other Node In The Head • Data integrity does not tolerate any confusing state. Before migrating resources to another node in the cluster, the cluster must confirm the suspicious node really is down. • STONITH is mandatory for *enterprise* Linux HA clusters.
  • 14. 14 Popular STONITH devices • APC PDU ‒ network based powerswitch • Standard Protocols Integrated with Servers ‒ Intel AMT, HP iLO, Dell DRAC, IBM IMM, IPMI Alliance • Software libraries ‒ to deal with KVM, Xen and VMware Vms. • Software based ‒ SBD (STONITH Block Device) to do self termination. The last implicit option in the fencing topology. • NOTE: Fencing devices can be chained.
  • 15. 15 Resources Agents (RAs) • Write RA for your applications • LSB shell scripts: start / stop / monitor • More than hundred contributors in upstream github.
  • 16. 16 Cluster Filesystem • OCFS2 / GFS2 ‒ On the shared storage. ‒ Multiple nodes concurrently access the same filesystem. http://clusterlabs.org/doc/fr/Pacemaker/1.1/html/Clusters_from_Scratch/_pacemaker_architecture.html
  • 17. 17 Cluster Block Device • DRBD ‒ network based raid1. ‒ high performance data replication over network. • cLVM2 + cmirrord ‒ Clustered lvm2 mirroring. ‒ Multiple nodes can manipulate volumes on the shared disk. ‒ clvmd distributes LVM metadata updates in the cluster. ‒ Data replication speed is way too slow. • Cluster md raid1 ‒ multiple nodes use the shared disks as md-raid1. ‒ High performance raid1 solution in cluster.
  • 19. 19 NFS Server ( High Available NAS ) Cluster Example in Diagram VM1 DRBD MasterDRBD Master LVMLVM Filesystem(ext4)Filesystem(ext4) NFS exportsNFS exports Virtual IPVirtual IP VM2 DRBD SlaveDRBD Slave LVMLVM Filesystem(ext4)Filesystem(ext4) NFS exportsNFS exports Virtual IPVirtual IP Failover Pacemaker + CorosyncPacemaker + Corosync KernelKernel KernelKernel network raid1
  • 20. 20 HA iSCSI Server ( Active/Passive ) Cluster Example in Diagram VM1 DRBD MasterDRBD Master LVMLVM iSCSITargetiSCSITarget iSCSILogicalUnitiSCSILogicalUnit Virtual IPVirtual IP VM2 DRBD SlaveDRBD Slave LVMLVM iSCSITargetiSCSITarget iSCSILogicalUnitiSCSILogicalUnit Virtual IPVirtual IP Failover Pacemaker + CorosyncPacemaker + Corosync network raid1 KernelKernel KernelKernel
  • 21. 21 Cluster FS - OCFS2 on shared disk Cluster Example in Diagram Host3Host1 OCFS2OCFS2 Host2 OCFS2OCFS2 KernelKernel KernelKernel Virtual Machine Images and Configuration ( /mnt/images/ )Virtual Machine Images and Configuration ( /mnt/images/ ) Cluster MD-RAID1Cluster MD-RAID1 Cluster MD-RAID1Cluster MD-RAID1 Replication / Backup /dev/md127 /dev/md127 VMVM VMmigration Pacemaker + CorosyncPacemaker + Corosync
  • 22. 22 Future outlook • Upstream activities ‒ OpenStack: from the control plane into the compute domain. ‒ Scalability of corosync/pacemaker ‒ Docker adoption ‒ “Zero” Downtime HA VM ‒ ...
  • 23. 23 Join us ( all *open-source* ) • Play with Leap 42.1: http://www.opensuse.org Doc: https://www.suse.com/documentation/sle-ha-12/ • Report and Fix Bugs: http://bugzilla.opensuse.org • Discussion: opensuse-ha@opensuse.org • HA ClusterLabs: http://clusterlabs.org/ • General HA Users: users@clusterlabs.org
  • 24. Demo + Q&A + Have fun
  • 25. 25