SlideShare a Scribd company logo
Overlapping Ring
Monitoring Algorithm
in TIPC
Jon Maloy, Ericsson Canada Inc. Montreal
April 7th 2017
When a cluster node becomes unresponsive due to
crash, reboot or lost connectivity we want to:
 Have all affected connections on the remaining nodes aborted
 Inform other users who have subscribed for cluster connectivity
events
 Within a well-defined short interval from the occurrence of the
event
PURPOSE
1) Crank up the connection keepalive timer
 Network and CPU load quickly gets out of hand when there are thousands of connections
 Does not provide a neighbor monitoring service that can be used by others
2) Dedicated full-mesh framework of per-node daemons with
frequently probed connections
 Even here monitoring traffic becomes overwhelming when cluster size > 100 nodes
 Does not automatically abort any other connections
COMMON SOLUTIONS
 Full-mesh framework of frequently probed node-to-node “links”
 At kernel level
 Provides generic neighbor monitoring service
 Each link endpoint keeps track of all connections to peer node
 Issues “ABORT” message to its local socket endpoints when connectivity to peer node is lost
 Even this solution causes excessive traffic beyond ~100 nodes
 CPU load grows with ~N
 Network load grows with ~N*(N-1)
TIPC SOLUTION: HIERARCHY + FULL MESH
 Each node monitors its two nearest neighbors by heatbeats
 Low monitoring network overhead, - increases by ~2*N
 Node loss can also be detected through loss of an iterating token
 Both solutions offered by Corosync
 Hard to handle accidental network partitioning
 How do we detect loss of nodes not adjacent to fracture point in opposite partition?
 Consensus on ring topology required
OTHER SOLUTION: RING
 Each node periodically transmits its known network view to a
randomly selected set of known neighbors
 Each node knows and monitors only a subset of all nodes
 Scales extremely well
 Used by BitTorrent client Tribler
 Non-deterministic delay until all cluster nodes are informed
 Potentially very long because of the periodic and random nature of event propagation
 Unpredictable number of generations to reach last node
 Extra network overhead because of duplicate information spreading
OTHER SOLUTION: GOSSIP PROTOCOL
THE CHALLENGE
Finding an algorithm which:
 Has the scalability of Gossip, but with
 A deterministic set of peer nodes to monitor and update from each node
 A predictable number of propagation generations before all nodes are reached
 Predictable, well-defined and short event propagation delay
 Has the light-weight properties of ring monitoring, but
 Is able to handle accidental network partitioning
 Has the full-mesh link connectivity of TIPC, but
 Does not require full-mesh active monitoring
THE ANSWER:
OVERLAPPING RING MONITORING
 Sort all cluster nodes into a circular list
 All nodes use same algorithm and criteria
 Select next [√N] - 1 downstream nodes in the
list as “local domain” to be actively monitored
 CPU load increases by ~√N
 Distribute a record describing the local domain
to all other nodes in the cluster
 Select and monitor a set of “head” nodes outside
the local domain so that no node is more than
two active monitoring hops away
 There will be [√N] - 1 such nodes
 Guarantees failure discovery even at
accidental network partitioning
 Each node now monitors 2 x (√N – 1) neighbors
• 6 neighbors in a 16 node cluster
• 56 neighbors in an 800 node cluster
 All nodes use this algorithm
 In total 2 x (√N - 1) x N actively monitored links
• 96 links in a 16 node cluster
• 44,800 links in an 800 node cluster
+ x N =
(√N – 1) Local Domain
Destinations
(√N – 1) Remote
“Head” Destinations
2 x N x (√N – 1) Actively
Monitored Links
LOSS OF LOCAL DOMAIN NODE
State change of local
domain node detected
1
 A domain record is sent to all other nodes in cluster when any state change
(discovery, loss, re-establish) is detected in a local domain node
 The record keeps a generation id, so the receiver can know if it really
contains a change before it starts parsing and applying it
 It is piggy-backed on regular unicast link state/probe messages, which must
always be sent out after a domain state change
 May be sent several times until the receiver acknowledges reception of the
current generation
 Because probing is driven by a background timer, it may take up to 375 ms
(configurable) until all nodes are updated
1
Domain record distributed to
all other nodes in cluster
LOSS OF ACTIVELY MONITORED HEAD NODE
Node failure detected Brief confirmation probing of
lost node’s domain members
After recalculation
 The two-hop criteria plus confirmation probing eliminates the
network partitioning problem
 If we really have a partition worst-case failure detection time will be
 Tfailmax = 2 x active failure detection time
 Active failure detection time is configurable
 50 ms – 10 s
 Default 1.5 s in TIPC/Linux 4.7
Actively monitored nodes outside local domain
LOSS OF INDIRECTLY MONITORED NODE
Actively monitoring neighbors
discover failure
Actively monitoring neighbors
report failure
 Max one event propagation hop
 Near uniform failure detection time across the whole cluster
 Tfailmax = active failure detection time + (1 x event propagation hop time)
Actively monitored nodes outside local domain
DIFFERING NETWORK VIEWS
1
A node has discovered a peer that
nobody else is monitoring
 Actively monitor that node
 Add it to its circular list according to algorithm (as local domain
member or “head”)
 Handle its domain members according to algorithm (“applied”
or “non-applied”)
 Continue calculating the monitoring view from the next peer
Actively monitored nodes outside local domain
1
A node is unable to discover a peer
that others are monitoring
 Don’t add the peer to the circular list
 Ignore it during the calculation of the monitoring view
 Keep it as “non-applied” in the copies of received domain records
 Apply it to the monitoring view if it is discovered at a later moment
Transiently, this happens all the time, and must be considered a normal situation
STATUS LISTING OF 16 NODE CLUSTER
5
13
9
1
STATUS LISTING OF 600 NODE CLUSTER
THE END

More Related Content

PDF
TIPC Roadmap 2021
PDF
Tipc Communication Groups
PDF
TIPC Overview
PPTX
MX Fabric troubleshootingv1.0.pptx
PPT
computer Networks Error Detection and Correction.ppt
PPTX
TCP/IP Protocols
PDF
Intro to Single / Two Rate Three Color Marker (srTCM / trTCM)
PDF
TIPC Roadmap 2021
Tipc Communication Groups
TIPC Overview
MX Fabric troubleshootingv1.0.pptx
computer Networks Error Detection and Correction.ppt
TCP/IP Protocols
Intro to Single / Two Rate Three Color Marker (srTCM / trTCM)

What's hot (20)

PPTX
Chapter 10
PPTX
DHCP basics
PPSX
Congestion control in TCP
PPTX
Carrier Sense Multiple Access (CSMA)
PPT
Ip addressing classful
PDF
Link state protocols.ppt
PDF
Project ACRN: SR-IOV implementation
PPT
Internet control message protocol
PDF
DHCP (dynamic host configuration protocol)
PDF
Day 1 INTRODUCTION TO IOS AND CISCO ROUTERS
PDF
QOS (Quality of Services) - Computer Networks
PPT
25 DNS
PPTX
OSPF Basics
PPT
Fundamental of Quality of Service(QoS)
DOCX
Classful and classless addressing
PDF
Model driven telemetry
PPTX
TCP & UDP ( Transmission Control Protocol and User Datagram Protocol)
PPT
Routing Information Protocol (RIP)
PPTX
Broadcast storm
PPTX
Routing Information Protocol
Chapter 10
DHCP basics
Congestion control in TCP
Carrier Sense Multiple Access (CSMA)
Ip addressing classful
Link state protocols.ppt
Project ACRN: SR-IOV implementation
Internet control message protocol
DHCP (dynamic host configuration protocol)
Day 1 INTRODUCTION TO IOS AND CISCO ROUTERS
QOS (Quality of Services) - Computer Networks
25 DNS
OSPF Basics
Fundamental of Quality of Service(QoS)
Classful and classless addressing
Model driven telemetry
TCP & UDP ( Transmission Control Protocol and User Datagram Protocol)
Routing Information Protocol (RIP)
Broadcast storm
Routing Information Protocol
Ad

Similar to Overlapping Ping Monitoring (20)

PDF
On-line Fault diagnosis of Arbitrary Connected Networks
PDF
CIP Based BOND for Wireless Sensor Networks
PPTX
Cut detection
ODP
Consensus algo with_distributed_key_value_store_in_distributed_system
PPTX
Cut detection
PPTX
L6.FA16hhhhsjjjhkjhkjhkjhjjjhjhkjjk.pptx
PDF
Distributed computing time
PPTX
Pulkit 10103644
PDF
Design Presentation Distributed Monitoring tool
PDF
An efficient recovery mechanism
PDF
Network Security Enhancement in WSN by Detecting Misbehavioural Activity as C...
PDF
A Distributed Cut Detection Method for Wireless Sensor Networks
PDF
Efficient failure detection and consensus at extreme-scale systems
PPTX
Energy efficient communication techniques for wireless micro sensor networks
PDF
IRJET- Low Priced and Energy Economical Detection of Replicas for Wireles...
PPTX
IEEE ICCCN 2013 - Continuous Gossip-based Aggregation through Dynamic Informa...
PPTX
List_of_cmpknowledgeofcmputerbscmsc.pptx
PDF
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
DOCX
JPN1411 Secure Continuous Aggregation in Wireless Sensor Networks
PDF
M017358794
On-line Fault diagnosis of Arbitrary Connected Networks
CIP Based BOND for Wireless Sensor Networks
Cut detection
Consensus algo with_distributed_key_value_store_in_distributed_system
Cut detection
L6.FA16hhhhsjjjhkjhkjhkjhjjjhjhkjjk.pptx
Distributed computing time
Pulkit 10103644
Design Presentation Distributed Monitoring tool
An efficient recovery mechanism
Network Security Enhancement in WSN by Detecting Misbehavioural Activity as C...
A Distributed Cut Detection Method for Wireless Sensor Networks
Efficient failure detection and consensus at extreme-scale systems
Energy efficient communication techniques for wireless micro sensor networks
IRJET- Low Priced and Energy Economical Detection of Replicas for Wireles...
IEEE ICCCN 2013 - Continuous Gossip-based Aggregation through Dynamic Informa...
List_of_cmpknowledgeofcmputerbscmsc.pptx
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
JPN1411 Secure Continuous Aggregation in Wireless Sensor Networks
M017358794
Ad

Recently uploaded (20)

PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
artificial intelligence overview of it and more
PPTX
Funds Management Learning Material for Beg
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
Introduction to the IoT system, how the IoT system works
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
DOCX
Unit-3 cyber security network security of internet system
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
innovation process that make everything different.pptx
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
The New Creative Director: How AI Tools for Social Media Content Creation Are...
artificial intelligence overview of it and more
Funds Management Learning Material for Beg
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Introduction to the IoT system, how the IoT system works
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Paper PDF World Game (s) Great Redesign.pdf
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
An introduction to the IFRS (ISSB) Stndards.pdf
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Unit-1 introduction to cyber security discuss about how to secure a system
PptxGenJS_Demo_Chart_20250317130215833.pptx
WebRTC in SignalWire - troubleshooting media negotiation
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Unit-3 cyber security network security of internet system
Job_Card_System_Styled_lorem_ipsum_.pptx
The Internet -By the Numbers, Sri Lanka Edition
innovation process that make everything different.pptx

Overlapping Ping Monitoring

  • 1. Overlapping Ring Monitoring Algorithm in TIPC Jon Maloy, Ericsson Canada Inc. Montreal April 7th 2017
  • 2. When a cluster node becomes unresponsive due to crash, reboot or lost connectivity we want to:  Have all affected connections on the remaining nodes aborted  Inform other users who have subscribed for cluster connectivity events  Within a well-defined short interval from the occurrence of the event PURPOSE
  • 3. 1) Crank up the connection keepalive timer  Network and CPU load quickly gets out of hand when there are thousands of connections  Does not provide a neighbor monitoring service that can be used by others 2) Dedicated full-mesh framework of per-node daemons with frequently probed connections  Even here monitoring traffic becomes overwhelming when cluster size > 100 nodes  Does not automatically abort any other connections COMMON SOLUTIONS
  • 4.  Full-mesh framework of frequently probed node-to-node “links”  At kernel level  Provides generic neighbor monitoring service  Each link endpoint keeps track of all connections to peer node  Issues “ABORT” message to its local socket endpoints when connectivity to peer node is lost  Even this solution causes excessive traffic beyond ~100 nodes  CPU load grows with ~N  Network load grows with ~N*(N-1) TIPC SOLUTION: HIERARCHY + FULL MESH
  • 5.  Each node monitors its two nearest neighbors by heatbeats  Low monitoring network overhead, - increases by ~2*N  Node loss can also be detected through loss of an iterating token  Both solutions offered by Corosync  Hard to handle accidental network partitioning  How do we detect loss of nodes not adjacent to fracture point in opposite partition?  Consensus on ring topology required OTHER SOLUTION: RING
  • 6.  Each node periodically transmits its known network view to a randomly selected set of known neighbors  Each node knows and monitors only a subset of all nodes  Scales extremely well  Used by BitTorrent client Tribler  Non-deterministic delay until all cluster nodes are informed  Potentially very long because of the periodic and random nature of event propagation  Unpredictable number of generations to reach last node  Extra network overhead because of duplicate information spreading OTHER SOLUTION: GOSSIP PROTOCOL
  • 7. THE CHALLENGE Finding an algorithm which:  Has the scalability of Gossip, but with  A deterministic set of peer nodes to monitor and update from each node  A predictable number of propagation generations before all nodes are reached  Predictable, well-defined and short event propagation delay  Has the light-weight properties of ring monitoring, but  Is able to handle accidental network partitioning  Has the full-mesh link connectivity of TIPC, but  Does not require full-mesh active monitoring
  • 8. THE ANSWER: OVERLAPPING RING MONITORING  Sort all cluster nodes into a circular list  All nodes use same algorithm and criteria  Select next [√N] - 1 downstream nodes in the list as “local domain” to be actively monitored  CPU load increases by ~√N  Distribute a record describing the local domain to all other nodes in the cluster  Select and monitor a set of “head” nodes outside the local domain so that no node is more than two active monitoring hops away  There will be [√N] - 1 such nodes  Guarantees failure discovery even at accidental network partitioning  Each node now monitors 2 x (√N – 1) neighbors • 6 neighbors in a 16 node cluster • 56 neighbors in an 800 node cluster  All nodes use this algorithm  In total 2 x (√N - 1) x N actively monitored links • 96 links in a 16 node cluster • 44,800 links in an 800 node cluster + x N = (√N – 1) Local Domain Destinations (√N – 1) Remote “Head” Destinations 2 x N x (√N – 1) Actively Monitored Links
  • 9. LOSS OF LOCAL DOMAIN NODE State change of local domain node detected 1  A domain record is sent to all other nodes in cluster when any state change (discovery, loss, re-establish) is detected in a local domain node  The record keeps a generation id, so the receiver can know if it really contains a change before it starts parsing and applying it  It is piggy-backed on regular unicast link state/probe messages, which must always be sent out after a domain state change  May be sent several times until the receiver acknowledges reception of the current generation  Because probing is driven by a background timer, it may take up to 375 ms (configurable) until all nodes are updated 1 Domain record distributed to all other nodes in cluster
  • 10. LOSS OF ACTIVELY MONITORED HEAD NODE Node failure detected Brief confirmation probing of lost node’s domain members After recalculation  The two-hop criteria plus confirmation probing eliminates the network partitioning problem  If we really have a partition worst-case failure detection time will be  Tfailmax = 2 x active failure detection time  Active failure detection time is configurable  50 ms – 10 s  Default 1.5 s in TIPC/Linux 4.7 Actively monitored nodes outside local domain
  • 11. LOSS OF INDIRECTLY MONITORED NODE Actively monitoring neighbors discover failure Actively monitoring neighbors report failure  Max one event propagation hop  Near uniform failure detection time across the whole cluster  Tfailmax = active failure detection time + (1 x event propagation hop time) Actively monitored nodes outside local domain
  • 12. DIFFERING NETWORK VIEWS 1 A node has discovered a peer that nobody else is monitoring  Actively monitor that node  Add it to its circular list according to algorithm (as local domain member or “head”)  Handle its domain members according to algorithm (“applied” or “non-applied”)  Continue calculating the monitoring view from the next peer Actively monitored nodes outside local domain 1 A node is unable to discover a peer that others are monitoring  Don’t add the peer to the circular list  Ignore it during the calculation of the monitoring view  Keep it as “non-applied” in the copies of received domain records  Apply it to the monitoring view if it is discovered at a later moment Transiently, this happens all the time, and must be considered a normal situation
  • 13. STATUS LISTING OF 16 NODE CLUSTER 5 13 9 1
  • 14. STATUS LISTING OF 600 NODE CLUSTER