SlideShare a Scribd company logo
Distributed Systems Theory
for Mere Mortals
Ensar Basri Kahveci
Distributed Systems Engineer, Hazelcast
1
Disclaimer Notice
In this presentation, I talk about distributed systems theory based on my own understanding.
First of all, distributed systems theory is hard. It also covers a wide-range of topics.
So, my statements might be wrong or incomplete!
Please discuss any point you are confused or you think I am wrong.
2
Agenda
- Defining distributed systems
- Systems Models
- Time and Order
- Consensus, FLP Result, Failure Detectors
- Consensus Algorithms: 2PC, 3PC, Paxos and others...
3
“A DISTRIBUTED SYSTEM IS ONE IN WHICH THE
FAILURE OF A COMPUTER YOU DID NOT EVEN
KNOW EXISTED CAN RENDER YOUR OWN
COMPUTER UNUSABLE”
Leslie Lamport
4
What is a distributed system?
- Collection of entities (machines, nodes, processes...)
- trying to solve a common problem,
- linked by a network and communicating via passing messages,
- having uncertain and partial knowledge of the system.
5
About being distributed…
- Independent failures
- Some servers might fail while others work correctly.
- Non-negligible message transmission delays
- The interconnection between servers has lower bandwidth and higher latency than that
available within a single server.
- Unreliable communication
- The connections between server are unreliable compared to the connections within a
server.
6
System Models
7
Interaction Models
- Synchronous
- Asynchronous
- Partially-synchronous
8
Failure Modes
- Fail-stop
- Fail-recover
- Omission failures
- Arbitrary failures (Byzantine)
9
Time and Order
10
Time and Order
- We use time to:
- order events
- measure the duration between events
- In the asynchronous model, nodes have local clocks, which can shift unboundedly.
- Components of a distributed system behave in an unpredictable manner.
- Failures, rates of advance, delays in network packets etc.
- We cannot assume sync clocks while designing our algorithms in the asynchronous
model.
- Clock synchronization methods helps us a lot but doesn’t fix the problem completely.
11
The Idea: Ordering Events
- We don’t have the notion of “now” in distributed systems.
- To what extend do we need it?
- We don’t need absolute clock synchronization.
- If machines don’t interact with each other, why bothering synchronizing their clocks?
- For a lot of problems, processes need to agree on the order in which
events occur, rather than the time at which they occur
12
Ordering Events: Logical Clocks
- We can use Logical Clocks (=Lamport Clocks) [1] to order events in a
distributed system.
- Logical clocks rely on counters and the communication between nodes.
- Each node maintains a local counter value.
- happened-before relationship ( “→” )
- If events a and b are events in the same process, and a comes before b, then a → b
- If a is sending and b is receipt of a message, then a → b
- If a → b and b → c, then a → c
- If neither of a → b or b → a holds, a and b are concurrent.
- Partial ordering and total ordering of the events
13
Clock Condition
- For any events a, b: if a → b, then C(a) < C(b).
- Can we also infer the reverse?
- p1→ q2 and q2 → q3, then C(q3) > C(p1)
- Causality: p1 causes q2 and q2 causes q3, then p1
causes q3.
- C(p3) and C(q3) are concurrent events due to
the happened-before relationship.
- Can we infer if there is any causality by comparing C(p3)
and C(q3)?
Image taken from [1] 14
Vector Clocks and Causality
- We use vector clocks to infer
causalities by comparing clock values.
- If V(a) < V(b) then a causally precedes b
Image taken from [2] 15
Is Logical Clocks our only chance?
- Google Spanner [3] uses NTP, GPS, and atomic clocks to synchronize the
local clocks of the machines as much as possible.
- It doesn’t pretend that clocks are perfectly synchronized.
- It introduces the uncertainty of clocks into its TrueTime API.
- CockroachDB [4] uses Hybrid Logical Clocks [5] which combines logical
clocks and physical clocks to infer causalities.
16
Consensus
17
Consensus
- The problem of having a set of processes agree on a value.
- leader election, state machine replication, deciding to commit a transaction etc.
- Validity: the value agreed upon must have been proposed by some
process
- Termination: at least one non-faulty process eventually decides
- Agreement: all deciding processes agree on the same value
18
Liveness and Safety Properties
- Liveness: A “good” thing happens during execution of an algorithm
- Safety: Some “bad” thing never happens during execution of an algorithm
19
FLP Result (Fischer, Lynch and Paterson) [6]
- Distributed consensus is not always possible ...
- with reliable message delivery
- with a single crash-stop failure
- … in the asynchronous model, because we cannot differentiate between a crashed
process or a slow process.
- No algorithm can always guarantee termination in the presence of crashes.
- It is related to the liveness property, not the safety property.
20
Detecting failures: Why don’t you “talking to me”?
21
Unreliable Failure Detectors by Chandra and Toueg [7]
- Distributed failure detectors which are allowed to make mistakes
- Each process has a local state to keep the list of processes that it suspects have failed
- A local failure detector can make 2 types of mistakes
- suspecting processes that haven’t actually crashed ⇒ ACCURACY property
- not-suspecting processes that have actually crashed ⇒ COMPLETENESS property
- Degrees of completeness
- strong completeness, weak completeness
- Degrees of accuracy
- strong accuracy, weak accuracy, eventually strong accuracy, eventually weak accuracy
22
Classes of Failure Detectors
- Perfect Failure Detector (P)
- Strongly Complete: Every faulty process is eventually permanently suspected by every non-faulty
process.
- Strongly Accurate: No process is suspected (by anybody) before it crashes.
- Eventually Strong Failure Detector (⋄S)
- Strongly Complete
- Eventually Weakly Accurate: After some initial period of confusion, some non-faulty process is never
suspected.
- Consensus problem can be solved with Eventually Strong Failure Detector (⋄S)
with f < n / 2 failures in the asynchronous model. [7], [8]
- As long as you hear from the majority, you can solve consensus. ⇒ SAFETY
- Every correct process eventually decides. No blocking forever. ⇒ LIVENESS
23
Consensus Algorithms
2PC, 3PC, Paxos, Raft and the others
24
Two-Phase Commit (2PC) [9]
- With no failures, it satisfies Validity,
Termination, and Agreement.
- C crashes before Phase 1: No problem
- C crashes before Phase 2: A can ask B
what it has vote for.
- C and A crash before Phase 2: The
protocol blocks!
- The protocol blocks with fail-stop failures
(the simplest failure model).
25
Three-Phase Commit (3PC) [10]
- The main problem of 2PC is the
participants don’t know outcome
of the voting before they actually
take action (commit / abort).
- We add a new step for this ⇒
3PC
- 3PC is non-blocking and it
handles fail-stop failures.
- What about fail-recover, network
partitions, the asynchronous
model?
26
Paxos [11], [12]
- It chooses to sacrifice liveness to maintain safety
- It doesn’t terminate when the network behaves asynchronously and
terminates only when synchronicity returns.
- It doesn’t block when the majority is available.
- The correct run is similar to 2PC.
- 2 new mechanisms:
- Order to proposals such that we can find out which proposal should
be accepted: sequence numbers
- Prefer majority, instead of all participants
27Image taken from [23]
Paxos
- The original paper “The Part-time Parliament” [11] is difficult to read as it explains the
algorithm using an analogy with Greek democracy.
- Submitted in 1990, published in 1998, after explained in another paper [17] in 1996.
- “The Paxos algorithm, when presented in plain English, is very simple” Paxos Made Simple [12]
- Cheap Paxos [13], Fast Paxos [14] and many other variations…
- Paxos Made Live [15]: There are significant gaps between the description of the Paxos algorithm and the
needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas
scattered in the literature and make several relatively small protocol extensions. The cumulative effort will
be substantial and the final system will be based on an unproven protocol.
- Paxos Made Moderately Complex [16]: For anybody who has ever tried to implement it, Paxos is by no
means a simple protocol, even though it is based on relatively simple invariants. This paper provides
imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing
various implementation details. 28
Raft: In search of an understandable consensus algorithm [18]
- A new consensus algorithm with understandability being one of its design
goals.
- It divides the problem into parts:
- leader election, log replication, safety and membership changes
- Also discusses implementation details
- More than 80 implementations on its website [19]
29
Other Consensus Algorithms
- Viewstamped Replication [20], [21]
- Another consensus algorithm. It is less popular than Paxos.
- Raft has a lot of similarities to it.
- Zab [22]
- Implemented in ZooKeeper
- Many variants of Paxos...
30
References
[1] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565.
[2] Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56.
[3] Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 8.
[4] https://github.com/cockroachdb/cockroach
[5] Leone, Marcelo, et al. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." (2014).
[6] Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32.2 (1985): 374-382.
[7] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.
[8] Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The weakest failure detector for solving consensus." Journal of the ACM (JACM) 43.4 (1996): 685-722.
[9] Gray, James N. "Notes on database operating systems." Operating Systems. Springer Berlin Heidelberg, 1978. 393-481.
[10] Skeen, Dale. "Nonblocking commit protocols." Proceedings of the 1981 ACM SIGMOD international conference on Management of data. ACM, 1981.
[11] Lamport, Leslie. "The part-time parliament." ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169.
[12] Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25.
[13] Lamport, Leslie, and Mike Massa. "Cheap paxos." Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004.
[14] Lamport, Leslie. "Fast paxos." Distributed Computing 19.2 (2006): 79-103.
[15] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on
Principles of distributed computing. ACM, 2007.
[16] Van Renesse, Robbert, and Deniz Altinbuken. "Paxos made moderately complex." ACM Computing Surveys (CSUR) 47.3 (2015): 42.
[17] Lampson, Butler. "How to build a highly available system using consensus." Distributed Algorithms (1996): 1-17.
[18] Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014.
[19] https://raft.github.io/
[20] Oki, Brian M., and Barbara H. Liskov. "Viewstamped replication: A new primary copy method to support highly-available distributed systems." Proceedings of the seventh
annual ACM Symposium on Principles of distributed computing. ACM, 1988.
[21] Liskov, Barbara, and James Cowling. "Viewstamped replication revisited." (2012).
[22] Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." 2011 IEEE/IFIP 41st International Conference
on Dependable Systems & Networks (DSN). IEEE, 2011.
[23] http://the-paper-trail.org/blog/consensus-protocols-paxos/ 31
Thank you!
Stay tuned for the next episode...
32

More Related Content

PDF
Replication in the Wild
PDF
Replication in the Wild - Warsaw Cloud Native Meetup - May 2017
PDF
Distributed Systems Theory for Mere Mortals - GeeCON Krakow May 2017
PDF
Client-centric Consistency Models
PDF
Distributed Systems Theory for Mere Mortals - Java Day Istanbul May 2017
PPT
Client Centric Consistency Model
PPT
Consistency protocols
PDF
The CAP Theorem
Replication in the Wild
Replication in the Wild - Warsaw Cloud Native Meetup - May 2017
Distributed Systems Theory for Mere Mortals - GeeCON Krakow May 2017
Client-centric Consistency Models
Distributed Systems Theory for Mere Mortals - Java Day Istanbul May 2017
Client Centric Consistency Model
Consistency protocols
The CAP Theorem

What's hot (20)

PPTX
Replication in Distributed Systems
PPTX
Distruted applications
PPTX
All you didn't know about the CAP theorem
DOCX
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
PPTX
Micro-Services RabbitMQ vs Kafka
PPTX
Multiprocessing -Interprocessing communication and process sunchronization,se...
PDF
Cap Theorem
PDF
CAP theorem and distributed systems
DOCX
Process scheduling
PDF
Grokking Techtalk #39: Gossip protocol and applications
PPTX
Concurrency
DOCX
Error tolerant resource allocation and payment minimization for cloud system
PPTX
SYNCHRONIZATION IN MULTIPROCESSING
PDF
A Critique of the CAP Theorem by Martin Kleppmann
DOCX
Error tolerant resource allocation and payment minimization for cloud system
DOCX
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Error tolerant resource allocation an...
PPTX
Scheduling in distributed systems - Andrii Vozniuk
PPTX
Buffer management
PPT
Chap 4
Replication in Distributed Systems
Distruted applications
All you didn't know about the CAP theorem
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
Micro-Services RabbitMQ vs Kafka
Multiprocessing -Interprocessing communication and process sunchronization,se...
Cap Theorem
CAP theorem and distributed systems
Process scheduling
Grokking Techtalk #39: Gossip protocol and applications
Concurrency
Error tolerant resource allocation and payment minimization for cloud system
SYNCHRONIZATION IN MULTIPROCESSING
A Critique of the CAP Theorem by Martin Kleppmann
Error tolerant resource allocation and payment minimization for cloud system
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Error tolerant resource allocation an...
Scheduling in distributed systems - Andrii Vozniuk
Buffer management
Chap 4
Ad

Similar to Distributed Systems Theory for Mere Mortals (20)

PDF
Time in distributed systmes
PPT
Ds ppt imp.
PPTX
Intro720T5.pptx
PPTX
Cross cutting concerns should be logically centralized DRY ,but it may appear...
PPT
Resource Management in (Embedded) Real-Time Systems
PPT
Coordination and Agreement .ppt
PDF
Distributed Systems, Blockchain, Bitcoin, and Smart Contracts: An Introduction
PPTX
Real-Time Systems Intro.pptx
PPT
Process Synchronization And Deadlocks
PDF
Beyond Off the-Shelf Consensus
PPT
Reliable Distributed Computing: The Price of Mastering Churn in Distributed S...
PDF
Distributed clouds — micro clouds
PPTX
Simulation-System Modeling and Simulation
PPT
Real Time Systems &amp; RTOS
PDF
PDF
Migrating from One Cloud Provider to Another (Without Losing Your Data or You...
PPTX
CST 402 Distributed Computing Module 1 Notes
PPTX
fault tolerance1.pptx
PPTX
Concurrency and Parallelism, Asynchronous Programming, Network Programming
Time in distributed systmes
Ds ppt imp.
Intro720T5.pptx
Cross cutting concerns should be logically centralized DRY ,but it may appear...
Resource Management in (Embedded) Real-Time Systems
Coordination and Agreement .ppt
Distributed Systems, Blockchain, Bitcoin, and Smart Contracts: An Introduction
Real-Time Systems Intro.pptx
Process Synchronization And Deadlocks
Beyond Off the-Shelf Consensus
Reliable Distributed Computing: The Price of Mastering Churn in Distributed S...
Distributed clouds — micro clouds
Simulation-System Modeling and Simulation
Real Time Systems &amp; RTOS
Migrating from One Cloud Provider to Another (Without Losing Your Data or You...
CST 402 Distributed Computing Module 1 Notes
fault tolerance1.pptx
Concurrency and Parallelism, Asynchronous Programming, Network Programming
Ad

More from Ensar Basri Kahveci (10)

PDF
java.util.concurrent for Distributed Coordination - Berlin Expert Days 2019
PDF
java.util.concurrent for Distributed Coordination, Riga DevDays 2019
PDF
java.util.concurrent for Distributed Coordination, GeeCON Krakow 2019
PDF
java.util.concurrent for Distributed Coordination, JEEConf 2019
PDF
Replication Distilled: Hazelcast Deep Dive @ In-Memory Computing Summit San F...
PDF
Replication Distilled: Hazelcast Deep Dive - Berlin Expert Days 2018
PDF
From AP to CP and Back: The Curious Case of Hazelcast (jdk.io 2018)
PDF
Distributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
PDF
Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017
PDF
Ankara Jug - Practical Functional Programming with Scala
java.util.concurrent for Distributed Coordination - Berlin Expert Days 2019
java.util.concurrent for Distributed Coordination, Riga DevDays 2019
java.util.concurrent for Distributed Coordination, GeeCON Krakow 2019
java.util.concurrent for Distributed Coordination, JEEConf 2019
Replication Distilled: Hazelcast Deep Dive @ In-Memory Computing Summit San F...
Replication Distilled: Hazelcast Deep Dive - Berlin Expert Days 2018
From AP to CP and Back: The Curious Case of Hazelcast (jdk.io 2018)
Distributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017
Ankara Jug - Practical Functional Programming with Scala

Recently uploaded (20)

PPT
Project quality management in manufacturing
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
web development for engineering and engineering
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Well-logging-methods_new................
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
Mechanical Engineering MATERIALS Selection
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Digital Logic Computer Design lecture notes
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PPT on Performance Review to get promotions
PDF
composite construction of structures.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Geodesy 1.pptx...............................................
Project quality management in manufacturing
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CYBER-CRIMES AND SECURITY A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
web development for engineering and engineering
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Well-logging-methods_new................
Model Code of Practice - Construction Work - 21102022 .pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Mechanical Engineering MATERIALS Selection
OOP with Java - Java Introduction (Basics)
Digital Logic Computer Design lecture notes
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT on Performance Review to get promotions
composite construction of structures.pdf
Foundation to blockchain - A guide to Blockchain Tech
Geodesy 1.pptx...............................................

Distributed Systems Theory for Mere Mortals

  • 1. Distributed Systems Theory for Mere Mortals Ensar Basri Kahveci Distributed Systems Engineer, Hazelcast 1
  • 2. Disclaimer Notice In this presentation, I talk about distributed systems theory based on my own understanding. First of all, distributed systems theory is hard. It also covers a wide-range of topics. So, my statements might be wrong or incomplete! Please discuss any point you are confused or you think I am wrong. 2
  • 3. Agenda - Defining distributed systems - Systems Models - Time and Order - Consensus, FLP Result, Failure Detectors - Consensus Algorithms: 2PC, 3PC, Paxos and others... 3
  • 4. “A DISTRIBUTED SYSTEM IS ONE IN WHICH THE FAILURE OF A COMPUTER YOU DID NOT EVEN KNOW EXISTED CAN RENDER YOUR OWN COMPUTER UNUSABLE” Leslie Lamport 4
  • 5. What is a distributed system? - Collection of entities (machines, nodes, processes...) - trying to solve a common problem, - linked by a network and communicating via passing messages, - having uncertain and partial knowledge of the system. 5
  • 6. About being distributed… - Independent failures - Some servers might fail while others work correctly. - Non-negligible message transmission delays - The interconnection between servers has lower bandwidth and higher latency than that available within a single server. - Unreliable communication - The connections between server are unreliable compared to the connections within a server. 6
  • 8. Interaction Models - Synchronous - Asynchronous - Partially-synchronous 8
  • 9. Failure Modes - Fail-stop - Fail-recover - Omission failures - Arbitrary failures (Byzantine) 9
  • 11. Time and Order - We use time to: - order events - measure the duration between events - In the asynchronous model, nodes have local clocks, which can shift unboundedly. - Components of a distributed system behave in an unpredictable manner. - Failures, rates of advance, delays in network packets etc. - We cannot assume sync clocks while designing our algorithms in the asynchronous model. - Clock synchronization methods helps us a lot but doesn’t fix the problem completely. 11
  • 12. The Idea: Ordering Events - We don’t have the notion of “now” in distributed systems. - To what extend do we need it? - We don’t need absolute clock synchronization. - If machines don’t interact with each other, why bothering synchronizing their clocks? - For a lot of problems, processes need to agree on the order in which events occur, rather than the time at which they occur 12
  • 13. Ordering Events: Logical Clocks - We can use Logical Clocks (=Lamport Clocks) [1] to order events in a distributed system. - Logical clocks rely on counters and the communication between nodes. - Each node maintains a local counter value. - happened-before relationship ( “→” ) - If events a and b are events in the same process, and a comes before b, then a → b - If a is sending and b is receipt of a message, then a → b - If a → b and b → c, then a → c - If neither of a → b or b → a holds, a and b are concurrent. - Partial ordering and total ordering of the events 13
  • 14. Clock Condition - For any events a, b: if a → b, then C(a) < C(b). - Can we also infer the reverse? - p1→ q2 and q2 → q3, then C(q3) > C(p1) - Causality: p1 causes q2 and q2 causes q3, then p1 causes q3. - C(p3) and C(q3) are concurrent events due to the happened-before relationship. - Can we infer if there is any causality by comparing C(p3) and C(q3)? Image taken from [1] 14
  • 15. Vector Clocks and Causality - We use vector clocks to infer causalities by comparing clock values. - If V(a) < V(b) then a causally precedes b Image taken from [2] 15
  • 16. Is Logical Clocks our only chance? - Google Spanner [3] uses NTP, GPS, and atomic clocks to synchronize the local clocks of the machines as much as possible. - It doesn’t pretend that clocks are perfectly synchronized. - It introduces the uncertainty of clocks into its TrueTime API. - CockroachDB [4] uses Hybrid Logical Clocks [5] which combines logical clocks and physical clocks to infer causalities. 16
  • 18. Consensus - The problem of having a set of processes agree on a value. - leader election, state machine replication, deciding to commit a transaction etc. - Validity: the value agreed upon must have been proposed by some process - Termination: at least one non-faulty process eventually decides - Agreement: all deciding processes agree on the same value 18
  • 19. Liveness and Safety Properties - Liveness: A “good” thing happens during execution of an algorithm - Safety: Some “bad” thing never happens during execution of an algorithm 19
  • 20. FLP Result (Fischer, Lynch and Paterson) [6] - Distributed consensus is not always possible ... - with reliable message delivery - with a single crash-stop failure - … in the asynchronous model, because we cannot differentiate between a crashed process or a slow process. - No algorithm can always guarantee termination in the presence of crashes. - It is related to the liveness property, not the safety property. 20
  • 21. Detecting failures: Why don’t you “talking to me”? 21
  • 22. Unreliable Failure Detectors by Chandra and Toueg [7] - Distributed failure detectors which are allowed to make mistakes - Each process has a local state to keep the list of processes that it suspects have failed - A local failure detector can make 2 types of mistakes - suspecting processes that haven’t actually crashed ⇒ ACCURACY property - not-suspecting processes that have actually crashed ⇒ COMPLETENESS property - Degrees of completeness - strong completeness, weak completeness - Degrees of accuracy - strong accuracy, weak accuracy, eventually strong accuracy, eventually weak accuracy 22
  • 23. Classes of Failure Detectors - Perfect Failure Detector (P) - Strongly Complete: Every faulty process is eventually permanently suspected by every non-faulty process. - Strongly Accurate: No process is suspected (by anybody) before it crashes. - Eventually Strong Failure Detector (⋄S) - Strongly Complete - Eventually Weakly Accurate: After some initial period of confusion, some non-faulty process is never suspected. - Consensus problem can be solved with Eventually Strong Failure Detector (⋄S) with f < n / 2 failures in the asynchronous model. [7], [8] - As long as you hear from the majority, you can solve consensus. ⇒ SAFETY - Every correct process eventually decides. No blocking forever. ⇒ LIVENESS 23
  • 24. Consensus Algorithms 2PC, 3PC, Paxos, Raft and the others 24
  • 25. Two-Phase Commit (2PC) [9] - With no failures, it satisfies Validity, Termination, and Agreement. - C crashes before Phase 1: No problem - C crashes before Phase 2: A can ask B what it has vote for. - C and A crash before Phase 2: The protocol blocks! - The protocol blocks with fail-stop failures (the simplest failure model). 25
  • 26. Three-Phase Commit (3PC) [10] - The main problem of 2PC is the participants don’t know outcome of the voting before they actually take action (commit / abort). - We add a new step for this ⇒ 3PC - 3PC is non-blocking and it handles fail-stop failures. - What about fail-recover, network partitions, the asynchronous model? 26
  • 27. Paxos [11], [12] - It chooses to sacrifice liveness to maintain safety - It doesn’t terminate when the network behaves asynchronously and terminates only when synchronicity returns. - It doesn’t block when the majority is available. - The correct run is similar to 2PC. - 2 new mechanisms: - Order to proposals such that we can find out which proposal should be accepted: sequence numbers - Prefer majority, instead of all participants 27Image taken from [23]
  • 28. Paxos - The original paper “The Part-time Parliament” [11] is difficult to read as it explains the algorithm using an analogy with Greek democracy. - Submitted in 1990, published in 1998, after explained in another paper [17] in 1996. - “The Paxos algorithm, when presented in plain English, is very simple” Paxos Made Simple [12] - Cheap Paxos [13], Fast Paxos [14] and many other variations… - Paxos Made Live [15]: There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol. - Paxos Made Moderately Complex [16]: For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. 28
  • 29. Raft: In search of an understandable consensus algorithm [18] - A new consensus algorithm with understandability being one of its design goals. - It divides the problem into parts: - leader election, log replication, safety and membership changes - Also discusses implementation details - More than 80 implementations on its website [19] 29
  • 30. Other Consensus Algorithms - Viewstamped Replication [20], [21] - Another consensus algorithm. It is less popular than Paxos. - Raft has a lot of similarities to it. - Zab [22] - Implemented in ZooKeeper - Many variants of Paxos... 30
  • 31. References [1] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565. [2] Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56. [3] Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 8. [4] https://github.com/cockroachdb/cockroach [5] Leone, Marcelo, et al. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." (2014). [6] Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32.2 (1985): 374-382. [7] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267. [8] Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The weakest failure detector for solving consensus." Journal of the ACM (JACM) 43.4 (1996): 685-722. [9] Gray, James N. "Notes on database operating systems." Operating Systems. Springer Berlin Heidelberg, 1978. 393-481. [10] Skeen, Dale. "Nonblocking commit protocols." Proceedings of the 1981 ACM SIGMOD international conference on Management of data. ACM, 1981. [11] Lamport, Leslie. "The part-time parliament." ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169. [12] Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25. [13] Lamport, Leslie, and Mike Massa. "Cheap paxos." Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004. [14] Lamport, Leslie. "Fast paxos." Distributed Computing 19.2 (2006): 79-103. [15] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. ACM, 2007. [16] Van Renesse, Robbert, and Deniz Altinbuken. "Paxos made moderately complex." ACM Computing Surveys (CSUR) 47.3 (2015): 42. [17] Lampson, Butler. "How to build a highly available system using consensus." Distributed Algorithms (1996): 1-17. [18] Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014. [19] https://raft.github.io/ [20] Oki, Brian M., and Barbara H. Liskov. "Viewstamped replication: A new primary copy method to support highly-available distributed systems." Proceedings of the seventh annual ACM Symposium on Principles of distributed computing. ACM, 1988. [21] Liskov, Barbara, and James Cowling. "Viewstamped replication revisited." (2012). [22] Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011. [23] http://the-paper-trail.org/blog/consensus-protocols-paxos/ 31
  • 32. Thank you! Stay tuned for the next episode... 32