SlideShare a Scribd company logo
Distributed Consensus:
Making Impossible
Possible
QCon London
Tuesday 29/3/2016
Heidi Howard
PhD Student @ University of Cambridge
heidi.howard@cl.cam.ac.uk
@heidiann360
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
distributed-consensus
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Distributed Consensus: Making the Impossible Possible
What is Consensus?
“The process by which we reach agreement over
system state between unreliable machines connected
by asynchronous networks”
Why?
• Distributed locking
• Banking
• Safety critical systems
• Distributed scheduling and coordination
Anything which requires guaranteed agreement
A walk through history
We are going to take a journey through the
developments in distributed consensus, spanning 3
decades.
We are going to search for answers to questions like:
• how do we reach consensus?
• what is the best method for reaching consensus?
• can we even reach consensus?
• what’s next in the field?
FLP Result
off to a slippery start
Impossibility of distributed
consensus with one faulty process
Michael Fischer, Nancy Lynch
and Michael Paterson
ACM SIGACT-SIGMOD
Symposium on Principles of
Database Systems
1983
FLP
We cannot guarantee agreement in an asynchronous
system where even one host might fail.
Why?
We cannot reliably detect failures. We cannot know
for sure the difference between a slow host/network
and a failed host
NB: We can still guarantee safety, the issue limited to
guaranteeing liveness.
Solution to FLP
In practice:
We accept that sometimes the system will not be
available. We mitigate this using timers and backoffs.
In theory:
We make weaker assumptions about the synchrony
of the system e.g. messages arrive within a year.
Paxos
Lamport’s original consensus algorithm
The Part-Time Parliament
Leslie Lamport
ACM Transactions on Computer Systems
May 1998
Paxos
The original consensus algorithm for reaching
agreement on a single value.
• two phase process: prepare and commit
• majority agreement
• monotonically increasing numbers
Paxos Example -
Failure Free
1 2
3
P:
C:
P:
C:
P:
C:
1 2
3
P:
C:
P:
C:
P:
C:
B
Incoming request from Bob
1 2
3
P:
C:
P: 13
C:
P:
C:
B
Promise (13) ?
Phase 1
1 2
3
P: 13
C:
OKOK
P: 13
C:
P: 13
C:
Phase 1
1 2
3
P: 13
C: 13, B
P: 13
C:
P: 13
C:
Phase 2
Commit (13, ) ?B
1 2
3
P: 13
C: 13, B
P: 13
C: 13,
P: 13
C: 13,
Phase 2
B B
OKOK
1 2
3
P: 13
C: 13, B
P: 13
C: 13,
P: 13
C: 13, B B
OK
Bob is granted the lock
Paxos Example -
Node Failure
1 2
3
P:
C:
P:
C:
P:
C:
1 2
3
P:
C:
P: 13
C:
P:
C:
Promise (13) ?
Phase 1
B
Incoming request from Bob
1 2
3
P: 13
C:
P: 13
C:
P: 13
C:
Phase 1
B
OK
OK
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C:
Phase 2
Commit (13, ) ?B
B
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Phase 2
B
B
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Alice would also like the lock
B
B
A
1 2
3
P: 13
C:
P: 13
C: 13,
P: 13
C: 13,
Alice would also like the lock
B
B
A
1 2
3
P: 22
C:
P: 13
C: 13,
P: 13
C: 13,
Phase 1
B
B
A
Promise (22) ?
1 2
3
P: 22
C:
P: 13
C: 13,
P: 22
C: 13,
Phase 1
B
B
A
OK(13, )B
1 2
3
P: 22
C: 22,
P: 13
C: 13,
P: 22
C: 13,
Phase 2
B
B
A
Commit (22, ) ?B
B
1 2
3
P: 22
C: 22,
P: 13
C: 13,
P: 22
C: 22,
Phase 2
B
B
OK
B
NO
Paxos Example -
Conflict
1 2
3
P: 13
C:
P: 13
C:
P: 13
C:
B
Phase 1 - Bob
1 2
3
P: 21
C:
P: 21
C:
P: 21
C:
B
Phase 1 - Alice
A
1 2
3
P: 33
C:
P: 33
C:
P: 33
C:
B
Phase 1 - Bob
A
1 2
3
P: 41
C:
P: 41
C:
P: 41
C:
B
Phase 1 - Alice
A
Paxos Summary
Clients much wait two round trips (2 RTT) to the
majority of nodes. Sometimes longer.
The system will continue as long as a majority of
nodes are up
Multi-Paxos
Lamport’s leader-driven consensus algorithm
Paxos Made Moderately Complex
Robbert van Renesse and Deniz
Altinbuken
ACM Computing Surveys
April 2015
Not the original, but highly recommended
Multi-Paxos
Lamport’s insight:
Phase 1 is not specific to the request so can be done
before the request arrives and can be reused.
Implication:
Bob now only has to wait one RTT
State Machine
Replication
fault-tolerant services using consensus
Implementing Fault-Tolerant
Services Using the State Machine
Approach: A Tutorial
Fred Schneider
ACM Computing Surveys
1990
State Machine Replication
A general technique for making a service, such as a
database, fault-tolerant.
Application
Client Client
Distributed Consensus: Making the Impossible Possible
Application
Application
Application
Client
Client
Network
Consensus
Consensus
Consensus
Consensus
Consensus
Distributed Consensus: Making the Impossible Possible
CAP Theorem
You cannot have your cake and eat it
CAP Theorem
Eric Brewer
Presented at Symposium on
Principles of Distributed
Computing, 2000
Consistency, Availability &
Partition Tolerance - Pick Two
1 2
3 4
B C
Paxos Made Live
How google uses Paxos
Paxos Made Live - An Engineering
Perspective
Tushar Chandra, Robert Griesemer
and Joshua Redstone
ACM Symposium on Principles of
Distributed Computing
2007
Paxos Made Live
Paxos made live documents the challenges in
constructing Chubby, a distributed coordination
service, built using Multi-Paxos and SMR.
Isn’t this a solved problem?
“There are significant gaps between the description
of the Paxos algorithm and the needs of a real-world
system.
In order to build a real-world system, an expert needs
to use numerous ideas scattered in the literature and
make several relatively small protocol extensions.
The cumulative effort will be substantial and the final
system will be based on an unproven protocol.”
Challenges
• Handling disk failure and corruption
• Dealing with limited storage capacity
• Effectively handling read-only requests
• Dynamic membership & reconfiguration
• Supporting transactions
• Verifying safety of the implementation
Fast Paxos
Like Multi-Paxos, but faster
Fast Paxos
Leslie Lamport
Microsoft Research Tech Report
MSR-TR-2005-112
Fast Paxos
Paxos: Any node can commit a value in 2 RTTs
Multi-Paxos: The leader node can commit a value in
1 RTT
But, what about any node committing a value in 1
RTT?
Fast Paxos
We can bypass the leader node for many operations,
so any node can commit a value in 1 RTT.
However, we must either:
• reduce the number of failures we guarantee to
tolerance, or
• increase the size of the quorum, or
• a combination of both
Egalitarian Paxos
Don’t restrict yourself unnecessarily
There Is More Consensus in
Egalitarian Parliaments
Iulian Moraru, David G. Andersen,
Michael Kaminsky
SOSP 2013
also see Generalized Consensus and Paxos
Egalitarian Paxos
The basis of SMR is that every replica of an
application receives the same commands in the
same order.
However, sometimes the ordering can be relaxed…
C=1 B? C=C+1 C? B=0 B=C
C=1 B?
C=C+1
C?
B=0
B=C
Partial Ordering
Total Ordering
C=1 B? C=C+1 C? B=0 B=C
Many possible orderings
B? C=C+1 C?B=0 B=CC=1
B?C=C+1 C? B=0 B=CC=1
B? C=C+1 C? B=0 B=CC=1
Egalitarian Paxos
Allow requests to be out-of-order if they are
commutative.
Conflict becomes much less common.
Works well in combination with Fast Paxos.
Viewstamped
Replication Revisited
the forgotten algorithm
Viewstamped Replication Revisited
Barbara Liskov and James
Cowling
MIT Tech Report
MIT-CSAIL-TR-2012-021
Viewstamped Replication
Revisited (VRR)
Interesting and well explained variant of SMR + Multi-
Paxos.
Key features:
• Round robin leader election
• Dynamic Membership
Raft Consensus
Paxos made understandable
In Search of an Understandable
Consensus Algorithm
Diego Ongaro and John
Ousterhout
USENIX Annual Technical
Conference
2014
Raft
Raft has taken the wider community by storm. Due to
its understandable description
It’s another variant of SMR with Multi-Paxos.
Key features:
• Really strong leadership - all other nodes are
passive
• Dynamic membership and log compaction
Follower Candidate Leader
Startup/
Restart
Timeout Win
Timeout
Step down
Step down
Ios
Why do things yourself, when you can delegate it?
to appear
Ios
The issue with leader-driven algorithms like Multi-
Paxos, Raft and VRR is that throughput is limited to
one node.
Ios allows a leader to safely and dynamically
delegate their responsibilities to other nodes in the
system.
Hydra
consensus for geo-replication
to appear
Hydra
Distributed consensus for systems which span
multiple datacenters.
We use Ios for replication within the datacenter and a
Egalitarian Paxos like protocol for across datacenters.
The system has a clear leader but most requests
simply bypass the leader.
1 2
3
4 5
6
7 8
9
Tokyo
West Coast
East Coast
B
1 2
3
4 5
6
7 8
9
Tokyo
West Coast
East Coast
B
1 2
3
4 5
6
7 8
9
Tokyo
West Coast
East Coast
B
The road we travelled
• 2 impossibility results: CAP & FLP
• 1 replication method: State machine Replication
• 6 consensus algorithms: Paxos, Multi-Paxos, Fast
Paxos, Egalitarian Paxos, Viewstamped Replication
Revisited & Raft
• 2 future algorithms: Ios & Hydra
How strong is the
leadership?
Strong
Leadership Leaderless
Paxos
Egalitarian
Paxos
Raft VRR Ios Hydra
Multi-Paxos Fast Paxos
Leader with
Delegation
Leader only
when needed
Leader driven
Who is the winner?
Depends on the award:
• Best for minimum latency: VRR
• Easier to understand: Raft
• Best for WANs (conflicts rare): Egalitarian Paxos
• Best for WANs (conflicts common): Fast Paxos
Future
1. More algorithms offering a compromise between
strong leadership and leaderless
2. More understandable consensus algorithms
3. Achieving consensus is getting cheaper, even in
challenging settings
4. Deployment with micro-services and unikernels
5. Self-scaling replication - adapting resources to
maintain resilience level.
Stops we drove passed
We have seen one path through history, but many
more exist.
• Alternative replication techniques e.g. chain
replication and primary backup replication
• Alternative failure models e.g. nodes acting
maliciously
• Alternative domains e.g. sensor networks, mobile
networks, between cores
Summary
Do not be discouraged by
impossibility results and dense
abstract academic papers.
Consensus is useful and achievable.
Find the right algorithm for your
specific domain.
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
distributed-consensus

More Related Content

PDF
Distributed Consensus: Making Impossible Possible
PPTX
Hyperledger Consensus Algorithms
PPTX
Blockchain consensus algorithms
PPTX
Consensus Algorithms - Nakov at CryptoBlockCon - Las Vegas (2018)
PDF
Alternative Consensus & Enterprise Blockchain
PDF
How to Build Your Blockchain Project with Chainstack
PPTX
Consensus Algorithms - Nakov @ jProfessionals - Jan 2018
PDF
Getting started with quorum -101
Distributed Consensus: Making Impossible Possible
Hyperledger Consensus Algorithms
Blockchain consensus algorithms
Consensus Algorithms - Nakov at CryptoBlockCon - Las Vegas (2018)
Alternative Consensus & Enterprise Blockchain
How to Build Your Blockchain Project with Chainstack
Consensus Algorithms - Nakov @ jProfessionals - Jan 2018
Getting started with quorum -101

What's hot (20)

PPTX
Overview of Blockchain Consensus Mechanisms
PDF
Consensus Algorithms: An Introduction & Analysis
PPTX
Introduction to Consensus techniques
PDF
Vilnius blockchain club 20170413 consensus
PPTX
How do private transactions work on Quorum
PPTX
EUIPO DPM knowledge share: Blockchain and IP
PDF
02 hello smart contracts
PDF
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
PDF
Understanding blockchains
PDF
Polkadot, Substrate and Governance (PoC-3)
PDF
Encode polkadot club event 3, technical deepdive
PDF
Demystify blockchain development with hyperledger fabric
PDF
The Lightning Network - A gentle introduction
PDF
Bitcoin Lightning Network - Presentation
PPTX
Practical Challenges for Public Blockchains
PDF
Architecture of the Hyperledger Blockchain Fabric
PDF
How AI benefits from Blockchain and Game Theory with Scalable Censorship-resi...
PDF
Can we safely adapt the construction of permissionless blockchain to user dem...
PDF
Blockchain Explored: A technical deep-dive
PDF
What is a blockchain
Overview of Blockchain Consensus Mechanisms
Consensus Algorithms: An Introduction & Analysis
Introduction to Consensus techniques
Vilnius blockchain club 20170413 consensus
How do private transactions work on Quorum
EUIPO DPM knowledge share: Blockchain and IP
02 hello smart contracts
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Understanding blockchains
Polkadot, Substrate and Governance (PoC-3)
Encode polkadot club event 3, technical deepdive
Demystify blockchain development with hyperledger fabric
The Lightning Network - A gentle introduction
Bitcoin Lightning Network - Presentation
Practical Challenges for Public Blockchains
Architecture of the Hyperledger Blockchain Fabric
How AI benefits from Blockchain and Game Theory with Scalable Censorship-resi...
Can we safely adapt the construction of permissionless blockchain to user dem...
Blockchain Explored: A technical deep-dive
What is a blockchain
Ad

Viewers also liked (20)

PPTX
Social Media Mining - Chapter 4 (Network Models)
PPTX
Blockchain Hub Seminar
PDF
Music Recommendation and Discovery in the Long Tail
PPTX
Distributed Consensus in MongoDB's Replication System
PPTX
Structured approach to blockchain and consensus techniques
PDF
Minicourse on Network Science
PPTX
Distributed systems and scalability rules
PPTX
Block Chains and Consensus Algos
PDF
Random graph models
PDF
Blockchain BTSym '16
PDF
Distributed Systems Theory for Mere Mortals
PPTX
Network Theory
PDF
RESIGN REPUBLIC: An education technology platform by Ali. R. Khan
PDF
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
PDF
Distributed algorithms for big data @ GeeCon
PDF
Distributed Consensus A.K.A. "What do we eat for lunch?"
PPT
An Introduction to Network Theory
PPTX
Blockchain Consensus Protocols
PDF
Time, clocks and the ordering of events
PPTX
Introduction to Apache ZooKeeper
Social Media Mining - Chapter 4 (Network Models)
Blockchain Hub Seminar
Music Recommendation and Discovery in the Long Tail
Distributed Consensus in MongoDB's Replication System
Structured approach to blockchain and consensus techniques
Minicourse on Network Science
Distributed systems and scalability rules
Block Chains and Consensus Algos
Random graph models
Blockchain BTSym '16
Distributed Systems Theory for Mere Mortals
Network Theory
RESIGN REPUBLIC: An education technology platform by Ali. R. Khan
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Distributed algorithms for big data @ GeeCon
Distributed Consensus A.K.A. "What do we eat for lunch?"
An Introduction to Network Theory
Blockchain Consensus Protocols
Time, clocks and the ordering of events
Introduction to Apache ZooKeeper
Ad

Similar to Distributed Consensus: Making the Impossible Possible (20)

PDF
Distributed Consensus: Making Impossible Possible [Revised]
PDF
Distributed Consensus: Making Impossible Possible by Heidi howard
PDF
Three years of OFELIA - taking stock
PPTX
Modern Distributed Messaging and RPC
PDF
Much Faster Networking
PDF
Beyond Off the-Shelf Consensus
ODP
Scaling Streaming - Concepts, Research, Goals
PPTX
HPC Controls Future
PDF
Haystack + DASH7 Security
PDF
RINA Introduction, part I
PDF
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
PDF
rtosbyshibu-131026100746-phpapp01.pdf
PDF
Move fast and make things with microservices
PDF
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
PDF
Microservices.pdf
PDF
Striving for ultimate Low Latency
PDF
Spark Streaming into context
PDF
LCA14: LCA14-111: Upstreaming 101
PPTX
Rtos by shibu
PDF
Introduction to Blockchain Development
Distributed Consensus: Making Impossible Possible [Revised]
Distributed Consensus: Making Impossible Possible by Heidi howard
Three years of OFELIA - taking stock
Modern Distributed Messaging and RPC
Much Faster Networking
Beyond Off the-Shelf Consensus
Scaling Streaming - Concepts, Research, Goals
HPC Controls Future
Haystack + DASH7 Security
RINA Introduction, part I
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
rtosbyshibu-131026100746-phpapp01.pdf
Move fast and make things with microservices
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
Microservices.pdf
Striving for ultimate Low Latency
Spark Streaming into context
LCA14: LCA14-111: Upstreaming 101
Rtos by shibu
Introduction to Blockchain Development

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Assigned Numbers - 2025 - Bluetooth® Document
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm

Distributed Consensus: Making the Impossible Possible

  • 1. Distributed Consensus: Making Impossible Possible QCon London Tuesday 29/3/2016 Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ distributed-consensus
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 5. What is Consensus? “The process by which we reach agreement over system state between unreliable machines connected by asynchronous networks”
  • 6. Why? • Distributed locking • Banking • Safety critical systems • Distributed scheduling and coordination Anything which requires guaranteed agreement
  • 7. A walk through history We are going to take a journey through the developments in distributed consensus, spanning 3 decades. We are going to search for answers to questions like: • how do we reach consensus? • what is the best method for reaching consensus? • can we even reach consensus? • what’s next in the field?
  • 8. FLP Result off to a slippery start Impossibility of distributed consensus with one faulty process Michael Fischer, Nancy Lynch and Michael Paterson ACM SIGACT-SIGMOD Symposium on Principles of Database Systems 1983
  • 9. FLP We cannot guarantee agreement in an asynchronous system where even one host might fail. Why? We cannot reliably detect failures. We cannot know for sure the difference between a slow host/network and a failed host NB: We can still guarantee safety, the issue limited to guaranteeing liveness.
  • 10. Solution to FLP In practice: We accept that sometimes the system will not be available. We mitigate this using timers and backoffs. In theory: We make weaker assumptions about the synchrony of the system e.g. messages arrive within a year.
  • 11. Paxos Lamport’s original consensus algorithm The Part-Time Parliament Leslie Lamport ACM Transactions on Computer Systems May 1998
  • 12. Paxos The original consensus algorithm for reaching agreement on a single value. • two phase process: prepare and commit • majority agreement • monotonically increasing numbers
  • 17. 1 2 3 P: 13 C: OKOK P: 13 C: P: 13 C: Phase 1
  • 18. 1 2 3 P: 13 C: 13, B P: 13 C: P: 13 C: Phase 2 Commit (13, ) ?B
  • 19. 1 2 3 P: 13 C: 13, B P: 13 C: 13, P: 13 C: 13, Phase 2 B B OKOK
  • 20. 1 2 3 P: 13 C: 13, B P: 13 C: 13, P: 13 C: 13, B B OK Bob is granted the lock
  • 23. 1 2 3 P: C: P: 13 C: P: C: Promise (13) ? Phase 1 B Incoming request from Bob
  • 24. 1 2 3 P: 13 C: P: 13 C: P: 13 C: Phase 1 B OK OK
  • 25. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: Phase 2 Commit (13, ) ?B B
  • 26. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Phase 2 B B
  • 27. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Alice would also like the lock B B A
  • 28. 1 2 3 P: 13 C: P: 13 C: 13, P: 13 C: 13, Alice would also like the lock B B A
  • 29. 1 2 3 P: 22 C: P: 13 C: 13, P: 13 C: 13, Phase 1 B B A Promise (22) ?
  • 30. 1 2 3 P: 22 C: P: 13 C: 13, P: 22 C: 13, Phase 1 B B A OK(13, )B
  • 31. 1 2 3 P: 22 C: 22, P: 13 C: 13, P: 22 C: 13, Phase 2 B B A Commit (22, ) ?B B
  • 32. 1 2 3 P: 22 C: 22, P: 13 C: 13, P: 22 C: 22, Phase 2 B B OK B NO
  • 34. 1 2 3 P: 13 C: P: 13 C: P: 13 C: B Phase 1 - Bob
  • 35. 1 2 3 P: 21 C: P: 21 C: P: 21 C: B Phase 1 - Alice A
  • 36. 1 2 3 P: 33 C: P: 33 C: P: 33 C: B Phase 1 - Bob A
  • 37. 1 2 3 P: 41 C: P: 41 C: P: 41 C: B Phase 1 - Alice A
  • 38. Paxos Summary Clients much wait two round trips (2 RTT) to the majority of nodes. Sometimes longer. The system will continue as long as a majority of nodes are up
  • 39. Multi-Paxos Lamport’s leader-driven consensus algorithm Paxos Made Moderately Complex Robbert van Renesse and Deniz Altinbuken ACM Computing Surveys April 2015 Not the original, but highly recommended
  • 40. Multi-Paxos Lamport’s insight: Phase 1 is not specific to the request so can be done before the request arrives and can be reused. Implication: Bob now only has to wait one RTT
  • 41. State Machine Replication fault-tolerant services using consensus Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred Schneider ACM Computing Surveys 1990
  • 42. State Machine Replication A general technique for making a service, such as a database, fault-tolerant. Application Client Client
  • 46. CAP Theorem You cannot have your cake and eat it CAP Theorem Eric Brewer Presented at Symposium on Principles of Distributed Computing, 2000
  • 47. Consistency, Availability & Partition Tolerance - Pick Two 1 2 3 4 B C
  • 48. Paxos Made Live How google uses Paxos Paxos Made Live - An Engineering Perspective Tushar Chandra, Robert Griesemer and Joshua Redstone ACM Symposium on Principles of Distributed Computing 2007
  • 49. Paxos Made Live Paxos made live documents the challenges in constructing Chubby, a distributed coordination service, built using Multi-Paxos and SMR.
  • 50. Isn’t this a solved problem? “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.”
  • 51. Challenges • Handling disk failure and corruption • Dealing with limited storage capacity • Effectively handling read-only requests • Dynamic membership & reconfiguration • Supporting transactions • Verifying safety of the implementation
  • 52. Fast Paxos Like Multi-Paxos, but faster Fast Paxos Leslie Lamport Microsoft Research Tech Report MSR-TR-2005-112
  • 53. Fast Paxos Paxos: Any node can commit a value in 2 RTTs Multi-Paxos: The leader node can commit a value in 1 RTT But, what about any node committing a value in 1 RTT?
  • 54. Fast Paxos We can bypass the leader node for many operations, so any node can commit a value in 1 RTT. However, we must either: • reduce the number of failures we guarantee to tolerance, or • increase the size of the quorum, or • a combination of both
  • 55. Egalitarian Paxos Don’t restrict yourself unnecessarily There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David G. Andersen, Michael Kaminsky SOSP 2013 also see Generalized Consensus and Paxos
  • 56. Egalitarian Paxos The basis of SMR is that every replica of an application receives the same commands in the same order. However, sometimes the ordering can be relaxed…
  • 57. C=1 B? C=C+1 C? B=0 B=C C=1 B? C=C+1 C? B=0 B=C Partial Ordering Total Ordering
  • 58. C=1 B? C=C+1 C? B=0 B=C Many possible orderings B? C=C+1 C?B=0 B=CC=1 B?C=C+1 C? B=0 B=CC=1 B? C=C+1 C? B=0 B=CC=1
  • 59. Egalitarian Paxos Allow requests to be out-of-order if they are commutative. Conflict becomes much less common. Works well in combination with Fast Paxos.
  • 60. Viewstamped Replication Revisited the forgotten algorithm Viewstamped Replication Revisited Barbara Liskov and James Cowling MIT Tech Report MIT-CSAIL-TR-2012-021
  • 61. Viewstamped Replication Revisited (VRR) Interesting and well explained variant of SMR + Multi- Paxos. Key features: • Round robin leader election • Dynamic Membership
  • 62. Raft Consensus Paxos made understandable In Search of an Understandable Consensus Algorithm Diego Ongaro and John Ousterhout USENIX Annual Technical Conference 2014
  • 63. Raft Raft has taken the wider community by storm. Due to its understandable description It’s another variant of SMR with Multi-Paxos. Key features: • Really strong leadership - all other nodes are passive • Dynamic membership and log compaction
  • 64. Follower Candidate Leader Startup/ Restart Timeout Win Timeout Step down Step down
  • 65. Ios Why do things yourself, when you can delegate it? to appear
  • 66. Ios The issue with leader-driven algorithms like Multi- Paxos, Raft and VRR is that throughput is limited to one node. Ios allows a leader to safely and dynamically delegate their responsibilities to other nodes in the system.
  • 68. Hydra Distributed consensus for systems which span multiple datacenters. We use Ios for replication within the datacenter and a Egalitarian Paxos like protocol for across datacenters. The system has a clear leader but most requests simply bypass the leader.
  • 69. 1 2 3 4 5 6 7 8 9 Tokyo West Coast East Coast B
  • 70. 1 2 3 4 5 6 7 8 9 Tokyo West Coast East Coast B
  • 71. 1 2 3 4 5 6 7 8 9 Tokyo West Coast East Coast B
  • 72. The road we travelled • 2 impossibility results: CAP & FLP • 1 replication method: State machine Replication • 6 consensus algorithms: Paxos, Multi-Paxos, Fast Paxos, Egalitarian Paxos, Viewstamped Replication Revisited & Raft • 2 future algorithms: Ios & Hydra
  • 73. How strong is the leadership? Strong Leadership Leaderless Paxos Egalitarian Paxos Raft VRR Ios Hydra Multi-Paxos Fast Paxos Leader with Delegation Leader only when needed Leader driven
  • 74. Who is the winner? Depends on the award: • Best for minimum latency: VRR • Easier to understand: Raft • Best for WANs (conflicts rare): Egalitarian Paxos • Best for WANs (conflicts common): Fast Paxos
  • 75. Future 1. More algorithms offering a compromise between strong leadership and leaderless 2. More understandable consensus algorithms 3. Achieving consensus is getting cheaper, even in challenging settings 4. Deployment with micro-services and unikernels 5. Self-scaling replication - adapting resources to maintain resilience level.
  • 76. Stops we drove passed We have seen one path through history, but many more exist. • Alternative replication techniques e.g. chain replication and primary backup replication • Alternative failure models e.g. nodes acting maliciously • Alternative domains e.g. sensor networks, mobile networks, between cores
  • 77. Summary Do not be discouraged by impossibility results and dense abstract academic papers. Consensus is useful and achievable. Find the right algorithm for your specific domain.
  • 78. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ distributed-consensus