SlideShare a Scribd company logo
Using consensus algorithm and
distributed store in designing
distributed system
Atin Mukherjee
GlusterFS Hacker
@mukherjee_atin
Topics
● What is consensus in distributed system?
● What is CAP theorem in distributed system
● Different distributed system design approaches
● Challenges in design of a distributed system
● What is RAFT algorithm and how it works
● Distributed store
● Combining RAFT & distributed store – in the form of
technologies like consul/etcd/zookeeper etc
● Q & A
What is consensus in distributed
system
● Consensus – An agrement but for what and
between whom?
● For what → the op/transaction to be committed
or not
● Between whom → Answer is pretty simple, the
nodes forming the distributed system
● Quorum – (n/2) + 1
CAP theorem
● Any two of the following three gurantees
– Consistency (all nodes see the same data at the
same time)
– Availability (a guarantee that every request
receives a response about whether it succeeded or
failed)
– Partition tolerance (the system continues to
operate despite arbitrary message loss or failure of
part of the system)
Design approaches of distributed
system
● No meta data – all nodes share across their
data
● Meta data server – One node holds data where
others fetches from it
So which one is better???
Probably none of them? Ask yourself for a
minute....
Challenges in design of a distributed
system
● No meta data
– N * N exchange of Network messages
– Not scalable when N is probably in hundreds or
thousands
– Initialization time can be very high
– Can end up in a situation like “whom to believe,
whom not to” - popularly known as split brain
– How to undo a transaction locally
Challenges in design of a distributed
system contd...
● MDS (Meta data server)
– SPOF
Ahh!! so is this the only drawback??
– How about having replicas and then replica count??
– Additional N/W hop, lower performance
RAFT – A consensus algorithm
● Key features
– Leader followers based model
– Leader election
– Normal operation
– Safety and consistency after leader changes
– Neutralizing old leaders
– Client interactions
– Configuration changes
RAFT : Server states
● Server states transition
RAFT : Terms
● Divided into two parts
– Election
– Normal operation
● At most 1 leader per term
● Failed election
● Split vote
● Each server maintains current term value
RAFT : Replicated state machine
● A picture says thousand words...
RAFT : Different RPCs
● RequestVote RPCs – Candidate sends to other
nodes for electing itself as leader
● AppendEntries RPCs – Normal operation
workload
● AppendEntries RPCs with no message - Heart
beat messages – Leader sends to all followers
to make its presence
RAFT : Leader Election
● current_term++
● Follower->Candidate
● Self vote
● Send request vote RPCs to all other servers, retry until either:
– Receive votes from majority of server
– Receive RPC from valid leader
– Election time out elapses – increment term
● Election properties
– Safety – allow at most one winner per term
– Liveness – some candidate must eventually win
RAFT : Picking the best leader
● Candidate include log info in RequestVote
RPCs with index & term of last log entry
● Voting server V denies vote if its log is more
complete by
(votingServerLastTerm > candidateLastTerm ||
((votingServerLastTerm == candidateLastTerm) &&
(votingServerLastIndex > candidateLastIndex))
● But is this enough to have crash consistency?
RAFT : New commitment rules
● For a leader to decide an entry is committed:
– Must be stored on a majority of server &
– At least one new entry from leader's term must also
be stored on majority of servers
RAFT : Log inconsistency
● Leader repairs log entries by
– Delete extraneous entries
– Fill in missing entries from the leader
RAFT : Neutralizing old leaders
● Sender sends its term over RPC
● If sender's term in older than receiver's term
RPC is rejected else it receiver steps down to
follower, updates its term and process the RPC
RAFT : Client protocol
● Send commands to leader
– If leader is unknown, send to anyone
– If contacted server is not leader, it will redirect to leader
● Client gets back the response after the full cycle at leader
● Req- timeout
– Re-issues command to other server
– Unique id for each command at client to avoid duplicate
execution
Joint consensus phase
● 2 phase approach
● Need majority of both old and new
configurations for election and commitment
● Configuration change is just a log entry, applied
immediately on receipt (committed or not)
● Once joint consensus is committed, begin
replicate log entry for final configuration
Distributed store
● A common store which can be shared by
different nodes
● In the form of key value pair for ease of use
● Such distributed key value store
implementations are available.
etcd
● Named as /etc distributed
● Open source distributed consistent key value store
● Highly available and reliable
● Sequentially consistent
● Watchable
● Exposed via HTTP
● Runtime reconfigurable (Saling feature)
● Durable (snapshot backup/restore)
● Time to live keys (have a time out)
etcd cond..
● Bootstraping using RAFT
● Proxy mode in node
● Cluster configuration – etcdctl member
add/remove/list
● Similar projects like consul, zookeeper are also
available.
Why etcd
● Vibrant community
● 500+ applications like kubernetes, cloud
foundry using it
● 150+ developers
● Stable releases
References
● https://raftconsensus.github.io/
● https://www.youtube.com/watch?
v=YbZ3zDzDnrw
● https://github.com/coreos/etcd#etcd
● https://consul.io/
Q & A
THANK YOU

More Related Content

PDF
Introduction to Raft algorithm
PPTX
Operating systems question bank
PPT
Synchronization linux
PPTX
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
PPTX
Intel processor trace - What are Recorded?
PPTX
Kali linux
PPTX
Wonderful world of Microarchitectural attacks
PDF
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
Introduction to Raft algorithm
Operating systems question bank
Synchronization linux
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Intel processor trace - What are Recorded?
Kali linux
Wonderful world of Microarchitectural attacks
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017

What's hot (20)

PPT
Concurrency bug identification through kernel panic log (english)
PDF
OpenZFS send and receive
PPTX
PDF
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
PDF
An Introduction to the Formalised Memory Model for Linux Kernel
PDF
RTAI - Earliest Deadline First
PDF
LCA14: LCA14-412: GPGPU on ARM SoC session
PDF
Sweetening Systems Management with Salt
PPTX
Sgnog openflow demo-v1.0
PPTX
Transactional Memory
PPTX
The Silence of the Canaries
PPT
Free FreeRTOS Course-Task Management
PDF
Lac2006 Lee Revell Slides
PDF
Large scale overlay networks with ovn: problems and solutions
PPTX
Process synchronization in Operating Systems
PPT
Libckpt transparent checkpointing under unix
PDF
Is That A Penguin In My Windows?
PDF
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
PDF
Real Time Operating System Concepts
PDF
Continuous Performance Regression Testing with JfrUnit
Concurrency bug identification through kernel panic log (english)
OpenZFS send and receive
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
An Introduction to the Formalised Memory Model for Linux Kernel
RTAI - Earliest Deadline First
LCA14: LCA14-412: GPGPU on ARM SoC session
Sweetening Systems Management with Salt
Sgnog openflow demo-v1.0
Transactional Memory
The Silence of the Canaries
Free FreeRTOS Course-Task Management
Lac2006 Lee Revell Slides
Large scale overlay networks with ovn: problems and solutions
Process synchronization in Operating Systems
Libckpt transparent checkpointing under unix
Is That A Penguin In My Windows?
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
Real Time Operating System Concepts
Continuous Performance Regression Testing with JfrUnit
Ad

Viewers also liked (20)

PDF
Etcd terraform by Alex Somesan
PDF
Distributed Consensus: Making Impossible Possible [Revised]
PDF
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
PDF
Distributed Consensus A.K.A. "What do we eat for lunch?"
PDF
We don't need consensus: All agreed?
PDF
Spark Stream and SEEP
PDF
MapReduce
PDF
Linux Module Programming
PDF
MegaStore and Spanner
PDF
Main Memory - Part2
PDF
Introduction to Operating Systems - Part2
PDF
Process Management - Part2
PDF
Cloud Computing
PDF
Protection
PDF
IO Systems
PDF
Security
PDF
File System Implementation - Part2
PDF
CPU Scheduling - Part2
PDF
Storage
PDF
The Stratosphere Big Data Analytics Platform
Etcd terraform by Alex Somesan
Distributed Consensus: Making Impossible Possible [Revised]
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Distributed Consensus A.K.A. "What do we eat for lunch?"
We don't need consensus: All agreed?
Spark Stream and SEEP
MapReduce
Linux Module Programming
MegaStore and Spanner
Main Memory - Part2
Introduction to Operating Systems - Part2
Process Management - Part2
Cloud Computing
Protection
IO Systems
Security
File System Implementation - Part2
CPU Scheduling - Part2
Storage
The Stratosphere Big Data Analytics Platform
Ad

Similar to Consensus algo with_distributed_key_value_store_in_distributed_system (20)

ODP
Manging scalability of distributed system
PDF
Raft_Diego_Ongaro.pdf
PDF
From Mainframe to Microservice: An Introduction to Distributed Systems
PDF
Distributed Consensus: Making Impossible Possible by Heidi howard
PDF
Distributed Consensus: Making Impossible Possible
PDF
The computer science behind a modern disributed data store
PDF
Unveiling etcd: Architecture and Source Code Deep Dive
PDF
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
PPTX
Scylla Summit 2018: Consensus in Eventually Consistent Databases
PDF
Raft presentation
PDF
Reaching reliable agreement in an unreliable world
PDF
Coordination in distributed systems
PDF
Beyond Off the-Shelf Consensus
PDF
Raft in details
PDF
The Computer Science Behind a modern Distributed Database
PDF
Distributed Consensus: Making the Impossible Possible
PDF
uiuc2016.pdf
PDF
uiuc201 6-merged.pdf
PPTX
Hyperledger Consensus Algorithms
PPTX
Fault tolerance in distributed systems
Manging scalability of distributed system
Raft_Diego_Ongaro.pdf
From Mainframe to Microservice: An Introduction to Distributed Systems
Distributed Consensus: Making Impossible Possible by Heidi howard
Distributed Consensus: Making Impossible Possible
The computer science behind a modern disributed data store
Unveiling etcd: Architecture and Source Code Deep Dive
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
Scylla Summit 2018: Consensus in Eventually Consistent Databases
Raft presentation
Reaching reliable agreement in an unreliable world
Coordination in distributed systems
Beyond Off the-Shelf Consensus
Raft in details
The Computer Science Behind a modern Distributed Database
Distributed Consensus: Making the Impossible Possible
uiuc2016.pdf
uiuc201 6-merged.pdf
Hyperledger Consensus Algorithms
Fault tolerance in distributed systems

More from Atin Mukherjee (7)

ODP
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
ODP
Ready to go
ODP
Glusterd_thread_synchronization_using_urcu_lca2016
ODP
Gluster d2.0
ODP
Thread synchronization in GlusterD using URCU
ODP
GlusterD - Daemon refactoring
PDF
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
Ready to go
Glusterd_thread_synchronization_using_urcu_lca2016
Gluster d2.0
Thread synchronization in GlusterD using URCU
GlusterD - Daemon refactoring
Gluster fs architecture_&_roadmap_atin_punemeetup_2015

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25 Week I

Consensus algo with_distributed_key_value_store_in_distributed_system

  • 1. Using consensus algorithm and distributed store in designing distributed system Atin Mukherjee GlusterFS Hacker @mukherjee_atin
  • 2. Topics ● What is consensus in distributed system? ● What is CAP theorem in distributed system ● Different distributed system design approaches ● Challenges in design of a distributed system ● What is RAFT algorithm and how it works ● Distributed store ● Combining RAFT & distributed store – in the form of technologies like consul/etcd/zookeeper etc ● Q & A
  • 3. What is consensus in distributed system ● Consensus – An agrement but for what and between whom? ● For what → the op/transaction to be committed or not ● Between whom → Answer is pretty simple, the nodes forming the distributed system ● Quorum – (n/2) + 1
  • 4. CAP theorem ● Any two of the following three gurantees – Consistency (all nodes see the same data at the same time) – Availability (a guarantee that every request receives a response about whether it succeeded or failed) – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
  • 5. Design approaches of distributed system ● No meta data – all nodes share across their data ● Meta data server – One node holds data where others fetches from it So which one is better??? Probably none of them? Ask yourself for a minute....
  • 6. Challenges in design of a distributed system ● No meta data – N * N exchange of Network messages – Not scalable when N is probably in hundreds or thousands – Initialization time can be very high – Can end up in a situation like “whom to believe, whom not to” - popularly known as split brain – How to undo a transaction locally
  • 7. Challenges in design of a distributed system contd... ● MDS (Meta data server) – SPOF Ahh!! so is this the only drawback?? – How about having replicas and then replica count?? – Additional N/W hop, lower performance
  • 8. RAFT – A consensus algorithm ● Key features – Leader followers based model – Leader election – Normal operation – Safety and consistency after leader changes – Neutralizing old leaders – Client interactions – Configuration changes
  • 9. RAFT : Server states ● Server states transition
  • 10. RAFT : Terms ● Divided into two parts – Election – Normal operation ● At most 1 leader per term ● Failed election ● Split vote ● Each server maintains current term value
  • 11. RAFT : Replicated state machine ● A picture says thousand words...
  • 12. RAFT : Different RPCs ● RequestVote RPCs – Candidate sends to other nodes for electing itself as leader ● AppendEntries RPCs – Normal operation workload ● AppendEntries RPCs with no message - Heart beat messages – Leader sends to all followers to make its presence
  • 13. RAFT : Leader Election ● current_term++ ● Follower->Candidate ● Self vote ● Send request vote RPCs to all other servers, retry until either: – Receive votes from majority of server – Receive RPC from valid leader – Election time out elapses – increment term ● Election properties – Safety – allow at most one winner per term – Liveness – some candidate must eventually win
  • 14. RAFT : Picking the best leader ● Candidate include log info in RequestVote RPCs with index & term of last log entry ● Voting server V denies vote if its log is more complete by (votingServerLastTerm > candidateLastTerm || ((votingServerLastTerm == candidateLastTerm) && (votingServerLastIndex > candidateLastIndex)) ● But is this enough to have crash consistency?
  • 15. RAFT : New commitment rules ● For a leader to decide an entry is committed: – Must be stored on a majority of server & – At least one new entry from leader's term must also be stored on majority of servers
  • 16. RAFT : Log inconsistency ● Leader repairs log entries by – Delete extraneous entries – Fill in missing entries from the leader
  • 17. RAFT : Neutralizing old leaders ● Sender sends its term over RPC ● If sender's term in older than receiver's term RPC is rejected else it receiver steps down to follower, updates its term and process the RPC
  • 18. RAFT : Client protocol ● Send commands to leader – If leader is unknown, send to anyone – If contacted server is not leader, it will redirect to leader ● Client gets back the response after the full cycle at leader ● Req- timeout – Re-issues command to other server – Unique id for each command at client to avoid duplicate execution
  • 19. Joint consensus phase ● 2 phase approach ● Need majority of both old and new configurations for election and commitment ● Configuration change is just a log entry, applied immediately on receipt (committed or not) ● Once joint consensus is committed, begin replicate log entry for final configuration
  • 20. Distributed store ● A common store which can be shared by different nodes ● In the form of key value pair for ease of use ● Such distributed key value store implementations are available.
  • 21. etcd ● Named as /etc distributed ● Open source distributed consistent key value store ● Highly available and reliable ● Sequentially consistent ● Watchable ● Exposed via HTTP ● Runtime reconfigurable (Saling feature) ● Durable (snapshot backup/restore) ● Time to live keys (have a time out)
  • 22. etcd cond.. ● Bootstraping using RAFT ● Proxy mode in node ● Cluster configuration – etcdctl member add/remove/list ● Similar projects like consul, zookeeper are also available.
  • 23. Why etcd ● Vibrant community ● 500+ applications like kubernetes, cloud foundry using it ● 150+ developers ● Stable releases
  • 24. References ● https://raftconsensus.github.io/ ● https://www.youtube.com/watch? v=YbZ3zDzDnrw ● https://github.com/coreos/etcd#etcd ● https://consul.io/
  • 25. Q & A