SlideShare a Scribd company logo
DISTRIBUTED 
COMPUTING 
FOR NEW BLOODS 
RAYMOND TAY 
I WORK IN HEWLETT-PACKARD LABS SINGAPORE 
@RAYMONDTAYBL
What is a 
Distributed 
System?
What is a 
Distributed 
System? 
It’s been 
there for 
a long 
time
What is a 
Distributed 
System? 
It’s been 
there for 
a long 
time 
You did not realize it
DISTRIBUTED 
SYSTEMS ARE 
THE NEW 
FRONTIER 
whether YOU like it or not
DISTRIBUTED 
SYSTEMS ARE 
THE NEW 
FRONTIER 
whether YOU like it or not 
LAN Games 
Mobile 
Databases 
ATMs 
Social Media
WHAT IS THE 
ESSENCE OF 
DISTRIBUTED 
SYSTEMS?
WHAT IS THE 
ESSENCE OF 
DISTRIBUTED 
SYSTEMS? 
IT IS ABOUT THE 
ATTEMPT TO 
OVERCOME 
INFORMATION 
TRAVEL 
WHEN 
INDEPENDENT 
PROCESSES FAIL
Information flows 
at speed of light ! 
MESSAGE 
X Y
When 
independent 
things DONT fail. 
I’ve sent it. I’ve received it. 
X Y 
I’ve received it. I’ve sent it.
When independent 
things fail, 
independently. 
X Y 
NETWORK FAILURE
When independent 
things fail, 
independently. 
X Y 
NODE FAILURE
There is no difference 
between a slow node 
and a dead node
Why do we NEED it? 
Scalability 
when the needs of the 
system outgrows what a 
single node can provide
Why do we NEED it? 
Scalability 
when the needs of the 
system outgrows what a 
single node can provide 
Availability 
Enabling resilience 
when a node fails
IS IT REALLY THAT 
HARD?
IS IT REALLY THAT 
HARD? 
The answer lies in 
knowing what is 
possible and what is not 
possible…
IS IT REALLY THAT 
HARD? 
Even widely accepted 
facts are challenged…
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable Peter Deutsch and 
other fellows 
at Sun 
Microsystems
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
The network is secure 
Peter Deutsch and 
other fellows 
at Sun 
Microsystems
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
The network is secure 
2013 cost of cyber 
crime study: 
United States
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
Peter Deutsch and 
The network is secure 
other fellows 
at Sun 
The network is 
Microsystems 
homogeneous
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
Peter Deutsch and 
The network is secure 
other fellows 
at Sun 
The network is 
Microsystems 
homogeneous 
Latency is zero 
Ingo Rammer on 
latency vs 
bandwidth
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
Peter Deutsch and 
The network is secure 
other fellows 
at Sun 
The network is 
Microsystems 
homogeneous 
Latency is zero 
Bandwidth is infinite
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
Peter Deutsch and 
The network is secure 
other fellows 
The network is 
at Sun 
homogeneous 
Microsystems 
Latency is zero 
Bandwidth is infinite 
Topology doesn’t change
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
The network is secure 
Peter Deutsch and 
The network is 
other fellows 
homogeneous 
at Sun 
Microsystems 
Latency is zero 
Bandwidth is infinite 
Topology doesn’t change 
There is one administrator
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
The network is secure 
Peter Deutsch and 
The network is homogeneous 
other fellows 
at Sun 
Latency is zero 
Microsystems 
Bandwidth is infinite 
Topology doesn’t change 
There is one administrator 
Transport cost is zero
THE FALLACIES OF 
DISTRIBUTED COMPUTING 
The network is reliable 
The network is secure 
Peter Deutsch and 
The network is homogeneous 
other fellows 
at Sun 
Latency is zero 
Microsystems 
Bandwidth is infinite 
Topology doesn’t change 
There is one administrator 
Transport cost is zero
WHAT WAS TRIED IN THE 
PAST? 
DISTRIBUTED OBJECTS 
REMOTE PROCEDURE CALL 
DISTRIBUTED SHARED MUTABLE STATE
WHAT WAS ATTEMPTED? 
DISTRIBUTED OBJECTS 
REMOTE PROCEDURE CALL 
DISTRIBUTED SHARED MUTABLE STATE 
• Typically, programming constructs to “abstract” the fact 
that there are local and distributed objects 
• Ignores latencies 
• Handles failures with this attitude → “i dunno what to 
do…YOU (i.e. invoker) handle it”.
WHAT WAS ATTEMPTED? 
DISTRIBUTED OBJECTS 
REMOTE PROCEDURE CALL 
DISTRIBUTED SHARED MUTABLE STATE 
• Assumes the synchronous processing model 
• Asynchronous RPC tries to model after 
synchronous RPC…
WHAT WAS ATTEMPTED? 
DISTRIBUTED OBJECTS 
REMOTE PROCEDURE CALL 
DISTRIBUTED SHARED MUTABLE STATE 
• Found in Distributed Shared Memory Systems i.e. 
“1” address space partitioned into “x” address spaces for “x” nodes 
• e.g. JavaSpaces ⇒ Danger of 2 independent 
processes to commit successfully.
YOU CAN APPRECIATE 
THAT IT IS 
VERY HARD TO SOLVE 
IN ITS ENTIRETY
The question really 
is “How can we do 
better?”
The question really 
is “How can we do 
better?” 
To start,we 
need to 
understand 2 
results
FLP IMPOSSIBILITY 
RESULT 
The FLP result shows that in 
an asynchronous model, 
where one processor might 
crash, there is no distributed 
algorithm to solve the 
consensus problem.
Consensus is a fundamental problem in fault-tolerant 
distributed systems. Consensus involves multiple servers 
agreeing on values. Once they reach a decision on a value, 
that decision is final. 
Typical consensus algorithms make progress when any 
majority of their servers are available; for example, a cluster 
of 5 servers can continue to operate even if 2 servers fail. If 
more servers fail, they stop making progress (but will never 
return an incorrect result). 
Paxos (1989), Raft (2013)
CAP THEOREM 
CAP CONJECTURE 
(2000) established 
as a theorem in 2002 
Consistency 
Availability 
Partition Tolerance
CAP THEOREM 
CAP CONJECTURE 
(2000) established 
as a theorem in 2002 
Consistency 
Availability 
Partition Tolerance 
• Consistency ⇒ all nodes should see the same data, 
eventually
CAP THEOREM 
CAP CONJECTURE 
(2000) established 
as a theorem in 2002 
Consistency 
Availability 
Partition Tolerance 
• Consistency ⇒ all nodes should see the same data, 
eventually 
• Availability ⇒ System is still working when node(s) 
fails
CAP THEOREM 
CAP CONJECTURE 
(2000) established 
as a theorem in 2002 
Consistency 
Availability 
Partition Tolerance 
• Consistency ⇒ all nodes should see the same data, 
eventually 
• Availability ⇒ System is still working when node(s) 
fails
CAP THEOREM 
CAP CONJECTURE 
(2000) established 
as a theorem in 2002 
Consistency 
Availability 
Partition Tolerance 
• Consistency ⇒ all nodes should see the same data, 
eventually 
• Availability ⇒ System is still working when node(s) 
fails • Partition-Tolerance ⇒ System is still working on 
arbitrary message loss
How does knowing 
FLP and CAP help me?
How does knowing 
FLP and CAP help me? 
It’s really about asking which 
would you sacrifice when a 
system fails? Consistency 
or Availability?
You can choose C over A 
It needs to preserve reads 
and writes i.e. linearizability 
i.e. A read will return the last 
completed write (on ANY 
replica) e.g. Two-Phase-Commit
You can choose A over C 
It might return stale 
reads …
You can choose A over C 
It might return stale 
reads … 
DynamoDB uses vector clocks; 
Cassandra uses a clever form 
of last-write-wins
You cannot NOT choose P 
Partition Tolerance is mandatory 
in distributed systems
To scale, partition 
To resilient, replicate
Partitioning Replication 
A B C 
A B A B 
[0..100] [0..100] [101..200] [101..200] 
C C 
Node 1 Node 2
Replication is strongly 
related to fault-tolerance 
Fault-tolerance relies 
on reaching consensus 
Paxos (1989), ZAB, Raft 
(2013)
Consistent Hashing is 
used often PARTITIONING 
and REPLICATION 
C-H commonly used in load-balancing 
web objects. In 
DynamoDB Cassandra, Riak and 
Memcached
• If your problem can fit into memory of a 
single machine, you don’t need a 
distributed system 
• If you really need to design / build a 
distributed system, the following helps: 
• Design for failure 
• Use the FLP and CAP to critique 
systems 
• Algorithms that work for single-node 
may not work in distributed mode 
• Avoid coordination of nodes 
• Learn to estimate your capacity
Jeff Dean - Google
REFERENCES 
• http://queue.acm.org/detail.cfm?id=2655736 “The network is 
reliable” 
• http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems- 
eventual-consistency-and-the-cap-theorem/fulltext 
• http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf 
• “Towards Robust Distributed Systems,” http:// 
www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf 
• “Why Do Computers Stop and What Can be Done About It,” 
Tandem Computers Technical Report 85.7, Cupertino, Ca., 1985. 
http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf 
• http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote- 
ladis2009.pdf 
• http://codahale.com/you-cant-sacrifice-partition-tolerance/ 
• http://dl.acm.org/citation.cfm?id=258660 “Consistent hashing 
and random trees: distributed caching protocols for relieving hot 
spots on the World Wide Web”

More Related Content

PPTX
Distributed Computing
PPT
Distributed computing
DOCX
Distributed system unit II according to syllabus of RGPV, Bhopal
DOC
Centralized vs distrbution system
PDF
Lecture 1 introduction to parallel and distributed computing
PPT
Distributed Processing
PPTX
Beowulf cluster
DOC
Distributed Computing
Distributed computing
Distributed system unit II according to syllabus of RGPV, Bhopal
Centralized vs distrbution system
Lecture 1 introduction to parallel and distributed computing
Distributed Processing
Beowulf cluster

What's hot (20)

PDF
Design issues of dos
PPT
Distributed Systems
DOCX
Distributed system notes unit I
PPTX
WAN & LAN Cluster with Diagrams and OSI explanation
PPT
7 distributed and real systems
PPT
Cluster Tutorial
PPTX
Cluster computing
PPTX
distributed Computing system model
PPT
CLUSTER COMPUTING
PPT
Cluster Computing
ODP
Distributed Computing
PPT
Parallel Computing
PPT
Distributed computing ).ppt him
PPTX
Distributed computing environment
PPT
System models for distributed and cloud computing
PPTX
Introduction to Parallel and Distributed Computing
PPTX
Cluster computer
PPT
Distributed & parallel system
PDF
PPT
Tutorial on Parallel Computing and Message Passing Model - C1
Design issues of dos
Distributed Systems
Distributed system notes unit I
WAN & LAN Cluster with Diagrams and OSI explanation
7 distributed and real systems
Cluster Tutorial
Cluster computing
distributed Computing system model
CLUSTER COMPUTING
Cluster Computing
Distributed Computing
Parallel Computing
Distributed computing ).ppt him
Distributed computing environment
System models for distributed and cloud computing
Introduction to Parallel and Distributed Computing
Cluster computer
Distributed & parallel system
Tutorial on Parallel Computing and Message Passing Model - C1
Ad

Similar to Distributed computing for new bloods (20)

PPTX
CAP Theorem and Split Brain Syndrome
PPTX
Introduction
PPTX
CAP Theorem - Theory, Implications and Practices
PDF
From Mainframe to Microservice: An Introduction to Distributed Systems
PPT
Database Expert Q&A from 2600hz and Cloudant
PPTX
Data Engineering for Data Scientists
PDF
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
PDF
The Reactive Principles: Design Principles For Cloud Native Applications
PDF
Lightning talk: highly scalable databases and the PACELC theorem
PDF
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
PPT
CAP, PACELC, and Determinism
PDF
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
PDF
Data consistency: Analyse, understand and decide
PPTX
Distributed Systems (3rd Edition)Introduction
ODP
Everything you always wanted to know about Distributed databases, at devoxx l...
PDF
Intro to distributed systems
PDF
Simple Solutions for Complex Problems
PDF
dist_systems.pdf
PDF
Simple Solutions for Complex Problems
PDF
Architecting for failure - Why are distributed systems hard?
CAP Theorem and Split Brain Syndrome
Introduction
CAP Theorem - Theory, Implications and Practices
From Mainframe to Microservice: An Introduction to Distributed Systems
Database Expert Q&A from 2600hz and Cloudant
Data Engineering for Data Scientists
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
Lightning talk: highly scalable databases and the PACELC theorem
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
CAP, PACELC, and Determinism
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
Data consistency: Analyse, understand and decide
Distributed Systems (3rd Edition)Introduction
Everything you always wanted to know about Distributed databases, at devoxx l...
Intro to distributed systems
Simple Solutions for Complex Problems
dist_systems.pdf
Simple Solutions for Complex Problems
Architecting for failure - Why are distributed systems hard?
Ad

More from Raymond Tay (8)

PDF
Principled io in_scala_2019_distribution
PDF
Building a modern data platform with scala, akka, apache beam
PDF
Practical cats
PDF
Toying with spark
PPT
Functional programming with_scala
PDF
Introduction to cuda geek camp singapore 2011
PPTX
Introduction to Erlang
PDF
Introduction to CUDA
Principled io in_scala_2019_distribution
Building a modern data platform with scala, akka, apache beam
Practical cats
Toying with spark
Functional programming with_scala
Introduction to cuda geek camp singapore 2011
Introduction to Erlang
Introduction to CUDA

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Distributed computing for new bloods

  • 1. DISTRIBUTED COMPUTING FOR NEW BLOODS RAYMOND TAY I WORK IN HEWLETT-PACKARD LABS SINGAPORE @RAYMONDTAYBL
  • 2. What is a Distributed System?
  • 3. What is a Distributed System? It’s been there for a long time
  • 4. What is a Distributed System? It’s been there for a long time You did not realize it
  • 5. DISTRIBUTED SYSTEMS ARE THE NEW FRONTIER whether YOU like it or not
  • 6. DISTRIBUTED SYSTEMS ARE THE NEW FRONTIER whether YOU like it or not LAN Games Mobile Databases ATMs Social Media
  • 7. WHAT IS THE ESSENCE OF DISTRIBUTED SYSTEMS?
  • 8. WHAT IS THE ESSENCE OF DISTRIBUTED SYSTEMS? IT IS ABOUT THE ATTEMPT TO OVERCOME INFORMATION TRAVEL WHEN INDEPENDENT PROCESSES FAIL
  • 9. Information flows at speed of light ! MESSAGE X Y
  • 10. When independent things DONT fail. I’ve sent it. I’ve received it. X Y I’ve received it. I’ve sent it.
  • 11. When independent things fail, independently. X Y NETWORK FAILURE
  • 12. When independent things fail, independently. X Y NODE FAILURE
  • 13. There is no difference between a slow node and a dead node
  • 14. Why do we NEED it? Scalability when the needs of the system outgrows what a single node can provide
  • 15. Why do we NEED it? Scalability when the needs of the system outgrows what a single node can provide Availability Enabling resilience when a node fails
  • 16. IS IT REALLY THAT HARD?
  • 17. IS IT REALLY THAT HARD? The answer lies in knowing what is possible and what is not possible…
  • 18. IS IT REALLY THAT HARD? Even widely accepted facts are challenged…
  • 19. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable Peter Deutsch and other fellows at Sun Microsystems
  • 20. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable
  • 21. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable The network is secure Peter Deutsch and other fellows at Sun Microsystems
  • 22. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable The network is secure 2013 cost of cyber crime study: United States
  • 23. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable Peter Deutsch and The network is secure other fellows at Sun The network is Microsystems homogeneous
  • 24. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable Peter Deutsch and The network is secure other fellows at Sun The network is Microsystems homogeneous Latency is zero Ingo Rammer on latency vs bandwidth
  • 25. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable Peter Deutsch and The network is secure other fellows at Sun The network is Microsystems homogeneous Latency is zero Bandwidth is infinite
  • 26. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable Peter Deutsch and The network is secure other fellows The network is at Sun homogeneous Microsystems Latency is zero Bandwidth is infinite Topology doesn’t change
  • 27. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable The network is secure Peter Deutsch and The network is other fellows homogeneous at Sun Microsystems Latency is zero Bandwidth is infinite Topology doesn’t change There is one administrator
  • 28. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable The network is secure Peter Deutsch and The network is homogeneous other fellows at Sun Latency is zero Microsystems Bandwidth is infinite Topology doesn’t change There is one administrator Transport cost is zero
  • 29. THE FALLACIES OF DISTRIBUTED COMPUTING The network is reliable The network is secure Peter Deutsch and The network is homogeneous other fellows at Sun Latency is zero Microsystems Bandwidth is infinite Topology doesn’t change There is one administrator Transport cost is zero
  • 30. WHAT WAS TRIED IN THE PAST? DISTRIBUTED OBJECTS REMOTE PROCEDURE CALL DISTRIBUTED SHARED MUTABLE STATE
  • 31. WHAT WAS ATTEMPTED? DISTRIBUTED OBJECTS REMOTE PROCEDURE CALL DISTRIBUTED SHARED MUTABLE STATE • Typically, programming constructs to “abstract” the fact that there are local and distributed objects • Ignores latencies • Handles failures with this attitude → “i dunno what to do…YOU (i.e. invoker) handle it”.
  • 32. WHAT WAS ATTEMPTED? DISTRIBUTED OBJECTS REMOTE PROCEDURE CALL DISTRIBUTED SHARED MUTABLE STATE • Assumes the synchronous processing model • Asynchronous RPC tries to model after synchronous RPC…
  • 33. WHAT WAS ATTEMPTED? DISTRIBUTED OBJECTS REMOTE PROCEDURE CALL DISTRIBUTED SHARED MUTABLE STATE • Found in Distributed Shared Memory Systems i.e. “1” address space partitioned into “x” address spaces for “x” nodes • e.g. JavaSpaces ⇒ Danger of 2 independent processes to commit successfully.
  • 34. YOU CAN APPRECIATE THAT IT IS VERY HARD TO SOLVE IN ITS ENTIRETY
  • 35. The question really is “How can we do better?”
  • 36. The question really is “How can we do better?” To start,we need to understand 2 results
  • 37. FLP IMPOSSIBILITY RESULT The FLP result shows that in an asynchronous model, where one processor might crash, there is no distributed algorithm to solve the consensus problem.
  • 38. Consensus is a fundamental problem in fault-tolerant distributed systems. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers are available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result). Paxos (1989), Raft (2013)
  • 39. CAP THEOREM CAP CONJECTURE (2000) established as a theorem in 2002 Consistency Availability Partition Tolerance
  • 40. CAP THEOREM CAP CONJECTURE (2000) established as a theorem in 2002 Consistency Availability Partition Tolerance • Consistency ⇒ all nodes should see the same data, eventually
  • 41. CAP THEOREM CAP CONJECTURE (2000) established as a theorem in 2002 Consistency Availability Partition Tolerance • Consistency ⇒ all nodes should see the same data, eventually • Availability ⇒ System is still working when node(s) fails
  • 42. CAP THEOREM CAP CONJECTURE (2000) established as a theorem in 2002 Consistency Availability Partition Tolerance • Consistency ⇒ all nodes should see the same data, eventually • Availability ⇒ System is still working when node(s) fails
  • 43. CAP THEOREM CAP CONJECTURE (2000) established as a theorem in 2002 Consistency Availability Partition Tolerance • Consistency ⇒ all nodes should see the same data, eventually • Availability ⇒ System is still working when node(s) fails • Partition-Tolerance ⇒ System is still working on arbitrary message loss
  • 44. How does knowing FLP and CAP help me?
  • 45. How does knowing FLP and CAP help me? It’s really about asking which would you sacrifice when a system fails? Consistency or Availability?
  • 46. You can choose C over A It needs to preserve reads and writes i.e. linearizability i.e. A read will return the last completed write (on ANY replica) e.g. Two-Phase-Commit
  • 47. You can choose A over C It might return stale reads …
  • 48. You can choose A over C It might return stale reads … DynamoDB uses vector clocks; Cassandra uses a clever form of last-write-wins
  • 49. You cannot NOT choose P Partition Tolerance is mandatory in distributed systems
  • 50. To scale, partition To resilient, replicate
  • 51. Partitioning Replication A B C A B A B [0..100] [0..100] [101..200] [101..200] C C Node 1 Node 2
  • 52. Replication is strongly related to fault-tolerance Fault-tolerance relies on reaching consensus Paxos (1989), ZAB, Raft (2013)
  • 53. Consistent Hashing is used often PARTITIONING and REPLICATION C-H commonly used in load-balancing web objects. In DynamoDB Cassandra, Riak and Memcached
  • 54. • If your problem can fit into memory of a single machine, you don’t need a distributed system • If you really need to design / build a distributed system, the following helps: • Design for failure • Use the FLP and CAP to critique systems • Algorithms that work for single-node may not work in distributed mode • Avoid coordination of nodes • Learn to estimate your capacity
  • 55. Jeff Dean - Google
  • 56. REFERENCES • http://queue.acm.org/detail.cfm?id=2655736 “The network is reliable” • http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems- eventual-consistency-and-the-cap-theorem/fulltext • http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • “Towards Robust Distributed Systems,” http:// www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • “Why Do Computers Stop and What Can be Done About It,” Tandem Computers Technical Report 85.7, Cupertino, Ca., 1985. http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf • http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote- ladis2009.pdf • http://codahale.com/you-cant-sacrifice-partition-tolerance/ • http://dl.acm.org/citation.cfm?id=258660 “Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web”