SlideShare a Scribd company logo
M A K E Y O U R C H O I C E
C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N
A n d re a G i u l i a n o
@ b i t _ s h a r k
D I S T R I B U T E D S Y S T E M S
W H AT A D I S T R I B U T E D S Y S T E M I S
“A distributed system is a software system in which
components located on networked computers communicate
and coordinate their actions by passing messages”
D I S T R I B U T E D S Y S T E M S
E X A M P L E S
D I S T R I B U T E D S Y S T E M S
R E P L I C AT I O N
R E P L I C AT E D S E R V I C E
P R O P E R T I E S
CONSISTENCY
AVAILABILITY
C O N S I S T E N C Y
The result of operations will be predictable
C O N S I S T E N C Y
Strong consistency
all replicas return the same value for the same object
C O N S I S T E N C Y
Strong consistency
all replicas return the same value for the same object
Weak consistency
different replicas can return different values for the same object
S T R O N G V S W E A K
C O N S I S T E N C Y
S T R O N G V S W E A K
C O N S I S T E N C Y
Strong consistency
Atomic, consistent, isolated, durable database
Weak consistency
Basically Available Soft-state Eventual consistency database
E X A M P L E
C O N S I S T E N C Y
put(price, 10)
E X A M P L E
C O N S I S T E N C Y
get(price)
price = 10
AVA I L A B I L I T Y
E X A M P L E
A VA I L A B I L I T Y
C O M M U N I C AT I O N
PA R T I T I O N T O L E R A N C E
continue to operate even in presence of partitions
PA R T I T I O N T O L E R A N C E
Network failure
groups at each side of a faulty entity network (switch, backbone)
Process failure
system split in two groups: correct nodes and crashed node
C A P T H E O R E M
“Of three properties of shared-data systems
(data consistency, system availability and
tolerance to network partitions) only two can
be achieved at any given moment in time.”
T H E P R O O F
C A P T H E O R E M
put(price, 10)
get(price)
price = 0
price = 0 price = 0
price = 0
no response
not consistent
not available
t2
t1
partition 1
partition 2
CONSISTENCY AVAILABILITY
PARTITION
TOLERANCE
➡ distributed databases
➡ distributed locking
➡ majority protocol
➡ active/passive replication
➡ quorum-based systems
BigTable
C A P T H E O R E M
I N P R A C T I C E
C A P T H E O R E M
CONSISTENCY AVAILABILITY
PARTITION
TOLERANCE
➡ web caches
➡ stateless systems
➡ DNS
DynamoDB
C A P T H E O R E M
CONSISTENCY AVAILABILITY
PARTITION
TOLERANCE
➡ Single site database
➡ cluster databases
➡ ldap
D Y N A M O
R E Q U I R E M E N T S
D Y N A M O
“customers should be able to view and add items
to their shopping cart even if disks are failing,
network routes are flapping, or data centers are
being destroyed by tornados.”
R E Q U I R E M E N T S
D Y N A M O
“customers should be able to view and add items
to their shopping cart even if disks are failing,
network routes are flapping, or data centers are
being destroyed by tornados.”
➡ reliable
➡ high scalable
➡ always available
S I M P L E I N T E R FA C E
D Y N A M O
get(key)
returns the object associated with the key and returns a
single object or a list of objects with conflicting versions
along with a context.
put(key, context, object)
determines where the replicas of the object should be
placed based on the associated key. The context
includes information such as the version of the object.
R E P L I C AT I O N : T H E C H O I C E
D Y N A M O
Synchronous replica coordination
‣ strong consistency
‣ availability tradeoff
Optimistic replication technique
‣ high availability
‣ conflicts probability
C O N F L I C T S : W H E N
D Y N A M O
At write time
‣ writes rejection probability
At read time
‣ “always writable” datastore
C O N F L I C T S : W H O
D Y N A M O
The data store
‣ e.g. “last write win” policy
The application
‣ resolution as implementation detail
A R I N G T O R U L E T H E M A L L
D Y N A M O
PA R T I T I O N I N G : T H E R I N G
D Y N A M O
A
B
C
DE
F
G
DATA
hash
R E P L I C AT I O N
D Y N A M O
A
B
C
DE
F
G
N = 3 D will store keys in the range (A, B], (B, C], (C, D]
DATA
hash
D ATA V E R S I O N I N G
D Y N A M O
put()
may return before the update has been propagated to
all replicas.
get()
subsequent get() may return an object that does not
have the latest update
R E C O N C I L I AT I O N
D Y N A M O
R E C O N C I L I AT I O N
D Y N A M O
Syntactic reconciliation
‣ new version subsumes the previous
Semantic reconciliation
‣ conflicting versions of the same object
V E C T O R C L O C K
D Y N A M O
V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
write
handled by Sx
V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
write
handled by Sx
write
handled by Sx
V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
write
handled by Sx
write
handled by Sx
handled by Sywrite
V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
write
handled by Sz
V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
D5 [Sx,3], [Sy,1], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
write
handled by Sz
reconciled and
written by Sx
P U T ( ) A N D G E T ( )
D Y N A M O
R
‣ minimum number of nodes that must partecipate
in a read operation.
W
‣ minimum number of nodes that must participate
in a successful write operation
P U T ( ) A N D G E T ( )
D Y N A M O
put()
‣ the coordinator generates the vector clock for the new version and
writes the new version locally
‣ the new version is sent to N nodes
‣ the write is successful if W-1 nodes respond
get()
‣ the coordinator requests all existing versions of data
‣ the coordinator waits for R responses before returning the result
‣ the coordinator returns all the version causally unrelated
‣ the divergent versions are reconciled and written back
S L O P P Y Q U O R U M
D Y N A M O
A
B
C
DE
F
G
N = 3
W H Y I S A P ?
D Y N A M O
‣ requests served even if some replicas are not available
‣ if some node is down the write is stored to another node
‣ consistency conflicts resolved at read time or in the
background
‣ eventually, all the replicas will converge
‣ concurrent read/write operation can make distinct clients
see distinct versions of the same key
B I G TA B L E
R E Q U I R E M E N T S
G O O G L E B I G TA B L E
‣ scale to petabyte of data
‣ thousand of machines
‣ high availability
‣ high performance
D ATA M O D E L
G O O G L E B I G TA B L E
‣ sparse, distributed, persistent multi-dimensional
sorted map
(row: string, column: string, time: int64) string
R O W S
G O O G L E B I G TA B L E
‣ arbitrary strings
‣ read/write operations are atomic
‣ data is maintained in lexicographic order by row key
‣ each row range is called a tablet
maps.google.com com.google.maps
C O L U M N S
G O O G L E B I G TA B L E
‣ columns keys are grouped into sets: column families
‣ a column family must be created before data can be
stored under any column key in that family
‣ column key named as family:qualifier
‣ access control and both disk and memory
accounting are performed at the column-family level
T I M E S TA M P S
G O O G L E B I G TA B L E
C O N T E N T S :
c o m . e x a m p l e
< h t m l > …
< h t m l > …
t 1
t 2
D ATA M O D E L : E X A M P L E
G O O G L E B I G TA B L E
L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A
c o m . e x a m p l e e n
< ! D O C T Y P E
h t m l P U B L I C
…
c o m . c n n . w w w e n
< ! D O C T Y P E
h t m l P U B L I C
…
“ c n n " “ c n n . c o m ”
c o m . c n n . w w w / f o o e n
< ! D O C T Y P E
h t m l P U B L I C
…
column familiesrow keys
sortedrows
D I F F E R E N C E S W I T H R D B M S
G O O G L E B I G TA B L E
R D B M S B I G TA B L E
q u e r y l a n g u a g e s p e c i f i c a p i
j o i n s n o re f e re n t i a l i n t e g r i t y
e x p l i c i t s o r t i n g
s o r t i n g d e f i n e d a p r i o r i
i n t h e c o l u m n f a m i l y
A R C H I T E C T U R E
G O O G L E B I G TA B L E
Google File System (GFS)
‣ store data files and logs
Google SSTable
‣ store BigTable data
Chubby
‣ high-available distributed lock service
C O M P O N E N T S
G O O G L E B I G TA B L E
library
‣ linked into every client
one master server
‣ assigning tablets to tablet server
‣ detecting the addition and expiration of tablet servers
‣ balancing tablet-server load
‣ garbaging collection of files in GFS
‣ handling schema changes
many tablet servers
‣ manages 10 to 100 tablets
‣ handles read and write requests to the tablets
‣ splits tablets that have grown too large
C O M P O N E N T S
G O O G L E B I G TA B L E
Master server
Client
Tablet server Tablet server Tablet server
Metadata
read/write
S TA R T U P A N D G R O W T H
G O O G L E B I G TA B L E
Chubby file
Root tablet
1st Metadata tablet
other
metadata
tablets
UserTableN
UserTable1
…
…
…
…
…
…
…
…
…
…
…
TA B L E T A S S I G N M E N T
G O O G L E B I G TA B L E
tablet server
‣ when started, creates and acquires a lock in Chubby
master
‣ grabs a unique master lock in Chubby
‣ scans Chubby to find live tablet servers
‣ asks each tablet server to discover its tablets
‣ scans the Metadata table to learn the full set of tablets
‣ builds a set of unassigned tablet server, for future tablet
assignment
W H Y I S C P ?
G O O G L E B I G TA B L E
‣ master death cause services no longer functioning
‣ tablet server death cause tablets unavailable
‣ Chubby death cause BigTable inability to execute
synchronization operations and to serve client requests
‣ Google File System is a CP system
$ W H O A M I
Andrea Giuliano
@bit_shark
www.andreagiuliano.it
joind.in/13224
Please rate the talk!
G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store”
F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data”
Assets:
https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg
https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg
https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg
https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg
https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg
https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg
https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg
https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg
https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg
https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg
https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg
https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg
https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg
https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg
https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg
https://www.flickr.com/photos/avardwoolaver/7137096221
R E F E R E N C E S

More Related Content

PDF
Meteor - not just for rockstars
PDF
Writing (Meteor) Code With Style
PPTX
Automated Versioning As A Mechanism For Component Software
PPTX
VIKING cluster meeting 1
PDF
Coherence and consistency models in multiprocessor architecture
PDF
Pp3 - Pixel Perfect Precision V3
PPT
Louzel Report - Reliability & validity
PPTX
Benefit Of Computer
Meteor - not just for rockstars
Writing (Meteor) Code With Style
Automated Versioning As A Mechanism For Component Software
VIKING cluster meeting 1
Coherence and consistency models in multiprocessor architecture
Pp3 - Pixel Perfect Precision V3
Louzel Report - Reliability & validity
Benefit Of Computer

Viewers also liked (10)

PPTX
Apache kafka
PPTX
advantages and disadvanteges of computer
PPTX
Introduction to Apache Kafka
PPTX
Validity and Reliability
PPTX
Validity and reliability of questionnaires
PPT
Presentation Validity & Reliability
DOCX
ADVANTAGES AND DIS-ADVANTAGES OF COMPUTER
PDF
Precision attachments
PPT
multimedia element
PPT
Benefits Of Computer Software
Apache kafka
advantages and disadvanteges of computer
Introduction to Apache Kafka
Validity and Reliability
Validity and reliability of questionnaires
Presentation Validity & Reliability
ADVANTAGES AND DIS-ADVANTAGES OF COMPUTER
Precision attachments
multimedia element
Benefits Of Computer Software
Ad

Similar to Consistency, Availability, Partition: Make Your Choice (20)

PPT
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
PPT
Schemaless Databases
PDF
Design Patterns For Distributed NO-reational databases
PDF
Design Patterns for Distributed Non-Relational Databases
PPTX
Dynamo and BigTable in light of the CAP theorem
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPT
Handling Data in Mega Scale Web Systems
ODP
Distributed systems and consistency
PPTX
Megastore by Google
PDF
Thoughts on Transaction and Consistency Models
PPTX
Data Engineering for Data Scientists
PDF
Bigtable and Dynamo
PDF
Cassandra for Ruby/Rails Devs
PPTX
Modern software design in Big data era
PDF
Intro to Cassandra
PDF
Consistency Models in New Generation Databases
PDF
Consistency-New-Generation-Databases
PDF
Bill howe 4_bigdatasystems
PPT
Storage cassandra
PPT
SQL or NoSQL, that is the question!
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Schemaless Databases
Design Patterns For Distributed NO-reational databases
Design Patterns for Distributed Non-Relational Databases
Dynamo and BigTable in light of the CAP theorem
CS 542 Parallel DBs, NoSQL, MapReduce
Handling Data in Mega Scale Web Systems
Distributed systems and consistency
Megastore by Google
Thoughts on Transaction and Consistency Models
Data Engineering for Data Scientists
Bigtable and Dynamo
Cassandra for Ruby/Rails Devs
Modern software design in Big data era
Intro to Cassandra
Consistency Models in New Generation Databases
Consistency-New-Generation-Databases
Bill howe 4_bigdatasystems
Storage cassandra
SQL or NoSQL, that is the question!
Ad

More from Andrea Giuliano (10)

PDF
CQRS, ReactJS, Docker in a nutshell
PDF
Go fast in a graph world
PDF
Concurrent test frameworks
PDF
Index management in depth
PDF
Asynchronous data processing
PDF
Think horizontally @Codemotion
PDF
Index management in shallow depth
PDF
Everything you always wanted to know about forms* *but were afraid to ask
PDF
Stub you!
PDF
Let's test!
CQRS, ReactJS, Docker in a nutshell
Go fast in a graph world
Concurrent test frameworks
Index management in depth
Asynchronous data processing
Think horizontally @Codemotion
Index management in shallow depth
Everything you always wanted to know about forms* *but were afraid to ask
Stub you!
Let's test!

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Programs and apps: productivity, graphics, security and other tools
Univ-Connecticut-ChatGPT-Presentaion.pdf
Unlocking AI with Model Context Protocol (MCP)
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Digital-Transformation-Roadmap-for-Companies.pptx
Zenith AI: Advanced Artificial Intelligence
NewMind AI Weekly Chronicles - August'25-Week II
A novel scalable deep ensemble learning framework for big data classification...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A Presentation on Artificial Intelligence
Heart disease approach using modified random forest and particle swarm optimi...
Web App vs Mobile App What Should You Build First.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
TLE Review Electricity (Electricity).pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf

Consistency, Availability, Partition: Make Your Choice

  • 1. M A K E Y O U R C H O I C E C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N A n d re a G i u l i a n o @ b i t _ s h a r k
  • 2. D I S T R I B U T E D S Y S T E M S
  • 3. W H AT A D I S T R I B U T E D S Y S T E M I S “A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages”
  • 4. D I S T R I B U T E D S Y S T E M S E X A M P L E S
  • 5. D I S T R I B U T E D S Y S T E M S R E P L I C AT I O N
  • 6. R E P L I C AT E D S E R V I C E P R O P E R T I E S CONSISTENCY AVAILABILITY
  • 7. C O N S I S T E N C Y The result of operations will be predictable
  • 8. C O N S I S T E N C Y Strong consistency all replicas return the same value for the same object
  • 9. C O N S I S T E N C Y Strong consistency all replicas return the same value for the same object Weak consistency different replicas can return different values for the same object
  • 10. S T R O N G V S W E A K C O N S I S T E N C Y
  • 11. S T R O N G V S W E A K C O N S I S T E N C Y Strong consistency Atomic, consistent, isolated, durable database Weak consistency Basically Available Soft-state Eventual consistency database
  • 12. E X A M P L E C O N S I S T E N C Y put(price, 10)
  • 13. E X A M P L E C O N S I S T E N C Y get(price) price = 10
  • 14. AVA I L A B I L I T Y
  • 15. E X A M P L E A VA I L A B I L I T Y
  • 16. C O M M U N I C AT I O N
  • 17. PA R T I T I O N T O L E R A N C E continue to operate even in presence of partitions
  • 18. PA R T I T I O N T O L E R A N C E Network failure groups at each side of a faulty entity network (switch, backbone) Process failure system split in two groups: correct nodes and crashed node
  • 19. C A P T H E O R E M “Of three properties of shared-data systems (data consistency, system availability and tolerance to network partitions) only two can be achieved at any given moment in time.”
  • 20. T H E P R O O F C A P T H E O R E M put(price, 10) get(price) price = 0 price = 0 price = 0 price = 0 no response not consistent not available t2 t1 partition 1 partition 2
  • 21. CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ distributed databases ➡ distributed locking ➡ majority protocol ➡ active/passive replication ➡ quorum-based systems BigTable C A P T H E O R E M I N P R A C T I C E
  • 22. C A P T H E O R E M CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ web caches ➡ stateless systems ➡ DNS DynamoDB
  • 23. C A P T H E O R E M CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ Single site database ➡ cluster databases ➡ ldap
  • 24. D Y N A M O
  • 25. R E Q U I R E M E N T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”
  • 26. R E Q U I R E M E N T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.” ➡ reliable ➡ high scalable ➡ always available
  • 27. S I M P L E I N T E R FA C E D Y N A M O get(key) returns the object associated with the key and returns a single object or a list of objects with conflicting versions along with a context. put(key, context, object) determines where the replicas of the object should be placed based on the associated key. The context includes information such as the version of the object.
  • 28. R E P L I C AT I O N : T H E C H O I C E D Y N A M O Synchronous replica coordination ‣ strong consistency ‣ availability tradeoff Optimistic replication technique ‣ high availability ‣ conflicts probability
  • 29. C O N F L I C T S : W H E N D Y N A M O At write time ‣ writes rejection probability At read time ‣ “always writable” datastore
  • 30. C O N F L I C T S : W H O D Y N A M O The data store ‣ e.g. “last write win” policy The application ‣ resolution as implementation detail
  • 31. A R I N G T O R U L E T H E M A L L D Y N A M O
  • 32. PA R T I T I O N I N G : T H E R I N G D Y N A M O A B C DE F G DATA hash
  • 33. R E P L I C AT I O N D Y N A M O A B C DE F G N = 3 D will store keys in the range (A, B], (B, C], (C, D] DATA hash
  • 34. D ATA V E R S I O N I N G D Y N A M O put() may return before the update has been propagated to all replicas. get() subsequent get() may return an object that does not have the latest update
  • 35. R E C O N C I L I AT I O N D Y N A M O
  • 36. R E C O N C I L I AT I O N D Y N A M O Syntactic reconciliation ‣ new version subsumes the previous Semantic reconciliation ‣ conflicting versions of the same object
  • 37. V E C T O R C L O C K D Y N A M O
  • 38. V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs
  • 39. V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] write handled by Sx
  • 40. V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] write handled by Sx write handled by Sx
  • 41. V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] write handled by Sx write handled by Sx handled by Sywrite
  • 42. V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz
  • 43. V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] D5 [Sx,3], [Sy,1], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz reconciled and written by Sx
  • 44. P U T ( ) A N D G E T ( ) D Y N A M O R ‣ minimum number of nodes that must partecipate in a read operation. W ‣ minimum number of nodes that must participate in a successful write operation
  • 45. P U T ( ) A N D G E T ( ) D Y N A M O put() ‣ the coordinator generates the vector clock for the new version and writes the new version locally ‣ the new version is sent to N nodes ‣ the write is successful if W-1 nodes respond get() ‣ the coordinator requests all existing versions of data ‣ the coordinator waits for R responses before returning the result ‣ the coordinator returns all the version causally unrelated ‣ the divergent versions are reconciled and written back
  • 46. S L O P P Y Q U O R U M D Y N A M O A B C DE F G N = 3
  • 47. W H Y I S A P ? D Y N A M O ‣ requests served even if some replicas are not available ‣ if some node is down the write is stored to another node ‣ consistency conflicts resolved at read time or in the background ‣ eventually, all the replicas will converge ‣ concurrent read/write operation can make distinct clients see distinct versions of the same key
  • 48. B I G TA B L E
  • 49. R E Q U I R E M E N T S G O O G L E B I G TA B L E ‣ scale to petabyte of data ‣ thousand of machines ‣ high availability ‣ high performance
  • 50. D ATA M O D E L G O O G L E B I G TA B L E ‣ sparse, distributed, persistent multi-dimensional sorted map (row: string, column: string, time: int64) string
  • 51. R O W S G O O G L E B I G TA B L E ‣ arbitrary strings ‣ read/write operations are atomic ‣ data is maintained in lexicographic order by row key ‣ each row range is called a tablet maps.google.com com.google.maps
  • 52. C O L U M N S G O O G L E B I G TA B L E ‣ columns keys are grouped into sets: column families ‣ a column family must be created before data can be stored under any column key in that family ‣ column key named as family:qualifier ‣ access control and both disk and memory accounting are performed at the column-family level
  • 53. T I M E S TA M P S G O O G L E B I G TA B L E C O N T E N T S : c o m . e x a m p l e < h t m l > … < h t m l > … t 1 t 2
  • 54. D ATA M O D E L : E X A M P L E G O O G L E B I G TA B L E L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A c o m . e x a m p l e e n < ! D O C T Y P E h t m l P U B L I C … c o m . c n n . w w w e n < ! D O C T Y P E h t m l P U B L I C … “ c n n " “ c n n . c o m ” c o m . c n n . w w w / f o o e n < ! D O C T Y P E h t m l P U B L I C … column familiesrow keys sortedrows
  • 55. D I F F E R E N C E S W I T H R D B M S G O O G L E B I G TA B L E R D B M S B I G TA B L E q u e r y l a n g u a g e s p e c i f i c a p i j o i n s n o re f e re n t i a l i n t e g r i t y e x p l i c i t s o r t i n g s o r t i n g d e f i n e d a p r i o r i i n t h e c o l u m n f a m i l y
  • 56. A R C H I T E C T U R E G O O G L E B I G TA B L E Google File System (GFS) ‣ store data files and logs Google SSTable ‣ store BigTable data Chubby ‣ high-available distributed lock service
  • 57. C O M P O N E N T S G O O G L E B I G TA B L E library ‣ linked into every client one master server ‣ assigning tablets to tablet server ‣ detecting the addition and expiration of tablet servers ‣ balancing tablet-server load ‣ garbaging collection of files in GFS ‣ handling schema changes many tablet servers ‣ manages 10 to 100 tablets ‣ handles read and write requests to the tablets ‣ splits tablets that have grown too large
  • 58. C O M P O N E N T S G O O G L E B I G TA B L E Master server Client Tablet server Tablet server Tablet server Metadata read/write
  • 59. S TA R T U P A N D G R O W T H G O O G L E B I G TA B L E Chubby file Root tablet 1st Metadata tablet other metadata tablets UserTableN UserTable1 … … … … … … … … … … …
  • 60. TA B L E T A S S I G N M E N T G O O G L E B I G TA B L E tablet server ‣ when started, creates and acquires a lock in Chubby master ‣ grabs a unique master lock in Chubby ‣ scans Chubby to find live tablet servers ‣ asks each tablet server to discover its tablets ‣ scans the Metadata table to learn the full set of tablets ‣ builds a set of unassigned tablet server, for future tablet assignment
  • 61. W H Y I S C P ? G O O G L E B I G TA B L E ‣ master death cause services no longer functioning ‣ tablet server death cause tablets unavailable ‣ Chubby death cause BigTable inability to execute synchronization operations and to serve client requests ‣ Google File System is a CP system
  • 62. $ W H O A M I Andrea Giuliano @bit_shark www.andreagiuliano.it
  • 64. G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store” F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data” Assets: https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg https://www.flickr.com/photos/avardwoolaver/7137096221 R E F E R E N C E S