handles paxos protocol, replication, and recovery.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
cation
Paxos group: all machines involved in an instance of paxos.
1. Megastore and Spanner
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54
2. Motivation
Storage requirements of today’s interactive online applications.
• Scalability (a billion internet users)
• Rapid development
• Responsiveness (low latency)
• Durability and consistency (never lose data)
• Fault tolerant (no unplanned/planned downtime)
• Easy operations (minimize confusion, support is expensive)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
3. Motivation
Storage requirements of today’s interactive online applications.
• Scalability (a billion internet users)
• Rapid development
• Responsiveness (low latency)
• Durability and consistency (never lose data)
• Fault tolerant (no unplanned/planned downtime)
• Easy operations (minimize confusion, support is expensive)
These requirements are in conflict.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
4. Motivation
Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB
• Rich set of features
• Difficult to scale to the massive amount of reads and writes.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
5. Motivation
Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB
• Rich set of features
• Difficult to scale to the massive amount of reads and writes.
NoSQL, e.g., BigTable, Dynamo, Cassandra
• Highly Scalable
• Limited API
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
6. NewSQL Databases
NoSQL scalability + RDBMS ACID
E.g., Megastore and Spanner
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 4 / 54
8. Megastore
Started in 2006 for app development at Google.
Google’s wide-area replicated data store.
Adds (limited) transactions to wide-area replicated data stores.
GMail, Google+, Android Market, Google App Engine, ...
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 6 / 54
9. Megastore
Megastore layered on:
• GFS (Distributed file system)
• Bigtable (NoSQL scalable data store per datacenter)
[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
10. Megastore
Megastore layered on:
• GFS (Distributed file system)
• Bigtable (NoSQL scalable data store per datacenter)
BigTable is cluster-level structured storage, while Megastore is geo-
scale structured database.
[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
11. Entity Group (1/2)
The data is partitioned into a collection of entity groups (EG).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 8 / 54
12. Entity Group (2/2)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 9 / 54
13. Entity Group Replication (1/2)
Each entity group independently and synchronously replicated over
a wide area.
Megastore’s replication system provides a single consistent view of
the data stored in its underlying replicas.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 10 / 54
14. Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
15. Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
16. Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
• Writes require at least two inter-replica round-trips to achieve
consensus: prepare round, accept round
• Reads require one inter-replica round-trip: prepare round
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
17. Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
• Writes require at least two inter-replica round-trips to achieve
consensus: prepare round, accept round
• Reads require one inter-replica round-trip: prepare round
Megastore uses a modified version of paxos: fast read, fast write
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
18. Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
19. Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
20. Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
BigTable feature: ability to store multiple data for same row/column
with different timestamps.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
21. Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
BigTable feature: ability to store multiple data for same row/column
with different timestamps.
Multiversion Concurrency Control (MVCC) in EGs.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
22. Entity Group Transaction (2/3)
Read consistency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
23. Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
24. Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
• Snapshot: doesn’t wait, and reads the last committed values.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
25. Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
• Snapshot: doesn’t wait, and reads the last committed values.
• Inconsistent reads: ignores the state of log and reads the last values
directly (data may be stale).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
26. Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
27. Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
28. Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
• Employs paxos to settle the resource contention.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
29. Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
• Employs paxos to settle the resource contention.
• Based on optimistic concurrency: in case of multiple writers to the
same log position, only one will win, and the rest will notice the
victorious write, abort, and retry their operations.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
30. Across Entity Group Transaction (1/3)
Across entity groups: limited consistency guarantees
Two methods:
• Asynchronous messaging (queue)
• Two-Phase-Commit (2PC)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 15 / 54
31. Across Entity Group Transaction (2/3)
Queues
Provide transactional messaging between EGs.
Each message either is:
• Synchronous: has a single sending and receiving entity group.
• Asynchronous: has different sending and receiving entity group.
Useful to perform operations that affect many EGs.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 16 / 54
32. Across Entity Group Transaction (3/3)
Two-Phase Commit
Atomicity is satisfied.
High latency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 17 / 54
34. Limitations of Existing Systems
BigTable
• Scalability
• High throughput
• High performance
• Transactional scope limited to single row
• Eventually-consistent replication support across data-centers
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 19 / 54
35. Limitations of Existing Systems
Megastore
• Replicated ACID transactions
• Schematized semi-relational tables
• Synchronous replication support across data-centers
• Performance (poor write throughput)
• Lack of query language
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 20 / 54
36. Spanner
Bridging the gap between Megastore and Bigtable.
SQL transactions + high throughput
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 21 / 54
37. Spanner
Global scale database with strict transactional guarantees.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
38. Spanner
Global scale database with strict transactional guarantees.
Global scale
• Across datacenters
• Scale up to millions of nodes, hundreds of datacenters, trillions of
database rows
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
39. Spanner
Global scale database with strict transactional guarantees.
Global scale
• Across datacenters
• Scale up to millions of nodes, hundreds of datacenters, trillions of
database rows
Strict transactional guarantees
• General transactions (even inter-row)
• Reliable even during wide-area natural disasters
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
42. Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
43. Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
44. Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
• The proxies: used by clients to locate the spanservers assigned to
serve their data
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
45. Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
• The proxies: used by clients to locate the spanservers assigned to
serve their data
• Thousands of spanservers: serve data to clients
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
46. Spanner Organization (2/2)
The universe master: a console that displays status information
about all the zones.
The placement driver: handles automated movement of data across
zones.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 25 / 54
47. Spanserver Software Stack (1/4)
Each spanserver is responsible for 100-1000 data structure instances,
called tablet (similar to BigTable tablet).
Tablet mapping: (key: string, timestamp:int64) → string
Data and logs stored on Colossus (successor of GFS).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 26 / 54
48. Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
49. Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Paxos group: all machines involved in an instance of paxos.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
50. Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Paxos group: all machines involved in an instance of paxos.
Paxos implementation supports long-lived leaders with time-based
leader leases.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
51. Spanserver Software Stack (3/4)
Writes must initiate the paxos protocol at the leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
52. Spanserver Software Stack (3/4)
Writes must initiate the paxos protocol at the leader.
Reads access state directly from the underlying tablet at any replica
that is sufficiently up-to-date.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
53. Spanserver Software Stack (4/4)
Transaction manager: to support distributed transactions
• At every replica that is a leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 29 / 54
54. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
55. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
56. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
57. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
58. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
59. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
• Wound-wait for dead lock avoidance: young transaction dies if an
older transaction needs a resource held by the young transaction.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
60. Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
• Wound-wait for dead lock avoidance: young transaction dies if an
older transaction needs a resource held by the young transaction.
It can bypass the transaction manager.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
61. Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
62. Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Group’s leaders coordinate to perform two phase commit.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
63. Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Group’s leaders coordinate to perform two phase commit.
The state of each transaction manager is stored in the underlying
paxos group (and therefore is replicated).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
64. Data Model and Directories
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 32 / 54
65. Data Model
An application creates one or more databases in a universe.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
66. Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
67. Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
68. Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
Hierarchies of tables
• Tables must be partitioned by client into one or more hierarchies of
tables
• Table in the top: directory table
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
69. Directory (1/2)
Set of contiguous keys that share a common prefix.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
70. Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
71. Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
The smallest unit whose geographic replication properties can be
specified by an application.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
72. Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
The smallest unit whose geographic replication properties can be
specified by an application.
A Paxos group may contain multiple directories.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
73. Directory (2/2)
Spanner might move a directory:
• To shed load from a paxos group.
• To put directories that are frequently accessed together into the
same group.
• To move a directory into a group that is closer to its accessors.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 35 / 54
79. Key Innovation
Spanner knows what time it is.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 38 / 54
80. Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
81. Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Synchronizing time within and between datacenters is extremely
hard and uncertain.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
82. Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Synchronizing time within and between datacenters is extremely
hard and uncertain.
Serialization of requests is impossible at global scale.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
83. Time Synchronization (2/2)
Idea: accept uncertainty, keep it small and quantify (using GPS and
Atomic Clocks).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 40 / 54
84. True Time API
TTinterval: is guaranteed to contain the absolute time during
which TT.now() was invoked.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 41 / 54
85. How TrueTime Is Implemented? (1/2)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 42 / 54
86. How TrueTime Is Implemented? (2/2)
Daemon polls variety of masters:
• Chosen from nearby datacenters
• From further datacenters
• Armageddon masters
Daemon polls variety of masters
and reaches a consensus about
correct timestamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 43 / 54
87. External Consistency (1/2)
Jerry unfriends Tom to write a controversial comment.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
88. External Consistency (1/2)
Jerry unfriends Tom to write a controversial comment.
If serial order is as above, Jerry will be in trouble!
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
89. External Consistency (2/2)
External Consistency: Formally, If commit of T1 preceded the ini-
tiation of a new transaction T2 in wall-clock (physical) time, then
commit of T1 should precede commit of T2 in the serial ordering
also.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 45 / 54
90. Snapshot Reads
Read in past without locking.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
91. Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
92. Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Each replica tracks a value called safe time tsafe, which is the max-
imum timestamp at which a replica is up-to-date.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
93. Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Each replica tracks a value called safe time tsafe, which is the max-
imum timestamp at which a replica is up-to-date.
Replica can satisfy read at any t ≤ tsafe.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
94. Read-only Transactions
Assign timestamp sread and do snapshot read at sread.
sread = TT.now().latest()
It guarantees external consistency.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 47 / 54
95. Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
96. Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Timestamps must be assigned in monotonically increasing order.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
97. Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Timestamps must be assigned in monotonically increasing order.
If transaction T1 commits before T2 starts, T2’s commit timestamp
must be greater than T1’s commit timestamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
98. Read-Write Transactions (2/3)
Clients buffer writes.
Client chooses a coordinate group that initiates two-phase commit.
A non-coordinator-participant leader chooses a prepare timestamp
and logs a prepare record through paxos and notifies the coordinator.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 49 / 54
99. Read-Write Transactions (3/3)
The coordinator assigns a commit timestamp si no less than all
prepare timestamps and TT.now().latest().
The coordinator ensures that clients cannot see any data committed
by Ti until TT.after(si) is true. This is done by commit wait (wait
until absolute time passes si to commit).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 50 / 54
101. Summary
Megastore
Entity Groups (EG)
Within EG: using paxos - ACID
Across EGs: using queue and two-phase commit
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 52 / 54
102. Summary
Spanner
Replica consistency: using paxos protocol
Concurrency control: using two phase locking
Transaction coordination: using two-phase commit
Timestamps for transactions and data items
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 53 / 54