SlideShare a Scribd company logo
Megastore and Spanner
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54
Motivation
Storage requirements of today’s interactive online applications.
• Scalability (a billion internet users)
• Rapid development
• Responsiveness (low latency)
• Durability and consistency (never lose data)
• Fault tolerant (no unplanned/planned downtime)
• Easy operations (minimize confusion, support is expensive)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
Motivation
Storage requirements of today’s interactive online applications.
• Scalability (a billion internet users)
• Rapid development
• Responsiveness (low latency)
• Durability and consistency (never lose data)
• Fault tolerant (no unplanned/planned downtime)
• Easy operations (minimize confusion, support is expensive)
These requirements are in conflict.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
Motivation
Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB
• Rich set of features
• Difficult to scale to the massive amount of reads and writes.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
Motivation
Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB
• Rich set of features
• Difficult to scale to the massive amount of reads and writes.
NoSQL, e.g., BigTable, Dynamo, Cassandra
• Highly Scalable
• Limited API
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
NewSQL Databases
NoSQL scalability + RDBMS ACID
E.g., Megastore and Spanner
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 4 / 54
Megastore
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 5 / 54
Megastore
Started in 2006 for app development at Google.
Google’s wide-area replicated data store.
Adds (limited) transactions to wide-area replicated data stores.
GMail, Google+, Android Market, Google App Engine, ...
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 6 / 54
Megastore
Megastore layered on:
• GFS (Distributed file system)
• Bigtable (NoSQL scalable data store per datacenter)
[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
Megastore
Megastore layered on:
• GFS (Distributed file system)
• Bigtable (NoSQL scalable data store per datacenter)
BigTable is cluster-level structured storage, while Megastore is geo-
scale structured database.
[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
Entity Group (1/2)
The data is partitioned into a collection of entity groups (EG).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 8 / 54
Entity Group (2/2)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 9 / 54
Entity Group Replication (1/2)
Each entity group independently and synchronously replicated over
a wide area.
Megastore’s replication system provides a single consistent view of
the data stored in its underlying replicas.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 10 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
• Writes require at least two inter-replica round-trips to achieve
consensus: prepare round, accept round
• Reads require one inter-replica round-trip: prepare round
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
• Writes require at least two inter-replica round-trips to achieve
consensus: prepare round, accept round
• Reads require one inter-replica round-trip: prepare round
Megastore uses a modified version of paxos: fast read, fast write
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
BigTable feature: ability to store multiple data for same row/column
with different timestamps.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
BigTable feature: ability to store multiple data for same row/column
with different timestamps.
Multiversion Concurrency Control (MVCC) in EGs.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (2/3)
Read consistency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
• Snapshot: doesn’t wait, and reads the last committed values.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
• Snapshot: doesn’t wait, and reads the last committed values.
• Inconsistent reads: ignores the state of log and reads the last values
directly (data may be stale).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
• Employs paxos to settle the resource contention.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
• Employs paxos to settle the resource contention.
• Based on optimistic concurrency: in case of multiple writers to the
same log position, only one will win, and the rest will notice the
victorious write, abort, and retry their operations.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Across Entity Group Transaction (1/3)
Across entity groups: limited consistency guarantees
Two methods:
• Asynchronous messaging (queue)
• Two-Phase-Commit (2PC)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 15 / 54
Across Entity Group Transaction (2/3)
Queues
Provide transactional messaging between EGs.
Each message either is:
• Synchronous: has a single sending and receiving entity group.
• Asynchronous: has different sending and receiving entity group.
Useful to perform operations that affect many EGs.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 16 / 54
Across Entity Group Transaction (3/3)
Two-Phase Commit
Atomicity is satisfied.
High latency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 17 / 54
Spanner
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 18 / 54
Limitations of Existing Systems
BigTable
• Scalability
• High throughput
• High performance
• Transactional scope limited to single row
• Eventually-consistent replication support across data-centers
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 19 / 54
Limitations of Existing Systems
Megastore
• Replicated ACID transactions
• Schematized semi-relational tables
• Synchronous replication support across data-centers
• Performance (poor write throughput)
• Lack of query language
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 20 / 54
Spanner
Bridging the gap between Megastore and Bigtable.
SQL transactions + high throughput
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 21 / 54
Spanner
Global scale database with strict transactional guarantees.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
Spanner
Global scale database with strict transactional guarantees.
Global scale
• Across datacenters
• Scale up to millions of nodes, hundreds of datacenters, trillions of
database rows
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
Spanner
Global scale database with strict transactional guarantees.
Global scale
• Across datacenters
• Scale up to millions of nodes, hundreds of datacenters, trillions of
database rows
Strict transactional guarantees
• General transactions (even inter-row)
• Reliable even during wide-area natural disasters
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
Spanner Implementation
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 23 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
• The proxies: used by clients to locate the spanservers assigned to
serve their data
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
• The proxies: used by clients to locate the spanservers assigned to
serve their data
• Thousands of spanservers: serve data to clients
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (2/2)
The universe master: a console that displays status information
about all the zones.
The placement driver: handles automated movement of data across
zones.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 25 / 54
Spanserver Software Stack (1/4)
Each spanserver is responsible for 100-1000 data structure instances,
called tablet (similar to BigTable tablet).
Tablet mapping: (key: string, timestamp:int64) → string
Data and logs stored on Colossus (successor of GFS).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 26 / 54
Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Paxos group: all machines involved in an instance of paxos.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Paxos group: all machines involved in an instance of paxos.
Paxos implementation supports long-lived leaders with time-based
leader leases.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
Spanserver Software Stack (3/4)
Writes must initiate the paxos protocol at the leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
Spanserver Software Stack (3/4)
Writes must initiate the paxos protocol at the leader.
Reads access state directly from the underlying tablet at any replica
that is sufficiently up-to-date.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
Spanserver Software Stack (4/4)
Transaction manager: to support distributed transactions
• At every replica that is a leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 29 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
• Wound-wait for dead lock avoidance: young transaction dies if an
older transaction needs a resource held by the young transaction.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
• Wound-wait for dead lock avoidance: young transaction dies if an
older transaction needs a resource held by the young transaction.
It can bypass the transaction manager.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Group’s leaders coordinate to perform two phase commit.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Group’s leaders coordinate to perform two phase commit.
The state of each transaction manager is stored in the underlying
paxos group (and therefore is replicated).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
Data Model and Directories
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 32 / 54
Data Model
An application creates one or more databases in a universe.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
Hierarchies of tables
• Tables must be partitioned by client into one or more hierarchies of
tables
• Table in the top: directory table
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
The smallest unit whose geographic replication properties can be
specified by an application.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
The smallest unit whose geographic replication properties can be
specified by an application.
A Paxos group may contain multiple directories.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (2/2)
Spanner might move a directory:
• To shed load from a paxos group.
• To put directories that are frequently accessed together into the
same group.
• To move a directory into a group that is closer to its accessors.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 35 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
True Time
and
Consistency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 37 / 54
Key Innovation
Spanner knows what time it is.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 38 / 54
Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Synchronizing time within and between datacenters is extremely
hard and uncertain.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Synchronizing time within and between datacenters is extremely
hard and uncertain.
Serialization of requests is impossible at global scale.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
Time Synchronization (2/2)
Idea: accept uncertainty, keep it small and quantify (using GPS and
Atomic Clocks).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 40 / 54
True Time API
TTinterval: is guaranteed to contain the absolute time during
which TT.now() was invoked.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 41 / 54
How TrueTime Is Implemented? (1/2)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 42 / 54
How TrueTime Is Implemented? (2/2)
Daemon polls variety of masters:
• Chosen from nearby datacenters
• From further datacenters
• Armageddon masters
Daemon polls variety of masters
and reaches a consensus about
correct timestamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 43 / 54
External Consistency (1/2)
Jerry unfriends Tom to write a controversial comment.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
External Consistency (1/2)
Jerry unfriends Tom to write a controversial comment.
If serial order is as above, Jerry will be in trouble!
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
External Consistency (2/2)
External Consistency: Formally, If commit of T1 preceded the ini-
tiation of a new transaction T2 in wall-clock (physical) time, then
commit of T1 should precede commit of T2 in the serial ordering
also.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 45 / 54
Snapshot Reads
Read in past without locking.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Each replica tracks a value called safe time tsafe, which is the max-
imum timestamp at which a replica is up-to-date.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Each replica tracks a value called safe time tsafe, which is the max-
imum timestamp at which a replica is up-to-date.
Replica can satisfy read at any t ≤ tsafe.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Read-only Transactions
Assign timestamp sread and do snapshot read at sread.
sread = TT.now().latest()
It guarantees external consistency.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 47 / 54
Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Timestamps must be assigned in monotonically increasing order.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Timestamps must be assigned in monotonically increasing order.
If transaction T1 commits before T2 starts, T2’s commit timestamp
must be greater than T1’s commit timestamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
Read-Write Transactions (2/3)
Clients buffer writes.
Client chooses a coordinate group that initiates two-phase commit.
A non-coordinator-participant leader chooses a prepare timestamp
and logs a prepare record through paxos and notifies the coordinator.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 49 / 54
Read-Write Transactions (3/3)
The coordinator assigns a commit timestamp si no less than all
prepare timestamps and TT.now().latest().
The coordinator ensures that clients cannot see any data committed
by Ti until TT.after(si) is true. This is done by commit wait (wait
until absolute time passes si to commit).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 50 / 54
Summary
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 51 / 54
Summary
Megastore
Entity Groups (EG)
Within EG: using paxos - ACID
Across EGs: using queue and two-phase commit
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 52 / 54
Summary
Spanner
Replica consistency: using paxos protocol
Concurrency control: using two phase locking
Transaction coordination: using two-phase commit
Timestamps for transactions and data items
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 53 / 54
Questions?
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 54 / 54

More Related Content

PPTX
Database replication
PDF
Dual write strategies for microservices
PPT
Hadoop hive presentation
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PDF
HBase Advanced - Lars George
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PDF
Optimizing S3 Write-heavy Spark workloads
PDF
Best Practices for Becoming an Exceptional Postgres DBA
 
Database replication
Dual write strategies for microservices
Hadoop hive presentation
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
HBase Advanced - Lars George
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Optimizing S3 Write-heavy Spark workloads
Best Practices for Becoming an Exceptional Postgres DBA
 

What's hot (20)

ZIP
NoSQL databases
PPTX
NOSQL Databases types and Uses
PPTX
Database ,7 query localization
PDF
Redis basics
PPTX
SQL Server Database Backup and Restore Plan
PDF
What lies beneath
PPTX
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
PPTX
database backup and recovery
PPTX
DNS Security
PPTX
Allocation method - Operating System.pptx
PDF
Database System Architecture
PPT
Cluster e replicação em banco de dados
PDF
Query processing
PPT
Slowly changing dimensions informatica
PPTX
Apache hive introduction
PDF
NoSQL databases
PDF
Spark shuffle introduction
PDF
Succeeding with Agile
PPTX
Database , 8 Query Optimization
PDF
Google File System
NoSQL databases
NOSQL Databases types and Uses
Database ,7 query localization
Redis basics
SQL Server Database Backup and Restore Plan
What lies beneath
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
database backup and recovery
DNS Security
Allocation method - Operating System.pptx
Database System Architecture
Cluster e replicação em banco de dados
Query processing
Slowly changing dimensions informatica
Apache hive introduction
NoSQL databases
Spark shuffle introduction
Succeeding with Agile
Database , 8 Query Optimization
Google File System
Ad

Viewers also liked (20)

PDF
Spark Stream and SEEP
PDF
MapReduce
PDF
Linux Module Programming
PDF
Security
PDF
Process Management - Part2
PDF
IO Systems
PDF
Introduction to Operating Systems - Part2
PDF
Cloud Computing
PDF
Protection
PDF
Main Memory - Part2
PDF
CPU Scheduling - Part2
PDF
Storage
PDF
The Stratosphere Big Data Analytics Platform
PDF
Data Intensive Computing Frameworks
PDF
The Spark Big Data Analytics Platform
PDF
Deadlocks
PDF
File System Interface
PDF
CPU Scheduling - Part1
PDF
PDF
Spark Stream and SEEP
MapReduce
Linux Module Programming
Security
Process Management - Part2
IO Systems
Introduction to Operating Systems - Part2
Cloud Computing
Protection
Main Memory - Part2
CPU Scheduling - Part2
Storage
The Stratosphere Big Data Analytics Platform
Data Intensive Computing Frameworks
The Spark Big Data Analytics Platform
Deadlocks
File System Interface
CPU Scheduling - Part1
Ad

Similar to MegaStore and Spanner (20)

PPTX
Megastore by Google
PDF
Noha mega store
PPT
Megastore: Providing scalable and highly available storage
PPTX
Db presentation google_megastore
PPT
Google Megastore
PDF
Learning from google megastore (Part-1)
PDF
Megastore - ID2220 Presentation
PDF
Cidr11 paper32
PDF
Megastore providing scalable, highly available storage for interactive services
PPTX
Handling Data in Mega Scale Systems
PPTX
PDF
Megastore
PDF
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
PPTX
SQL and NoSQL in SQL Server
PDF
Heterogenous Persistence
PPTX
Polyglot Persistence
PPTX
Patterns of Distributed Application Design
PDF
Scaling Databases On The Cloud
PDF
Scaing databases on the cloud
PPTX
Spanner (may 19)
Megastore by Google
Noha mega store
Megastore: Providing scalable and highly available storage
Db presentation google_megastore
Google Megastore
Learning from google megastore (Part-1)
Megastore - ID2220 Presentation
Cidr11 paper32
Megastore providing scalable, highly available storage for interactive services
Handling Data in Mega Scale Systems
Megastore
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL and NoSQL in SQL Server
Heterogenous Persistence
Polyglot Persistence
Patterns of Distributed Application Design
Scaling Databases On The Cloud
Scaing databases on the cloud
Spanner (may 19)

More from Amir Payberah (14)

PDF
P2P Content Distribution Network
PDF
File System Implementation - Part2
PDF
File System Implementation - Part1
PDF
Virtual Memory - Part2
PDF
Virtual Memory - Part1
PDF
Main Memory - Part1
PDF
Process Synchronization - Part2
PDF
Process Synchronization - Part1
PDF
Threads
PDF
Process Management - Part3
PDF
Process Management - Part1
PDF
Introduction to Operating Systems - Part3
PDF
Introduction to Operating Systems - Part1
PDF
Mesos and YARN
P2P Content Distribution Network
File System Implementation - Part2
File System Implementation - Part1
Virtual Memory - Part2
Virtual Memory - Part1
Main Memory - Part1
Process Synchronization - Part2
Process Synchronization - Part1
Threads
Process Management - Part3
Process Management - Part1
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part1
Mesos and YARN

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectroscopy.pptx food analysis technology
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx

MegaStore and Spanner

  • 1. Megastore and Spanner Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54
  • 2. Motivation Storage requirements of today’s interactive online applications. • Scalability (a billion internet users) • Rapid development • Responsiveness (low latency) • Durability and consistency (never lose data) • Fault tolerant (no unplanned/planned downtime) • Easy operations (minimize confusion, support is expensive) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
  • 3. Motivation Storage requirements of today’s interactive online applications. • Scalability (a billion internet users) • Rapid development • Responsiveness (low latency) • Durability and consistency (never lose data) • Fault tolerant (no unplanned/planned downtime) • Easy operations (minimize confusion, support is expensive) These requirements are in conflict. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
  • 4. Motivation Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB • Rich set of features • Difficult to scale to the massive amount of reads and writes. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
  • 5. Motivation Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB • Rich set of features • Difficult to scale to the massive amount of reads and writes. NoSQL, e.g., BigTable, Dynamo, Cassandra • Highly Scalable • Limited API Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
  • 6. NewSQL Databases NoSQL scalability + RDBMS ACID E.g., Megastore and Spanner Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 4 / 54
  • 7. Megastore Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 5 / 54
  • 8. Megastore Started in 2006 for app development at Google. Google’s wide-area replicated data store. Adds (limited) transactions to wide-area replicated data stores. GMail, Google+, Android Market, Google App Engine, ... Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 6 / 54
  • 9. Megastore Megastore layered on: • GFS (Distributed file system) • Bigtable (NoSQL scalable data store per datacenter) [http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html] Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
  • 10. Megastore Megastore layered on: • GFS (Distributed file system) • Bigtable (NoSQL scalable data store per datacenter) BigTable is cluster-level structured storage, while Megastore is geo- scale structured database. [http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html] Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
  • 11. Entity Group (1/2) The data is partitioned into a collection of entity groups (EG). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 8 / 54
  • 12. Entity Group (2/2) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 9 / 54
  • 13. Entity Group Replication (1/2) Each entity group independently and synchronously replicated over a wide area. Megastore’s replication system provides a single consistent view of the data stored in its underlying replicas. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 10 / 54
  • 14. Entity Group Replication (2/2) Synchronous replication: a low-latency implementation of paxos. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 15. Entity Group Replication (2/2) Synchronous replication: a low-latency implementation of paxos. Basic paxos not used: poor match for high-latency links. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 16. Entity Group Replication (2/2) Synchronous replication: a low-latency implementation of paxos. Basic paxos not used: poor match for high-latency links. • Writes require at least two inter-replica round-trips to achieve consensus: prepare round, accept round • Reads require one inter-replica round-trip: prepare round Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 17. Entity Group Replication (2/2) Synchronous replication: a low-latency implementation of paxos. Basic paxos not used: poor match for high-latency links. • Writes require at least two inter-replica round-trips to achieve consensus: prepare round, accept round • Reads require one inter-replica round-trip: prepare round Megastore uses a modified version of paxos: fast read, fast write Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 18. Entity Group Transaction (1/3) Within each EG: full ACID semantics Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 19. Entity Group Transaction (1/3) Within each EG: full ACID semantics Transaction management using Write Ahead Logging (WAL). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 20. Entity Group Transaction (1/3) Within each EG: full ACID semantics Transaction management using Write Ahead Logging (WAL). BigTable feature: ability to store multiple data for same row/column with different timestamps. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 21. Entity Group Transaction (1/3) Within each EG: full ACID semantics Transaction management using Write Ahead Logging (WAL). BigTable feature: ability to store multiple data for same row/column with different timestamps. Multiversion Concurrency Control (MVCC) in EGs. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 22. Entity Group Transaction (2/3) Read consistency Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 23. Entity Group Transaction (2/3) Read consistency • Current: waits for uncommitted writes, then reads the last committed value. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 24. Entity Group Transaction (2/3) Read consistency • Current: waits for uncommitted writes, then reads the last committed value. • Snapshot: doesn’t wait, and reads the last committed values. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 25. Entity Group Transaction (2/3) Read consistency • Current: waits for uncommitted writes, then reads the last committed value. • Snapshot: doesn’t wait, and reads the last committed values. • Inconsistent reads: ignores the state of log and reads the last values directly (data may be stale). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 26. Entity Group Transaction (3/3) Write consistency • Determine the next available log position. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 27. Entity Group Transaction (3/3) Write consistency • Determine the next available log position. • Assigns mutations of WAL a timestamp higher than any previous one. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 28. Entity Group Transaction (3/3) Write consistency • Determine the next available log position. • Assigns mutations of WAL a timestamp higher than any previous one. • Employs paxos to settle the resource contention. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 29. Entity Group Transaction (3/3) Write consistency • Determine the next available log position. • Assigns mutations of WAL a timestamp higher than any previous one. • Employs paxos to settle the resource contention. • Based on optimistic concurrency: in case of multiple writers to the same log position, only one will win, and the rest will notice the victorious write, abort, and retry their operations. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 30. Across Entity Group Transaction (1/3) Across entity groups: limited consistency guarantees Two methods: • Asynchronous messaging (queue) • Two-Phase-Commit (2PC) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 15 / 54
  • 31. Across Entity Group Transaction (2/3) Queues Provide transactional messaging between EGs. Each message either is: • Synchronous: has a single sending and receiving entity group. • Asynchronous: has different sending and receiving entity group. Useful to perform operations that affect many EGs. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 16 / 54
  • 32. Across Entity Group Transaction (3/3) Two-Phase Commit Atomicity is satisfied. High latency Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 17 / 54
  • 33. Spanner Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 18 / 54
  • 34. Limitations of Existing Systems BigTable • Scalability • High throughput • High performance • Transactional scope limited to single row • Eventually-consistent replication support across data-centers Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 19 / 54
  • 35. Limitations of Existing Systems Megastore • Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance (poor write throughput) • Lack of query language Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 20 / 54
  • 36. Spanner Bridging the gap between Megastore and Bigtable. SQL transactions + high throughput Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 21 / 54
  • 37. Spanner Global scale database with strict transactional guarantees. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
  • 38. Spanner Global scale database with strict transactional guarantees. Global scale • Across datacenters • Scale up to millions of nodes, hundreds of datacenters, trillions of database rows Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
  • 39. Spanner Global scale database with strict transactional guarantees. Global scale • Across datacenters • Scale up to millions of nodes, hundreds of datacenters, trillions of database rows Strict transactional guarantees • General transactions (even inter-row) • Reliable even during wide-area natural disasters Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
  • 40. Spanner Implementation Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 23 / 54
  • 41. Spanner Organization (1/2) Universe: Spanner deployment Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 42. Spanner Organization (1/2) Universe: Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 43. Spanner Organization (1/2) Universe: Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) • One zonemaster: assigns data to spanservers Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 44. Spanner Organization (1/2) Universe: Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) • One zonemaster: assigns data to spanservers • The proxies: used by clients to locate the spanservers assigned to serve their data Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 45. Spanner Organization (1/2) Universe: Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) • One zonemaster: assigns data to spanservers • The proxies: used by clients to locate the spanservers assigned to serve their data • Thousands of spanservers: serve data to clients Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 46. Spanner Organization (2/2) The universe master: a console that displays status information about all the zones. The placement driver: handles automated movement of data across zones. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 25 / 54
  • 47. Spanserver Software Stack (1/4) Each spanserver is responsible for 100-1000 data structure instances, called tablet (similar to BigTable tablet). Tablet mapping: (key: string, timestamp:int64) → string Data and logs stored on Colossus (successor of GFS). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 26 / 54
  • 48. Spanserver Software Stack (2/4) A single paxos state machine on top of each tablet: consistent repli- cation Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
  • 49. Spanserver Software Stack (2/4) A single paxos state machine on top of each tablet: consistent repli- cation Paxos group: all machines involved in an instance of paxos. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
  • 50. Spanserver Software Stack (2/4) A single paxos state machine on top of each tablet: consistent repli- cation Paxos group: all machines involved in an instance of paxos. Paxos implementation supports long-lived leaders with time-based leader leases. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
  • 51. Spanserver Software Stack (3/4) Writes must initiate the paxos protocol at the leader. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
  • 52. Spanserver Software Stack (3/4) Writes must initiate the paxos protocol at the leader. Reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
  • 53. Spanserver Software Stack (4/4) Transaction manager: to support distributed transactions • At every replica that is a leader. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 29 / 54
  • 54. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 55. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 56. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 57. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 58. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. • Two-phase locking. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 59. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. • Two-phase locking. • Wound-wait for dead lock avoidance: young transaction dies if an older transaction needs a resource held by the young transaction. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 60. Transactions Involving Only One Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. • Two-phase locking. • Wound-wait for dead lock avoidance: young transaction dies if an older transaction needs a resource held by the young transaction. It can bypass the transaction manager. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 61. Transactions Involving Multiple Paxos Groups One of the participant groups is chosen as the coordinator. • The participant leader of that group will be referred to as the coordinator leader. • The slaves of that group as coordinator slaves. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
  • 62. Transactions Involving Multiple Paxos Groups One of the participant groups is chosen as the coordinator. • The participant leader of that group will be referred to as the coordinator leader. • The slaves of that group as coordinator slaves. Group’s leaders coordinate to perform two phase commit. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
  • 63. Transactions Involving Multiple Paxos Groups One of the participant groups is chosen as the coordinator. • The participant leader of that group will be referred to as the coordinator leader. • The slaves of that group as coordinator slaves. Group’s leaders coordinate to perform two phase commit. The state of each transaction manager is stored in the underlying paxos group (and therefore is replicated). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
  • 64. Data Model and Directories Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 32 / 54
  • 65. Data Model An application creates one or more databases in a universe. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 66. Data Model An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 67. Data Model An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Table • Rows and columns • Must have an ordered set one or more primary key columns • Primary key uniquely identifies each row Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 68. Data Model An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Table • Rows and columns • Must have an ordered set one or more primary key columns • Primary key uniquely identifies each row Hierarchies of tables • Tables must be partitioned by client into one or more hierarchies of tables • Table in the top: directory table Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 69. Directory (1/2) Set of contiguous keys that share a common prefix. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 70. Directory (1/2) Set of contiguous keys that share a common prefix. All data in a directory has the same replication configuration. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 71. Directory (1/2) Set of contiguous keys that share a common prefix. All data in a directory has the same replication configuration. The smallest unit whose geographic replication properties can be specified by an application. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 72. Directory (1/2) Set of contiguous keys that share a common prefix. All data in a directory has the same replication configuration. The smallest unit whose geographic replication properties can be specified by an application. A Paxos group may contain multiple directories. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 73. Directory (2/2) Spanner might move a directory: • To shed load from a paxos group. • To put directories that are frequently accessed together into the same group. • To move a directory into a group that is closer to its accessors. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 35 / 54
  • 74. Example Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 75. Example Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 76. Example Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 77. Example Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 78. True Time and Consistency Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 37 / 54
  • 79. Key Innovation Spanner knows what time it is. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 38 / 54
  • 80. Time Synchronization (1/2) Is synchronizing time at the global scale possible? Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
  • 81. Time Synchronization (1/2) Is synchronizing time at the global scale possible? Synchronizing time within and between datacenters is extremely hard and uncertain. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
  • 82. Time Synchronization (1/2) Is synchronizing time at the global scale possible? Synchronizing time within and between datacenters is extremely hard and uncertain. Serialization of requests is impossible at global scale. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
  • 83. Time Synchronization (2/2) Idea: accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 40 / 54
  • 84. True Time API TTinterval: is guaranteed to contain the absolute time during which TT.now() was invoked. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 41 / 54
  • 85. How TrueTime Is Implemented? (1/2) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 42 / 54
  • 86. How TrueTime Is Implemented? (2/2) Daemon polls variety of masters: • Chosen from nearby datacenters • From further datacenters • Armageddon masters Daemon polls variety of masters and reaches a consensus about correct timestamp. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 43 / 54
  • 87. External Consistency (1/2) Jerry unfriends Tom to write a controversial comment. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
  • 88. External Consistency (1/2) Jerry unfriends Tom to write a controversial comment. If serial order is as above, Jerry will be in trouble! Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
  • 89. External Consistency (2/2) External Consistency: Formally, If commit of T1 preceded the ini- tiation of a new transaction T2 in wall-clock (physical) time, then commit of T1 should precede commit of T2 in the serial ordering also. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 45 / 54
  • 90. Snapshot Reads Read in past without locking. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 91. Snapshot Reads Read in past without locking. Client can specify timestamp for read or an upper bound of times- tamp. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 92. Snapshot Reads Read in past without locking. Client can specify timestamp for read or an upper bound of times- tamp. Each replica tracks a value called safe time tsafe, which is the max- imum timestamp at which a replica is up-to-date. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 93. Snapshot Reads Read in past without locking. Client can specify timestamp for read or an upper bound of times- tamp. Each replica tracks a value called safe time tsafe, which is the max- imum timestamp at which a replica is up-to-date. Replica can satisfy read at any t ≤ tsafe. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 94. Read-only Transactions Assign timestamp sread and do snapshot read at sread. sread = TT.now().latest() It guarantees external consistency. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 47 / 54
  • 95. Read-Write Transactions (1/3) Leader must only assign timestamps within the interval of its leader lease. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
  • 96. Read-Write Transactions (1/3) Leader must only assign timestamps within the interval of its leader lease. Timestamps must be assigned in monotonically increasing order. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
  • 97. Read-Write Transactions (1/3) Leader must only assign timestamps within the interval of its leader lease. Timestamps must be assigned in monotonically increasing order. If transaction T1 commits before T2 starts, T2’s commit timestamp must be greater than T1’s commit timestamp. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
  • 98. Read-Write Transactions (2/3) Clients buffer writes. Client chooses a coordinate group that initiates two-phase commit. A non-coordinator-participant leader chooses a prepare timestamp and logs a prepare record through paxos and notifies the coordinator. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 49 / 54
  • 99. Read-Write Transactions (3/3) The coordinator assigns a commit timestamp si no less than all prepare timestamps and TT.now().latest(). The coordinator ensures that clients cannot see any data committed by Ti until TT.after(si) is true. This is done by commit wait (wait until absolute time passes si to commit). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 50 / 54
  • 100. Summary Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 51 / 54
  • 101. Summary Megastore Entity Groups (EG) Within EG: using paxos - ACID Across EGs: using queue and two-phase commit Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 52 / 54
  • 102. Summary Spanner Replica consistency: using paxos protocol Concurrency control: using two phase locking Transaction coordination: using two-phase commit Timestamps for transactions and data items Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 53 / 54
  • 103. Questions? Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 54 / 54