SlideShare a Scribd company logo
Architecture of a Geo-Distributed SQL Database
CockroachDB
Peter Mattis (@petermattis), Co-founder & CTO
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
cockroachdb-distributed-sql/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
CockroachDB: Geo-distributed SQL Database
Make Data Easy
• Distributed
○ Horizontally scalable to grow with your application
• Geo-distributed
○ Handle datacenter failures
○ Place data near usage
○ Push computation near data
• SQL
○ Lingua-franca for rich data storage
○ Schemas, indexes, and transactions make app development easier
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
Distributed, Replicated, Transactional KV*
• Keys and values are strings
○ Lexicographically ordered by key
• Multi-version concurrency control (MVCC)
○ Values are never updated “in place”, newer versions shadow older versions
○ Tombstones are used to delete values
○ Provides snapshot to each transaction
• Monolithic key-space
* Not exposed for external usage
Monolithic Key Space
DOGS
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
Monolithic logical key space
● Ordered lexicographically by key
Ranges
DOGS
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
zee
Key space divided into contiguous ~64MB ranges
Ranges are small enough to
be moved/split quickly
Ranges are large enough to
amortize indexing overhead
Range Indexing
DOGS
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
Index structure used to
locate ranges
(very much like a B-tree)
1
2
3
carl - jack
lady - peetey
pinetop - zee
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
zee
Ordered Range Scans
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
Ordered keys enable
efficient range scans
dogs >= “muddy” AND <= “stella”
1
2
3
carl - jack
lady - peetey
pinetop - zee
carl
dagne
figment
jack peetey
lula
lady pinetop
sooshi
zee
muddy stella
stella
muddy
Transactional Updates
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
Transactions used to insert
records into ranges
1
2
3
carl - jack
lady - peetey
pinetop - zee
stella
muddy
INSERT[sunny]
INSERT[sunny]
Space available in range? - YES
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
✓?
Transactional Updates
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
1
2
3
carl - jack
lady - peetey
pinetop - zee
stella
muddy
INSERT[sunny]
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
✓
Transactions used to insert
records into ranges
INSERT[sunny]
Range Splits
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
1
2
3
carl - jack
lady - peetey
pinetop - zee
stella
muddy
INSERT[rudy]
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
BUT… what happens when
a range is full?
✓? INSERT[rudy]
Space available in range? - NO
Range Splits
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
stella
muddy
INSERT[rudy]
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
rudy
sooshi
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
Ranges are automatically
split, a new range index is
created & order maintained
✓ INSERT[rudy]
split range and insert
stella
sunny
zee
1
2
3
carl - jack
lady - peetey
pinetop - sooshi
4 stella - zee
Raft and Replication
Ranges (~64MB) are the unit of replication
Each range is a Raft group
(Raft is a consensus replication protocol)
Default to 3 replicas, though this is configurable
• Important system ranges default to 5 replicas
• Note: 2 replicas doesn’t make sense in consensus replication
Raft
group
Raft and Replication
Raft provides “atomic replication” of commands
Commands are proposed by the leaseholder replica
and distributed to the follower replicas, but only
accepted when a quorum of replicas have
acknowledged receipt
* Leaseholder == Raft leader
Raft
group
LEASEHOLDER
node1
node2
node4
node3
Range Leases
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Reads with consensus
Reads must talk to a quorum of replicas
READ[carl]
node1
node2
node4
node3
Range Leases
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Reads without consensus
One replica is chosen as the leaseholder
READ[carl]
leaseholder
node1
node2
node4
node3
Range Leases
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Reads without consensus
One replica is chosen as the leaseholder
● Coordinates writes (proposal, key locking)
● Performs reads
READ[carl]
leaseholder
node1
node2
node4
node3
Replica Placement
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
● Space
● Diversity
● Load
● Latency
carl
dagne
figment
jack
Each Range is a Raft state machine
A Range has 1 or more Replicas
node1
node2
node4
node3
Replica Placement: Diversity
muddy
peetey
lula
lady
carl
dagne
figment
jack
Diversity
optimizes placement of
replicas across “failure
domains”
● Disk
● Single machine
● Rack
● Datacenter
● Region
pinetop
sooshi
stella
zee
node1
node2
node6
node4
node5
Replica Placement: Load
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Load
Balances placement using
heuristics that considers
real-time usage metrics of
the data itself
This range is high load as it is
accessed more than others
While we show this for ranges within a
single table, this is also applicable across
all ranges across ALL tables, which is the
more typical situation
node1node3
Replica Placement: Latency & Geo-partitioning
muddy
peetey
lula
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
USE/muddy
USE/stella
USE/figment
USE/dagne
USW/jack
USW/lady
USW/peetey
USW/pinetop
EU/carl
EU/lula
EU/sooshi
EU/zee
We apply a constraint that indicates regional
placement so we can ensure low latency
access or jurisdictional control of data
Rebalancing Replicas
node1
node5
node4
node2
node3
NEW
Scale: Add a node
If we add a node to the cluster,
CockroachDB automatically
redistributed replicas to even load
across the cluster
Uses the replica placement
heuristics from previous slides to
decide which node to add to and
which to remove from
Rebalancing Replicas
node1
node5
node4
node2
node3
NEW
Scale: Add a node
If we add a node to the cluster,
CockroachDB automatically
redistributed replicas to even load
across the cluster
Uses the replica placement
heuristics from previous slides
Movement is decomposed into
adding a replica followed by
removing a replica
Rebalancing Replicas
node1
node5
node4
node2
node3
NEW
Scale: Add a node
If we add a node to the cluster,
CockroachDB automatically
redistributed replicas to even load
across the cluster
Uses the replica placement
heuristics from previous slides
Movement is decomposed into
adding a replica followed by
removing a replica
Rebalancing Replicas
node1
node5
node4
node2
node3
Loss of a node
Permanent Failure
If a node goes down, the Raft
group realizes a replica is missing
and replaces it with a new replica
on an active node
Uses the replica placement
heuristics from previous slides
Rebalancing Replicas
node1
node5
node4
node2
node3
Loss of a node
Permanent Failure
If a node goes down, the Raft
group realizes a replica is missing
and replaces it with a new replica
on an active node
Uses the replica placement
heuristics from previous slides
The failed replica is removed from the Raft group
and a new replica created. The leaseholder sends a
snapshot of the Range’s state to bring the new
replica up to date.
Rebalancing Replicas
node1
node5
node4
node2
Loss of a node
Temporary Failure
If a node goes down for a moment,
the leaseholder can “catch up” any
replica that is behind
The leaseholder can send commands to be replayed
OR it can send a snapshot of the current Range data.
We apply heuristics to decide which is most efficient
for a given failure.
node3
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
Transactions
Atomicity, Consistency, Isolation, Durability
Serializable Isolation
• As if the transactions are run in a serial order
• Gold standard isolation level
• Make Data Easy - weaker isolation levels are too great a burden
Transactions can span arbitrary ranges
Conversational
• The full set of operations is not required up front
Transactions
Raft provides atomic writes to individual ranges
Bootstrap transaction atomicity using Raft atomic writes
Transaction record atomically flipped from PENDING to COMMIT
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
INSERT INTO dogs
VALUES (sunny, ozzie)
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
node2
node3
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
Distributed Transactions
node1
carl
dagne
figment
jack
node2
node3
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
COMMIT
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: COMMIT
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
COMMIT
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
ACK
Transactions: Pipelining
Serial Pipelined
Transactions: Pipelining
Serial Pipelined
sunny
sunny
BEGIN
WRITE[sunny]
txn:sunny (pending)
Transactions: Pipelining
Serial Pipelined
txn:sunny (pending)
sunny
ozzie
sunny
ozzie
BEGIN
WRITE[sunny]
WRITE[ozzie]
Transactions: Pipelining
Serial Pipelined
txn:sunny (pending)
sunny
ozzie
txn:sunny (commit)[keys: sunny, ozzie]
txn:sunny (staged)[keys: sunny, ozzie]
sunny
ozzie
BEGIN
WRITE[sunny]
WRITE[ozzie]
COMMIT
Transactions: Pipelining
Serial Pipelined
txn:sunny (pending)
sunny
ozzie
txn:sunny (commit)[keys: sunny, ozzie]
BEGIN
WRITE[sunny]
WRITE[ozzie]
COMMIT
Committed once all
operations complete
We replaced the
centralized commit marker
with a distributed one
t
sunny
ozzie
txn:sunny (staged)[keys: sunny, ozzie]
* “Proved” with TLA+
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
SQL
Structured Query Language
Declarative, not imperative
• These are the results I want vs perform these operations in this sequence
Relational data model
• Typed: INT, FLOAT, STRING, ...
• Schemas: tables, rows, columns, foreign keys
SQL: Tabular Data in a KV World
SQL data has columns and types?!?
How do we store typed and columnar data in a distributed, replicated,
transactional key-value store?
• The SQL data model needs to be mapped to KV data
• Reminder: keys and values are lexicographically sorted
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/1 “Bat”,1.11
/2 “Ball”,2.22
/3 “Glove”,3.33
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/<Table>/<Index>/1 “Bat”,1.11
/<Table>/<Index>/2 “Ball”,2.22
/<Table>/<Index>/3 “Glove”,3.33
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/inventory/primary/1 “Bat”,1.11
/inventory/primary/2 “Ball”,2.22
/inventory/primary/3 “Glove”,3.33
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT,
INDEX name_idx (name)
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/inventory/name_idx/”Bat”/1 ∅
/inventory/name_idx/”Ball”/2 ∅
/inventory/name_idx/”Glove”/3 ∅
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT,
INDEX name_idx (name)
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
4 Bat 4.44
Key Value
/inventory/name_idx/”Bat”/1 ∅
/inventory/name_idx/”Ball”/2 ∅
/inventory/name_idx/”Glove”/3 ∅
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT,
INDEX name_idx (name)
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
4 Bat 4.44
Key Value
/inventory/name_idx/”Bat”/1 ∅
/inventory/name_idx/”Ball”/2 ∅
/inventory/name_idx/”Glove”/3 ∅
/inventory/name_idx/”Bat”/4 ∅
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
SQL Execution
Relational operators
• Projection (SELECT <columns>)
• Selection (WHERE <filter>)
• Aggregation (GROUP BY <columns>)
• Join (JOIN), union (UNION), intersect (INTERSECT)
• Scan (FROM <table>)
• Sort (ORDER BY)
○ Technically, not a relational operator
SQL Execution
• Relational expressions have input expressions and scalar expressions
○ For example, a “filter” expression has 1 input expression and a scalar expression that
filters the rows from the child
○ The scan expression has zero inputs
• Query plan is a tree of relational expressions
• SQL execution takes a query plan and runs the operations to completion
SQL Execution: Example
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
SQL Execution: Scan
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
SQL Execution: Filter
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
Filter
name >= “b” AND name < “c”
SQL Execution: Project
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
Filter
name >= “b” AND name < “c”
Project
name
SQL Execution: Project
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
Filter
name >= “b” AND name < “c”
Project
name
Results
SQL Execution: Index Scans
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory@name [“b” - “c”)
The filter gets pushed into the scan
SQL Execution: Index Scans
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory@name [“b” - “c”)
Project
name
Results
SQL Execution: Correctness
Correct SQL execution involves lots of bookkeeping
• User defined tables, and indexes
• Queries refer to table and column names
• Execution uses table and column IDs
• NULL handling
SQL Execution: Performance
Performant SQL execution
• Tight, well written code
• Operator specialization
○ hash group by, stream group by
○ hash join, merge join, lookup join, zig-zag join
• Distributed execution
SQL Execution: Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
Germany 1
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
Germany 1
France 1
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
Germany 1
France 2
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 2
Germany 1
France 2
SQL Execution: Group By Revisited
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
SQL Execution: Sort on Grouping Column(s)
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 1
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
Germany 1
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
Germany 1
United States 1
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
Germany 1
United States 2
Distributed SQL Execution
Network latencies and
throughput are important
considerations in
geo-distributed setups
Push fragments of computation
as close to the data as possible
Distributed SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country
Scan
customers
Scan
customers
Scan
customers
scan
scan
scan
Distributed SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country
Scan
customers
Scan
customers
Scan
customers
Group-By
“country”
Group-By
“country”
Group-By
“country”
group-by
group-by
group-by
Distributed SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country
Scan
customers
Scan
customers
Scan
customers
Group-By
“country”
Group-By
“country”
Group-By
“country”
Group-By
“country”
group-by
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
SQL Optimization
An optimizer explores many plans that are logically equivalent to a given
query and chooses the best one
Parse ExecuteSearch
Memo
Prep
AST Plan
Fold Constants
Check Types
Resolve Names
Report Semantic Errors
Compute properties
Retrieve and attach stats
Cost-independent transformations
Cost-based transformationsParse SQL
SQL Optimization: Cost-Independent Transformations
• Some transformations always make sense
○ Constant folding
○ Filter push-down
○ Decorrelating subqueries*
○ ...
• These transformations are cost-independent
○ If the transformation can be applied to the query, it is applied
• Domain Specific Language for transformations
○ Compiled down to code which efficiently matches query fragments in the memo
○ ~200 transformations currently defined
* Actually cost-based, but we’re treating it as cost-independent right now
SQL Optimization: Filter Push-Down
SELECT * FROM a JOIN b WHERE x > 10
Scan
a@primary
Filter
x > 10
Results
Scan
b@primary
Join
Initial plan
SQL Optimization: Filter Push-Down
SELECT * FROM a JOIN b WHERE x > 10
Scan
a@primary
Filter
x > 10
Results
Scan
b@primary
Join
Filter
x > 10
After filter push-down
SQL Optimization: Cost-Based Transformations
• Some transformations are not universally good
○ Index selection
○ Join reordering
○ ...
• These transformations are cost-based
○ When should the transformation be applied?
○ Need to try both paths and maintain both the original and transformed query
○ State explosion: thousands of possible query plans
■ Memo data structure maintains a forest of query plans
○ Estimate cost of each query, select query with lowest cost
• Costing
○ Based on table statistics and estimating cardinality of inputs to relational expressions
SQL Optimization: Cost-based Index Selection
The index to use for a query is affected by multiple factors
• Filters and join conditions
• Required ordering (ORDER BY)
• Implicit ordering (GROUP BY)
• Covering vs non-covering (i.e. is an index-join required)
• Locality
SQL Optimization: Cost-based Index Selection
SELECT *
FROM a
WHERE x > 10
ORDER BY y
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Filter
x > 10
Sort
y
SELECT *
FROM a
WHERE x > 10
ORDER BY y
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Sort
y
Sort
y
SELECT *
FROM a
WHERE x > 10
ORDER BY y
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Scan
a@y
Sort
y
Sort
y
Filter
x > 10
SELECT *
FROM a
WHERE x > 10
ORDER BY y
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Scan
a@y
Sort
y
Sort
y
Filter
x > 10
SELECT *
FROM a
WHERE x > 10
ORDER BY y
10
100,000
10
10
Lowest
Cost
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Scan
a@y
Sort
y
Sort
y
Filter
x > 10
SELECT *
FROM a
WHERE x > 10
ORDER BY y
50,000
100,000
50,000
50,000
Lowest
Cost
Locality-Aware SQL Optimization
Network latencies and
throughput are important
considerations in
geo-distributed setups
Duplicate read-mostly data in
each locality
Plan queries to use data from
the same locality
Locality-Aware SQL Optimization
Three copies of the
postal_codes table data
Use replication constraints to
pin the copies to different
geographic regions (US-East,
US-West, EU)
CREATE TABLE postal_codes (
id INT PRIMARY KEY,
code STRING,
INDEX idx_eu (id) STORING (code),
INDEX idx_usw (id) STORING (code)
)
Locality-Aware SQL Optimization
Optimizer includes locality in
cost model
Automatically selects index
from same locality: primary,
idx_eu, or idx_usw
CREATE TABLE postal_codes (
id INT PRIMARY KEY,
code STRING,
INDEX idx_eu (id) STORING (code),
INDEX idx_usw (id) STORING (code)
)
SELECT * FROM postal_codes
Conclusion
● Distributed, replicated, transactional key-value store
● Monolithic key space
● Raft replication of ranges (~64MB)
● Replica placement signals: space, diversity, load, latency
● Pipelined transaction operations
● Mapping SQL data to KV storage
● Distributed SQL execution
● Distributed SQL optimization
www.cockroachlabs.com
github.com/cockroachdb/cockroach
Thank You
A Simple Transaction
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
INSERT INTO DOGS (sunny);
A Simple Transaction: One Range
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
sunny
zee
BEGIN
WRITE[sunny]
COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
NOTE: a gateway can be ANY CockroachDB instance. It can
find the leaseholder for any range and execute a transaction
A Simple Transaction: One Range
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN
WRITE[sunny]
COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
ACK
A Simple Transaction: One Range
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN
WRITE[sunny]
COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
ACK
Ranges
CockroachDB implements order-preserving data distribution
• Automates sharding of key/value data into “ranges”
• Supports efficient range scans
• Requires an indexing structure
Foundational capability that enables efficient distribution
of data across nodes within a CockroachDB cluster
* This approach is also used by Bigtable (tablets), HBase (regions) & Spanner (ranges)
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
cockroachdb-distributed-sql/

More Related Content

PPTX
CockroachDB
PPTX
Couchbase presentation
PPTX
HBase in Practice
PDF
An Overview of Spanner: Google's Globally Distributed Database
PDF
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
PPT
7. Key-Value Databases: In Depth
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
PPTX
Couchbase 101
CockroachDB
Couchbase presentation
HBase in Practice
An Overview of Spanner: Google's Globally Distributed Database
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
7. Key-Value Databases: In Depth
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Couchbase 101

What's hot (20)

PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
PPTX
Real-time Analytics with Trino and Apache Pinot
PPTX
Introduction to Kafka Cruise Control
PPTX
PostgreSQL and CockroachDB SQL
PPTX
Introduction to Storm
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Introduction to Apache ZooKeeper
PPTX
Apache Spark Architecture
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PPTX
Autoscaling Flink with Reactive Mode
PPTX
Druid deep dive
PDF
Fundamentals of Apache Kafka
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
PPTX
PDF
Apache Spark Core – Practical Optimization
PDF
Cassandra sharding and consistency (lightning talk)
PDF
Apache Airflow
PDF
Apache Kafka - Martin Podval
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Real-time Analytics with Trino and Apache Pinot
Introduction to Kafka Cruise Control
PostgreSQL and CockroachDB SQL
Introduction to Storm
Evening out the uneven: dealing with skew in Flink
Introduction to Apache ZooKeeper
Apache Spark Architecture
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Autoscaling Flink with Reactive Mode
Druid deep dive
Fundamentals of Apache Kafka
Apache Beam: A unified model for batch and stream processing data
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Apache Spark Core – Practical Optimization
Cassandra sharding and consistency (lightning talk)
Apache Airflow
Apache Kafka - Martin Podval
Ad

Similar to CockroachDB: Architecture of a Geo-Distributed SQL Database (20)

PDF
Design Patterns for Distributed Non-Relational Databases
PDF
Design Patterns For Distributed NO-reational databases
PPTX
Cassandra
PDF
Cassandra - A Decentralized Structured Storage System
PPT
Bigtable
PDF
Cassandra: Open Source Bigtable + Dynamo
PDF
A Guide to the Post Relational Revolution
PPT
Schemaless Databases
PPTX
Google
PPTX
Cloud storage
PPTX
Cassandra an overview
PDF
Cassandra Talk: Austin JUG
PDF
Understanding and building big data Architectures - NoSQL
PDF
Cassandra for Sysadmins
PPTX
Breaking the Relational Headlock: A Survey of NoSQL Datastores
PDF
Vienna Feb 2015: Cassandra: How it works and what it's good for!
PPTX
Talk About Apache Cassandra
PPTX
Talk about apache cassandra, TWJUG 2011
PPTX
HBase in Practice
PDF
Scalable Data Storage Getting You Down? To The Cloud!
Design Patterns for Distributed Non-Relational Databases
Design Patterns For Distributed NO-reational databases
Cassandra
Cassandra - A Decentralized Structured Storage System
Bigtable
Cassandra: Open Source Bigtable + Dynamo
A Guide to the Post Relational Revolution
Schemaless Databases
Google
Cloud storage
Cassandra an overview
Cassandra Talk: Austin JUG
Understanding and building big data Architectures - NoSQL
Cassandra for Sysadmins
Breaking the Relational Headlock: A Survey of NoSQL Datastores
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Talk About Apache Cassandra
Talk about apache cassandra, TWJUG 2011
HBase in Practice
Scalable Data Storage Getting You Down? To The Cloud!
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology

CockroachDB: Architecture of a Geo-Distributed SQL Database