SlideShare a Scribd company logo
How to build TiDB
PingCAP
About me
● Infrastructure engineer / CEO of PingCAP
● Working on open source projects: TiDB/TiKV
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Email: liuqi@pingcap.com
Let’s say we want to build a NewSQL Database
● From the beginning
● What’s wrong with the existing DBs?
○ Relational databases
○ NoSQL
We have a key-value store (RocksDB)
● Good start, RocksDB is fast and stable.
○ Atomic batch write
○ Snapshot
● However… It’s a local embedded kv store.
○ Can’t tolerate machine failures
○ Scalability depends on the capacity of the disk
Let’s fix Fault Tolerance
● Use Raft to replicate data
○ Key features of Raft
■ Strong leader: leader does most of the work, issue all log updates
■ Leader election
■ Membership changes
● Implementation:
○ Ported from etcd
Let’s fix Fault Tolerance
Machine 1 Machine 2 Machine 3
RocksDB RocksDB RocksDB
Raft Raft
That’s cool
● Basically we have a lite version of etcd or zookeeper.
○ Does not support watch command, and some other features
● Let’s make it better.
How about Scalability?
● What if we SPLIT data into many regions?
○ We got many Raft groups.
○ Region = Contiguous Keys
● Hash partitioning or Range partitioning
○ Redis: Hash partitioning
○ HBase: Range partitioning
That’s Cool, but...
● But what if we want to scan data?
○ How to support API: scan(startKey, endKey, limit)
● So, we need a globally ordered map
○ Can’t use hash partitioning
○ Use range partitioning
■ Region 1 -> [a - d]
■ Region 2 -> [e - h]
■ …
■ Region n -> [w - z]
How to scale? (1/2)
● That’s simple
● Just Split && Move Region 1
Region 1 Region 2
How to scale? (2/2)
● Raft comes for rescue again
○ Using Raft Membership changes, 2 steps:
■ Add a new replica
■ Destroy old region replica
Region 1
Region 3
Region 1
Region 2
Scale-out(initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to
Node B
Node C
Node D
Scale-out (add new node)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica on Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
Now we have a distributed key-value store
● We want to keep replicas in different datacenters
○ For HA: any node might crash, even the whole Data center
○ And to balance the workload
● So, we need Placement Driver (PD) to act as cluster manager, for:
○ Replication constraint
○ Data movement
Placement Driver
● Concept comes from Spanner
● Provide the God’s view of the whole cluster
● Store the metadata
○ Clients have cache of placement information.
● Maintain the replication constraint
○ 3 replicas, by default
● Data movement
○ For balancing the workload
● It’s a cluster too, of course.
○ Thanks to Raft.
Placement
Driver
Placement
Driver
Placement
Driver
Raft
Raft
Raft
Placement Driver
● Rebalance without moving data.
○ Raft: Leadership transfer extension
● Moving data is a slow operation.
● We need fast rebalance.
Store4
Raft groups
RPCRPC
Client
Store1
TiKV Node1
Region 1
Region 3
...
Store2
TiKV Node2
Region 1
Region 2
Region 3
...
Store3
TiKV Node3
Region 1Region 2
...
TiKV Node4
Region 2Region 3
...
RPCRPC
TiKV: The whole picture
Placement
Driver
That’s Cool, but hold on...
● It could be cooler if we have:
○ MVCC
○ ACID Transaction
■ Transaction mode: Google Percolator (2PC)
MVCC (Multi-Version Concurrency Control)
● Each transaction sees a snapshot of database at the beginning time of this
transaction, any changes made by this transaction will not seen by other
transactions until the transaction is committed.
● Data is tagged with versions
○ Key_version: value
● Lock-free snapshot reads
Transaction API style (go code)
txn := store.Begin() // start a transaction
txn.Set([]byte("key1"), []byte("value1"))
txn.Set([]byte("key2"), []byte("value2"))
err = txn.Commit() // commit transaction
if err != nil {
txn.Rollback()
}
I want to write
code like this.
Transaction Model
● Inspired by Google Percolator
● 3 column families
○ cf:lock: An uncommitted transaction is writing this cell; contains the
location/pointer of primary lock
○ cf: write: it stores the commit timestamp of the data
○ cf: data: Stores the data itself
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
6:
5: $10
6:
5:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Bob wants to transfer 7$ to Joe
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
7: $9
6:
5: $2
7:Primary@Bob.bal
6:
5:
7:
6: data @ 5
5:
Transaction Model (commit point)
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $3
6:
5: $10
8:
7: I am Primary
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $9
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
TiKV: Architecture overview (Logical)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
● Highly layered
● Using Raft for consistency and
scalability
● No distributed file system
○ For better performance and lower
latency
TiKV: Highly layered (API angle)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
get(key)
raft_get(key)
MVCC_get(key, ver)
txn_get(key, txn_start_ts)
That’s really really Cool
● We have A Distributed Key-Value Database with
○ Geo-Replication / Auto Rebalance
○ ACID Transaction support
○ Horizontal Scalability
What if we support SQL?
● SQL is simple and very productive
● We want to write code like this:
SELECT COUNT(*) FROM user
WHERE age > 20 and age < 30;
And this...
BEGIN;
INSERT INTO person VALUES(‘tom’, 25);
INSERT INTO person VALUES(‘jerry’, 30);
COMMIT;
First of all, map table data to key value store
● What happens behind:
CREATE TABLE user (
id INT PRIMARY KEY,
name TEXT,
email TEXT
);
Mapping table data to kv store
Key Value
user/1 dongxu | huang@pingcap.com
user/2 tom | tom@pingcap.com
... ...
INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”);
INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
Secondary index is necessary
● Global index
○ All indexes in TiDB are transactional and fully consistent
○ Stored as separate key-value pairs in TiKV
● Keyed by a concatenation of the index prefix and primary key in TiKV
○ For example: table := {id, name} , id is primary key. If we want to build an index on the name
column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like:
■ name_index/tom_1 => nil
■ name_index/tom_2 => nil
○ For unique index
■ id_index/tom => 1,
Index is just not enough...
● Can we push down filters?
○ select count(*) from person
where age > 20 and age < 30
● It should be much faster, maybe 100x
○ Less RPC round trip
○ Less transferring data
Predicate pushdown
TiKV Node1 TiKV Node2 TiKV Node3
TiDB Server
Region 2Region 1
Region 5
age > 20 and age < 30 age > 20 and age < 30
age > 20 and age < 30
TiDB knows that
Region 1 / 2 / 5
stores the data of
person table.
But TiKV doesn’t know the schema
● Key-value database doesn’t have any information about table and row
● Coprocessor comes for help:
○ Concept comes from HBase
○ Inject your own logic to data nodes
What about drivers for every language?
● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go…
● It needs lots of time and code.
○ Trust me, you don’t want to do that.
OR...
● We just build a protocol layer that is compatible with MySQL. Then we have
all the MySQL drivers.
○ All the tools
○ All the ORMs
○ All the applications
● That’s what TiDB does.
Schema change in distributed RDBMS?
● A must-have feature!
● But you don’t want to lock the whole table while changing schema.
○ Usually distributed database stores tons of data spanning multiple machines
● We need a non-blocking schema change algorithm
● Thanks F1 again
○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
Architecture (The whole picture)
MySQL Clients (e.g. JDBC)
TiDB
TiKV
RPC
MySQL Protocol
F1
Spanner
Applications
Testing
● Testing in distributed system is really hard
Embedded testing to your design
● Design for testing
● Get tests from community
○ Lots of tests in MySQL drivers/connectors
○ Lots of ORMs
○ Lots of applications (Record---replay)
And more
● Fault injection
○ Hardware
■ disk error
■ network card
■ cpu
■ clock
○ Software
■ file system
■ network & protocol
And more
● Simulate everything
○ Network example :
https://github.com/pingcap/tikv/pull/916/commits/3cf0f7
248b32c3c523927eed5ebf82aabea481ec
Distribute testing
● Jepsen
● Namazu
○ ZooKeeper:
■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article)
○ Etcd:
■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a
hint of #3611
■ Reproduced flaky tests {#4006, #4039}
○ YARN:
○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}
More to come
Distributed query plan - WIP
Change history (binlog) - WIP
Run TiDB on top of Kubernetes
Thanks
Q&A
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv

More Related Content

PDF
Scale Relational Database with NewSQL
PDF
A Brief Introduction of TiDB (Percona Live)
PDF
Rust in TiKV
PPTX
Building a transactional key-value store that scales to 100+ nodes (percona l...
PDF
TiDB as an HTAP Database
PDF
Golang in TiDB (GopherChina 2017)
PDF
TiDB for Big Data
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
Scale Relational Database with NewSQL
A Brief Introduction of TiDB (Percona Live)
Rust in TiKV
Building a transactional key-value store that scales to 100+ nodes (percona l...
TiDB as an HTAP Database
Golang in TiDB (GopherChina 2017)
TiDB for Big Data
The Dark Side Of Go -- Go runtime related problems in TiDB in production

What's hot (20)

PDF
TiDB Introduction
PDF
TiDB DevCon 2020 Opening Keynote
PDF
TiDB Introduction - San Francisco MySQL Meetup
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
PDF
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
PDF
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
PDF
Introducing TiDB @ SF DevOps Meetup
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
PPTX
OVN Controller Incremental Processing
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Baker: Scaling OVN with Kubernetes API Server
PDF
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
PDF
Introduction to Stateful Stream Processing with Apache Flink.
PDF
Introducing TiDB - Percona Live Frankfurt
PPTX
Migration strategies for a mission critical cluster
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
PDF
Stream Processing Live Traffic Data with Kafka Streams
PDF
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
PDF
Airframe RPC
PDF
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
TiDB Introduction
TiDB DevCon 2020 Opening Keynote
TiDB Introduction - San Francisco MySQL Meetup
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Introducing TiDB @ SF DevOps Meetup
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
OVN Controller Incremental Processing
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Baker: Scaling OVN with Kubernetes API Server
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Introduction to Stateful Stream Processing with Apache Flink.
Introducing TiDB - Percona Live Frankfurt
Migration strategies for a mission critical cluster
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Processing Live Traffic Data with Kafka Streams
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Airframe RPC
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
Ad

Similar to How to build TiDB (20)

PDF
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
PDF
Cassandra Talk: Austin JUG
PDF
Outside The Box With Apache Cassnadra
PDF
Spring one2gx2010 spring-nonrelational_data
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
PDF
Datastores
PDF
Understanding and building big data Architectures - NoSQL
PDF
TiDB Introduction - Boston MySQL Meetup Group
PPTX
Introduction to NoSql
PDF
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
PPT
No sql
PDF
Gcp data engineer
PDF
SQL, NoSQL, NewSQL? What's a developer to do?
PPTX
Big Data (NJ SQL Server User Group)
PPTX
Breaking the Relational Headlock: A Survey of NoSQL Datastores
PPTX
Sql vs NoSQL
PDF
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
PPT
No sql
PDF
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
PPTX
001 hbase introduction
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Cassandra Talk: Austin JUG
Outside The Box With Apache Cassnadra
Spring one2gx2010 spring-nonrelational_data
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Datastores
Understanding and building big data Architectures - NoSQL
TiDB Introduction - Boston MySQL Meetup Group
Introduction to NoSql
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
No sql
Gcp data engineer
SQL, NoSQL, NewSQL? What's a developer to do?
Big Data (NJ SQL Server User Group)
Breaking the Relational Headlock: A Survey of NoSQL Datastores
Sql vs NoSQL
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
No sql
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
001 hbase introduction
Ad

More from PingCAP (20)

PPTX
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PDF
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PPTX
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
PPTX
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
PPTX
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
PPTX
[Paper Reading] QAGen: Generating query-aware test databases
PDF
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
PDF
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PDF
[Paperreading] Paxos made easy (by sen han)
PPTX
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
PDF
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PDF
Finding Logic Bugs in Database Management Systems
PDF
Chaos Practice in PingCAP
PDF
TiDB at PayPay
PPTX
Paper Reading: FPTree
PPTX
Paper Reading: Smooth Scan
PPTX
Paper Reading: Flexible Paxos
PPTX
Paper reading: Cost-based Query Transformation in Oracle
PPTX
Paper reading: HashKV and beyond
PDF
Paper Reading: Pessimistic Cardinality Estimation
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paperreading] Paxos made easy (by sen han)
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
Finding Logic Bugs in Database Management Systems
Chaos Practice in PingCAP
TiDB at PayPay
Paper Reading: FPTree
Paper Reading: Smooth Scan
Paper Reading: Flexible Paxos
Paper reading: Cost-based Query Transformation in Oracle
Paper reading: HashKV and beyond
Paper Reading: Pessimistic Cardinality Estimation

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
top salesforce developer skills in 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Nekopoi APK 2025 free lastest update
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PTS Company Brochure 2025 (1).pdf.......
top salesforce developer skills in 2025.pdf
CHAPTER 2 - PM Management and IT Context
L1 - Introduction to python Backend.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Reimagine Home Health with the Power of Agentic AI​
ai tools demonstartion for schools and inter college
How to Migrate SBCGlobal Email to Yahoo Easily
Softaken Excel to vCard Converter Software.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Nekopoi APK 2025 free lastest update
VVF-Customer-Presentation2025-Ver1.9.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus

How to build TiDB

  • 1. How to build TiDB PingCAP
  • 2. About me ● Infrastructure engineer / CEO of PingCAP ● Working on open source projects: TiDB/TiKV https://github.com/pingcap/tidb https://github.com/pingcap/tikv Email: liuqi@pingcap.com
  • 3. Let’s say we want to build a NewSQL Database ● From the beginning ● What’s wrong with the existing DBs? ○ Relational databases ○ NoSQL
  • 4. We have a key-value store (RocksDB) ● Good start, RocksDB is fast and stable. ○ Atomic batch write ○ Snapshot ● However… It’s a local embedded kv store. ○ Can’t tolerate machine failures ○ Scalability depends on the capacity of the disk
  • 5. Let’s fix Fault Tolerance ● Use Raft to replicate data ○ Key features of Raft ■ Strong leader: leader does most of the work, issue all log updates ■ Leader election ■ Membership changes ● Implementation: ○ Ported from etcd
  • 6. Let’s fix Fault Tolerance Machine 1 Machine 2 Machine 3 RocksDB RocksDB RocksDB Raft Raft
  • 7. That’s cool ● Basically we have a lite version of etcd or zookeeper. ○ Does not support watch command, and some other features ● Let’s make it better.
  • 8. How about Scalability? ● What if we SPLIT data into many regions? ○ We got many Raft groups. ○ Region = Contiguous Keys ● Hash partitioning or Range partitioning ○ Redis: Hash partitioning ○ HBase: Range partitioning
  • 9. That’s Cool, but... ● But what if we want to scan data? ○ How to support API: scan(startKey, endKey, limit) ● So, we need a globally ordered map ○ Can’t use hash partitioning ○ Use range partitioning ■ Region 1 -> [a - d] ■ Region 2 -> [e - h] ■ … ■ Region n -> [w - z]
  • 10. How to scale? (1/2) ● That’s simple ● Just Split && Move Region 1 Region 1 Region 2
  • 11. How to scale? (2/2) ● Raft comes for rescue again ○ Using Raft Membership changes, 2 steps: ■ Add a new replica ■ Destroy old region replica
  • 12. Region 1 Region 3 Region 1 Region 2 Scale-out(initial state) Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node C Node D
  • 13. Region 1 Region 3 Region 1^ Region 2 Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node E 1) Transfer leadership of region 1 from Node A to Node B Node C Node D Scale-out (add new node)
  • 14. Region 1 Region 3 Region 1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 2) Add Replica on Node E Node C Node D Node E Region 1 Scale-out (balancing)
  • 15. Region 1 Region 3 Region 1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 3) Remove Replica from Node A Node C Node D Node E Scale-out (balancing)
  • 16. Now we have a distributed key-value store ● We want to keep replicas in different datacenters ○ For HA: any node might crash, even the whole Data center ○ And to balance the workload ● So, we need Placement Driver (PD) to act as cluster manager, for: ○ Replication constraint ○ Data movement
  • 17. Placement Driver ● Concept comes from Spanner ● Provide the God’s view of the whole cluster ● Store the metadata ○ Clients have cache of placement information. ● Maintain the replication constraint ○ 3 replicas, by default ● Data movement ○ For balancing the workload ● It’s a cluster too, of course. ○ Thanks to Raft. Placement Driver Placement Driver Placement Driver Raft Raft Raft
  • 18. Placement Driver ● Rebalance without moving data. ○ Raft: Leadership transfer extension ● Moving data is a slow operation. ● We need fast rebalance.
  • 19. Store4 Raft groups RPCRPC Client Store1 TiKV Node1 Region 1 Region 3 ... Store2 TiKV Node2 Region 1 Region 2 Region 3 ... Store3 TiKV Node3 Region 1Region 2 ... TiKV Node4 Region 2Region 3 ... RPCRPC TiKV: The whole picture Placement Driver
  • 20. That’s Cool, but hold on... ● It could be cooler if we have: ○ MVCC ○ ACID Transaction ■ Transaction mode: Google Percolator (2PC)
  • 21. MVCC (Multi-Version Concurrency Control) ● Each transaction sees a snapshot of database at the beginning time of this transaction, any changes made by this transaction will not seen by other transactions until the transaction is committed. ● Data is tagged with versions ○ Key_version: value ● Lock-free snapshot reads
  • 22. Transaction API style (go code) txn := store.Begin() // start a transaction txn.Set([]byte("key1"), []byte("value1")) txn.Set([]byte("key2"), []byte("value2")) err = txn.Commit() // commit transaction if err != nil { txn.Rollback() } I want to write code like this.
  • 23. Transaction Model ● Inspired by Google Percolator ● 3 column families ○ cf:lock: An uncommitted transaction is writing this cell; contains the location/pointer of primary lock ○ cf: write: it stores the commit timestamp of the data ○ cf: data: Stores the data itself
  • 24. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 6: 5: $10 6: 5: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5: Bob wants to transfer 7$ to Joe
  • 25. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5:
  • 26. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 7: $9 6: 5: $2 7:Primary@Bob.bal 6: 5: 7: 6: data @ 5 5:
  • 27. Transaction Model (commit point) Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $3 6: 5: $10 8: 7: I am Primary 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $9 6: 5: $2 8: 7:Primary@Bob 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 28. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7:Primary@Bob 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 29. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 30. TiKV: Architecture overview (Logical) Transaction MVCC RaftKV Local KV Storage (RocksDB) ● Highly layered ● Using Raft for consistency and scalability ● No distributed file system ○ For better performance and lower latency
  • 31. TiKV: Highly layered (API angle) Transaction MVCC RaftKV Local KV Storage (RocksDB) get(key) raft_get(key) MVCC_get(key, ver) txn_get(key, txn_start_ts)
  • 32. That’s really really Cool ● We have A Distributed Key-Value Database with ○ Geo-Replication / Auto Rebalance ○ ACID Transaction support ○ Horizontal Scalability
  • 33. What if we support SQL? ● SQL is simple and very productive ● We want to write code like this: SELECT COUNT(*) FROM user WHERE age > 20 and age < 30;
  • 34. And this... BEGIN; INSERT INTO person VALUES(‘tom’, 25); INSERT INTO person VALUES(‘jerry’, 30); COMMIT;
  • 35. First of all, map table data to key value store ● What happens behind: CREATE TABLE user ( id INT PRIMARY KEY, name TEXT, email TEXT );
  • 36. Mapping table data to kv store Key Value user/1 dongxu | huang@pingcap.com user/2 tom | tom@pingcap.com ... ... INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”); INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
  • 37. Secondary index is necessary ● Global index ○ All indexes in TiDB are transactional and fully consistent ○ Stored as separate key-value pairs in TiKV ● Keyed by a concatenation of the index prefix and primary key in TiKV ○ For example: table := {id, name} , id is primary key. If we want to build an index on the name column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like: ■ name_index/tom_1 => nil ■ name_index/tom_2 => nil ○ For unique index ■ id_index/tom => 1,
  • 38. Index is just not enough... ● Can we push down filters? ○ select count(*) from person where age > 20 and age < 30 ● It should be much faster, maybe 100x ○ Less RPC round trip ○ Less transferring data
  • 39. Predicate pushdown TiKV Node1 TiKV Node2 TiKV Node3 TiDB Server Region 2Region 1 Region 5 age > 20 and age < 30 age > 20 and age < 30 age > 20 and age < 30 TiDB knows that Region 1 / 2 / 5 stores the data of person table.
  • 40. But TiKV doesn’t know the schema ● Key-value database doesn’t have any information about table and row ● Coprocessor comes for help: ○ Concept comes from HBase ○ Inject your own logic to data nodes
  • 41. What about drivers for every language? ● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go… ● It needs lots of time and code. ○ Trust me, you don’t want to do that.
  • 42. OR... ● We just build a protocol layer that is compatible with MySQL. Then we have all the MySQL drivers. ○ All the tools ○ All the ORMs ○ All the applications ● That’s what TiDB does.
  • 43. Schema change in distributed RDBMS? ● A must-have feature! ● But you don’t want to lock the whole table while changing schema. ○ Usually distributed database stores tons of data spanning multiple machines ● We need a non-blocking schema change algorithm ● Thanks F1 again ○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
  • 44. Architecture (The whole picture) MySQL Clients (e.g. JDBC) TiDB TiKV RPC MySQL Protocol F1 Spanner Applications
  • 45. Testing ● Testing in distributed system is really hard
  • 46. Embedded testing to your design ● Design for testing ● Get tests from community ○ Lots of tests in MySQL drivers/connectors ○ Lots of ORMs ○ Lots of applications (Record---replay)
  • 47. And more ● Fault injection ○ Hardware ■ disk error ■ network card ■ cpu ■ clock ○ Software ■ file system ■ network & protocol
  • 48. And more ● Simulate everything ○ Network example : https://github.com/pingcap/tikv/pull/916/commits/3cf0f7 248b32c3c523927eed5ebf82aabea481ec
  • 49. Distribute testing ● Jepsen ● Namazu ○ ZooKeeper: ■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article) ○ Etcd: ■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a hint of #3611 ■ Reproduced flaky tests {#4006, #4039} ○ YARN: ○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}
  • 50. More to come Distributed query plan - WIP Change history (binlog) - WIP Run TiDB on top of Kubernetes