SlideShare a Scribd company logo
The Story of RocksDB
Embedded Key-Value Store for Flash and RAM

Dhruba Borthakur & Haobo Xu
Database Engineering@Facebook
Monday, December 9, 13
Monday, December 9, 13
Monday, December 9, 13
A Client-Server Architecture with disks

Application Server

Network roundtrip =
50 micro sec

Database
Server
Disk access =
10 milli seconds

Locally attached Disks

Monday, December 9, 13
Client-Server Architecture with fast storage

Application Server

Network roundtrip =
50 micro sec

Database
Server

100

microsecs
SSD
Latency dominated by network

Monday, December 9, 13

100

nanosecs
RAM
Architecture of an Embedded Database

Application
Server

Monday, December 9, 13

Network roundtrip =
50 micro sec

Database
Server
Architecture of an Embedded Database

Application
Server

Monday, December 9, 13

Network roundtrip =
50 micro sec

Database
Server
Architecture of an Embedded Database

Network roundtrip =
50 micro sec

Application
Server

100

microsecs
SSD

Monday, December 9, 13

100

nanosecs
RAM

Database
Server
Architecture of an Embedded Database

Network roundtrip =
50 micro sec

Application
Server

100

microsecs
SSD

100

nanosecs
RAM
Storage attached directly to application servers

Monday, December 9, 13

Database
Server
Any pre-existing embedded databases?

Open
Source

Monday, December 9, 13

FB

Proprietary
Any pre-existing embedded databases?
Key-value stores
1.
2.
3.
4.

Berkeley DB
SQLite
Kyoto TreeDB
LevelDB

Open
Source

Monday, December 9, 13

FB

Proprietary
Any pre-existing embedded databases?
Key-value stores

Open
Source

1. High Performant
2. No transaction log
3. Fixed size keys

FB

Proprietary

Monday, December 9, 13
Comparison of open source databases

Monday, December 9, 13
Comparison of open source databases
Random Reads

Monday, December 9, 13
Comparison of open source databases
Random Reads

Random Writes

Monday, December 9, 13
Comparison of open source databases
Random Reads
LevelDB
Kyoto TreeDB
SQLite3

Random Writes

Monday, December 9, 13

129,000 ops/sec
151,000 ops/sec
134,000 ops/sec
Comparison of open source databases
Random Reads
LevelDB
Kyoto TreeDB
SQLite3

129,000 ops/sec
151,000 ops/sec
134,000 ops/sec

Random Writes
LevelDB
Kyoto TreeDB
SQLite3

Monday, December 9, 13

164,000 ops/sec
88,500 ops/sec
9,860 ops/sec
HBase and HDFS (in April 2012)

Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13
HBase and HDFS (in April 2012)
Random Reads

Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13
HBase and HDFS (in April 2012)
Random Reads
HDFS (1 node)
HBase (1 node)

93,000 ops/sec
35,000 ops/sec

Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13
Log Structured Merge Architecture

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Transaction log
Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Transaction log
Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Transaction log
Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Read Only data in RAM on
disk
Monday, December 9, 13

Transaction log
Log Structured Merge Architecture
Write Request from Application

Periodic
Compaction

Read Write data
in RAM

Read Only data in RAM on
disk
Monday, December 9, 13

Transaction log
Log Structured Merge Architecture
Scan Request from Application

Periodic
Compaction

Read Write data
in RAM

Read Only data in RAM on
disk
Monday, December 9, 13

Write Request from Application

Transaction log
Leveldb has low write rates
Facebook Application 1:

• Write rate 2 MB/sec only per machine
• Only one cpu was used

Monday, December 9, 13
Leveldb has low write rates
Facebook Application 1:

• Write rate 2 MB/sec only per machine
• Only one cpu was used
We developed multithreaded compaction

10x

improvement on
write rate

Monday, December 9, 13

+

100%
of cpus are
in use
Leveldb has stalls
Facebook Feed:

• P99 latencies were tens of seconds
• Single-threaded compaction

Monday, December 9, 13
Leveldb has stalls
Facebook Feed:

• P99 latencies were tens of seconds
• Single-threaded compaction
We implemented thread aware compaction

Dedicated thread(s)
to flush memtable

Monday, December 9, 13

Pipelined memtables

P99 reduced to less
than a second
Leveldb has high write amplification
• Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very high

Monday, December 9, 13
Leveldb has high write amplification
• Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very high
Level-0

5 bytes

Level-1

6 bytes

11 bytes

Level-2

10 bytes

10 bytes

10 bytes

Stage 1

Stage 2

Stage 3

Two compactions by LevelDB Style Compaction

Monday, December 9, 13
Our solution: lower write amplification
• Facebook Application 2:
• We implemented Universal
Style Compaction
• Start from newest file,
include next file in
candidate set if

• Candidate set size >= size
of next file

Monday, December 9, 13
Our solution: lower write amplification
• Facebook Application 2:
• We implemented Universal
Style Compaction
• Start from newest file,
include next file in
candidate set if

• Candidate set size >= size
of next file

Level-0

5bytes

Level-1

6 bytes

Level-2

10 bytes

10 bytes

Stage 1

Stage 2

Single compaction by Universal Style Compaction

Write amplification reduced to <10
Monday, December 9, 13
Leveldb has high read amplification

Monday, December 9, 13
Leveldb has high read amplification
• Secondary Index Service:
• Leveldb does not use blooms for scans

Monday, December 9, 13
Leveldb has high read amplification
• Secondary Index Service:
• Leveldb does not use blooms for scans
• We implemented prefix scans
• Range scans within same key prefix
• Blooms created for prefix
• Reduces read amplification

Monday, December 9, 13
Leveldb: read modify write = 2X IOs

Monday, December 9, 13
Leveldb: read modify write = 2X IOs
• Counter increments
• Get value, value++, Put value
• Leveldb uses 2X IOPS

Monday, December 9, 13
Leveldb: read modify write = 2X IOs
• Counter increments
• Get value, value++, Put value
• Leveldb uses 2X IOPS

• We implemented MergeRecord
• Put “++” operation in MergeRecord
• Background compaction merges all MergeRecords
• Uses only 1X IOPS

Monday, December 9, 13
Leveldb has a Rigid Design

Monday, December 9, 13
Leveldb has a Rigid Design
• LevelDB Design
• Cannot tune system, fixed file sizes

Monday, December 9, 13
Leveldb has a Rigid Design
• LevelDB Design
• Cannot tune system, fixed file sizes

• We wanted a pluggable architecture
• Pluggable compaction filter, e.g. TimeToLive
• Pluggable memtable/sstable for RAM/Flash
• Pluggable Compaction Algorithm

Monday, December 9, 13
The Changes we did to LevelDB

Monday, December 9, 13
The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and Reverse Iteration

Monday, December 9, 13
The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and Reverse Iteration

Monday, December 9, 13

RocksDB

• 10X higher write rate
• Fewer stalls
• 7x lower write amplification
• Blooms for range scans
• Ability to avoid read-modify-write
• Optimizations for flash or RAM
• And many more…
RocksDB is born!
• Key-Value persistent store
• Embedded
• Optimized for fast storage
• Server workloads

Monday, December 9, 13
RocksDB is born!
• Key-Value persistent store
• Embedded
• Optimized for fast storage
• Server workloads

Monday, December 9, 13
What is it not?
• Not distributed
• No failover
• Not highly-available,
if machine dies you
lose your data

Monday, December 9, 13
What is it not?
• Not distributed
• No failover
• Not highly-available,
if machine dies you
lose your data

Monday, December 9, 13
RocksDB API
▪

Keys and values are arbitrary byte arrays.

▪

Data is stored sorted by key.

▪

The basic operations are Put(key,value), Get(key),
Delete(key) and Merge(key, delta)

▪

Forward and backward iteration is supported
over the data.

Monday, December 9, 13
RocksDB Architecture
Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
RocksDB Architecture
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
RocksDB Architecture
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
RocksDB Architecture
Memory
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
RocksDB Architecture
Memory
Write Request

Persistent Storage

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Log Structured Merge Tree -- Writes
▪

Log Structured Merge Tree

▪

New Puts are written to memory and optionally
to transaction log

▪

Also can specify log write sync option for each
individual write

▪

We say RocksDB is optimized for writes, what
does this mean?

Monday, December 9, 13
RocksDB Write Path
Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
RocksDB Write Path
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Log Structured Merge Tree -- Reads
▪

Data could be in memory or on disk

▪

Consult multiple files to find the latest
instance of the key

▪

Use bloom filters to reduce IO

Monday, December 9, 13
RocksDB Read Path
Active
MemTable

log

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB Read Path
Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB Read Path
Memory
Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB Read Path
Memory

Persistent Storage

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB: Open & Pluggable
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Transaction log
Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Transaction log
Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Transaction log
Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Pluggable sst data format
on storage
Monday, December 9, 13

Transaction log
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Compaction

Pluggable
Memtable
format in RAM

Pluggable sst data format
on storage
Monday, December 9, 13

Transaction log
RocksDB: Open & Pluggable
Get or Scan Request from Application

Write Request from Application
Customizable
WAL

Blooms

Pluggable
Compaction

Pluggable
Memtable
format in RAM

Pluggable sst data format
on storage
Monday, December 9, 13

Transaction log
Example: Customizable WALogging
• In-house Replication solution wants to be able to
embed arbitrary blob in the rocksdb WAL stream for
log annotation
• Use Case: Indicate where a log record came from in
multi-master replication

• Solution:

Monday, December 9, 13

A Put that only speaks to the log
Example: Customizable WALogging
Active
MemTable
k1
Replication Layer
In one write batch:
PutLogData(“I came from Mars”)
Put(k1,v1)

Monday, December 9, 13

v1

log
“I came from Mars”
k1/v1
Example: Customizable WALogging
Active
MemTable
Write Request
Replication Layer
In one write batch:
PutLogData(“I came from Mars”)
Put(k1,v1)

Monday, December 9, 13

k1

v1

log
“I came from Mars”
k1/v1
Example: Pluggable SST format
• One Facebook use case needs extreme fast response
but could tolerate some loss of durability
• Quick hack: mount sst in tmpfs
• Still not performant:

• existing sst format is block based

• Solution:

A much simpler format that just stores
sorted key/value pairs sequentially

• no blocks, no caching, mmap the whole file
• build efficient lookup index on load
Monday, December 9, 13
Example: Blooms for MemTable
• Same use case, after we optimized sst access, memtable
lookup becomes a major cost in query
• Problem: Get needs to go through the memtable lookups
that eventually return no data

• Solution:

Monday, December 9, 13

Just add a bloom filter to memtable!
RocksDB Read Path
Blooms

Blooms

Active
MemTable

log

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB Read Path
Blooms

Blooms

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB Read Path
Memory
Blooms

Blooms

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
RocksDB Read Path
Memory
Blooms

Blooms

Persistent Storage

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Monday, December 9, 13

Compaction
Example: Pluggable memtable format
• Another Facebook use case has a distinct load phase
where no query is issued.
• Problem: write throughput is limited by single writer
thread

• Solution:

A new memtable representation that does
not keep keys sorted

Monday, December 9, 13
Example: Pluggable memtable format
Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Sort,
Flush
sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Example: Pluggable memtable format
Write Request

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Sort,
Flush
sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Example: Pluggable memtable format
Write Request

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM

Sort,
Flush
sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Example: Pluggable memtable format
Memory
Write Request

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM

Sort,
Flush
sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Example: Pluggable memtable format
Memory
Write Request

Persistent Storage

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM

Sort,
Flush
sst

sst

sst

sst

sst

sst

d

Monday, December 9, 13

Compaction
Possible workloads for RocksDB?
▪

Serving data to users via a website

▪

A spam detection backend that needs fast access to data

▪

A graph search query that needs to scan a dataset in realtime

▪

Distributed Configuration Management Systems

▪

Fast serve of Hive Data

▪

A Queue that needs a high rate of inserts and deletes

Monday, December 9, 13
Futures

Monday, December 9, 13
Futures
• Scale linearly with number of cpus
• 32, 64 or higher core machines
• ARM processors

Monday, December 9, 13
Futures
• Scale linearly with number of cpus
• 32, 64 or higher core machines
• ARM processors

• Scale linearly with storage iops
• Striped flash cards
• RAM & NVRAM storage

Monday, December 9, 13
Come Hack with us

Monday, December 9, 13
Come Hack with us
• RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksdb.dev/

Monday, December 9, 13
Come Hack with us
• RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksdb.dev/

• Help us HACK RocksDB

Monday, December 9, 13
Monday, December 9, 13

More Related Content

PDF
MyRocks Deep Dive
PPTX
RedisConf17- Using Redis at scale @ Twitter
PDF
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
PDF
Google Cloud Dataflow
PPTX
RocksDB detail
PDF
RocksDB Performance and Reliability Practices
PDF
Landscape of AI/ML in 2023
PDF
MyRocks introduction and production deployment
MyRocks Deep Dive
RedisConf17- Using Redis at scale @ Twitter
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
Google Cloud Dataflow
RocksDB detail
RocksDB Performance and Reliability Practices
Landscape of AI/ML in 2023
MyRocks introduction and production deployment

What's hot (20)

PDF
Log Structured Merge Tree
PPTX
RocksDB compaction
PDF
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
PDF
Cassandra Introduction & Features
PDF
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
KEY
Redis overview for Software Architecture Forum
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
Apache Spark Architecture
PDF
MongoDB Database Replication
PPTX
Apache Flink and what it is used for
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Introduction to Storm
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Etsy Activity Feeds Architecture
PDF
Redis cluster
PDF
Blazing Performance with Flame Graphs
PDF
Parquet performance tuning: the missing guide
PPTX
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
Log Structured Merge Tree
RocksDB compaction
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Cassandra Introduction & Features
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
Redis overview for Software Architecture Forum
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Apache Spark Architecture
MongoDB Database Replication
Apache Flink and what it is used for
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Apache Iceberg - A Table Format for Hige Analytic Datasets
Introduction to Storm
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Etsy Activity Feeds Architecture
Redis cluster
Blazing Performance with Flame Graphs
Parquet performance tuning: the missing guide
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
Ad

Similar to Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook (20)

PDF
Data Replication Options in AWS
PDF
Mysql features for the enterprise
PDF
Cassandra at scale
PDF
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
PDF
Ruby meetup 7_years_in_testing
PDF
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
PDF
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
PDF
Escalando una PHP App con DB sharding - PHP Conference
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ODP
The care and feeding of a MySQL database
PDF
Tuning Linux Windows and Firebird for Heavy Workload
PDF
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
PPTX
Capacity Planning
PDF
High performance Infrastructure Oct 2013
PDF
Qcon talk
PDF
Say Hello to MyRocks
PPTX
Evaluating Storage for VDI Projects
PDF
Proud to be polyglot!
PPTX
Scalable Text File Service with MongoDB (Intuit)
PPTX
Arc305 how netflix leverages multiple regions to increase availability an i...
Data Replication Options in AWS
Mysql features for the enterprise
Cassandra at scale
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
Ruby meetup 7_years_in_testing
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
Escalando una PHP App con DB sharding - PHP Conference
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
The care and feeding of a MySQL database
Tuning Linux Windows and Firebird for Heavy Workload
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
Capacity Planning
High performance Infrastructure Oct 2013
Qcon talk
Say Hello to MyRocks
Evaluating Storage for VDI Projects
Proud to be polyglot!
Scalable Text File Service with MongoDB (Intuit)
Arc305 how netflix leverages multiple regions to increase availability an i...
Ad

More from The Hive (20)

PDF
"Responsible AI", by Charlie Muirhead
PPTX
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
PDF
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
PDF
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
PPTX
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
PDF
Data Science in the Enterprise
PDF
AI in Software for Augmenting Intelligence Across the Enterprise
PPTX
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
PPTX
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
PPTX
Social Impact & Ethics of AI by Steve Omohundro
PDF
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
PDF
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
PDF
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
PPTX
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
PDF
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
PPTX
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
PDF
The Hive Think Tank: Heron at Twitter
PPTX
The Hive Think Tank: Unpacking AI for Healthcare
PPTX
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
PDF
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
"Responsible AI", by Charlie Muirhead
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
Data Science in the Enterprise
AI in Software for Augmenting Intelligence Across the Enterprise
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
Social Impact & Ethics of AI by Steve Omohundro
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Machine learning based COVID-19 study performance prediction
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Modernizing your data center with Dell and AMD
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Machine learning based COVID-19 study performance prediction
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Modernizing your data center with Dell and AMD
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

  • 1. The Story of RocksDB Embedded Key-Value Store for Flash and RAM Dhruba Borthakur & Haobo Xu Database Engineering@Facebook Monday, December 9, 13
  • 4. A Client-Server Architecture with disks Application Server Network roundtrip = 50 micro sec Database Server Disk access = 10 milli seconds Locally attached Disks Monday, December 9, 13
  • 5. Client-Server Architecture with fast storage Application Server Network roundtrip = 50 micro sec Database Server 100 microsecs SSD Latency dominated by network Monday, December 9, 13 100 nanosecs RAM
  • 6. Architecture of an Embedded Database Application Server Monday, December 9, 13 Network roundtrip = 50 micro sec Database Server
  • 7. Architecture of an Embedded Database Application Server Monday, December 9, 13 Network roundtrip = 50 micro sec Database Server
  • 8. Architecture of an Embedded Database Network roundtrip = 50 micro sec Application Server 100 microsecs SSD Monday, December 9, 13 100 nanosecs RAM Database Server
  • 9. Architecture of an Embedded Database Network roundtrip = 50 micro sec Application Server 100 microsecs SSD 100 nanosecs RAM Storage attached directly to application servers Monday, December 9, 13 Database Server
  • 10. Any pre-existing embedded databases? Open Source Monday, December 9, 13 FB Proprietary
  • 11. Any pre-existing embedded databases? Key-value stores 1. 2. 3. 4. Berkeley DB SQLite Kyoto TreeDB LevelDB Open Source Monday, December 9, 13 FB Proprietary
  • 12. Any pre-existing embedded databases? Key-value stores Open Source 1. High Performant 2. No transaction log 3. Fixed size keys FB Proprietary Monday, December 9, 13
  • 13. Comparison of open source databases Monday, December 9, 13
  • 14. Comparison of open source databases Random Reads Monday, December 9, 13
  • 15. Comparison of open source databases Random Reads Random Writes Monday, December 9, 13
  • 16. Comparison of open source databases Random Reads LevelDB Kyoto TreeDB SQLite3 Random Writes Monday, December 9, 13 129,000 ops/sec 151,000 ops/sec 134,000 ops/sec
  • 17. Comparison of open source databases Random Reads LevelDB Kyoto TreeDB SQLite3 129,000 ops/sec 151,000 ops/sec 134,000 ops/sec Random Writes LevelDB Kyoto TreeDB SQLite3 Monday, December 9, 13 164,000 ops/sec 88,500 ops/sec 9,860 ops/sec
  • 18. HBase and HDFS (in April 2012) Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html Monday, December 9, 13
  • 19. HBase and HDFS (in April 2012) Random Reads Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html Monday, December 9, 13
  • 20. HBase and HDFS (in April 2012) Random Reads HDFS (1 node) HBase (1 node) 93,000 ops/sec 35,000 ops/sec Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html Monday, December 9, 13
  • 21. Log Structured Merge Architecture Read Write data in RAM Monday, December 9, 13
  • 22. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Monday, December 9, 13
  • 23. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Monday, December 9, 13
  • 24. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Monday, December 9, 13
  • 25. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Transaction log Monday, December 9, 13
  • 26. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Transaction log Monday, December 9, 13
  • 27. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Transaction log Monday, December 9, 13
  • 28. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Read Only data in RAM on disk Monday, December 9, 13 Transaction log
  • 29. Log Structured Merge Architecture Write Request from Application Periodic Compaction Read Write data in RAM Read Only data in RAM on disk Monday, December 9, 13 Transaction log
  • 30. Log Structured Merge Architecture Scan Request from Application Periodic Compaction Read Write data in RAM Read Only data in RAM on disk Monday, December 9, 13 Write Request from Application Transaction log
  • 31. Leveldb has low write rates Facebook Application 1: • Write rate 2 MB/sec only per machine • Only one cpu was used Monday, December 9, 13
  • 32. Leveldb has low write rates Facebook Application 1: • Write rate 2 MB/sec only per machine • Only one cpu was used We developed multithreaded compaction 10x improvement on write rate Monday, December 9, 13 + 100% of cpus are in use
  • 33. Leveldb has stalls Facebook Feed: • P99 latencies were tens of seconds • Single-threaded compaction Monday, December 9, 13
  • 34. Leveldb has stalls Facebook Feed: • P99 latencies were tens of seconds • Single-threaded compaction We implemented thread aware compaction Dedicated thread(s) to flush memtable Monday, December 9, 13 Pipelined memtables P99 reduced to less than a second
  • 35. Leveldb has high write amplification • Facebook Application 2: • Level Style Compaction • Write amplification of 70 very high Monday, December 9, 13
  • 36. Leveldb has high write amplification • Facebook Application 2: • Level Style Compaction • Write amplification of 70 very high Level-0 5 bytes Level-1 6 bytes 11 bytes Level-2 10 bytes 10 bytes 10 bytes Stage 1 Stage 2 Stage 3 Two compactions by LevelDB Style Compaction Monday, December 9, 13
  • 37. Our solution: lower write amplification • Facebook Application 2: • We implemented Universal Style Compaction • Start from newest file, include next file in candidate set if • Candidate set size >= size of next file Monday, December 9, 13
  • 38. Our solution: lower write amplification • Facebook Application 2: • We implemented Universal Style Compaction • Start from newest file, include next file in candidate set if • Candidate set size >= size of next file Level-0 5bytes Level-1 6 bytes Level-2 10 bytes 10 bytes Stage 1 Stage 2 Single compaction by Universal Style Compaction Write amplification reduced to <10 Monday, December 9, 13
  • 39. Leveldb has high read amplification Monday, December 9, 13
  • 40. Leveldb has high read amplification • Secondary Index Service: • Leveldb does not use blooms for scans Monday, December 9, 13
  • 41. Leveldb has high read amplification • Secondary Index Service: • Leveldb does not use blooms for scans • We implemented prefix scans • Range scans within same key prefix • Blooms created for prefix • Reduces read amplification Monday, December 9, 13
  • 42. Leveldb: read modify write = 2X IOs Monday, December 9, 13
  • 43. Leveldb: read modify write = 2X IOs • Counter increments • Get value, value++, Put value • Leveldb uses 2X IOPS Monday, December 9, 13
  • 44. Leveldb: read modify write = 2X IOs • Counter increments • Get value, value++, Put value • Leveldb uses 2X IOPS • We implemented MergeRecord • Put “++” operation in MergeRecord • Background compaction merges all MergeRecords • Uses only 1X IOPS Monday, December 9, 13
  • 45. Leveldb has a Rigid Design Monday, December 9, 13
  • 46. Leveldb has a Rigid Design • LevelDB Design • Cannot tune system, fixed file sizes Monday, December 9, 13
  • 47. Leveldb has a Rigid Design • LevelDB Design • Cannot tune system, fixed file sizes • We wanted a pluggable architecture • Pluggable compaction filter, e.g. TimeToLive • Pluggable memtable/sstable for RAM/Flash • Pluggable Compaction Algorithm Monday, December 9, 13
  • 48. The Changes we did to LevelDB Monday, December 9, 13
  • 49. The Changes we did to LevelDB Inherited from LevelDB • Log Structured Merge DB • Gets/Puts/Scans of keys • Forward and Reverse Iteration Monday, December 9, 13
  • 50. The Changes we did to LevelDB Inherited from LevelDB • Log Structured Merge DB • Gets/Puts/Scans of keys • Forward and Reverse Iteration Monday, December 9, 13 RocksDB • 10X higher write rate • Fewer stalls • 7x lower write amplification • Blooms for range scans • Ability to avoid read-modify-write • Optimizations for flash or RAM • And many more…
  • 51. RocksDB is born! • Key-Value persistent store • Embedded • Optimized for fast storage • Server workloads Monday, December 9, 13
  • 52. RocksDB is born! • Key-Value persistent store • Embedded • Optimized for fast storage • Server workloads Monday, December 9, 13
  • 53. What is it not? • Not distributed • No failover • Not highly-available, if machine dies you lose your data Monday, December 9, 13
  • 54. What is it not? • Not distributed • No failover • Not highly-available, if machine dies you lose your data Monday, December 9, 13
  • 55. RocksDB API ▪ Keys and values are arbitrary byte arrays. ▪ Data is stored sorted by key. ▪ The basic operations are Put(key,value), Get(key), Delete(key) and Merge(key, delta) ▪ Forward and backward iteration is supported over the data. Monday, December 9, 13
  • 58. RocksDB Architecture Write Request Active MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 59. RocksDB Architecture Memory Write Request Active MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 60. RocksDB Architecture Memory Write Request Persistent Storage Active MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 61. Log Structured Merge Tree -- Writes ▪ Log Structured Merge Tree ▪ New Puts are written to memory and optionally to transaction log ▪ Also can specify log write sync option for each individual write ▪ We say RocksDB is optimized for writes, what does this mean? Monday, December 9, 13
  • 63. RocksDB Write Path Write Request Active MemTable log Switch Switch ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 64. Log Structured Merge Tree -- Reads ▪ Data could be in memory or on disk ▪ Consult multiple files to find the latest instance of the key ▪ Use bloom filters to reduce IO Monday, December 9, 13
  • 66. RocksDB Read Path Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  • 67. RocksDB Read Path Memory Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  • 68. RocksDB Read Path Memory Persistent Storage Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  • 69. RocksDB: Open & Pluggable Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  • 70. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  • 71. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  • 72. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  • 73. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Transaction log Monday, December 9, 13
  • 74. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Transaction log Monday, December 9, 13
  • 75. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Transaction log Monday, December 9, 13
  • 76. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Pluggable sst data format on storage Monday, December 9, 13 Transaction log
  • 77. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Compaction Pluggable Memtable format in RAM Pluggable sst data format on storage Monday, December 9, 13 Transaction log
  • 78. RocksDB: Open & Pluggable Get or Scan Request from Application Write Request from Application Customizable WAL Blooms Pluggable Compaction Pluggable Memtable format in RAM Pluggable sst data format on storage Monday, December 9, 13 Transaction log
  • 79. Example: Customizable WALogging • In-house Replication solution wants to be able to embed arbitrary blob in the rocksdb WAL stream for log annotation • Use Case: Indicate where a log record came from in multi-master replication • Solution: Monday, December 9, 13 A Put that only speaks to the log
  • 80. Example: Customizable WALogging Active MemTable k1 Replication Layer In one write batch: PutLogData(“I came from Mars”) Put(k1,v1) Monday, December 9, 13 v1 log “I came from Mars” k1/v1
  • 81. Example: Customizable WALogging Active MemTable Write Request Replication Layer In one write batch: PutLogData(“I came from Mars”) Put(k1,v1) Monday, December 9, 13 k1 v1 log “I came from Mars” k1/v1
  • 82. Example: Pluggable SST format • One Facebook use case needs extreme fast response but could tolerate some loss of durability • Quick hack: mount sst in tmpfs • Still not performant: • existing sst format is block based • Solution: A much simpler format that just stores sorted key/value pairs sequentially • no blocks, no caching, mmap the whole file • build efficient lookup index on load Monday, December 9, 13
  • 83. Example: Blooms for MemTable • Same use case, after we optimized sst access, memtable lookup becomes a major cost in query • Problem: Get needs to go through the memtable lookups that eventually return no data • Solution: Monday, December 9, 13 Just add a bloom filter to memtable!
  • 85. RocksDB Read Path Blooms Blooms Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  • 86. RocksDB Read Path Memory Blooms Blooms Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  • 87. RocksDB Read Path Memory Blooms Blooms Persistent Storage Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  • 88. Example: Pluggable memtable format • Another Facebook use case has a distinct load phase where no query is issued. • Problem: write throughput is limited by single writer thread • Solution: A new memtable representation that does not keep keys sorted Monday, December 9, 13
  • 89. Example: Pluggable memtable format Unsorted MemTable log Switch Switch ReadOnly MemTable log log LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 90. Example: Pluggable memtable format Write Request Unsorted MemTable log Switch Switch ReadOnly MemTable log log LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 91. Example: Pluggable memtable format Write Request Unsorted MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 92. Example: Pluggable memtable format Memory Write Request Unsorted MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 93. Example: Pluggable memtable format Memory Write Request Persistent Storage Unsorted MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  • 94. Possible workloads for RocksDB? ▪ Serving data to users via a website ▪ A spam detection backend that needs fast access to data ▪ A graph search query that needs to scan a dataset in realtime ▪ Distributed Configuration Management Systems ▪ Fast serve of Hive Data ▪ A Queue that needs a high rate of inserts and deletes Monday, December 9, 13
  • 96. Futures • Scale linearly with number of cpus • 32, 64 or higher core machines • ARM processors Monday, December 9, 13
  • 97. Futures • Scale linearly with number of cpus • 32, 64 or higher core machines • ARM processors • Scale linearly with storage iops • Striped flash cards • RAM & NVRAM storage Monday, December 9, 13
  • 98. Come Hack with us Monday, December 9, 13
  • 99. Come Hack with us • RocksDB is Open Sourced • http://rocksdb.org • Developers group https://www.facebook.com/groups/rocksdb.dev/ Monday, December 9, 13
  • 100. Come Hack with us • RocksDB is Open Sourced • http://rocksdb.org • Developers group https://www.facebook.com/groups/rocksdb.dev/ • Help us HACK RocksDB Monday, December 9, 13