SlideShare a Scribd company logo
Making the case for write-optimized
database algorithms
Mark Callaghan
Member of Technical Staff, Facebook
RocksDB
• Embedded key-value storage engine
MyRocks
• RocksDB storage engine for MySQL
• coming to Percona Server and MariaDB Server
MongoRocks
• RocksDB storage engine for MongoDB
• In Percona Server today
RocksDB, MyRocks & MongoRocks
• Good read efficiency
• Better write efficiency
• Best space efficiency
RocksDB, MyRocks & MongoRocks
• Use less SSD
• Use lower endurance SSD
In some cases, better read efficiency is possible
Efficient performance is the goal
Benchmarketing?
Sysbench read-write, in-memory
0
50000
100000
150000
200000
1 2 4 8 16 24 32 40 48 64 80 96 128
MyRocks InnoDB
Better is not just about throughput
tpmC
iostat
rKB/t
iostat
wKB/t
vmstat
CPU/t
Size

(GB)
p99
response
(ms)
MyRocks
+zlib
95680 2.19 2.02 528 82 29.9
InnoDB 91981 2.67 7.49 400 222 18.4
tpcc-mysql, 1000 warehouses, IO-bound
• QPS - average throughput
• QoS - worst case throughput
• Efficiency - hardware per query
Peak performance is overrated
Performance
Performance & Efficiency
Meet performance goals
Then optimize for efficiency
RUM or RWS?
• RUM = Read, Update, Memory
• RWS = Read, Write, Space
Efficiency
• Synonym for amplification
• Related to, but not equivalent to, performance.
An algorithm can’t be optimal for all of read, write & space amplification
• See Designing Access Methods: The RUM Conjecture
• daslab.seas.harvard.edu/rum-conjecture
RUM Conjecture
For any useful database algorithm there exists another
useful algorithm that has better read, write or space
amplification.
Define: optimal
Read, Write
• physical work per logical request
Space
• sizeof(database files) / sizeof(data)
Define: amplification
Storage
• random operations
• bytes read
• bytes written
Define: work
CPU
• blocks (un)compressed
• memory/cache operations
• hash searches
• key comparisons
• CPU seconds
Basic operations:
• point read
• range read
• put
• delete
Complex operations:
• query
• transaction
Define: operation
Update is one of:
• put
• point read, put
• point read, delete, put
Everything is relative
tpmC
iostat
rKB/t
iostat
wKB/t
vmstat
CPU/t
Size

(GB)
MyRocks+zlib 95680 2.19 2.02 528 82
InnoDB 91981 2.67 7.49 400 222
InnoDB/
MyRocks
0.96 1.22 3.71 0.76 2.71
tpcc-mysql, 1000 warehouses, IO-bound
Efficiency: B-Tree
Example: InnoDB
Read
• logN key compares
Write
• sizeof(page) / sizeof(row)
Space
• 1.5X if leaf pages are 2/3 full
Update-in-Place B-Tree
• InnoDB
Copy-on-Write B-Tree
• WiredTiger
• LMDB
Efficiency: leveled LSM
Example: RocksDB, Cassandra

Read
• logN + log(N/10) + log(N/100) + log(N/1000) key compares
• point reads can use bloom filter
Write
• rewrite previously written rows
• worse than size-tiered LSM, better than B-Tree
Space
• 1.1X
Efficiency: size-tiered LSM
Example: RocksDB, Cassandra, HBase
Read
• more than leveled LSM
• point reads can use bloom filter
Write
• rewrite previously written rows
• better than leveled LSM, better than B-Tree
Space
• ~2X, worse than leveled LSM
Efficiency: summary
Read Write Space
B-Tree best good
leveled LSM good for point good best
size-tiered LSM best
Theory meets practice
Access distribution
• LSM benefits from skew
Cache
• B-Tree - prefer to have index in cache
• LSM - prefer to have all but largest level in cache

IO costs are hard to predict
Linkbench, IO-bound
TPS
iostat
r/t
iostat
wKB/t
CPU
usecs/t
Size

(GB)
p99
update (ms)
MyRocks+zlib 28965 1.03 1.25 999 374 1
InnoDB 21474 1.16 19.70 914 14xx 6
InnoDB+zlib 20734 1.07 14.59 1199 880 6
MyRocks: best throughput & QoS, most efficient
Space efficiency
• Fragmentation
• Fixed page size
• More per-row metadata
• No prefix encoding (InnoDB)
Why did RocksDB beat a B-Tree?
Write efficiency
• Uses more space = more data to write
• Working set larger than cache
• sizeof(page) / sizeof(row)
• Double write buffer (InnoDB)
Page size & write amplification
Page size TPS
iostat
wKB/t
MyRocks+zlib 16kb 28965 1.25
InnoDB 4kb 24845 6.13
InnoDB 8kb 24352 10.52
InnoDB 16kb 21414 19.70
Advantage B-Tree
• Fewer key comparisons
• Less IO for range queries
Read efficiency: B-Tree vs LSM?
Advantage LSM
• Uses less space = more data in cache
• Prefix key encoding when uncompressed
• Efficient writes saves IO for reads
• Read-free index maintenance
• Bloom filter
Performance is complex
Save on writes, spend more on reads
MyRocks
zlib
TPS
InnoDB
TPS
Ratio

(MyRocks / InnoDB)
Disk array 2195 414 5.3
Slow SSD 23484 10143 2.3
Fast SSD 28965 21414 1.4
Read versus Write efficiency
• Indexes - more & wider
Read versus Space efficiency
• Bloom filters
• Compression
• Indexes
Write versus Space efficiency
• RocksDB fanout
• Size-tiered vs leveled
• GC & defragmentation frequency
Trading between R, W and S efficiency
Write versus space amplification
Space vs write amplification for a log-based algorithm
WriteAmplification
0
2.5
5
7.5
10
Space Amplification
1.11 1.25 1.33 1.67 2
• space amplification = 100 / %full
• write amplification = 100 / (100 - %full)
One size doesn’t fit all
• B-Tree + LSM sharing one redo log
Adaptive algorithms
• DBA sets high-level goals
• Algorithm adapts to achieve them
More open source
• MyRocks in MariaDB Server & Percona Server
• MongoRocks in Percona Server
• More features in RocksDB
What comes next?
More performance results are
coming
• YCSB
• sysbench
• time series
• bulk load
• tpcc-mysql
rocksdb.org
mongorocks.org
github.com/facebook/mysql-5.6
Thank you
smalldatum.blogspot.com
twitter.com/markcallaghan

More Related Content

PDF
Spilo, отказоустойчивый PostgreSQL кластер / Oleksii Kliukin (Zalando SE)
PPTX
HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)
PPTX
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
PPT
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
PDF
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
PPTX
Understanding and tuning WiredTiger, the new high performance database engine...
ODP
Как Web-акселератор акселерирует ваш сайт / Александр Крижановский (Tempesta ...
PDF
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Spilo, отказоустойчивый PostgreSQL кластер / Oleksii Kliukin (Zalando SE)
HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Understanding and tuning WiredTiger, the new high performance database engine...
Как Web-акселератор акселерирует ваш сайт / Александр Крижановский (Tempesta ...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...

What's hot (19)

PDF
Алексей Лесовский "Тюнинг Linux для баз данных. "
PDF
My Sql Performance In A Cloud
PPTX
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
KEY
High Performance Weibo QCon Beijing 2011
PDF
My Sql Performance On Ec2
PDF
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PDF
Optimizing MongoDB: Lessons Learned at Localytics
PDF
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
PPTX
Redis Developers Day 2014 - Redis Labs Talks
PDF
Scaling MongoDB in the cloud with Microsoft Azure
PDF
No sql but even less security
PPTX
Day 2 General Session Presentations RedisConf
PDF
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
PPTX
Ужимай и властвуй алгоритмы компрессии в базах данных / Петр Зайцев (Percona)
KEY
微博cache设计谈
PDF
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
PPTX
MongoDB World 2015 - A Technical Introduction to WiredTiger
PDF
[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
PDF
Troubleshooting redis
Алексей Лесовский "Тюнинг Linux для баз данных. "
My Sql Performance In A Cloud
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
High Performance Weibo QCon Beijing 2011
My Sql Performance On Ec2
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Optimizing MongoDB: Lessons Learned at Localytics
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Redis Developers Day 2014 - Redis Labs Talks
Scaling MongoDB in the cloud with Microsoft Azure
No sql but even less security
Day 2 General Session Presentations RedisConf
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Ужимай и властвуй алгоритмы компрессии в базах данных / Петр Зайцев (Percona)
微博cache设计谈
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
MongoDB World 2015 - A Technical Introduction to WiredTiger
[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
Troubleshooting redis
Ad

Viewers also liked (20)

PDF
NoSQL внутри SQL: приземленные вопросы практического применения / Дмитрий До...
PDF
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
PDF
Life Of A Dirty Page Inno Db Disk Io
PDF
Open Source SQL-базы данных вступили в эру миллионов запросов в секунду / Фед...
PDF
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
PDF
Отладка производительности приложения на Erlang / Максим Лапшин (Erlyvideo)
PDF
Хранение данных на виниле / Константин Осипов (tarantool.org)
PDF
Профилирование кода на C/C++ в *nix-системах / Александр Алексеев (Postgres P...
PDF
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
PDF
Archival Disc на смену Blu-ray: построение архивного хранилища на оптических ...
PPTX
Как смигрировать 50Пб в 32 без даунтайма? / Альберт Галимов, Андрей Сумин (Ma...
PDF
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
PDF
Долгожданный релиз pg_pathman 1.0 / Александр Коротков, Дмитрий Иванов (Post...
PDF
Performance Schema in MySQL (Danil Zburivsky)
PDF
MySQL Troubleshooting with the Performance Schema
PDF
The MySQL Performance Schema & New SYS Schema
KEY
MySQL Performance - SydPHP October 2011
PDF
Девять кругов ада или PostgreSQL Vacuum / Алексей Лесовский (PostgreSQL-Consu...
PPTX
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
PPTX
LuaJIT как основа для сервера приложений - проблемы и решения / Игорь Эрлих (...
NoSQL внутри SQL: приземленные вопросы практического применения / Дмитрий До...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Life Of A Dirty Page Inno Db Disk Io
Open Source SQL-базы данных вступили в эру миллионов запросов в секунду / Фед...
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Отладка производительности приложения на Erlang / Максим Лапшин (Erlyvideo)
Хранение данных на виниле / Константин Осипов (tarantool.org)
Профилирование кода на C/C++ в *nix-системах / Александр Алексеев (Postgres P...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Archival Disc на смену Blu-ray: построение архивного хранилища на оптических ...
Как смигрировать 50Пб в 32 без даунтайма? / Альберт Галимов, Андрей Сумин (Ma...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Долгожданный релиз pg_pathman 1.0 / Александр Коротков, Дмитрий Иванов (Post...
Performance Schema in MySQL (Danil Zburivsky)
MySQL Troubleshooting with the Performance Schema
The MySQL Performance Schema & New SYS Schema
MySQL Performance - SydPHP October 2011
Девять кругов ада или PostgreSQL Vacuum / Алексей Лесовский (PostgreSQL-Consu...
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
LuaJIT как основа для сервера приложений - проблемы и решения / Игорь Эрлих (...
Ad

Similar to Making the case for write-optimized database algorithms / Mark Callaghan (Facebook) (20)

PDF
FlashSQL 소개 & TechTalk
PPTX
Introduction to TokuDB v7.5 and Read Free Replication
PDF
MyRocks introduction and production deployment
PDF
Accelerating HBase with NVMe and Bucket Cache
PPTX
Accelerating hbase with nvme and bucket cache
PPTX
Migrating from InnoDB and HBase to MyRocks at Facebook
PDF
Scaling HDFS to Manage Billions of Files
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PDF
MyRocks Deep Dive
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPTX
MongoDB Aggregation Performance
PDF
TiDB vs Aurora.pdf
PPTX
Storage Engine Wars at Parse
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
PDF
Configuring workload-based storage and topologies
PDF
Linux and H/W optimizations for MySQL
PPTX
Galaxy Big Data with MariaDB
PDF
Colvin exadata mistakes_ioug_2014
PPTX
Capacity Planning
PDF
Deep Dive into DynamoDB
FlashSQL 소개 & TechTalk
Introduction to TokuDB v7.5 and Read Free Replication
MyRocks introduction and production deployment
Accelerating HBase with NVMe and Bucket Cache
Accelerating hbase with nvme and bucket cache
Migrating from InnoDB and HBase to MyRocks at Facebook
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files with Key-Value Stores
MyRocks Deep Dive
In-memory Caching in HDFS: Lower Latency, Same Great Taste
MongoDB Aggregation Performance
TiDB vs Aurora.pdf
Storage Engine Wars at Parse
Gruter TECHDAY 2014 Realtime Processing in Telco
Configuring workload-based storage and topologies
Linux and H/W optimizations for MySQL
Galaxy Big Data with MariaDB
Colvin exadata mistakes_ioug_2014
Capacity Planning
Deep Dive into DynamoDB

More from Ontico (20)

PDF
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
PDF
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
PPTX
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
PDF
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
PDF
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
PDF
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
PDF
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
PPTX
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
PDF
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
PPTX
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
PPTX
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
PDF
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
PPT
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
PPTX
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
PPTX
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
PPTX
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
PPTX
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
PDF
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
PDF
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
PPTX
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PPT on Performance Review to get promotions
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
web development for engineering and engineering
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Sustainable Sites - Green Building Construction
PPT
Project quality management in manufacturing
573137875-Attendance-Management-System-original
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mechanical Engineering MATERIALS Selection
bas. eng. economics group 4 presentation 1.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Lesson 3_Tessellation.pptx finite Mathematics
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Digital Logic Computer Design lecture notes
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CH1 Production IntroductoryConcepts.pptx
web development for engineering and engineering
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Sustainable Sites - Green Building Construction
Project quality management in manufacturing

Making the case for write-optimized database algorithms / Mark Callaghan (Facebook)

  • 1. Making the case for write-optimized database algorithms Mark Callaghan Member of Technical Staff, Facebook
  • 2. RocksDB • Embedded key-value storage engine MyRocks • RocksDB storage engine for MySQL • coming to Percona Server and MariaDB Server MongoRocks • RocksDB storage engine for MongoDB • In Percona Server today RocksDB, MyRocks & MongoRocks
  • 3. • Good read efficiency • Better write efficiency • Best space efficiency RocksDB, MyRocks & MongoRocks • Use less SSD • Use lower endurance SSD In some cases, better read efficiency is possible Efficient performance is the goal
  • 4. Benchmarketing? Sysbench read-write, in-memory 0 50000 100000 150000 200000 1 2 4 8 16 24 32 40 48 64 80 96 128 MyRocks InnoDB
  • 5. Better is not just about throughput tpmC iostat rKB/t iostat wKB/t vmstat CPU/t Size
 (GB) p99 response (ms) MyRocks +zlib 95680 2.19 2.02 528 82 29.9 InnoDB 91981 2.67 7.49 400 222 18.4 tpcc-mysql, 1000 warehouses, IO-bound • QPS - average throughput • QoS - worst case throughput • Efficiency - hardware per query
  • 6. Peak performance is overrated Performance Performance & Efficiency Meet performance goals Then optimize for efficiency
  • 7. RUM or RWS? • RUM = Read, Update, Memory • RWS = Read, Write, Space Efficiency • Synonym for amplification • Related to, but not equivalent to, performance. An algorithm can’t be optimal for all of read, write & space amplification • See Designing Access Methods: The RUM Conjecture • daslab.seas.harvard.edu/rum-conjecture RUM Conjecture
  • 8. For any useful database algorithm there exists another useful algorithm that has better read, write or space amplification. Define: optimal
  • 9. Read, Write • physical work per logical request Space • sizeof(database files) / sizeof(data) Define: amplification
  • 10. Storage • random operations • bytes read • bytes written Define: work CPU • blocks (un)compressed • memory/cache operations • hash searches • key comparisons • CPU seconds
  • 11. Basic operations: • point read • range read • put • delete Complex operations: • query • transaction Define: operation Update is one of: • put • point read, put • point read, delete, put
  • 12. Everything is relative tpmC iostat rKB/t iostat wKB/t vmstat CPU/t Size
 (GB) MyRocks+zlib 95680 2.19 2.02 528 82 InnoDB 91981 2.67 7.49 400 222 InnoDB/ MyRocks 0.96 1.22 3.71 0.76 2.71 tpcc-mysql, 1000 warehouses, IO-bound
  • 13. Efficiency: B-Tree Example: InnoDB Read • logN key compares Write • sizeof(page) / sizeof(row) Space • 1.5X if leaf pages are 2/3 full Update-in-Place B-Tree • InnoDB Copy-on-Write B-Tree • WiredTiger • LMDB
  • 14. Efficiency: leveled LSM Example: RocksDB, Cassandra
 Read • logN + log(N/10) + log(N/100) + log(N/1000) key compares • point reads can use bloom filter Write • rewrite previously written rows • worse than size-tiered LSM, better than B-Tree Space • 1.1X
  • 15. Efficiency: size-tiered LSM Example: RocksDB, Cassandra, HBase Read • more than leveled LSM • point reads can use bloom filter Write • rewrite previously written rows • better than leveled LSM, better than B-Tree Space • ~2X, worse than leveled LSM
  • 16. Efficiency: summary Read Write Space B-Tree best good leveled LSM good for point good best size-tiered LSM best
  • 17. Theory meets practice Access distribution • LSM benefits from skew Cache • B-Tree - prefer to have index in cache • LSM - prefer to have all but largest level in cache
 IO costs are hard to predict
  • 18. Linkbench, IO-bound TPS iostat r/t iostat wKB/t CPU usecs/t Size
 (GB) p99 update (ms) MyRocks+zlib 28965 1.03 1.25 999 374 1 InnoDB 21474 1.16 19.70 914 14xx 6 InnoDB+zlib 20734 1.07 14.59 1199 880 6 MyRocks: best throughput & QoS, most efficient
  • 19. Space efficiency • Fragmentation • Fixed page size • More per-row metadata • No prefix encoding (InnoDB) Why did RocksDB beat a B-Tree? Write efficiency • Uses more space = more data to write • Working set larger than cache • sizeof(page) / sizeof(row) • Double write buffer (InnoDB)
  • 20. Page size & write amplification Page size TPS iostat wKB/t MyRocks+zlib 16kb 28965 1.25 InnoDB 4kb 24845 6.13 InnoDB 8kb 24352 10.52 InnoDB 16kb 21414 19.70
  • 21. Advantage B-Tree • Fewer key comparisons • Less IO for range queries Read efficiency: B-Tree vs LSM? Advantage LSM • Uses less space = more data in cache • Prefix key encoding when uncompressed • Efficient writes saves IO for reads • Read-free index maintenance • Bloom filter Performance is complex
  • 22. Save on writes, spend more on reads MyRocks zlib TPS InnoDB TPS Ratio
 (MyRocks / InnoDB) Disk array 2195 414 5.3 Slow SSD 23484 10143 2.3 Fast SSD 28965 21414 1.4
  • 23. Read versus Write efficiency • Indexes - more & wider Read versus Space efficiency • Bloom filters • Compression • Indexes Write versus Space efficiency • RocksDB fanout • Size-tiered vs leveled • GC & defragmentation frequency Trading between R, W and S efficiency
  • 24. Write versus space amplification Space vs write amplification for a log-based algorithm WriteAmplification 0 2.5 5 7.5 10 Space Amplification 1.11 1.25 1.33 1.67 2 • space amplification = 100 / %full • write amplification = 100 / (100 - %full)
  • 25. One size doesn’t fit all • B-Tree + LSM sharing one redo log Adaptive algorithms • DBA sets high-level goals • Algorithm adapts to achieve them More open source • MyRocks in MariaDB Server & Percona Server • MongoRocks in Percona Server • More features in RocksDB What comes next? More performance results are coming • YCSB • sysbench • time series • bulk load • tpcc-mysql