SlideShare a Scribd company logo
ATechnical Introduction to WiredTiger
info@wiredtiger.com
Presenter
•  Keith Bostic
•  Co-architect WiredTiger
•  Senior Staff Engineer MongoDB
Ask questions as we go, or
keith.bostic@mongodb.com
3
This presentation is not…
•  How to write stand-alone WiredTiger apps
– contact info@wiredtiger.com
•  How to configure MongoDB with WiredTiger for your workload
4
WiredTiger
•  Embedded database engine
– general purpose toolkit
– high performing: scalable throughput with low latency
•  Key-value store (NoSQL)
•  Schema layer
– data typing, indexes
•  Single-node
•  OO APIs
– Python, C, C++, Java
•  Open Source
5
Deployments
•  Amazon AWS
•  ORC/Tbricks: financial trading solution
And, most important of all:
•  MongoDB: next-generation document store
You may have seen this:
or this…
8
MongoDB’s Storage Engine API
•  Allows different storage engines to "plug-in"
– different workloads have different performance characteristics
– mmapV1 is not ideal for all workloads
– more flexibility
•  mix storage engines on same replica set/sharded cluster
•  Opportunity to innovate further
– HDFS, encrypted, other workloads
•  WiredTiger is MongoDB’s general-purpose workhorse
Topics
Ø  WiredTiger Architecture
•  In-memory performance
•  Record-level concurrency
•  Compression
•  Durability and the journal
•  Future features
10
Motivation for WiredTiger
•  Traditional engines struggle with modern hardware:
– lots of CPU cores
– lots of RAM
•  Avoid thread contention for resources
– lock-free algorithms, for example, hazard pointers
– concurrency control without blocking
•  Hotter cache, more work per I/O
– big blocks
– compact file formats
11
WiredTiger Architecture
WiredTiger Engine
Schema &
Cursors
Python API C API Java API
Database
Files
Transactions
Page
read/write
Logging
Column
storage
Block
management
Row
storage
Snapshots
Log Files
Cache
12
Column-store, LSM
•  Column-store
– implemented inside the B+tree
– 64-bit record number keys
– valued by the key’s position in the tree
– variable-length or fixed-length
•  LSM
– forest of B+trees (row-store or column-store)
– bloom filters (fixed-length column-store)
•  Mix-and-match
– sparse, wide table: column-store primary, LSM indexes
Topics
ü  WiredTiger Architecture
Ø  In-memory performance
•  Record-level concurrency
•  Compression
•  Durability and the journal
•  Future features
14
Trees in cache
non-resident
child
ordinary pointer
root page
internal
page
internal
page
root page
leaf page
leaf page leaf page leaf page
15
Hazard Pointers
non-resident
child
ordinary pointer
root page
internal
page
internal
page
root page
leaf page
leaf page leaf page leaf page
1 memory
flush
16
Pages in cache
cache
data files
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
17
Skiplists
•  Updates stored in skiplists
– ordered linked lists with forward “skip” pointers
•  William Pugh, 1989
– simpler, as fast as binary-search, less space
– likely binary-search performance plus cache prefetch
– more space for an existing data set
•  Implementation
– insert without locking
– forward/backward traversal without locking, while inserting
– removal requires locking
18
In-memory performance
•  Cache trees/pages optimized for in-memory access
•  Follow pointers to traverse a tree
•  No locking to read or write
•  Keep updates separate from initial data
– updates are stored in skiplists
– updates are atomic in almost all cases
•  Do structural changes (eviction, splits) in background threads
Topics
ü  WiredTiger Architecture
ü  In-memory performance
Ø  Record-level concurrency
•  Compression
•  Durability and the journal
•  Future features
20
Multiversion Concurrency Control (MVCC)
•  Multiple versions of records maintained in cache
•  Readers see most recently committed version
– read-uncommitted or snapshot isolation available
– configurable per-transaction or per-handle
•  Writers can create new versions concurrent with readers
•  Concurrent updates to a single record cause write conflicts
– one of the updates wins
– other generally retries with back-off
•  No locking, no lock manager
21
Pages in cache
cache
data files
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
skiplist
22
MVCC In Action
on-disk
page
image
index
update1
(txn, value)
on-disk
page
image
index
update2
(txn, value)
update1
(txn, value)
update
on-disk
page
image
index
update
Topics
ü  WiredTiger Architecture
ü  In-memory performance
ü  Record-level concurrency
Ø  Compression
•  Durability and the journal
•  Future features
24
Block manager
•  Block allocation
– fragmentation
– allocation policy
•  Checksums
– block compression is at a higher level
•  Checkpoints
– involved in durability guarantees
•  Opaque address cookie
– stored as internal page key’s “value”
•  Pluggable
25
Write path
cache
data files
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
reconciled
during write
26
In-memory Compression
•  Prefix compression
– index keys usually have a common prefix
– rolling, per-block, requires instantiation for performance
•  Huffman/static encoding
– burns CPU
•  Dictionary lookup
– single value per page
•  Run-length encoding
– column-store values
27
On-disk Compression
•  Compression algorithms:
– snappy [default]: good compression, low overhead
– LZ4: good compression, low overhead, better page layout
– zlib: better compression, high overhead
– pluggable
•  Optional
– compressing filesystem instead
28
Compression in Action
Flights database
Topics
ü  WiredTiger Architecture
ü  In-memory performance
ü  Record-level concurrency
ü  Compression
Ø  Durability and the journal
•  Future features
30
Journal and Recovery
•  Write-ahead logging (aka journal) enabled by default
•  Only written at transaction commit
– only write redo records
•  Log records are compressed
•  Group commit for concurrency
•  Automatic log archival / removal
– bounded by checkpoint frequency
•  On startup, find a consistent checkpoint in the metadata
– use the checkpoint to figure out how much to roll forward
31
Durability without Journaling
•  MongoDB’s MMAP storage requires the journal for consistency
– running with “nojournal” is unsafe
•  WiredTiger is a no-overwrite data store
– with “nojournal”, updates since the last checkpoint may be lost
– data will still be consistent
– checkpoints every N seconds by default
•  Replication can guarantee durability
– the network is generally faster than disk I/O
Topics
ü  WiredTiger Architecture
ü  In-memory performance
ü  Record-level concurrency
ü  Compression
ü  Durability and the journal
Ø  Future features
33
34
What’s next for WiredTiger?
•  Our Big Year of Tuning
– applications doing “interesting” things
– stalls during checkpoints with 100GB+ caches
– MongoDB capped collections
•  Encryption
•  Advanced transactional semantics
– updates not stable until confirmed by replica majority
35
WiredTiger LSM support
•  Random insert workloads
•  Data set much larger than cache
•  Query performance less important
•  Background maintenance overhead acceptable
•  Bloom filters
36
37
Benchmarks
Mark Callaghan at Facebook: http://smalldatum.blogspot.com/
Thanks!
Questions?
Keith Bostic
keith.bostic@mongodb.com

More Related Content

PPTX
A Technical Introduction to WiredTiger
PDF
MongoDB WiredTiger Internals
PDF
Spark shuffle introduction
PDF
MongodB Internals
PPT
Introduction to redis
PPTX
Introduction to Redis
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
PPTX
HBase in Practice
A Technical Introduction to WiredTiger
MongoDB WiredTiger Internals
Spark shuffle introduction
MongodB Internals
Introduction to redis
Introduction to Redis
From cache to in-memory data grid. Introduction to Hazelcast.
HBase in Practice

What's hot (20)

PPTX
PPTX
PPTX
Introduction to redis
PDF
MongoDB WiredTiger Internals: Journey To Transactions
PPTX
Redis Introduction
PDF
PostgreSQL Deep Internal
PDF
Intro to HBase
PDF
Bloat and Fragmentation in PostgreSQL
PPTX
Introduction to MongoDB.pptx
PPTX
Using Wildcards with rsyslog's File Monitor imfile
PPTX
Introduction to Apache Spark
PDF
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
PPTX
Bucket your partitions wisely - Cassandra summit 2016
PDF
Introduction à ElasticSearch
PDF
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PPTX
Introduction to Storm
PDF
Introduction to MongoDB
PDF
Apache Hudi: The Path Forward
Introduction to redis
MongoDB WiredTiger Internals: Journey To Transactions
Redis Introduction
PostgreSQL Deep Internal
Intro to HBase
Bloat and Fragmentation in PostgreSQL
Introduction to MongoDB.pptx
Using Wildcards with rsyslog's File Monitor imfile
Introduction to Apache Spark
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
Bucket your partitions wisely - Cassandra summit 2016
Introduction à ElasticSearch
[pgday.Seoul 2022] PostgreSQL with Google Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction to Storm
Introduction to MongoDB
Apache Hudi: The Path Forward
Ad

Similar to A Technical Introduction to WiredTiger (20)

PDF
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
PPTX
MongoDB World 2015 - A Technical Introduction to WiredTiger
PPTX
WiredTiger & What's New in 3.0
PDF
Innodb 和 XtraDB 结构和性能优化
PDF
Let the Tiger Roar - MongoDB 3.0
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PPTX
What'sNnew in 3.0 Webinar
PDF
InnoDB architecture and performance optimization (Пётр Зайцев)
PDF
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
PDF
A Closer Look at Apache Kudu
PDF
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
PPTX
Running MongoDB 3.0 on AWS
PDF
9.6_Course Material-Postgresql_002.pdf
PDF
InnoDB Architecture and Performance Optimization, Peter Zaitsev
PDF
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
PPTX
Beyond the Basics 1: Storage Engines
PDF
SUSE Storage: Sizing and Performance (Ceph)
PPTX
whyPostgres, a presentation on the project choice for a storage system
PDF
Colvin exadata mistakes_ioug_2014
PDF
071410 sun a_1515_feldman_stephen
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB World 2015 - A Technical Introduction to WiredTiger
WiredTiger & What's New in 3.0
Innodb 和 XtraDB 结构和性能优化
Let the Tiger Roar - MongoDB 3.0
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
What'sNnew in 3.0 Webinar
InnoDB architecture and performance optimization (Пётр Зайцев)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
A Closer Look at Apache Kudu
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Running MongoDB 3.0 on AWS
9.6_Course Material-Postgresql_002.pdf
InnoDB Architecture and Performance Optimization, Peter Zaitsev
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
Beyond the Basics 1: Storage Engines
SUSE Storage: Sizing and Performance (Ceph)
whyPostgres, a presentation on the project choice for a storage system
Colvin exadata mistakes_ioug_2014
071410 sun a_1515_feldman_stephen
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced IT Governance
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Advanced IT Governance
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
GamePlan Trading System Review: Professional Trader's Honest Take
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Mobile App Security Testing_ A Comprehensive Guide.pdf

A Technical Introduction to WiredTiger

  • 1. ATechnical Introduction to WiredTiger info@wiredtiger.com
  • 2. Presenter •  Keith Bostic •  Co-architect WiredTiger •  Senior Staff Engineer MongoDB Ask questions as we go, or keith.bostic@mongodb.com
  • 3. 3 This presentation is not… •  How to write stand-alone WiredTiger apps – contact info@wiredtiger.com •  How to configure MongoDB with WiredTiger for your workload
  • 4. 4 WiredTiger •  Embedded database engine – general purpose toolkit – high performing: scalable throughput with low latency •  Key-value store (NoSQL) •  Schema layer – data typing, indexes •  Single-node •  OO APIs – Python, C, C++, Java •  Open Source
  • 5. 5 Deployments •  Amazon AWS •  ORC/Tbricks: financial trading solution And, most important of all: •  MongoDB: next-generation document store
  • 6. You may have seen this:
  • 8. 8 MongoDB’s Storage Engine API •  Allows different storage engines to "plug-in" – different workloads have different performance characteristics – mmapV1 is not ideal for all workloads – more flexibility •  mix storage engines on same replica set/sharded cluster •  Opportunity to innovate further – HDFS, encrypted, other workloads •  WiredTiger is MongoDB’s general-purpose workhorse
  • 9. Topics Ø  WiredTiger Architecture •  In-memory performance •  Record-level concurrency •  Compression •  Durability and the journal •  Future features
  • 10. 10 Motivation for WiredTiger •  Traditional engines struggle with modern hardware: – lots of CPU cores – lots of RAM •  Avoid thread contention for resources – lock-free algorithms, for example, hazard pointers – concurrency control without blocking •  Hotter cache, more work per I/O – big blocks – compact file formats
  • 11. 11 WiredTiger Architecture WiredTiger Engine Schema & Cursors Python API C API Java API Database Files Transactions Page read/write Logging Column storage Block management Row storage Snapshots Log Files Cache
  • 12. 12 Column-store, LSM •  Column-store – implemented inside the B+tree – 64-bit record number keys – valued by the key’s position in the tree – variable-length or fixed-length •  LSM – forest of B+trees (row-store or column-store) – bloom filters (fixed-length column-store) •  Mix-and-match – sparse, wide table: column-store primary, LSM indexes
  • 13. Topics ü  WiredTiger Architecture Ø  In-memory performance •  Record-level concurrency •  Compression •  Durability and the journal •  Future features
  • 14. 14 Trees in cache non-resident child ordinary pointer root page internal page internal page root page leaf page leaf page leaf page leaf page
  • 15. 15 Hazard Pointers non-resident child ordinary pointer root page internal page internal page root page leaf page leaf page leaf page leaf page 1 memory flush
  • 16. 16 Pages in cache cache data files page images on-disk page image index clean page on-disk page image indexdirty page updates
  • 17. 17 Skiplists •  Updates stored in skiplists – ordered linked lists with forward “skip” pointers •  William Pugh, 1989 – simpler, as fast as binary-search, less space – likely binary-search performance plus cache prefetch – more space for an existing data set •  Implementation – insert without locking – forward/backward traversal without locking, while inserting – removal requires locking
  • 18. 18 In-memory performance •  Cache trees/pages optimized for in-memory access •  Follow pointers to traverse a tree •  No locking to read or write •  Keep updates separate from initial data – updates are stored in skiplists – updates are atomic in almost all cases •  Do structural changes (eviction, splits) in background threads
  • 19. Topics ü  WiredTiger Architecture ü  In-memory performance Ø  Record-level concurrency •  Compression •  Durability and the journal •  Future features
  • 20. 20 Multiversion Concurrency Control (MVCC) •  Multiple versions of records maintained in cache •  Readers see most recently committed version – read-uncommitted or snapshot isolation available – configurable per-transaction or per-handle •  Writers can create new versions concurrent with readers •  Concurrent updates to a single record cause write conflicts – one of the updates wins – other generally retries with back-off •  No locking, no lock manager
  • 21. 21 Pages in cache cache data files page images on-disk page image index clean page on-disk page image indexdirty page updates skiplist
  • 22. 22 MVCC In Action on-disk page image index update1 (txn, value) on-disk page image index update2 (txn, value) update1 (txn, value) update on-disk page image index update
  • 23. Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency Ø  Compression •  Durability and the journal •  Future features
  • 24. 24 Block manager •  Block allocation – fragmentation – allocation policy •  Checksums – block compression is at a higher level •  Checkpoints – involved in durability guarantees •  Opaque address cookie – stored as internal page key’s “value” •  Pluggable
  • 25. 25 Write path cache data files page images on-disk page image index clean page on-disk page image indexdirty page updates reconciled during write
  • 26. 26 In-memory Compression •  Prefix compression – index keys usually have a common prefix – rolling, per-block, requires instantiation for performance •  Huffman/static encoding – burns CPU •  Dictionary lookup – single value per page •  Run-length encoding – column-store values
  • 27. 27 On-disk Compression •  Compression algorithms: – snappy [default]: good compression, low overhead – LZ4: good compression, low overhead, better page layout – zlib: better compression, high overhead – pluggable •  Optional – compressing filesystem instead
  • 29. Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency ü  Compression Ø  Durability and the journal •  Future features
  • 30. 30 Journal and Recovery •  Write-ahead logging (aka journal) enabled by default •  Only written at transaction commit – only write redo records •  Log records are compressed •  Group commit for concurrency •  Automatic log archival / removal – bounded by checkpoint frequency •  On startup, find a consistent checkpoint in the metadata – use the checkpoint to figure out how much to roll forward
  • 31. 31 Durability without Journaling •  MongoDB’s MMAP storage requires the journal for consistency – running with “nojournal” is unsafe •  WiredTiger is a no-overwrite data store – with “nojournal”, updates since the last checkpoint may be lost – data will still be consistent – checkpoints every N seconds by default •  Replication can guarantee durability – the network is generally faster than disk I/O
  • 32. Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency ü  Compression ü  Durability and the journal Ø  Future features
  • 33. 33
  • 34. 34 What’s next for WiredTiger? •  Our Big Year of Tuning – applications doing “interesting” things – stalls during checkpoints with 100GB+ caches – MongoDB capped collections •  Encryption •  Advanced transactional semantics – updates not stable until confirmed by replica majority
  • 35. 35 WiredTiger LSM support •  Random insert workloads •  Data set much larger than cache •  Query performance less important •  Background maintenance overhead acceptable •  Bloom filters
  • 36. 36
  • 37. 37 Benchmarks Mark Callaghan at Facebook: http://smalldatum.blogspot.com/

Editor's Notes

  • #3: Feel free to ask questions as we go, hopefully there will be a few minutes for Q&A at the end. Also happy to discuss by email, let me know how we can help.
  • #5: Build a toolkit: one path is to build special-purpose engines to handle specific workloads, another path is to handle complex/changing workloads.
  • #8: This kind of positive feedback isn’t common in engineering groups.
  • #10: The structure for the rest of the talk.
  • #11: Traditional storage engine designs struggle with modern hardware. I/O, in relationship to memory, a worse outcome than ever before: trade CPU for I/O wherever possible. Big block I/O: if we have to do I/O, bring in a lot of data.
  • #12: Moderately complex inside. Outside APIs, handle + method Top-level schema layer, where every table and associated indexes Operations are transactionally protected, implemented in terms of in-memory snapshots. Operations are to a row-store engine (ordered key/value pair), column-store (key is a 64-bit record number) Block management is intended to be pluggable itself. On disk, key/value pair files and log files.
  • #13: From now on, I’m going to mostly be talking about “trees” in-memory, without distinguishing what type of tree – here’s the overview, after which it’s just a page in-memory.
  • #14: WiredTiger focuses on in-memory performance: I/O means you’ve lost the war, you’re only trying to retreat gracefully.
  • #15: WiredTiger does have root pages and internal pages, with leaf pages at the bottom: binary search of each page yields the child page for the subsequent search. Importantly, pointers in memory are not disk offsets (in many engines in-memory objects find each other use disk offsets, so, for example, a transition from an internal page to a leaf page implies giving the cache subsystem a disk offset, and the cache returns the in-memory pointer, possibly after a read. The WiredTiger in-memory tree is exactly that, it’s an in-memory optimized tree. This is good (fast in-memory operations), this is bad (we have a translation step after reading, and before writing, disk blocks). We knew we wanted to modify our in-memory representation without changing our on-disk format (avoid upgrades!), and we knew a lot of our compression algorithms would require translation before writing anyway (for example, our on-disk pages have no indexing information, saves about 15-20% of the disk footprint in some workloads).
  • #16: To make this efficient, pointers need to be protected: once data is larger than cache, there needs to be a check to ensure a pointer is valid. There’s a background thread doing eviction of pages. Hazard pointers can be thought of as micro-logging. Readers/writers publishes the memory address of a page it wants on a non-shared cache line; after that publish, if the pointer is still valid, can proceed. Eviction threads must check those locations to ensure a page is not currently in use: eviction bears the burden, readers/writers go fast. Design principal: application threads never wait, shift work from application threads to system-internal threads.
  • #17: Different in-memory representation vs on-disk – opportunities to optimize plus some interesting features. Read on-disk page into cache, generally a read-only image. Add indexing information (binary search) on top of that image. Updates are layered on top of that image, including new key/value pairs inserted between existing keys. if the page grows too large, background threads will deal with it. Lots of magic in traversal, threads must go back-and-forth between the original image and the updated image. Writing a page: Combine the previous page image with in-memory changes If the result is too big, split Allocate new space in the file Always write multiples of the allocation size
  • #19: To summarize: Note we haven’t talked about locking at all: application threads can retrieve and update data without every acquiring a lock. Justin Levandoski’s BW-Tree work: they’ve avoided taking pages out of circulation during splits, interesting.
  • #21: Readers don’t block writers, writers don’t block readers, writers don’t block writers, again, no locks have been acquired.
  • #22: If there have been no updates, then it’s easy, the on-page item is the correct item for any query. If there are updates, each update has associated with it a transaction ID, and that transaction ID combined with the transaction’s snapshot, determines the correct value.
  • #23: Index references the original page image, once updates installed readers/writers have to check for updates. The first update in the list is generally the one we want, if it’s not yet visible, other updates are checked. if no update is correct, the value on the original page must be the one we want. All updates done by atomic updates, swapping a new pointer into place, readers concurrent to updates. Writing the page to disk requires processing all of this information, and determining the values to write.
  • #26: Page-Write: transform the in-memory version: selecting values to write, page-splitting, all sorts of compression, checksums. Checkpoints are simply another “snapshot reader”, so they run concurrently with other readers and writers. Writing a page: Combine the previous page image with in-memory changes Allocate new space in the file If the result is too big, split Always write multiples of the allocation size Can configure direct I/O to keep reads and writes out of the filesystem cache
  • #27: All of these apply in-memory and on disk, so we save both on disk and in the cache.
  • #28: in the same code paths, we compress the data.
  • #29: In WiredTiger, compact file sizes, and that certain types of compression cannot be turned off, WiredTiger without “compression”, is 50%.
  • #31: Storage engines are all about not losing your stuff. Pretty standard WAL implementation: before a commit is visible, a log record with all of the changes in the transaction has been flushed to stable storage. Group commit: concurrent log writes are done with a single storage flush. Started with “Scalability of write-ahead logging on multicore and multisocket hardware by Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis and Anastasia Ailamaki.”, and there’s lots more engineering there.
  • #32: Checkpoints move from one stable point to another one. Our goal was to build a single-node engine, and for that reason, we had to run without a log without losing durability.
  • #35: Lookaside tables.
  • #37: A very large data-set over time: blue: a btree tails off over time as the probability of a page being found in cache decreases (that’s why the random nature of the insert matters). red/green flatter: only maintain the recent updates in cache, and merge the updates in the background. LSM is write-optimized, though: the reason the btree is primary is that read-mostly workloads generally behave better in a btree than in LSM. What we want to do eventually, is enable the conversion of an LSM tree into a btree (if you think of a forest of btrees collapsing into a single btree, that matches nicely with the typical workload of inserting a lot of data and then processing that data. Ideally, we’d also be able to reverse that process on demand.