A Technical Introduction to WiredTiger

ATechnical Introduction to WiredTiger
info@wiredtiger.com

Presenter
•  Keith Bostic
•  Co-architect WiredTiger
•  Senior Staff Engineer MongoDB
Ask questions as we go, or
keith.bostic@mongodb.com

3
This presentation is not…
•  How to write stand-alone WiredTiger apps
– contact info@wiredtiger.com
•  How to configure MongoDB with WiredTiger for your workload

4
WiredTiger
•  Embedded database engine
– general purpose toolkit
– high performing: scalable throughput with low latency
•  Key-value store (NoSQL)
•  Schema layer
– data typing, indexes
•  Single-node
•  OO APIs
– Python, C, C++, Java
•  Open Source

5
Deployments
•  Amazon AWS
•  ORC/Tbricks: financial trading solution
And, most important of all:
•  MongoDB: next-generation document store

8
MongoDB’s Storage Engine API
•  Allows different storage engines to "plug-in"
– different workloads have different performance characteristics
– mmapV1 is not ideal for all workloads
– more flexibility
•  mix storage engines on same replica set/sharded cluster
•  Opportunity to innovate further
– HDFS, encrypted, other workloads
•  WiredTiger is MongoDB’s general-purpose workhorse

Topics
Ø  WiredTiger Architecture
•  In-memory performance
•  Record-level concurrency
•  Compression
•  Durability and the journal
•  Future features

10
Motivation for WiredTiger
•  Traditional engines struggle with modern hardware:
– lots of CPU cores
– lots of RAM
•  Avoid thread contention for resources
– lock-free algorithms, for example, hazard pointers
– concurrency control without blocking
•  Hotter cache, more work per I/O
– big blocks
– compact file formats

11
WiredTiger Architecture
WiredTiger Engine
Schema &
Cursors
Python API C API Java API
Database
Files
Transactions
Page
read/write
Logging
Column
storage
Block
management
Row
storage
Snapshots
Log Files
Cache

12
Column-store, LSM
•  Column-store
– implemented inside the B+tree
– 64-bit record number keys
– valued by the key’s position in the tree
– variable-length or fixed-length
•  LSM
– forest of B+trees (row-store or column-store)
– bloom filters (fixed-length column-store)
•  Mix-and-match
– sparse, wide table: column-store primary, LSM indexes

Topics
ü  WiredTiger Architecture
Ø  In-memory performance
•  Record-level concurrency
•  Compression

14
Trees in cache
non-resident
child
ordinary pointer
root page
internal
page
internal
page
root page
leaf page
leaf page leaf page leaf page

15
Hazard Pointers
non-resident
child
ordinary pointer
root page
internal
page
internal
page
root page
leaf page
leaf page leaf page leaf page
1 memory
flush

16
Pages in cache
cache
data ﬁles
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates

17
Skiplists
•  Updates stored in skiplists
– ordered linked lists with forward “skip” pointers
•  William Pugh, 1989
– simpler, as fast as binary-search, less space
– likely binary-search performance plus cache prefetch
– more space for an existing data set
•  Implementation
– insert without locking
– forward/backward traversal without locking, while inserting
– removal requires locking

18
In-memory performance
•  Cache trees/pages optimized for in-memory access
•  Follow pointers to traverse a tree
•  No locking to read or write
•  Keep updates separate from initial data
– updates are stored in skiplists
– updates are atomic in almost all cases
•  Do structural changes (eviction, splits) in background threads

Topics
ü  In-memory performance
Ø  Record-level concurrency
•  Compression

20
Multiversion Concurrency Control (MVCC)
•  Multiple versions of records maintained in cache
•  Readers see most recently committed version
– read-uncommitted or snapshot isolation available
– configurable per-transaction or per-handle
•  Writers can create new versions concurrent with readers
•  Concurrent updates to a single record cause write conflicts
– one of the updates wins
– other generally retries with back-off
•  No locking, no lock manager

21
Pages in cache
cache
data ﬁles
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
skiplist

22
MVCC In Action
on-disk
page
image
index
update1
(txn, value)
on-disk
page
image
index
update2
(txn, value)
update1
(txn, value)
update
on-disk
page
image
index
update

Topics
ü  Record-level concurrency
Ø  Compression

24
Block manager
•  Block allocation
– fragmentation
– allocation policy
•  Checksums
– block compression is at a higher level
•  Checkpoints
– involved in durability guarantees
•  Opaque address cookie
– stored as internal page key’s “value”
•  Pluggable

25
Write path
cache
data ﬁles
page images
on-disk
page
image
index
clean
page on-disk
page
image
indexdirty
page
updates
reconciled
during write

26
In-memory Compression
•  Prefix compression
– index keys usually have a common prefix
– rolling, per-block, requires instantiation for performance
•  Huffman/static encoding
– burns CPU
•  Dictionary lookup
– single value per page
•  Run-length encoding
– column-store values

27
On-disk Compression
•  Compression algorithms:
– snappy [default]: good compression, low overhead
– LZ4: good compression, low overhead, better page layout
– zlib: better compression, high overhead
– pluggable
•  Optional
– compressing filesystem instead

28
Compression in Action
Flights database

Topics
ü  Compression
Ø  Durability and the journal

30
Journal and Recovery
•  Write-ahead logging (aka journal) enabled by default
•  Only written at transaction commit
– only write redo records
•  Log records are compressed
•  Group commit for concurrency
•  Automatic log archival / removal
– bounded by checkpoint frequency
•  On startup, find a consistent checkpoint in the metadata
– use the checkpoint to figure out how much to roll forward

31
Durability without Journaling
•  MongoDB’s MMAP storage requires the journal for consistency
– running with “nojournal” is unsafe
•  WiredTiger is a no-overwrite data store
– with “nojournal”, updates since the last checkpoint may be lost
– data will still be consistent
– checkpoints every N seconds by default
•  Replication can guarantee durability
– the network is generally faster than disk I/O

Topics
ü  Compression
ü  Durability and the journal
Ø  Future features

34
What’s next for WiredTiger?
•  Our Big Year of Tuning
– applications doing “interesting” things
– stalls during checkpoints with 100GB+ caches
– MongoDB capped collections
•  Encryption
•  Advanced transactional semantics
– updates not stable until confirmed by replica majority

35
WiredTiger LSM support
•  Random insert workloads
•  Data set much larger than cache
•  Query performance less important
•  Background maintenance overhead acceptable
•  Bloom filters

37
Benchmarks
Mark Callaghan at Facebook: http://smalldatum.blogspot.com/

Thanks!
Questions?
Keith Bostic
keith.bostic@mongodb.com

A Technical Introduction to WiredTiger

More Related Content

What's hot (20)

Similar to A Technical Introduction to WiredTiger (20)

More from MongoDB (20)

Recently uploaded (20)

A Technical Introduction to WiredTiger

Editor's Notes