SlideShare a Scribd company logo
Software Design for Persistent
Memory Systems
Howard Chu
CTO, Symas Corp. hyc@symas.com
2018-03-07
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
nvram-systems
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
2
Personal Intro
● Howard Chu
– Founder and CTO Symas Corp.
– Developing Free/Open Source software since
1980s
● GNU compiler toolchain, e.g. "gmake -j", etc.
● Many other projects...
● I never use a software package without contributing to it
– Worked for NASA/JPL, wrote software for Space
Shuttle, etc.
3
Personal Intro
● Career Highlights
– 2011- Author of LMDB, world's smallest, fastest, and most
reliable embedded database engine
– 1998- Main developer of OpenLDAP, world's most
scalable distributed data store
– 1995 Author of PC-Enterprise/Mac, world's fastest
AppleTalk stack and Appleshare file server
– 1993 Author of faster-than-realtime speech recognition
using Motorola 68030
– 1991 Inventor of parallel make support in GNU make
4
Topics
● What is Persistent Memory?
● What system-level support exists?
● How do we leverage this in applications?
5
What is Persistent Memory
● Non-volatile, doesn't lose contents when system
is powered off
● Can be thought of as battery-backed DRAM
– billed as byte-addressable storage, but really is still
constrained to cacheline granularity
– being used as a new layer in system memory
hierarchy, between regular DRAM and secondary
storage (SSD, HDD)
– ideally, will replace regular DRAM completely
6
What is Persistent Memory
7
What is Persistent Memory
● STT-MRAM is the leading technology for now
– performance equivalent to DRAM
– endurance approaching DRAM (10^12 vs 10^15 writes)
– ST-DDR3, ST-DDR4 DIMMs available - drop-in compatible
with DDR3/DDR4
– Still lags in density, 256Mbit parts reaching market now
● Fabricated on 40nm process
● Compared to 8Gbit DDR4 DRAM chips already mainstream, on
10nm process
– Production on 22nm process expected later this year
8
What is Persistent Memory
● Other possibilities exist
– actual battery-backed DRAM DIMMs (BBU DIMM)
● offered up to 72 hours of persistence
● deprecated, no longer marketed
– Flash-backed DRAM DIMMs (NVDIMM)
● typically with a super-capacitor onboard
● copies DRAM to flash on system shutdown
● All of these are more expensive than regular
DRAM
9
System-Level Support
● Requires both BIOS and OS support
– POST must use non-destructive memory test, or
just skip memory test
– Kernel must recognize NV memory
– Linux kernel boot args can be used to explicitly
mark memory as persistent
– Current state of OS support is extremely primitive
10
System-Level Support
● Kernel treats persistent memory as a block device
– you can create a filesystem on top and use it as a glorified
RAMdisk
● Congratulations, welcome to the state of the art of 1986.
– you can use it as cache dedicated to a particular set of
devices
● using dm-cache, bcache, flashcache, etc.
● but these solutions are written for Flash SSDs, and aren't optimal
for persistent RAM
– current designs assume only a small subset of system
memory is persistent
11
System-Level Support
● Future support must account for systems with
100% persistent memory
– Kernel page cache manager must be modified to
utilize hot cache contents left by previous bootup
– "persistent memory" must become just "memory" -
used for system-wide device caching, instead of
isolated in its own block device
12
System-Level Support
● Whether system is 100% persistent RAM or not,
memory should be managed by kernel and not require
direct management at user level
– current usage as distinct block device requires a user to
manually manage it
● explicitly copy files to it
● when the space gets full the user must choose some files to
delete, in order to make room for new files
– instead, used as part of the system cache, the OS can
page data in and out as needed, without any user
intervention
13
Application Design
● Mindset
● Design Concepts
● Implementation Choices
● Other Details
– Concurrency Control
– Free Space Management
– Byte Addressability
● Endgame
14
Application Design
● Requires a different mindset
– Should not view "memory" and "storage" as distinct
concepts - must adopt "single-level store"
● Storage and RAM are interchangeable, via memory-
mapping
– Data structures that are intended to be persistent
must be written atomically - interruption of updates
must not leave corrupt or inconsistent states
– Avoid temptation to take "memory-only" / "main
memory" design approach
15
Application Design
● Problems with "main memory" approach
– A law of computing: data always grows to exceed
the size of available space
– There will always be larger/slower/cheaper memory
in addition to fast in-core memory: there will always
be a hierarchy of storage
– You must design for growth, and take this hierarchy
into account
16
Design Concepts
● Essentially, persistent data structures must
provide ACID transaction semantics
– persistent RAM gives Durability, implicitly
– the rest is up to you
● Atomicity can be actual, or effective
– Actual: you only support modifications that can be
performed with a single atomic update
– Effective: you use undo/redo logs to allow recovery
from interrupted updates
17
Design Concepts
● If you go for "effective atomicity" you'll need to have
complex locking mechanisms to protect intermediate
update states
● Once you go down the path of complex locking, you
also have to deal with deadlocks, backoffs, and retries
● All of this involves a great deal of additional code on
top of the actual data structure code
● Complex locking will not scale well across multiple
CPU sockets
18
Design Concepts
● If you use undo/redo logs you'll need to build a robust
crash detection mechanism, as well as a crash recovery
procedure to recover from incomplete transactions
● The undo log will also be needed to execute transaction
abort/rollback in normal (non-crashed) operation
● The log will be a central bottleneck in all write
operations
● Logs will need explicit management - pruning/etc
19
Design Concepts
● Better approach is to use MVCC (Multi-Version
Concurrency Control) with a single pointer to
the current version
– Once a new version has been constructed, a single
atomic write to the version pointer can be used to
make it visible
– Since each transaction operates on its own version
of the data structure, transactions have perfect
Isolation
20
Design Concepts
● Best solution, based on constraints so far:
– data structure must be storage oriented, for growth - not a
memory-only structure
– data structure must have atomic update visibility
● Use a B+tree
– inherently suited to caching, memory hierarchy
– using Copy-on-Write, can expose a new modification simply
by updating a pointer to the root of a new tree version
● a new update can be simply aborted/rolled back just by omitting
the pointer update, no undo/redo logs needed
21
Implementation
● Successful implementation requires explicit
control over memory layout of data structures
– structures must be CPU cacheline aligned, both for
performance and for integrity
– this precludes implementing in most higher level
languages
22
Implementation
● We're now clearly talking about a storage library
– there's a lot of details to manage, but they can be
hidden in a library
– written in a low level language
– should use something like C
● easily callable from any other language
● mature, portable, flexible
● direct control over memory layout
– allows identical layout for "in-memory" and "on-disk" representation
23
More Design Choices
● Multi-process concurrency, or just multi-thread?
– Multi-thread in a single process is simpler
● doesn't require shared memory for interprocess coordination
– Multi-process concurrency is more flexible
● allows administrative tools to query and operate regardless of whether
the main application is running
● Single-writer or multiple writer?
– Single-writer is simpler, eliminates possibility of deadlocks
– Multi-writer requires complex locking, conflict detection
● and still boils down to single-writer anyway, given the requirement of
atomic visibility
24
Implementation
● Use mmap to expose data to callers
– Use a read-only mmap, otherwise random
overwrites will be persisted, causing unrecoverable
corruption
– Pointers to data in map can be returned directly to
callers on data fetch requests, thus avoiding
expensive malloc/copy operations
● This requires that data values are always stored
contiguously, even if values are larger than B+tree page
size
25
Implementation
● Can optionally use writable mmap
– Opens a window to corruption vulnerability
– Requires explicit cache flush instructions, to ensure
writes are pushed from CPU cache out to RAM (if not
using msync)
– No performance benefit over readonly mmap
● writing a page requires that it first get faulted in, wasted effort if
the entire page is going to be overwritten
– May not be worth the cost in reliability and portability
● forcing a CPU cache flush is highly system-dependent
26
Concurrency Control
● Systems commonly offer reader/writer semantics
– 1 writer can operate exclusively, or arbitrary number
of readers
– writer and readers cannot operate simultaneously
● Done properly, an MVCC-based design allows
readers to run wait-free, taking no locks
– writer should be able to operate concurrently with
arbitrary number of readers
27
Free Space Management
● With MVCC, storage space rapidly fills up with
old/obsolete versions of data
● Most applications will have no use for the old
versions
● Reclaiming space from obsolete versions will
be critical for long term usability
● "Background" garbage collection (GC) is a
commonly practiced approach but is not viable
28
Free Space Management
● Background GC assumes there's always spare CPU and
I/O capacity
– GC can consume more CPU and I/O bandwidth than the actual
user workload
● which then leads to requiring complex runtime profiling and throttling
implementations
– Thus it will either require over-provisioning of system resources,
or GC will always cause user-visible pauses in processing
● Better to track page usage in foreground and reuse old
pages when they become available
– Yields consistent write throughput without any pauses
29
Free Space Management
● Tracking page availability has a direct impact on
concurrency
– Must record which readers are referencing which old versions,
to know which old versions can be purged/reclaimed
– Could just use a simple counter, recording the oldest version
still in use
● but accessing the counter becomes a bottleneck for readers
– Better to use an array with one slot per reader
● array slots must be cacheline aligned
● slots can be updated by readers and checked by writers without
taking any locks
30
Byte Addressability
● Highly touted feature of NVRAM-based storage
● Largely a red herring
– Can be useful for current RAMdisk-style approaches,
but these are evolutionary dead ends
– Eventually the industry will wake up to the fact that
reinventing reset-survivable RAMdisks was a waste of
time and money
– NVRAM will eventually be integral to the system cache,
and the system cache is necessarily page-based
31
Endgame
● Based on the given design constraints:
– atomicity, persistence, robustness, simplicity,
efficiency
– single-level store, blurring the line between memory
and storage
● You'll end up with something that looks a lot like
LMDB
32
LMDB Overview
● LMDB "Lightning Memory-Mapped Database"
● embedded key/value store implemented with a
B+tree
● as the name indicates, it uses memory mapped
data
– defaults to read-only mmap
– zero-copy reads: retrieved data points directly into
mmap
– zero-copy writes: optionally supports writable mmap
33
LMDB Overview
● full ACID transaction semantics
● MVCC concurrency control
– writers don't block readers, readers don't block
writers
– a pair of page pointers are used to point to the
current tree version
● single writer
– no need for callers to handle deadlocks or retries
34
LMDB Overview
● No undo/redo logs
– Uses Copy-on-Write
– Intermediate tree states are never visible, cannot be corrupted
by system crashes
● No garbage collection
– space freed by a transaction is recorded in a 2nd B+tree living
in the same space
– writers reuse whatever available free space as needed
● No tuning or administrative overhead
– zero-config
35
LMDB Overview
● Unrivalled read performance on any hardware and any data volume
– 1 billion record DB, ~120GB, on HP DL585 G5 with 128GB RAM, 16 cores
– 16 read threads concurrent with 1 write thread
36
Summary
● Persistent RAM is approaching price parity with
regular DRAM, will be more common soon
● Current OS support is primitive and needs further
improvement
● If you enjoy low level programming, the design
constraints of writing an always-consistent data
structure may be interesting to explore
● Otherwise, just use LMDB and don't worry about it
37
Questions?
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
nvram-systems

More Related Content

PPT
Fundamentals of Computing Chapter 6
PPTX
Lecture4
PDF
High Performance Computer Architecture
PDF
Lecture 6.1
PPTX
Lecture1
PPT
NUMA overview
PDF
RAMCloud: Scalable Datacenter Storage Entirely in DRAM
PDF
ch8_mainMem.pdf
Fundamentals of Computing Chapter 6
Lecture4
High Performance Computer Architecture
Lecture 6.1
Lecture1
NUMA overview
RAMCloud: Scalable Datacenter Storage Entirely in DRAM
ch8_mainMem.pdf

Similar to Software Design for Persistent Memory Systems (20)

PPTX
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
PPTX
CSE2010- Module 4 V1.pptx
PDF
lecture 8 b main memory
PPT
module4.ppt
PPTX
Memory Management techniques -ch8_1.pptx
PDF
Memory Management(MM) in operating system
DOC
Symmetric multiprocessing and Microkernel
PDF
Linux Huge Pages
PDF
IBM MQ Disaster Recovery
PDF
Operating system that project relate to os.Unit3.pdf
PPT
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
PPTX
Zendcon scaling magento
PDF
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
PPTX
M20CA1030_391_2_Part2.pptx
PPTX
Computer system Architecture. This PPT is based on computer system
ODP
Time For DIME
PDF
Coherence and consistency models in multiprocessor architecture
PDF
SanDisk: Persistent Memory and Cassandra
PDF
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
PDF
C++ Programming and the Persistent Memory Developers Kit
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
CSE2010- Module 4 V1.pptx
lecture 8 b main memory
module4.ppt
Memory Management techniques -ch8_1.pptx
Memory Management(MM) in operating system
Symmetric multiprocessing and Microkernel
Linux Huge Pages
IBM MQ Disaster Recovery
Operating system that project relate to os.Unit3.pdf
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Zendcon scaling magento
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
M20CA1030_391_2_Part2.pptx
Computer system Architecture. This PPT is based on computer system
Time For DIME
Coherence and consistency models in multiprocessor architecture
SanDisk: Persistent Memory and Cassandra
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
C++ Programming and the Persistent Memory Developers Kit
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Ad

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KodekX | Application Modernization Development
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MIND Revenue Release Quarter 2 2025 Press Release
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Software Design for Persistent Memory Systems

  • 1. Software Design for Persistent Memory Systems Howard Chu CTO, Symas Corp. hyc@symas.com 2018-03-07
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ nvram-systems • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. 2 Personal Intro ● Howard Chu – Founder and CTO Symas Corp. – Developing Free/Open Source software since 1980s ● GNU compiler toolchain, e.g. "gmake -j", etc. ● Many other projects... ● I never use a software package without contributing to it – Worked for NASA/JPL, wrote software for Space Shuttle, etc.
  • 5. 3 Personal Intro ● Career Highlights – 2011- Author of LMDB, world's smallest, fastest, and most reliable embedded database engine – 1998- Main developer of OpenLDAP, world's most scalable distributed data store – 1995 Author of PC-Enterprise/Mac, world's fastest AppleTalk stack and Appleshare file server – 1993 Author of faster-than-realtime speech recognition using Motorola 68030 – 1991 Inventor of parallel make support in GNU make
  • 6. 4 Topics ● What is Persistent Memory? ● What system-level support exists? ● How do we leverage this in applications?
  • 7. 5 What is Persistent Memory ● Non-volatile, doesn't lose contents when system is powered off ● Can be thought of as battery-backed DRAM – billed as byte-addressable storage, but really is still constrained to cacheline granularity – being used as a new layer in system memory hierarchy, between regular DRAM and secondary storage (SSD, HDD) – ideally, will replace regular DRAM completely
  • 9. 7 What is Persistent Memory ● STT-MRAM is the leading technology for now – performance equivalent to DRAM – endurance approaching DRAM (10^12 vs 10^15 writes) – ST-DDR3, ST-DDR4 DIMMs available - drop-in compatible with DDR3/DDR4 – Still lags in density, 256Mbit parts reaching market now ● Fabricated on 40nm process ● Compared to 8Gbit DDR4 DRAM chips already mainstream, on 10nm process – Production on 22nm process expected later this year
  • 10. 8 What is Persistent Memory ● Other possibilities exist – actual battery-backed DRAM DIMMs (BBU DIMM) ● offered up to 72 hours of persistence ● deprecated, no longer marketed – Flash-backed DRAM DIMMs (NVDIMM) ● typically with a super-capacitor onboard ● copies DRAM to flash on system shutdown ● All of these are more expensive than regular DRAM
  • 11. 9 System-Level Support ● Requires both BIOS and OS support – POST must use non-destructive memory test, or just skip memory test – Kernel must recognize NV memory – Linux kernel boot args can be used to explicitly mark memory as persistent – Current state of OS support is extremely primitive
  • 12. 10 System-Level Support ● Kernel treats persistent memory as a block device – you can create a filesystem on top and use it as a glorified RAMdisk ● Congratulations, welcome to the state of the art of 1986. – you can use it as cache dedicated to a particular set of devices ● using dm-cache, bcache, flashcache, etc. ● but these solutions are written for Flash SSDs, and aren't optimal for persistent RAM – current designs assume only a small subset of system memory is persistent
  • 13. 11 System-Level Support ● Future support must account for systems with 100% persistent memory – Kernel page cache manager must be modified to utilize hot cache contents left by previous bootup – "persistent memory" must become just "memory" - used for system-wide device caching, instead of isolated in its own block device
  • 14. 12 System-Level Support ● Whether system is 100% persistent RAM or not, memory should be managed by kernel and not require direct management at user level – current usage as distinct block device requires a user to manually manage it ● explicitly copy files to it ● when the space gets full the user must choose some files to delete, in order to make room for new files – instead, used as part of the system cache, the OS can page data in and out as needed, without any user intervention
  • 15. 13 Application Design ● Mindset ● Design Concepts ● Implementation Choices ● Other Details – Concurrency Control – Free Space Management – Byte Addressability ● Endgame
  • 16. 14 Application Design ● Requires a different mindset – Should not view "memory" and "storage" as distinct concepts - must adopt "single-level store" ● Storage and RAM are interchangeable, via memory- mapping – Data structures that are intended to be persistent must be written atomically - interruption of updates must not leave corrupt or inconsistent states – Avoid temptation to take "memory-only" / "main memory" design approach
  • 17. 15 Application Design ● Problems with "main memory" approach – A law of computing: data always grows to exceed the size of available space – There will always be larger/slower/cheaper memory in addition to fast in-core memory: there will always be a hierarchy of storage – You must design for growth, and take this hierarchy into account
  • 18. 16 Design Concepts ● Essentially, persistent data structures must provide ACID transaction semantics – persistent RAM gives Durability, implicitly – the rest is up to you ● Atomicity can be actual, or effective – Actual: you only support modifications that can be performed with a single atomic update – Effective: you use undo/redo logs to allow recovery from interrupted updates
  • 19. 17 Design Concepts ● If you go for "effective atomicity" you'll need to have complex locking mechanisms to protect intermediate update states ● Once you go down the path of complex locking, you also have to deal with deadlocks, backoffs, and retries ● All of this involves a great deal of additional code on top of the actual data structure code ● Complex locking will not scale well across multiple CPU sockets
  • 20. 18 Design Concepts ● If you use undo/redo logs you'll need to build a robust crash detection mechanism, as well as a crash recovery procedure to recover from incomplete transactions ● The undo log will also be needed to execute transaction abort/rollback in normal (non-crashed) operation ● The log will be a central bottleneck in all write operations ● Logs will need explicit management - pruning/etc
  • 21. 19 Design Concepts ● Better approach is to use MVCC (Multi-Version Concurrency Control) with a single pointer to the current version – Once a new version has been constructed, a single atomic write to the version pointer can be used to make it visible – Since each transaction operates on its own version of the data structure, transactions have perfect Isolation
  • 22. 20 Design Concepts ● Best solution, based on constraints so far: – data structure must be storage oriented, for growth - not a memory-only structure – data structure must have atomic update visibility ● Use a B+tree – inherently suited to caching, memory hierarchy – using Copy-on-Write, can expose a new modification simply by updating a pointer to the root of a new tree version ● a new update can be simply aborted/rolled back just by omitting the pointer update, no undo/redo logs needed
  • 23. 21 Implementation ● Successful implementation requires explicit control over memory layout of data structures – structures must be CPU cacheline aligned, both for performance and for integrity – this precludes implementing in most higher level languages
  • 24. 22 Implementation ● We're now clearly talking about a storage library – there's a lot of details to manage, but they can be hidden in a library – written in a low level language – should use something like C ● easily callable from any other language ● mature, portable, flexible ● direct control over memory layout – allows identical layout for "in-memory" and "on-disk" representation
  • 25. 23 More Design Choices ● Multi-process concurrency, or just multi-thread? – Multi-thread in a single process is simpler ● doesn't require shared memory for interprocess coordination – Multi-process concurrency is more flexible ● allows administrative tools to query and operate regardless of whether the main application is running ● Single-writer or multiple writer? – Single-writer is simpler, eliminates possibility of deadlocks – Multi-writer requires complex locking, conflict detection ● and still boils down to single-writer anyway, given the requirement of atomic visibility
  • 26. 24 Implementation ● Use mmap to expose data to callers – Use a read-only mmap, otherwise random overwrites will be persisted, causing unrecoverable corruption – Pointers to data in map can be returned directly to callers on data fetch requests, thus avoiding expensive malloc/copy operations ● This requires that data values are always stored contiguously, even if values are larger than B+tree page size
  • 27. 25 Implementation ● Can optionally use writable mmap – Opens a window to corruption vulnerability – Requires explicit cache flush instructions, to ensure writes are pushed from CPU cache out to RAM (if not using msync) – No performance benefit over readonly mmap ● writing a page requires that it first get faulted in, wasted effort if the entire page is going to be overwritten – May not be worth the cost in reliability and portability ● forcing a CPU cache flush is highly system-dependent
  • 28. 26 Concurrency Control ● Systems commonly offer reader/writer semantics – 1 writer can operate exclusively, or arbitrary number of readers – writer and readers cannot operate simultaneously ● Done properly, an MVCC-based design allows readers to run wait-free, taking no locks – writer should be able to operate concurrently with arbitrary number of readers
  • 29. 27 Free Space Management ● With MVCC, storage space rapidly fills up with old/obsolete versions of data ● Most applications will have no use for the old versions ● Reclaiming space from obsolete versions will be critical for long term usability ● "Background" garbage collection (GC) is a commonly practiced approach but is not viable
  • 30. 28 Free Space Management ● Background GC assumes there's always spare CPU and I/O capacity – GC can consume more CPU and I/O bandwidth than the actual user workload ● which then leads to requiring complex runtime profiling and throttling implementations – Thus it will either require over-provisioning of system resources, or GC will always cause user-visible pauses in processing ● Better to track page usage in foreground and reuse old pages when they become available – Yields consistent write throughput without any pauses
  • 31. 29 Free Space Management ● Tracking page availability has a direct impact on concurrency – Must record which readers are referencing which old versions, to know which old versions can be purged/reclaimed – Could just use a simple counter, recording the oldest version still in use ● but accessing the counter becomes a bottleneck for readers – Better to use an array with one slot per reader ● array slots must be cacheline aligned ● slots can be updated by readers and checked by writers without taking any locks
  • 32. 30 Byte Addressability ● Highly touted feature of NVRAM-based storage ● Largely a red herring – Can be useful for current RAMdisk-style approaches, but these are evolutionary dead ends – Eventually the industry will wake up to the fact that reinventing reset-survivable RAMdisks was a waste of time and money – NVRAM will eventually be integral to the system cache, and the system cache is necessarily page-based
  • 33. 31 Endgame ● Based on the given design constraints: – atomicity, persistence, robustness, simplicity, efficiency – single-level store, blurring the line between memory and storage ● You'll end up with something that looks a lot like LMDB
  • 34. 32 LMDB Overview ● LMDB "Lightning Memory-Mapped Database" ● embedded key/value store implemented with a B+tree ● as the name indicates, it uses memory mapped data – defaults to read-only mmap – zero-copy reads: retrieved data points directly into mmap – zero-copy writes: optionally supports writable mmap
  • 35. 33 LMDB Overview ● full ACID transaction semantics ● MVCC concurrency control – writers don't block readers, readers don't block writers – a pair of page pointers are used to point to the current tree version ● single writer – no need for callers to handle deadlocks or retries
  • 36. 34 LMDB Overview ● No undo/redo logs – Uses Copy-on-Write – Intermediate tree states are never visible, cannot be corrupted by system crashes ● No garbage collection – space freed by a transaction is recorded in a 2nd B+tree living in the same space – writers reuse whatever available free space as needed ● No tuning or administrative overhead – zero-config
  • 37. 35 LMDB Overview ● Unrivalled read performance on any hardware and any data volume – 1 billion record DB, ~120GB, on HP DL585 G5 with 128GB RAM, 16 cores – 16 read threads concurrent with 1 write thread
  • 38. 36 Summary ● Persistent RAM is approaching price parity with regular DRAM, will be more common soon ● Current OS support is primitive and needs further improvement ● If you enjoy low level programming, the design constraints of writing an always-consistent data structure may be interesting to explore ● Otherwise, just use LMDB and don't worry about it
  • 40. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ nvram-systems