SlideShare a Scribd company logo
NVDIMM Overview
Technology, Linux, and Xen
What are NVDIMMs?
• A standard for allowing NVRAM to be exposed as normal memory

• Potential to dramatically change the way software is written
But..
• They have a number of surprising problems to solve

• Particularly with respect to Xen

• Incomplete specifications and confusing terminology
Goals
• Give an overview of NVDIMM concepts, terminology, and architecture

• Introduce some of the issues they bring up WRT operating systems

• Introduce some of the issues faced wrt XEN

• Propose a “sketch” of a design solution
Current Technology
• Currently available: NVDIMM-N

• DRAM with flash + capacitor backup

• Strictly more expensive than a normal
DIMM with the same storage capacity
Coming soon: Terabytes?
Page Tables
SPA
Terminology
• Physical DIMM

• DPA: DIMM Physical Address

• SPA: System Physical Address
DPA 0
DPA $N
SPA
DPA 0
DPA $N
• DRAM-like access

• 1-1 Mapping between DPA and SPA

• Interleaved across DIMMs

• Similar to RAID-0

• Better performance

• Worse reliability
PMEM: RAM-like access
PBLK: Disk-like access
• Disk-like access

• Control region

• 8k Data “window”

• One window per NVDIMM device

• Never interleaved

• Useful for software RAID

• Also useful when SPA space < NVRAM size

• 48 address bits (256TiB)

• 46 physical bits (64TiB)
SPA
DPA 0
DPA $N
How is this mapping set up?
• Firmware sets up the mapping at boot

• May be modifiable using BIOS / vendor-specific tool

• Exposes information via ACPI
ACPI: Devices
• NVDIMM Root Device

• Expect one per system

• NVDIMM Device

• One per NVDIMM

• Information about size, manufacture, &c
ACPI: NFIT Table
• NVDIMM Firmware Interface Table

• PMEM information:

• SPA ranges

• Interleave sets

• PBLK information

• Control regions

• Data window regions
Practical issues for using NVDIMM
• How to partition up the NVRAM and share it between operating systems

• Knowing the correct way to access each area (PMEM / PBLK)

• Detecting when interleaving / layout has changed
Dividing things up: Namespaces
• “Namespace”: Think partition

• PMEM namespace and interleave sets

• PBLK namespaces 

• “Type UUID” to define how it’s used

• think DOS partitions: “Linux ext2/3/4”,
“Linux Swap”, “NTFS”, &c
How do we store namespace information?
• Reading via PMEM and PBLK will give you different results

• PMEM Interleave sets may change across reboots

• So we can’t store it inside the visible NVRAM
Label area: Per-NVDIMM storage
• One “label area” per NVDIMM device (aka
physical DIMM)

• Label describes a single contiguous DPA
region 

• Namespaces made out of labels

• Accessed via ACPI AML methods

• Pure read / write
NVRAM
Label Areas
• Read ACPI NFIT to determine

• How many NVDIMMs you have

• Where PMEM is mapped

• Read label area for each NVDIMM

• Piece together the namespace described

• Double-check interleave sets with the interleave
sets (from NFIT table)

• Access PMEM regions by offsets in SPA sets
(from NFIT table)

• Access PBLK regions by programming control /
data windows (from NFIT table)
SPA
How an OS Determines Namespaces
Label Areas
XPDDS18: NVDIMM Overview - George Dunlap, Citrix
Key points
• “Namespace”: Partition

• “Label area”: Partition table

• NVDIMM devices / SPA ranges / etc defined in ACPI static NFIT table

• Label area accessed via ACPI AML methods
NVDIMMs in Linux
ndctl
• Create / destroy namespaces

• Four modes 

• raw

• sector

• fsdax

• devdax
The ideal interface
• Guest processes map a normal file, and magically get permanent storage
The obvious solution
• Make a namespace into a block device

• Put a normal partition inside the namespace

• Put a normal filesystem inside a partition

• Have mmap() map to the system memory directly
Issue: Sector write atomicity
• Disk sector writes are atomic: all-or-nothing

• memcpy() can be interrupted in the middle

• Block Translation Table (BTT): an abstraction that guarantees write
atomicity

• ‘sector mode’

• But this means mmap() needs a separate buffer
Issue: Page struct
• To keep track of userspace mappings, Linux needs a ‘page struct’

• 64 bytes per 4k page

• 1 TiB of PMEM requires 7.85GiB of page array

• Solution: Use PMEM to store a ‘page struct’

• Use a superblock to designate areas of the namespace to be used for this
purpose (allocated on namespace creation)
Issue: Filesystems and block location
• Filesystems want to be able to move blocks around

• Difficult interactions between write() system call (DMA)
Issue: Interactions with the page cache
Mode summary: Raw
• Block mode access to full SPA range

• No support for namespaces

• Therefore, no UUIDs / superblocks; page structs must be stored in main
memory

• Supports DAX
Mode summary: Sector
• Block mode with BTT for sector atomicity

• Supports namespaces

• No DAX / direct mmap() support
Mode summary: fsdax
• Block mode access to a namespace

• Supports page structs in main memory, or in the namespace

• Must be chosen at time of namespace creation

• Supports filesystems with DAX

• But there’s some question about the safety of this
Mode summary: devdax
• Character device access to namespace

• Does not support filesystems

• page structs must be contained within the namespace

• Supports mmap()

• “No interaction with kernel page cache”

• Character devices don’t have a “size”, so you have to remember
Summary
• Four ways of mapping with different advantages and disadvantages

• Seems clear that Linux is still figuring out how best use PMEM
NVDIMM, Xen, and dom0
Issue: Xen and AML
• Reading the label areas can only be done via AML

• Xen cannot do AML

• ACPI spec requires only a single entity to do AML

• That must be domain 0
Issue: struct page
• In order to track mapping to guests, the hypervisor needs a struct page

• “frametable”

• 32 or 40 bits
Issue: RAM vs MMIO
• Dom0 is free to map any SPA

• RAM: Page reference counts taken

• MMIO: No reference counts taken

• If we “promote” NVDIMM SPA ranges from MMIO to RAM, existing dom0
mappings won’t have a reference count
PMEM promotion, continued
• Three basic options:

• 1a: Trust that dom0 has unmapped before promotion

• 1b: Check to make sure that dom0 unmapped before promotion

• 2: Automatically take appropriate reference counts on promotion

• Checking / converting seem about equivalent:

• A: Keep track of all unpromoted PMEM regions in dom0 pagetables

• B: Brute force search of dom0 pagetables (PV) or p2m table (PVH)
Solution: Where to get PMEM for frame table
• “Manually” set-aside namespaces

• Namespace with custom Xen UUID

• Allocated from within a namespace (like Linux)
Solution: Promoting NVRAM
• Hypercall to allow dom0 “promote” specific PMEM regions (probably full
namespaces)

• Allow dom0 to specify two types of “scratch” PMEM for frame tables

• PMEM tied to specific other PMEM regions

• PMEM generally available for any PMEM region
Proposal: Promoting PMEM mappings
• Keep a list of all unpromoted PMEM mappings, automatically increase
reference counts on promotion

• Alternately:

• For PV: Require unmapping without checking

• For PVH: Require dom0 to map PMEM into a single contiguous p2m
range
NVDIMM, Xen, and domUs
(Virtual NVDIMMs)
Issue: Guest compatibility
• PVH is our “next generation” guest type

• Requiring QEMU for is a last resort
PMEM only, no interleaving
• Don’t bother with virtual PBLK

• Each vNVDIMM will be exposed as a single non-interleaved chunk with its
own label area
Issue: Guest label area
• Each NVDIMM needs two distinct chunks

• To begin with, specify two files / devices

• Eventually: Both contained in a single file (w/ metadata?)
Issue: How to read label areas
• Expose via ACPI?

• Map label areas into “secret” section of p2m

• Implement AML which reads and writes this secret section

• Expose via PV interface?
Questions?

More Related Content

PDF
XPDDS18: Performance tuning on Xen platform - Bo Zhang & Yifei Jiang, Huawei
PDF
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
PDF
XPDDS18: Memory Overcommitment in XEN - Huang Zhichao, Huawei
PDF
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
PDF
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
PPSX
Redesigning Xen Memory Sharing (Grant) Mechanism
PDF
XPDDS18: Introducing ViryaOS: Secure Containers for Embedded and IoT - Stefan...
PDF
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDDS18: Performance tuning on Xen platform - Bo Zhang & Yifei Jiang, Huawei
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: Memory Overcommitment in XEN - Huang Zhichao, Huawei
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
Redesigning Xen Memory Sharing (Grant) Mechanism
XPDDS18: Introducing ViryaOS: Secure Containers for Embedded and IoT - Stefan...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...

What's hot (20)

PDF
XPDDS18: Xenwatch Multithreading - Dongli Zhang, Oracle
PDF
PVH : PV Guest in HVM container
PPTX
LinuxCon Japan 13 : 10 years of Xen and Beyond
PDF
XS Boston 2008 Quantitative
PDF
Xen Memory Management
PDF
XS Boston 2008 XenLoop
PDF
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
PDF
Erlang on Xen: Redefining the cloud software stack
PDF
Xen io
PDF
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
PDF
Why xen slides
PDF
XPDS13: Xen on ARM Update - Stefano Stabellini, Citrix
PDF
LCEU13: Securing your cloud with Xen's advanced security features - George Du...
PDF
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
PDF
KVM Tuning @ eBay
PDF
XS Boston 2008 Memory Overcommit
PDF
XPDDS19: Argo and Hypervisor-Mediated Data eXchange (HMX) - Christopher Clark...
PDF
OWF: Xen - Open Source Hypervisor Designed for Clouds
PDF
Bare-Metal Hypervisor as a Platform for Innovation
PDF
XPDS14 - RT-Xen: Real-Time Virtualization in Xen - Sisu Xi, Washington Univer...
XPDDS18: Xenwatch Multithreading - Dongli Zhang, Oracle
PVH : PV Guest in HVM container
LinuxCon Japan 13 : 10 years of Xen and Beyond
XS Boston 2008 Quantitative
Xen Memory Management
XS Boston 2008 XenLoop
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
Erlang on Xen: Redefining the cloud software stack
Xen io
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
Why xen slides
XPDS13: Xen on ARM Update - Stefano Stabellini, Citrix
LCEU13: Securing your cloud with Xen's advanced security features - George Du...
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
KVM Tuning @ eBay
XS Boston 2008 Memory Overcommit
XPDDS19: Argo and Hypervisor-Mediated Data eXchange (HMX) - Christopher Clark...
OWF: Xen - Open Source Hypervisor Designed for Clouds
Bare-Metal Hypervisor as a Platform for Innovation
XPDS14 - RT-Xen: Real-Time Virtualization in Xen - Sisu Xi, Washington Univer...
Ad

Similar to XPDDS18: NVDIMM Overview - George Dunlap, Citrix (20)

PDF
OSSEU18: NVDIMM and Virtualization - George Dunlap, Citrix
PDF
Linux internals for Database administrators at Linux Piter 2016
PPT
Windows memory manager internals
PDF
LINSTOR - Linux Block storage management tool (march 2019)
PPT
VDI storage and storage virtualization
PDF
Xen and the Art of Virtualization
PPTX
Minding SQL Server Memory
PDF
Current and Future of Non-Volatile Memory on Linux
PDF
Deterministic Memory Abstraction and Supporting Multicore System Architecture
PPTX
The Forefront of the Development for NVDIMM on Linux Kernel
PDF
Linux Huge Pages
PDF
What CloudStackers Need To Know About LINSTOR/DRBD
PDF
RHEL5 XEN HandOnTraining_v0.4.pdf
PDF
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
PDF
Presentation db2 best practices for optimal performance
PDF
Surviving a Plane Crash, a NU.nl case-study
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPTX
Presentation db2 best practices for optimal performance
PDF
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
PDF
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
OSSEU18: NVDIMM and Virtualization - George Dunlap, Citrix
Linux internals for Database administrators at Linux Piter 2016
Windows memory manager internals
LINSTOR - Linux Block storage management tool (march 2019)
VDI storage and storage virtualization
Xen and the Art of Virtualization
Minding SQL Server Memory
Current and Future of Non-Volatile Memory on Linux
Deterministic Memory Abstraction and Supporting Multicore System Architecture
The Forefront of the Development for NVDIMM on Linux Kernel
Linux Huge Pages
What CloudStackers Need To Know About LINSTOR/DRBD
RHEL5 XEN HandOnTraining_v0.4.pdf
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
Presentation db2 best practices for optimal performance
Surviving a Plane Crash, a NU.nl case-study
In-memory Caching in HDFS: Lower Latency, Same Great Taste
Presentation db2 best practices for optimal performance
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
Ad

More from The Linux Foundation (20)

PDF
ELC2019: Static Partitioning Made Simple
PDF
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
PDF
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
PDF
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
PDF
XPDDS19 Keynote: Unikraft Weather Report
PDF
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
PDF
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
PDF
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
PDF
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
PPTX
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
PPTX
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
PDF
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
PDF
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
PDF
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
PDF
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
PDF
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
PDF
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
PDF
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
PDF
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
PDF
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
ELC2019: Static Partitioning Made Simple
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.

XPDDS18: NVDIMM Overview - George Dunlap, Citrix

  • 2. What are NVDIMMs? • A standard for allowing NVRAM to be exposed as normal memory • Potential to dramatically change the way software is written
  • 3. But.. • They have a number of surprising problems to solve • Particularly with respect to Xen • Incomplete specifications and confusing terminology
  • 4. Goals • Give an overview of NVDIMM concepts, terminology, and architecture • Introduce some of the issues they bring up WRT operating systems • Introduce some of the issues faced wrt XEN • Propose a “sketch” of a design solution
  • 5. Current Technology • Currently available: NVDIMM-N • DRAM with flash + capacitor backup • Strictly more expensive than a normal DIMM with the same storage capacity
  • 7. Page Tables SPA Terminology • Physical DIMM • DPA: DIMM Physical Address • SPA: System Physical Address DPA 0 DPA $N
  • 8. SPA DPA 0 DPA $N • DRAM-like access • 1-1 Mapping between DPA and SPA • Interleaved across DIMMs • Similar to RAID-0 • Better performance • Worse reliability PMEM: RAM-like access
  • 9. PBLK: Disk-like access • Disk-like access • Control region • 8k Data “window” • One window per NVDIMM device • Never interleaved • Useful for software RAID • Also useful when SPA space < NVRAM size • 48 address bits (256TiB) • 46 physical bits (64TiB) SPA DPA 0 DPA $N
  • 10. How is this mapping set up? • Firmware sets up the mapping at boot • May be modifiable using BIOS / vendor-specific tool • Exposes information via ACPI
  • 11. ACPI: Devices • NVDIMM Root Device • Expect one per system • NVDIMM Device • One per NVDIMM • Information about size, manufacture, &c
  • 12. ACPI: NFIT Table • NVDIMM Firmware Interface Table • PMEM information: • SPA ranges • Interleave sets • PBLK information • Control regions • Data window regions
  • 13. Practical issues for using NVDIMM • How to partition up the NVRAM and share it between operating systems • Knowing the correct way to access each area (PMEM / PBLK) • Detecting when interleaving / layout has changed
  • 14. Dividing things up: Namespaces • “Namespace”: Think partition • PMEM namespace and interleave sets • PBLK namespaces • “Type UUID” to define how it’s used • think DOS partitions: “Linux ext2/3/4”, “Linux Swap”, “NTFS”, &c
  • 15. How do we store namespace information? • Reading via PMEM and PBLK will give you different results • PMEM Interleave sets may change across reboots • So we can’t store it inside the visible NVRAM
  • 16. Label area: Per-NVDIMM storage • One “label area” per NVDIMM device (aka physical DIMM) • Label describes a single contiguous DPA region • Namespaces made out of labels • Accessed via ACPI AML methods • Pure read / write NVRAM Label Areas
  • 17. • Read ACPI NFIT to determine • How many NVDIMMs you have • Where PMEM is mapped • Read label area for each NVDIMM • Piece together the namespace described • Double-check interleave sets with the interleave sets (from NFIT table) • Access PMEM regions by offsets in SPA sets (from NFIT table) • Access PBLK regions by programming control / data windows (from NFIT table) SPA How an OS Determines Namespaces Label Areas
  • 19. Key points • “Namespace”: Partition • “Label area”: Partition table • NVDIMM devices / SPA ranges / etc defined in ACPI static NFIT table • Label area accessed via ACPI AML methods
  • 21. ndctl • Create / destroy namespaces • Four modes • raw • sector • fsdax • devdax
  • 22. The ideal interface • Guest processes map a normal file, and magically get permanent storage
  • 23. The obvious solution • Make a namespace into a block device • Put a normal partition inside the namespace • Put a normal filesystem inside a partition • Have mmap() map to the system memory directly
  • 24. Issue: Sector write atomicity • Disk sector writes are atomic: all-or-nothing • memcpy() can be interrupted in the middle • Block Translation Table (BTT): an abstraction that guarantees write atomicity • ‘sector mode’ • But this means mmap() needs a separate buffer
  • 25. Issue: Page struct • To keep track of userspace mappings, Linux needs a ‘page struct’ • 64 bytes per 4k page • 1 TiB of PMEM requires 7.85GiB of page array • Solution: Use PMEM to store a ‘page struct’ • Use a superblock to designate areas of the namespace to be used for this purpose (allocated on namespace creation)
  • 26. Issue: Filesystems and block location • Filesystems want to be able to move blocks around • Difficult interactions between write() system call (DMA)
  • 27. Issue: Interactions with the page cache
  • 28. Mode summary: Raw • Block mode access to full SPA range • No support for namespaces • Therefore, no UUIDs / superblocks; page structs must be stored in main memory • Supports DAX
  • 29. Mode summary: Sector • Block mode with BTT for sector atomicity • Supports namespaces • No DAX / direct mmap() support
  • 30. Mode summary: fsdax • Block mode access to a namespace • Supports page structs in main memory, or in the namespace • Must be chosen at time of namespace creation • Supports filesystems with DAX • But there’s some question about the safety of this
  • 31. Mode summary: devdax • Character device access to namespace • Does not support filesystems • page structs must be contained within the namespace • Supports mmap() • “No interaction with kernel page cache” • Character devices don’t have a “size”, so you have to remember
  • 32. Summary • Four ways of mapping with different advantages and disadvantages • Seems clear that Linux is still figuring out how best use PMEM
  • 34. Issue: Xen and AML • Reading the label areas can only be done via AML • Xen cannot do AML • ACPI spec requires only a single entity to do AML • That must be domain 0
  • 35. Issue: struct page • In order to track mapping to guests, the hypervisor needs a struct page • “frametable” • 32 or 40 bits
  • 36. Issue: RAM vs MMIO • Dom0 is free to map any SPA • RAM: Page reference counts taken • MMIO: No reference counts taken • If we “promote” NVDIMM SPA ranges from MMIO to RAM, existing dom0 mappings won’t have a reference count
  • 37. PMEM promotion, continued • Three basic options: • 1a: Trust that dom0 has unmapped before promotion • 1b: Check to make sure that dom0 unmapped before promotion • 2: Automatically take appropriate reference counts on promotion • Checking / converting seem about equivalent: • A: Keep track of all unpromoted PMEM regions in dom0 pagetables • B: Brute force search of dom0 pagetables (PV) or p2m table (PVH)
  • 38. Solution: Where to get PMEM for frame table • “Manually” set-aside namespaces • Namespace with custom Xen UUID • Allocated from within a namespace (like Linux)
  • 39. Solution: Promoting NVRAM • Hypercall to allow dom0 “promote” specific PMEM regions (probably full namespaces) • Allow dom0 to specify two types of “scratch” PMEM for frame tables • PMEM tied to specific other PMEM regions • PMEM generally available for any PMEM region
  • 40. Proposal: Promoting PMEM mappings • Keep a list of all unpromoted PMEM mappings, automatically increase reference counts on promotion • Alternately: • For PV: Require unmapping without checking • For PVH: Require dom0 to map PMEM into a single contiguous p2m range
  • 41. NVDIMM, Xen, and domUs (Virtual NVDIMMs)
  • 42. Issue: Guest compatibility • PVH is our “next generation” guest type • Requiring QEMU for is a last resort
  • 43. PMEM only, no interleaving • Don’t bother with virtual PBLK • Each vNVDIMM will be exposed as a single non-interleaved chunk with its own label area
  • 44. Issue: Guest label area • Each NVDIMM needs two distinct chunks • To begin with, specify two files / devices • Eventually: Both contained in a single file (w/ metadata?)
  • 45. Issue: How to read label areas • Expose via ACPI? • Map label areas into “secret” section of p2m • Implement AML which reads and writes this secret section • Expose via PV interface?