SlideShare a Scribd company logo
Intel®Optane™DataCenter
PersistentMemory
Architecture (Jane) and Performance (Lily)
Presenters: Lily Looi, Jianping Jane Xu
Co-Authors: Asher Altman, Mohamed Arafa, Kaushik Balasubramanian, Kai Cheng, Prashant Damle, Sham Datta, Chet
Douglas, Kenneth Gibson, Benjamin Graniello, John Grooms, Naga Gurumoorthy, Ivan Cuevas Escareno, Tiffany
Kasanicky, Kunal Khochare, Zhiming Li, Sreenivas Mandava, Rick Mangold, Sai Muralidhara, Shamima Najnin, Bill Nale,
Jay Pickett, Shekoufeh Qawami, Tuan Quach, Bruce Querbach, Camille Raad, Andy Rudoff, Ryan Saffores, Ian Steiner,
Muthukumar Swaminathan, Shachi Thakkar, Vish Viswanathan, Dennis Wu, Cheng Xu
08/19/2019
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/benchmarks .
Configurations on slides 18 and 20.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration.
Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.
Performance results are based on testing as of Feb. 22. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component
can be absolutely secure.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your
system hardware, software or configuration may affect your actual performance.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete
information about performance and benchmark results, visit http://www.intel.com/benchmarks .
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide
cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are
accurate.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
*Other names and brands may be claimed as property of others.
Intel, the Intel logo, Xeon, the Xeon logo, Optane, and the Optane logo are trademarks of Intel Corporation in the United States and other countries.
© Intel Corporation.
Notices&Disclaimers
Intel®Optane™DC
PersistentMemory
Architecture
A Breakthrough with a New Interface
Protocol, Memory Controller, Media,
and Software Stack
1.
Memory-StorageGap
DRAM
HOT TIER
HDD / TAPE
COLD TIER
SSD
WARM TIER
Intel®3DNandSSD
CPU
LLC
Core
L2
L1
pico-secs
nano-secs
Memory Sub-System 10s GB
<100nanosecs
Network
Storage
SSD
10s TB
<100millisecs
10s TB
<100microsecs
Memory-Storage Gap?
less often →
AccessDistribution
Data Access Frequency
Cooler data
 more often
Hot data
CloseMemory–StorageGap
less often →
AccessDistribution
Data Access Frequency
Cooler data
 more often
Hot data
DRAM
HOT TIER
HDD / TAPE
COLD TIER
SSD
WARM TIER
Optimize performance given cost and power
budget
CPU
LLC
Core
L2
L1
pico-secs
nano-secs
Memory Sub-System
10s GB
<100nanosecs
Network
Storage
SSD
10s TB
<100millisecs
10s TB
<100microsecs
Move Data Closer to Compute
Maintain Persistency
100s GB
<1microsec
1s TB
<10microsecs
Intel®3DNandSSD
High Resistivity – ‘0’
Low Resistivity – ‘1’
Attributes
+ Non-volatile
+ Potentially fast write
+ High density
+ Non-destructive fast read
+ Low voltage
+ Integrate-able w/ logic
+ Bit alterable
Cross-Point Structure
Selectors allow dense packing
And individual access to bits
Scalable
Memory layers can be
stacked in a 3D manner
Breakthrough
Material Advances
Compatible switch and memory
cell materials
High
Performance
Cell and array architecture
that can switch fast
First Generation Capacities:
128 GB
256 GB
512 GB
Intel®Optane™MediaTechnology
DRAM
1. DQ buffers presents a single load to the host
2. Host SMBus: SPD visible to the CPU, Optane Controller plays thermal sensing (TSOD) functionality
3. Address Indirection Table
4. Integrated PMIC controlled Optane Controller
5. On DIMM Firmware storage
6. On-DIMM Power Fail Safe with auto-detection
Optane
Controller
NVM
PMICPMIC Bus
FlashFlush I/F
NVM Bus
Data Bus
Buff
Buff
Buff
Buff
Buff
Buff
SMBus/I2Cs
NVMNVMNVMNVMOptane
Media
CMD and Address Bus
SPD
NVM Rails
Mem Ctrl Rails
1
4
5
2
DDR4Connector DQ Buffer and Logic Rail
6
3
Intel®Optane™DCPersistentMemoryModuleArchitecture
Interface to
Host CPU
DCPMMMemoryInterface
Addr
Mapping
Cache
Address Mapping
Logic
Media Management Power &
Thermal
Mgmt
Uctrl
Encrypt/
Decrypt
Key Mgmt
DRNG
Scheduler
Read
Queue
Write
Queue
Refresh
Engine
ECC/
Scrambler
Error
Handling
Logic
OptaneTM
MediaInterface
Media
Channel
OptaneTM
Media Devices
Caps for
Flushes
Intel®Optane™DCPersistentMemoryControllerArchitecture
DDR4Sloton
HostCPU
PersistentMemory
USERSPACEKERNELSPACE
Standard
File API
GenericNVDIMMDriver
Application
FileSystem
ApplicationApplication
Standard
Raw Device
Access
Load/
Store
ManagementLibrary
ManagementUI
Standard
File API
pmem-Aware
FileSystem
MMU
Mappings
file memory
“DAX”
Intel®OptaneDCPersistentMemorySWEnablingStack
CPU
Core
1. AC power loss to de-assert the PWROK
2. Platform logic then asserts the ADR_Trigger
3. PCH starts the ADR programmable timer
4. PCH assertion to SYNC message
5. PCU in processor detects SYNC message bit and
sends AsyncSR to MC
6. MC flushes Write pending queue (WPQ)
7. After ADR timer expires, PCH asserts
ADR_COMPLETE pin
Asynchronous DRAM Refresh (ADR) - Power Fail Protected Feature
Platform Power
Supply
Platform Logic
PCH
ADR_SYNC
ADR_GPIO
ADR
Timer
ADR_Complete
PCU
MC
NVMNVMNVMNVMNVMIntel®
Optane™
DC PMM
ADR_Trigger
PWROK1
2
4
3
5
6
L1 L1
L2
L3
WPQ7
KeyFeatureDeepDive-ADR
• Large Memory Capacity
• No software/application changes required
• To mimic traditional memory, data is “volatile”
• Volatile mode key cleared and regenerated every power
cycle
• DRAM is ‘near memory’
• Used as a write-back cache
• Managed by host memory controller
• Within the same host memory controller, not across
• Ratio of far/near memory (PMEM/DRAM) can vary
• Overall latency
• Same as DRAM for cache hit
• DC persistent memory + DRAM for cache miss
Core
L1 L1
L2
L3
Core
L1 L1
L2
Core
L1 L1
L2
Core
L1 L1
L2
NEAR MEMORY
FAR MEMORY
Memory Controller
MOV
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM DRAM
MemoryMode
• PMEM-aware software/application required
• Adds a new tier between DRAM and block storage
(SSD/HDD)
• Industry open standard programming model and Intel
PMDK
• In-place persistence
• No paging, context switching, interrupts,
nor kernel code executes
• Byte addressable like memory
• Load/store access, no page caching
• Cache Coherent
• Ability to do DMA & RDMA
CPUCACHES
Minimum required
power fail protected
domain:
Memory subsystem
SW makes sure that
data is flushed to durability
domain
using
CLFLUSHOPT or CLWB
Core
L1 L1
L2
L3
MOV
Dram MEMORY Persistent
MEMORY
Memory
Controller
DRAM
WPQ
AppDirectMode–PersistentMemory
PERFORMANCE
Intel® Optane™ DC Persistent
Memory for larger data, better
performance/$, and new paradigms
2.
Intel®Optane™DC
PersistentMemoryLatency
1000x lower latency
Note 4K granularity gives about same performance as 256B
Read idle latency
Smaller
granularity
(vs. 4K)
Intel Optane
DC SSD
P4800X
Intel DC P4610
NVMe SSD
Lower is better
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Performance can vary based on
• 64B random vs. 256B granularity
• Read/write mix
• Power level (programmable 12-18W, graph is 18W)
Intel®Optane™DC PersistentMemoryLatency
Ranges from 180ns to 340ns
(vs. DRAM ~70ns)
Read idle latency
Smaller
granularity
(vs. 4K)
Intel Optane DC
SSD P4800X
Intel DC P4610
NVMe SSD
Read/Write (256B) Read (256B)
Read (64B)
Read/Write
(64B)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
DRAM Cache MissDRAM Cache Hit
• Good locality means near-DRAM performance
• Cache hit: latency same as DRAM
• Cache miss: latency DRAM + Intel® Optane™ DC
persistent memory
• Performance varies by workload
• Best workloads have the following traits:
• Good locality for high DRAM cache hit rate
• Low memory bandwidth demand
• Other factors:
• #reads > #writes
• Config vs. Workload size
MemoryModeTransactionFlow
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM
https://software.intel.com/en-us/articles/prepare-for-the-next-generation-of-memory
• Synthetic traffic generator represents different types of workloads
• Vary size of buffers to emulate more or less locality
• Very large data size (much larger than DRAM cache) causes higher miss rate
MemoryModePerformancevs.Locality&Load
17
High BW delivered for high
demand WL
High demand
+ poor locality = degradation
Medium/low demand WLs still
meet requirement
0.0
20.0
40.0
60.0
80.0
100.0
120.0
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Bandwidth(GB/s)
DRAM cache miss rate
MLC Bandwidth Varying Bandwidth Demand
(100% reads)
Light Demand (13GB/s) Med (33GB/s) Heavy Demand (max)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
2-2-2
System Platform Neon city
CPU CLX-B0
CPU per Node 28core/socket, 1 socket, 2 threads per core
Memory 6x 16GB DDR + 6x 128GB AEP QS
SUT OS Fedora 4.20.6-200.fc29.x86_64
BKC WW08
BIOS PLYXCRB1.86B.0576.D20.1902150028 (mbf50656_0400001c)
FW 01.00.00.5355
Security Variants 1,2, & 3 Patched
Test Date 4/5/2019
MemoryModePerformance/Load/Locality
MLC parameters: --loaded_latency –d<varies> -t200 Buffer size (GB) per thread 2-2-2
Miss rate (%) R W2
~0 0.1 0.1
~10 1.0 0.7
~25 4.5 1.8
~40 9.0 4.5
19
1. One Redis Memtier instance per VM
2. Max throughput scenario, will scale better at lower operating point
2 VMs per core2
1 VM per core
VM size DRAM baselineMM capacity
Throughput vs.
DRAM
Summary VM’s
45GB 768GB 1TB 111%, meets SLA
42% more VM’s @lower
cost
14->20
90GB 768GB 1TB 147%, meets SLA
42% more VM’s@lower
cost
7->10
EnableMoreRedisVMInstanceswithSub-msSLA
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Throughput: Higher is better, Latency: lower is better (must be 1ms or less)
Configuration 1 - 1LM Configuration 2 – Memory Mode (2LM)
Test by Intel Intel
Test date 02/22/2019 02/22/2019
Platform Neoncity Neoncity
# Nodes 1 1
# Sockets 2 2
CPU Intel® Xeon® Platinum 8276, 165W Intel® Xeon® Platinum 8276, 165W
Cores/socket, Threads/socket 28/56 28/56
HT On On
BIOS version PLYXCRB1.86B.0573.D10.190130045
3
PLYXCRB1.86B.0573.D10.1901300453
BKC version – E.g. ww47 WW06 WW06
AEP FW version – E.g. 5336 5346 (QS AEP) 5346 (QS AEP)
System DDR Mem Config: slots / cap / run-speed 12 slots / 32GB / 2666 12 slots / 16 GB / 2666
System DCPMM Config: slots / cap / run-speed 12 slots / 32GB / 2666 12 slots /128GB,256GB,512GB/ 2666
Total Memory/Node (DDR, DCPMM) 768, 0 192, 1TB,1.5TB,3TB, 6TB
NICs 2x40GB 2x40GB
OS Fedora-27 Fedora-27
Kernel 4.20.4-200.fc29.x86_64 4.20.4-200.fc29.x86_64
AEP mode: ex. MM or AD-volatile (replace DDR) or AD-
persistent (replace NVME)
1LM Memory Mode (2LM)
Workload & version Redis 4.0.11 Redis 4.0.11
Other SW (Frameworks, Topologies…) memtier_benchmark-1.2.12 (80/20
read/write) ; 1K record size
memtier_benchmark-1.2.12
(80/20 read/write) ; 1K record size
VMs (Type, vcpu/VM, VM OS) KVM, 1/VM, centos-7.0 KVM, 1/VM, centos-7.0
RedisConfiguration
App Direct ReadTraditional Read to Page Fault
• Traditional read to page fault (disk):
Software
4K transfer from disk
Request returned
• App Direct access memory directly
• Avoids software and 4K transfer
overhead
• Cores can still access DRAM normally,
even on same channel
AppDirectModeTransactionFlow
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
Memory Controller
DRAM
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM
SSD
One
cacheline
1
2
1
2
3
3
• Reduce TCO by moving large portion of data from DRAM to Intel® Optane™ DC persistent memory
• Optimize performance by using the values stored in persistent memory instead of creating a separate copy of the log in SSD (only
pointer written to log)
• Direct access vs. disk protocol
RedisExample(withSoftwareChange)
22
AOF file (log)
AOF
Pointers
(to log)
Write request
Store key/
value
Append
operation into
AOF log file
Write request
Store key
Store value
Append
pointer into
AOF log file
PMEM
Moving Value to App Direct reduces DRAM and optimizes logging by 2.27x
(Open Source Redis Set, AOF=always update, 1K datasize, 28 instances)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Configuration Query time
768GB DRAM 1417s
192GB DRAM
1TB App Direct
171s
3TB scale factor
8X improvement in Apache Spark* sql IO
intensive queries for Analytics
SparkSQLOAPcache
• Intel® Optane™ DC persistent memory as cache
• More affordable than similar capacity DRAM
• Significantly lower overhead for I/O intensive workloads
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Intel® Optane™ DC Persistent Memory closes the DDR memory and storage gap
• Architected for persistence
• Provided large capacity scales workloads to new heights
• Offered a new way to manage data flows with unprecedented integration into system and platform
• Optimized for performance and orders of magnitude faster than NAND
• Memory mode for large affordable volatile memory
• App Direct mode for persistent memory
Summary
Intel Optane Data Center Persistent Memory

More Related Content

PDF
TinyML - 4 speech recognition
PPTX
Ceph Performance and Sizing Guide
PPTX
Slideshare - PCIe
PDF
Boosting I/O Performance with KVM io_uring
PPTX
Broadcom PCIe & CXL Switches OCP Final.pptx
PPTX
DPDK KNI interface
PDF
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
PDF
DPDK in Containers Hands-on Lab
TinyML - 4 speech recognition
Ceph Performance and Sizing Guide
Slideshare - PCIe
Boosting I/O Performance with KVM io_uring
Broadcom PCIe & CXL Switches OCP Final.pptx
DPDK KNI interface
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
DPDK in Containers Hands-on Lab

What's hot (20)

PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
CXL_説明_公開用.pdf
PDF
Ixgbe internals
PPTX
Static partitioning virtualization on RISC-V
PDF
Smart NIC
PPTX
Introduction Linux Device Drivers
PPTX
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
PDF
Introducing ucx unified communications x framework
PPTX
클라우드 환경을 위한 네트워크 가상화와 NSX(기초편)
PDF
initramfsについて
PDF
JANOG43 Forefront of SRv6, Open Source Implementations
PDF
Zynq mp勉強会資料
PDF
Intel TSX HLE を触ってみた x86opti
PPTX
Zynq + Vivado HLS入門
PPTX
Get Hands-On with NGINX and QUIC+HTTP/3
PDF
Containers: The What, Why, and How
PDF
How VXLAN works on Linux
PPTX
Troubleshooting containerized triple o deployment
PDF
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
PDF
BGP Unnumbered で遊んでみた
High-Performance Networking Using eBPF, XDP, and io_uring
CXL_説明_公開用.pdf
Ixgbe internals
Static partitioning virtualization on RISC-V
Smart NIC
Introduction Linux Device Drivers
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Introducing ucx unified communications x framework
클라우드 환경을 위한 네트워크 가상화와 NSX(기초편)
initramfsについて
JANOG43 Forefront of SRv6, Open Source Implementations
Zynq mp勉強会資料
Intel TSX HLE を触ってみた x86opti
Zynq + Vivado HLS入門
Get Hands-On with NGINX and QUIC+HTTP/3
Containers: The What, Why, and How
How VXLAN works on Linux
Troubleshooting containerized triple o deployment
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
BGP Unnumbered で遊んでみた
Ad

Similar to Intel Optane Data Center Persistent Memory (20)

PPTX
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
PDF
Disrupt the Storage & Memory Hierarchy
PPTX
Impact of Intel Optane Technology on HPC
PPTX
Training - HPE and Intel Optane SSD Solution.PPTX
PDF
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
PDF
Deep Dive On Intel Optane SSDs And New Server Platforms
PDF
Persistent Memory Development Kit (PMDK) Essentials: Part 1
PDF
Persistent Memory Development Kit (PMDK) Essentials: Part 2
PDF
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
PDF
Watch your transactional database performance climb with Intel Optane DC pers...
PDF
Building an open memory-centric computing architecture using intel optane
PPTX
Customer Presentation - HPE Persistent Memory Portfolio.PPTX
PPTX
Breaking the Sound Barrier with Persistent Memory
PDF
Get more VMware vSAN database performance with Intel Optane SSDs and HPE ProL...
PPTX
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
PDF
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
PDF
Crooke CWF Keynote FINAL final platinum
PDF
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
PDF
Overcoming Scaling Challenges in MongoDB Deployments with SSD
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Disrupt the Storage & Memory Hierarchy
Impact of Intel Optane Technology on HPC
Training - HPE and Intel Optane SSD Solution.PPTX
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Deep Dive On Intel Optane SSDs And New Server Platforms
Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 2
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Watch your transactional database performance climb with Intel Optane DC pers...
Building an open memory-centric computing architecture using intel optane
Customer Presentation - HPE Persistent Memory Portfolio.PPTX
Breaking the Sound Barrier with Persistent Memory
Get more VMware vSAN database performance with Intel Optane SSDs and HPE ProL...
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
Crooke CWF Keynote FINAL final platinum
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
Overcoming Scaling Challenges in MongoDB Deployments with SSD
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Intel Optane Data Center Persistent Memory

  • 1. Intel®Optane™DataCenter PersistentMemory Architecture (Jane) and Performance (Lily) Presenters: Lily Looi, Jianping Jane Xu Co-Authors: Asher Altman, Mohamed Arafa, Kaushik Balasubramanian, Kai Cheng, Prashant Damle, Sham Datta, Chet Douglas, Kenneth Gibson, Benjamin Graniello, John Grooms, Naga Gurumoorthy, Ivan Cuevas Escareno, Tiffany Kasanicky, Kunal Khochare, Zhiming Li, Sreenivas Mandava, Rick Mangold, Sai Muralidhara, Shamima Najnin, Bill Nale, Jay Pickett, Shekoufeh Qawami, Tuan Quach, Bruce Querbach, Camille Raad, Andy Rudoff, Ryan Saffores, Ian Steiner, Muthukumar Swaminathan, Shachi Thakkar, Vish Viswanathan, Dennis Wu, Cheng Xu 08/19/2019
  • 2. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . Configurations on slides 18 and 20. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. Performance results are based on testing as of Feb. 22. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. *Other names and brands may be claimed as property of others. Intel, the Intel logo, Xeon, the Xeon logo, Optane, and the Optane logo are trademarks of Intel Corporation in the United States and other countries. © Intel Corporation. Notices&Disclaimers
  • 3. Intel®Optane™DC PersistentMemory Architecture A Breakthrough with a New Interface Protocol, Memory Controller, Media, and Software Stack 1.
  • 4. Memory-StorageGap DRAM HOT TIER HDD / TAPE COLD TIER SSD WARM TIER Intel®3DNandSSD CPU LLC Core L2 L1 pico-secs nano-secs Memory Sub-System 10s GB <100nanosecs Network Storage SSD 10s TB <100millisecs 10s TB <100microsecs Memory-Storage Gap? less often → AccessDistribution Data Access Frequency Cooler data  more often Hot data
  • 5. CloseMemory–StorageGap less often → AccessDistribution Data Access Frequency Cooler data  more often Hot data DRAM HOT TIER HDD / TAPE COLD TIER SSD WARM TIER Optimize performance given cost and power budget CPU LLC Core L2 L1 pico-secs nano-secs Memory Sub-System 10s GB <100nanosecs Network Storage SSD 10s TB <100millisecs 10s TB <100microsecs Move Data Closer to Compute Maintain Persistency 100s GB <1microsec 1s TB <10microsecs Intel®3DNandSSD
  • 6. High Resistivity – ‘0’ Low Resistivity – ‘1’ Attributes + Non-volatile + Potentially fast write + High density + Non-destructive fast read + Low voltage + Integrate-able w/ logic + Bit alterable Cross-Point Structure Selectors allow dense packing And individual access to bits Scalable Memory layers can be stacked in a 3D manner Breakthrough Material Advances Compatible switch and memory cell materials High Performance Cell and array architecture that can switch fast First Generation Capacities: 128 GB 256 GB 512 GB Intel®Optane™MediaTechnology
  • 7. DRAM 1. DQ buffers presents a single load to the host 2. Host SMBus: SPD visible to the CPU, Optane Controller plays thermal sensing (TSOD) functionality 3. Address Indirection Table 4. Integrated PMIC controlled Optane Controller 5. On DIMM Firmware storage 6. On-DIMM Power Fail Safe with auto-detection Optane Controller NVM PMICPMIC Bus FlashFlush I/F NVM Bus Data Bus Buff Buff Buff Buff Buff Buff SMBus/I2Cs NVMNVMNVMNVMOptane Media CMD and Address Bus SPD NVM Rails Mem Ctrl Rails 1 4 5 2 DDR4Connector DQ Buffer and Logic Rail 6 3 Intel®Optane™DCPersistentMemoryModuleArchitecture
  • 8. Interface to Host CPU DCPMMMemoryInterface Addr Mapping Cache Address Mapping Logic Media Management Power & Thermal Mgmt Uctrl Encrypt/ Decrypt Key Mgmt DRNG Scheduler Read Queue Write Queue Refresh Engine ECC/ Scrambler Error Handling Logic OptaneTM MediaInterface Media Channel OptaneTM Media Devices Caps for Flushes Intel®Optane™DCPersistentMemoryControllerArchitecture DDR4Sloton HostCPU
  • 10. CPU Core 1. AC power loss to de-assert the PWROK 2. Platform logic then asserts the ADR_Trigger 3. PCH starts the ADR programmable timer 4. PCH assertion to SYNC message 5. PCU in processor detects SYNC message bit and sends AsyncSR to MC 6. MC flushes Write pending queue (WPQ) 7. After ADR timer expires, PCH asserts ADR_COMPLETE pin Asynchronous DRAM Refresh (ADR) - Power Fail Protected Feature Platform Power Supply Platform Logic PCH ADR_SYNC ADR_GPIO ADR Timer ADR_Complete PCU MC NVMNVMNVMNVMNVMIntel® Optane™ DC PMM ADR_Trigger PWROK1 2 4 3 5 6 L1 L1 L2 L3 WPQ7 KeyFeatureDeepDive-ADR
  • 11. • Large Memory Capacity • No software/application changes required • To mimic traditional memory, data is “volatile” • Volatile mode key cleared and regenerated every power cycle • DRAM is ‘near memory’ • Used as a write-back cache • Managed by host memory controller • Within the same host memory controller, not across • Ratio of far/near memory (PMEM/DRAM) can vary • Overall latency • Same as DRAM for cache hit • DC persistent memory + DRAM for cache miss Core L1 L1 L2 L3 Core L1 L1 L2 Core L1 L1 L2 Core L1 L1 L2 NEAR MEMORY FAR MEMORY Memory Controller MOV NEAR MEMORY FAR MEMORY Memory Controller DRAM DRAM MemoryMode
  • 12. • PMEM-aware software/application required • Adds a new tier between DRAM and block storage (SSD/HDD) • Industry open standard programming model and Intel PMDK • In-place persistence • No paging, context switching, interrupts, nor kernel code executes • Byte addressable like memory • Load/store access, no page caching • Cache Coherent • Ability to do DMA & RDMA CPUCACHES Minimum required power fail protected domain: Memory subsystem SW makes sure that data is flushed to durability domain using CLFLUSHOPT or CLWB Core L1 L1 L2 L3 MOV Dram MEMORY Persistent MEMORY Memory Controller DRAM WPQ AppDirectMode–PersistentMemory
  • 13. PERFORMANCE Intel® Optane™ DC Persistent Memory for larger data, better performance/$, and new paradigms 2.
  • 14. Intel®Optane™DC PersistentMemoryLatency 1000x lower latency Note 4K granularity gives about same performance as 256B Read idle latency Smaller granularity (vs. 4K) Intel Optane DC SSD P4800X Intel DC P4610 NVMe SSD Lower is better For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
  • 15. Performance can vary based on • 64B random vs. 256B granularity • Read/write mix • Power level (programmable 12-18W, graph is 18W) Intel®Optane™DC PersistentMemoryLatency Ranges from 180ns to 340ns (vs. DRAM ~70ns) Read idle latency Smaller granularity (vs. 4K) Intel Optane DC SSD P4800X Intel DC P4610 NVMe SSD Read/Write (256B) Read (256B) Read (64B) Read/Write (64B) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 16. DRAM Cache MissDRAM Cache Hit • Good locality means near-DRAM performance • Cache hit: latency same as DRAM • Cache miss: latency DRAM + Intel® Optane™ DC persistent memory • Performance varies by workload • Best workloads have the following traits: • Good locality for high DRAM cache hit rate • Low memory bandwidth demand • Other factors: • #reads > #writes • Config vs. Workload size MemoryModeTransactionFlow CPUCACHES Core L1 L1 L2 L3 NEAR MEMORY FAR MEMORY Memory Controller DRAM CPUCACHES Core L1 L1 L2 L3 NEAR MEMORY FAR MEMORY Memory Controller DRAM https://software.intel.com/en-us/articles/prepare-for-the-next-generation-of-memory
  • 17. • Synthetic traffic generator represents different types of workloads • Vary size of buffers to emulate more or less locality • Very large data size (much larger than DRAM cache) causes higher miss rate MemoryModePerformancevs.Locality&Load 17 High BW delivered for high demand WL High demand + poor locality = degradation Medium/low demand WLs still meet requirement 0.0 20.0 40.0 60.0 80.0 100.0 120.0 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% Bandwidth(GB/s) DRAM cache miss rate MLC Bandwidth Varying Bandwidth Demand (100% reads) Light Demand (13GB/s) Med (33GB/s) Heavy Demand (max) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 18. 2-2-2 System Platform Neon city CPU CLX-B0 CPU per Node 28core/socket, 1 socket, 2 threads per core Memory 6x 16GB DDR + 6x 128GB AEP QS SUT OS Fedora 4.20.6-200.fc29.x86_64 BKC WW08 BIOS PLYXCRB1.86B.0576.D20.1902150028 (mbf50656_0400001c) FW 01.00.00.5355 Security Variants 1,2, & 3 Patched Test Date 4/5/2019 MemoryModePerformance/Load/Locality MLC parameters: --loaded_latency –d<varies> -t200 Buffer size (GB) per thread 2-2-2 Miss rate (%) R W2 ~0 0.1 0.1 ~10 1.0 0.7 ~25 4.5 1.8 ~40 9.0 4.5
  • 19. 19 1. One Redis Memtier instance per VM 2. Max throughput scenario, will scale better at lower operating point 2 VMs per core2 1 VM per core VM size DRAM baselineMM capacity Throughput vs. DRAM Summary VM’s 45GB 768GB 1TB 111%, meets SLA 42% more VM’s @lower cost 14->20 90GB 768GB 1TB 147%, meets SLA 42% more VM’s@lower cost 7->10 EnableMoreRedisVMInstanceswithSub-msSLA For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Throughput: Higher is better, Latency: lower is better (must be 1ms or less)
  • 20. Configuration 1 - 1LM Configuration 2 – Memory Mode (2LM) Test by Intel Intel Test date 02/22/2019 02/22/2019 Platform Neoncity Neoncity # Nodes 1 1 # Sockets 2 2 CPU Intel® Xeon® Platinum 8276, 165W Intel® Xeon® Platinum 8276, 165W Cores/socket, Threads/socket 28/56 28/56 HT On On BIOS version PLYXCRB1.86B.0573.D10.190130045 3 PLYXCRB1.86B.0573.D10.1901300453 BKC version – E.g. ww47 WW06 WW06 AEP FW version – E.g. 5336 5346 (QS AEP) 5346 (QS AEP) System DDR Mem Config: slots / cap / run-speed 12 slots / 32GB / 2666 12 slots / 16 GB / 2666 System DCPMM Config: slots / cap / run-speed 12 slots / 32GB / 2666 12 slots /128GB,256GB,512GB/ 2666 Total Memory/Node (DDR, DCPMM) 768, 0 192, 1TB,1.5TB,3TB, 6TB NICs 2x40GB 2x40GB OS Fedora-27 Fedora-27 Kernel 4.20.4-200.fc29.x86_64 4.20.4-200.fc29.x86_64 AEP mode: ex. MM or AD-volatile (replace DDR) or AD- persistent (replace NVME) 1LM Memory Mode (2LM) Workload & version Redis 4.0.11 Redis 4.0.11 Other SW (Frameworks, Topologies…) memtier_benchmark-1.2.12 (80/20 read/write) ; 1K record size memtier_benchmark-1.2.12 (80/20 read/write) ; 1K record size VMs (Type, vcpu/VM, VM OS) KVM, 1/VM, centos-7.0 KVM, 1/VM, centos-7.0 RedisConfiguration
  • 21. App Direct ReadTraditional Read to Page Fault • Traditional read to page fault (disk): Software 4K transfer from disk Request returned • App Direct access memory directly • Avoids software and 4K transfer overhead • Cores can still access DRAM normally, even on same channel AppDirectModeTransactionFlow CPUCACHES Core L1 L1 L2 L3 NEAR MEMORY Memory Controller DRAM CPUCACHES Core L1 L1 L2 L3 NEAR MEMORY FAR MEMORY Memory Controller DRAM SSD One cacheline 1 2 1 2 3 3
  • 22. • Reduce TCO by moving large portion of data from DRAM to Intel® Optane™ DC persistent memory • Optimize performance by using the values stored in persistent memory instead of creating a separate copy of the log in SSD (only pointer written to log) • Direct access vs. disk protocol RedisExample(withSoftwareChange) 22 AOF file (log) AOF Pointers (to log) Write request Store key/ value Append operation into AOF log file Write request Store key Store value Append pointer into AOF log file PMEM Moving Value to App Direct reduces DRAM and optimizes logging by 2.27x (Open Source Redis Set, AOF=always update, 1K datasize, 28 instances) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 23. Configuration Query time 768GB DRAM 1417s 192GB DRAM 1TB App Direct 171s 3TB scale factor 8X improvement in Apache Spark* sql IO intensive queries for Analytics SparkSQLOAPcache • Intel® Optane™ DC persistent memory as cache • More affordable than similar capacity DRAM • Significantly lower overhead for I/O intensive workloads For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 24. • Intel® Optane™ DC Persistent Memory closes the DDR memory and storage gap • Architected for persistence • Provided large capacity scales workloads to new heights • Offered a new way to manage data flows with unprecedented integration into system and platform • Optimized for performance and orders of magnitude faster than NAND • Memory mode for large affordable volatile memory • App Direct mode for persistent memory Summary