Intel Optane Data Center Persistent Memory

Intel®Optane™DataCenter
PersistentMemory
Architecture (Jane) and Performance (Lily)
Presenters: Lily Looi, Jianping Jane Xu
Co-Authors: Asher Altman, Mohamed Arafa, Kaushik Balasubramanian, Kai Cheng, Prashant Damle, Sham Datta, Chet
Douglas, Kenneth Gibson, Benjamin Graniello, John Grooms, Naga Gurumoorthy, Ivan Cuevas Escareno, Tiffany
Kasanicky, Kunal Khochare, Zhiming Li, Sreenivas Mandava, Rick Mangold, Sai Muralidhara, Shamima Najnin, Bill Nale,
Jay Pickett, Shekoufeh Qawami, Tuan Quach, Bruce Querbach, Camille Raad, Andy Rudoff, Ryan Saffores, Ian Steiner,
Muthukumar Swaminathan, Shachi Thakkar, Vish Viswanathan, Dennis Wu, Cheng Xu
08/19/2019

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/benchmarks .
Configurations on slides 18 and 20.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration.
Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.
Performance results are based on testing as of Feb. 22. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component
can be absolutely secure.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your
system hardware, software or configuration may affect your actual performance.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete
information about performance and benchmark results, visit http://www.intel.com/benchmarks .
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide
cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are
accurate.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
*Other names and brands may be claimed as property of others.
Intel, the Intel logo, Xeon, the Xeon logo, Optane, and the Optane logo are trademarks of Intel Corporation in the United States and other countries.
© Intel Corporation.
Notices&Disclaimers

Intel®Optane™DC
PersistentMemory
Architecture
A Breakthrough with a New Interface
Protocol, Memory Controller, Media,
and Software Stack
1.

Memory-StorageGap
DRAM
HOT TIER
HDD / TAPE
COLD TIER
SSD
WARM TIER
Intel®3DNandSSD
CPU
LLC
Core
L2
L1
pico-secs
nano-secs
Memory Sub-System 10s GB
<100nanosecs
Network
Storage
SSD
10s TB
<100millisecs
10s TB
<100microsecs
Memory-Storage Gap?
less often →
AccessDistribution
Data Access Frequency
Cooler data
 more often
Hot data

CloseMemory–StorageGap
less often →
AccessDistribution
Data Access Frequency
Cooler data
 more often
Hot data
DRAM
HOT TIER
HDD / TAPE
COLD TIER
SSD
WARM TIER
Optimize performance given cost and power
budget
CPU
LLC
Core
L2
L1
pico-secs
nano-secs
Memory Sub-System
10s GB
<100nanosecs
Network
Storage
SSD
10s TB
<100millisecs
10s TB
<100microsecs
Move Data Closer to Compute
Maintain Persistency
100s GB
<1microsec
1s TB
<10microsecs
Intel®3DNandSSD

High Resistivity – ‘0’
Low Resistivity – ‘1’
Attributes
+ Non-volatile
+ Potentially fast write
+ High density
+ Non-destructive fast read
+ Low voltage
+ Integrate-able w/ logic
+ Bit alterable
Cross-Point Structure
Selectors allow dense packing
And individual access to bits
Scalable
Memory layers can be
stacked in a 3D manner
Breakthrough
Material Advances
Compatible switch and memory
cell materials
High
Performance
Cell and array architecture
that can switch fast
First Generation Capacities:
128 GB
256 GB
512 GB
Intel®Optane™MediaTechnology

DRAM
1. DQ buffers presents a single load to the host
2. Host SMBus: SPD visible to the CPU, Optane Controller plays thermal sensing (TSOD) functionality
3. Address Indirection Table
4. Integrated PMIC controlled Optane Controller
5. On DIMM Firmware storage
6. On-DIMM Power Fail Safe with auto-detection
Optane
Controller
NVM
PMICPMIC Bus
FlashFlush I/F
NVM Bus
Data Bus
Buff
Buff
Buff
Buff
Buff
Buff
SMBus/I2Cs
NVMNVMNVMNVMOptane
Media
CMD and Address Bus
SPD
NVM Rails
Mem Ctrl Rails
1
4
5
2
DDR4Connector DQ Buffer and Logic Rail
6
3
Intel®Optane™DCPersistentMemoryModuleArchitecture

Interface to
Host CPU
DCPMMMemoryInterface
Addr
Mapping
Cache
Address Mapping
Logic
Media Management Power &
Thermal
Mgmt
Uctrl
Encrypt/
Decrypt
Key Mgmt
DRNG
Scheduler
Read
Queue
Write
Queue
Refresh
Engine
ECC/
Scrambler
Error
Handling
Logic
OptaneTM
MediaInterface
Media
Channel
OptaneTM
Media Devices
Caps for
Flushes
Intel®Optane™DCPersistentMemoryControllerArchitecture
DDR4Sloton
HostCPU

PersistentMemory
USERSPACEKERNELSPACE
Standard
File API
GenericNVDIMMDriver
Application
FileSystem
ApplicationApplication
Standard
Raw Device
Access
Load/
Store
ManagementLibrary
ManagementUI
Standard
File API
pmem-Aware
FileSystem
MMU
Mappings
file memory
“DAX”
Intel®OptaneDCPersistentMemorySWEnablingStack

CPU
Core
1. AC power loss to de-assert the PWROK
2. Platform logic then asserts the ADR_Trigger
3. PCH starts the ADR programmable timer
4. PCH assertion to SYNC message
5. PCU in processor detects SYNC message bit and
sends AsyncSR to MC
6. MC flushes Write pending queue (WPQ)
7. After ADR timer expires, PCH asserts
ADR_COMPLETE pin
Asynchronous DRAM Refresh (ADR) - Power Fail Protected Feature
Platform Power
Supply
Platform Logic
PCH
ADR_SYNC
ADR_GPIO
ADR
Timer
ADR_Complete
PCU
MC
NVMNVMNVMNVMNVMIntel®
Optane™
DC PMM
ADR_Trigger
PWROK1
2
4
3
5
6
L1 L1
L2
L3
WPQ7
KeyFeatureDeepDive-ADR

• Large Memory Capacity
• No software/application changes required
• To mimic traditional memory, data is “volatile”
• Volatile mode key cleared and regenerated every power
cycle
• DRAM is ‘near memory’
• Used as a write-back cache
• Managed by host memory controller
• Within the same host memory controller, not across
• Ratio of far/near memory (PMEM/DRAM) can vary
• Overall latency
• Same as DRAM for cache hit
• DC persistent memory + DRAM for cache miss
Core
L1 L1
L2
L3
Core
L1 L1
L2
Core
L1 L1
L2
Core
L1 L1
L2
NEAR MEMORY
FAR MEMORY
Memory Controller
MOV
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM DRAM
MemoryMode

• PMEM-aware software/application required
• Adds a new tier between DRAM and block storage
(SSD/HDD)
• Industry open standard programming model and Intel
PMDK
• In-place persistence
• No paging, context switching, interrupts,
nor kernel code executes
• Byte addressable like memory
• Load/store access, no page caching
• Cache Coherent
• Ability to do DMA & RDMA
CPUCACHES
Minimum required
power fail protected
domain:
Memory subsystem
SW makes sure that
data is flushed to durability
domain
using
CLFLUSHOPT or CLWB
Core
L1 L1
L2
L3
MOV
Dram MEMORY Persistent
MEMORY
Memory
Controller
DRAM
WPQ
AppDirectMode–PersistentMemory

PERFORMANCE
Intel® Optane™ DC Persistent
Memory for larger data, better
performance/$, and new paradigms
2.

Intel®Optane™DC
PersistentMemoryLatency
1000x lower latency
Note 4K granularity gives about same performance as 256B
Read idle latency
Smaller
granularity
(vs. 4K)
Intel Optane
DC SSD
P4800X
Intel DC P4610
NVMe SSD
Lower is better
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

Performance can vary based on
• 64B random vs. 256B granularity
• Read/write mix
• Power level (programmable 12-18W, graph is 18W)
Intel®Optane™DC PersistentMemoryLatency
Ranges from 180ns to 340ns
(vs. DRAM ~70ns)
Read idle latency
Smaller
granularity
(vs. 4K)
Intel Optane DC
SSD P4800X
Intel DC P4610
NVMe SSD
Read/Write (256B) Read (256B)
Read (64B)
Read/Write
(64B)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

DRAM Cache MissDRAM Cache Hit
• Good locality means near-DRAM performance
• Cache hit: latency same as DRAM
• Cache miss: latency DRAM + Intel® Optane™ DC
persistent memory
• Performance varies by workload
• Best workloads have the following traits:
• Good locality for high DRAM cache hit rate
• Low memory bandwidth demand
• Other factors:
• #reads > #writes
• Config vs. Workload size
MemoryModeTransactionFlow
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM
https://software.intel.com/en-us/articles/prepare-for-the-next-generation-of-memory

• Synthetic traffic generator represents different types of workloads
• Vary size of buffers to emulate more or less locality
• Very large data size (much larger than DRAM cache) causes higher miss rate
MemoryModePerformancevs.Locality&Load
17
High BW delivered for high
demand WL
High demand
+ poor locality = degradation
Medium/low demand WLs still
meet requirement
0.0
20.0
40.0
60.0
80.0
100.0
120.0
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Bandwidth(GB/s)
DRAM cache miss rate
MLC Bandwidth Varying Bandwidth Demand
(100% reads)
Light Demand (13GB/s) Med (33GB/s) Heavy Demand (max)

2-2-2
System Platform Neon city
CPU CLX-B0
CPU per Node 28core/socket, 1 socket, 2 threads per core
Memory 6x 16GB DDR + 6x 128GB AEP QS
SUT OS Fedora 4.20.6-200.fc29.x86_64
BKC WW08
BIOS PLYXCRB1.86B.0576.D20.1902150028 (mbf50656_0400001c)
FW 01.00.00.5355
Security Variants 1,2, & 3 Patched
Test Date 4/5/2019
MemoryModePerformance/Load/Locality
MLC parameters: --loaded_latency –d<varies> -t200 Buffer size (GB) per thread 2-2-2
Miss rate (%) R W2
~0 0.1 0.1
~10 1.0 0.7
~25 4.5 1.8
~40 9.0 4.5

19
1. One Redis Memtier instance per VM
2. Max throughput scenario, will scale better at lower operating point
2 VMs per core2
1 VM per core
VM size DRAM baselineMM capacity
Throughput vs.
DRAM
Summary VM’s
45GB 768GB 1TB 111%, meets SLA
42% more VM’s @lower
cost
14->20
90GB 768GB 1TB 147%, meets SLA
42% more VM’s@lower
cost
7->10
EnableMoreRedisVMInstanceswithSub-msSLA
Throughput: Higher is better, Latency: lower is better (must be 1ms or less)

Configuration 1 - 1LM Configuration 2 – Memory Mode (2LM)
Test by Intel Intel
Test date 02/22/2019 02/22/2019
Platform Neoncity Neoncity
# Nodes 1 1
# Sockets 2 2
CPU Intel® Xeon® Platinum 8276, 165W Intel® Xeon® Platinum 8276, 165W
Cores/socket, Threads/socket 28/56 28/56
HT On On
BIOS version PLYXCRB1.86B.0573.D10.190130045
3
PLYXCRB1.86B.0573.D10.1901300453
BKC version – E.g. ww47 WW06 WW06
AEP FW version – E.g. 5336 5346 (QS AEP) 5346 (QS AEP)
System DDR Mem Config: slots / cap / run-speed 12 slots / 32GB / 2666 12 slots / 16 GB / 2666
System DCPMM Config: slots / cap / run-speed 12 slots / 32GB / 2666 12 slots /128GB,256GB,512GB/ 2666
Total Memory/Node (DDR, DCPMM) 768, 0 192, 1TB,1.5TB,3TB, 6TB
NICs 2x40GB 2x40GB
OS Fedora-27 Fedora-27
Kernel 4.20.4-200.fc29.x86_64 4.20.4-200.fc29.x86_64
AEP mode: ex. MM or AD-volatile (replace DDR) or AD-
persistent (replace NVME)
1LM Memory Mode (2LM)
Workload & version Redis 4.0.11 Redis 4.0.11
Other SW (Frameworks, Topologies…) memtier_benchmark-1.2.12 (80/20
read/write) ; 1K record size
memtier_benchmark-1.2.12
(80/20 read/write) ; 1K record size
VMs (Type, vcpu/VM, VM OS) KVM, 1/VM, centos-7.0 KVM, 1/VM, centos-7.0
RedisConfiguration

App Direct ReadTraditional Read to Page Fault
• Traditional read to page fault (disk):
Software
4K transfer from disk
Request returned
• App Direct access memory directly
• Avoids software and 4K transfer
overhead
• Cores can still access DRAM normally,
even on same channel
AppDirectModeTransactionFlow
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
Memory Controller
DRAM
CPUCACHES
Core
L1 L1
L2
L3
NEAR MEMORY
FAR MEMORY
Memory Controller
DRAM
SSD
One
cacheline
1
2
1
2
3
3

• Reduce TCO by moving large portion of data from DRAM to Intel® Optane™ DC persistent memory
• Optimize performance by using the values stored in persistent memory instead of creating a separate copy of the log in SSD (only
pointer written to log)
• Direct access vs. disk protocol
RedisExample(withSoftwareChange)
22
AOF file (log)
AOF
Pointers
(to log)
Write request
Store key/
value
Append
operation into
AOF log file
Write request
Store key
Store value
Append
pointer into
AOF log file
PMEM
Moving Value to App Direct reduces DRAM and optimizes logging by 2.27x
(Open Source Redis Set, AOF=always update, 1K datasize, 28 instances)

Configuration Query time
768GB DRAM 1417s
192GB DRAM
1TB App Direct
171s
3TB scale factor
8X improvement in Apache Spark* sql IO
intensive queries for Analytics
SparkSQLOAPcache
• Intel® Optane™ DC persistent memory as cache
• More affordable than similar capacity DRAM
• Significantly lower overhead for I/O intensive workloads

• Intel® Optane™ DC Persistent Memory closes the DDR memory and storage gap
• Architected for persistence
• Provided large capacity scales workloads to new heights
• Offered a new way to manage data flows with unprecedented integration into system and platform
• Optimized for performance and orders of magnitude faster than NAND
• Memory mode for large affordable volatile memory
• App Direct mode for persistent memory
Summary

Intel Optane Data Center Persistent Memory

Intel Optane Data Center Persistent Memory

More Related Content

What's hot (20)

Similar to Intel Optane Data Center Persistent Memory (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Intel Optane Data Center Persistent Memory