Breaking the Memory Wall

Breaking Through the Memory Wall

Breaking Through the Memory Wall with
CXLTM
Michael Ocampo, Sr. Product Manager, Ecosystem/Cloud-Scale Interop Lab

Agenda
• Breaking Through the Memory Wall
• Memory Bound Use Cases
• CXL for Modular Shared Infrastructure
• Critical CXL Collaboration Happening Now
• Call to Action

Breaking Through the Memory Wall
Challenges with Previous Attempts
1. Memory BW and capacity did not scale
efficiently
2. Latency inferior to local CPU memory
3. Not deployable at scale
4. Not easily adopted by existing applications
Breaking Through the Memory Wall with CXL
1. Increase server memory BW and capacity by 50%
2. Reduce latency by 25%
3. Standard DRAM for flexible supply chain and cost
4. Seamlessly expand memory for existing and new
applications

Memory Bound Use Cases
• eCommerce & Business
Intelligence
• Online Transaction
Processing
• Online Analytics Processing
• AI Inferencing
• Recommendation Engines
• Semantic Cache
What is
happening
?
OLTP
What has
happened
?
OLAP
Opportunity for CXL to Boost MySQL Database
Performance
Opportunity for CXL to Boost Vector Database
Performance
Vector
DB
Vector Database
Inference Server
REST Models
Query
Inference
Users
Query/Store
Inference

OLTP & OLAP Results
0%
20%
40%
60%
80%
100%
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Average
Query
Times
(Normalized)
TPC-H Query Times
DRAM+CXL DRAM-Only
System Under Test Configuration
CPU
Storage
Local Memory
CXL-Attached Memory
Mode
Benchmark
5th Gen Intel® Xeon® Scalable Processor (Single-Socket
4x NVMe PCIe 4.0 SSDs
512GB (8x 64GB DDR5-5600)
256GB (4x 64GB DDR5-5600)
12-Way Heterogenous Interleaving
TPC-H (1000 scale factor)
Cut Big Query Times in Half with CXL Memory
OLAP
System Under Test Configuration
CPU
Storage
Local Memory
CXL-Attached Memory
Mode
Benchmark
4th Gen Intel® Xeon® Scalable Processor (Single-Socket)
2x NVMe PCIe 4.0 SSDs
128GB (8x 16GB DDR5-4800)
128GB (2x 64GB DDR5-5600)
Memory Tiering (MemVerge Memory Machine)
Sysbench (Percona Labs TPC-C Scripts)
150% More TPS with only 15% More CPU Utilization
0%
40%
80%
120%
160%
200%
240%
280%
0 200 400 600 800 1000
TPS
(Normalized)
Clients
Transactions per Second (TPS)
DRAM DRAM + CXL
0%
10%
20%
30%
40%
50%
60%
0 200 400 600 800 1000
CPU
Utilization
(Normalized)
Clients
CPU Utilization
DRAM DRAM + CXL
OLTP

Breaking the Memory Wall for Databases
Popular Certified & Supported SAP HANA® Hardware
48 DIMMs with Two 2-Socket Systems
High kW
High
TCO
Without CXL
Optimized Hardware for In-Memory Databases
56 DIMMs with One 2-Socket System
With CXL
Lower kW
Lower TCO
DDR5 4800 DDR5 4800
Interleaving across CXL-Attached Memory
2.33x memory capacity and 1.66x memory bandwidth per socket with
CXL
Lower TCO for memory-intensive application
DDR5 4800 DDR5 4800
x16
x16
x16
x16
x16
x16
x16
x16

• OCP Alignment with M-SIF and CMS:
• Shared Elements with CXL Support
• Pluggable Multi-Purpose Module
(PMM)
• 1U Support for High-Density
• Standard DIMM Support
• High Power Connector (200W-
600W)
• Hot-plug Support
Modular Shared Infrastructure (M-
SIF)
16-24 DIMMs per
Host
Core
Element
8-16 DIMMs per HIB
Shared Element
JBOF JBOG JBOM
Logical Wiring within
Enclosure
Node 1 Node 2 Node 3-4 Node 5
• Challenges:
• Signal Integrity
• Link Bifurcation & Configuration
• Latency/Performance
• DIMM Interoperability
PWR/PCIe/CXL BPN
CXL Controllers
PCIe/CXL Retimers
CXL.
MEM
SW
CXL.
MEM
PMM(s)
Host Interface Board
CPU

Enabling CXL Connectivity for M-SIF
PCIe
Retimer
 Use Case: Memory Expansion
 Real-time Apps
 MB / PCI CEM Connectivity
 Use Case: JBOM Enablement
 Intelligent Tiering/Placement
 Midplane or Backplane Connectivity
 Use Case: Shared/Pooled Memory
 High-Capacity In-Memory Compute
 PCIe Cabling Connectivity
CXL CXL CXL
CPU
PCIe
Retimer
Local CXL-Attached Short Reach, CXL-Attached Long Reach, CXL-Attached
PCIe Cabling PCIe
Retimer
CPU
Backplane
Leo CXL Memory Connectivity
Aries PCIe/CXL Smart Retimers
Direct

Extending Reach for PCIe/CXL
Memory
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
1.1x
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
110.00%
CXL CXL + 1 Retimer CXL + 2 Retimers
Relative
Latency
Relative
Bandwidth
Relative MLC Performance with and without Retimers
Bandwidth Latency
Minimal Impact to Performance with Extended Reach

Critical Collaboration Happening
Now
• DIMM Stability & Performance
• OS Development & Feature Testing
• CXL 2.0 RAS & Telemetry
• SW Integration & Orchestration
• High Performance Interleaving
• High-Capacity Memory Density Tiering
Breaking Through
the Memory Wall
Host & DDRx Interop
Cloud-Scale
Fleet Management

Call to Action
Visit Check out how we
smashed through Memory
[OCP Map and where we
are]
Learn More
www.opencompute.org
• wiki/server/DC-MHS/M-SIF Base Spec:
0.5
• wiki/server/CMS: CMM Proposal
• wiki/hardware_management/DC-SCM:
2.0
CXL Resources
• Linux: https://pmem.io/ndctl/collab
• Interop:
https://www.asteralabs.com/interop
Ecosystem Alliance Contact:
• michael.ocampo@asteralabs.com
See the Demos
Booth A11
Etherne
t
Scale memory
bandwidth and
capacity
Scale parallel
processing of
accelerators
Scale rack
connectivity over
copper

Breaking the Memory Wall

More Related Content

Similar to Breaking the Memory Wall (20)

More from Memory Fabric Forum (20)

Recently uploaded (20)

Breaking the Memory Wall

Editor's Notes