SlideShare a Scribd company logo
Offloading Deep Dive
Efstathios Efstathiou
Agenda
Introduction
Definition of offloading (DB view)
Offloading techniques we can use
Demo-Time ☺
Findings
Q&A
Introduction
About me
Married
Linux since 1998
Oracle since 2000
OCM & OCP
Master Database Engineer @BIT since 2014
Definition of offloading (DB view)
In general:
«Everything, that saves resources on
the database server»
Definition of offloading (DB view)
Examples of offloading implementations
NIC (TCP/IP Offload, iSCSI Offload, Infiniband RDMA, NVMe)
Storage Adatapets (RAID Calculation, SCSI)
Math Co-Processors
FPGAs
DMA-Engines
Distributed Computing (e.g. using MPI)
Remote DB Engine (Hadoop Connector, Gluent)
Definition of offloading (DB view)
How is it done the Exadata?
Offloading via DMA-Engine of the Infiniband HCA
Enables Remote-DMA (RDMA) Operations (DB to Cell)
The storage cell can be acessed at near zero cpu cost
Latency of a DMA operation is higher than PIO via CPU therefore good for large
amounts of data e.g. DWH, but worse for OLTP
The task can be distributed
Order e.g. to execute a sub-query on a node via MPI-call and to transmit the start
or end memory address to the requester (DB server)
The DB server now only needs to merge the partial results.
The DB server is in this sense more acting as a client
Offloading techniques we can use
The following devices have a DMA engine:
RDMA-enabled network adapters and Infiniband cards
Intel IOATDMA chip on Xeon boards (for NVMe SSDs
PCIe switch cards
PLX-based NVMe controllers
Or the PCIe chip in your Intel Xeon computer ;-)
Lowest latency
Offloading techniques we can use
The following protocols have (R) DMA support:
iSCSI over RMDA
NFS over RDMA
NVMe over Fabrics (RDMA-based) or RDMA Block Device
Needs the least CPU
Good starting point
Offloading techniques we can use
Comparison (Native PCIe fabric vs. NVMe over Fabrics)
Native PCIe fabric has significantly less latency
Setup with PCIe-JBOF is less complex than NVMe over Fabrics
Throughput is identical
Offloading techniques we can use
That PCIe is quite cool… What other tricks can it do?
DMA-Engine like Infiniband
Connect multiple PCIe root complexes via Non-Transparent Bridge
Network protocol IPoPCIe analogous to IPoIB, but performs way better
Device Sharing via I / O Virtualization (SR-IOV, MR-IOV)
Offloading techniques we can use
How do we get the system really fast?
Answer: Memory!
The only question is:
Which memory?
Where is it located?
How is it structured?
Demo-Time ☺
Demo 1: Device Sharing
Description
Host 1 has a SR-IOV capable NIC
Host 1 initializes a Virtual Function
Through Non-Transparent Bridge
(NTB) Host 2 can access that
function by loading the device driver
for the NIC
https://www.youtube.com/watch?v=GPh0Ms3dfPo
Demo-Time ☺
Demo 1: Device Sharing
Expected behaviour
Works as designed ☺
Depending on the approach PCIe switch chip, there is device driver dependencies
Demo-Time ☺
Demo 2: DMA-Transfer
Description
Host 1 and Host2 are fitted with a
PCIe Switch based host card and
connected back to back
PLXSDK comes with a Sample
Program supporting PIO and DMA
transfer
We measure the overall throughput
and cpu load
https://www.youtube.com/watch?v=LNPBr3WvuNg
Demo-Time ☺
Demo 2: DMA-Transfer
Expected behaviour
Large data transfer benefits from DMA (DWH) ☺
Small, time critical transfers have less latency with PIO (OLTP)
You’ll need both modes
Demo-Time ☺
Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC
Description
Database and Memory hosts are fitted
with a PCIe Switch based host card and
connected to a central PCIe Switch
Memory hosts’s physical DRAM is
expanded with OptaneGrid 3DXpoint
into an SDM Pool (mirrored via PCIe
NTB)
Database Servers expose a tiered
PMEM Device using local DRAM
(mirrored via PCIe NTB) and the remote
SDM Pool accessed over PCIe NTB)
ASM High Redudancy on top of PMEM
Devices with preferred mirror read and
device mapper path swapping
db0 db1 db2
mem0 mem1 mem2
SDM
DRAM
Optane
GRID
SDM
DRAM
Optane
GRID
SDM
DRAM
Optane
GRID
ASM
PMEM
DRAM
Expansion
PMEM
DRAM
Expansion
PMEM
DRAM
Expansion
PCIe Switch
RAC
NTB
Domain
Demo-Time ☺
Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC
16 GB/s throughput per licensable core (4cores, 8 threads per db node)
85 % of native aggregated memory controller performance
Findings
Generic offloading is possible per se, but different than expected :
Fabric Attached Memory
Yes, the DB is running in memory (mirrored)
Question is:
In which server’s memory (local or remote)?
How do we acccess it (local memory extension or DMA call)?
How is it constructed (DRAM or Software Defined Memory)?
Using the right PCIe-Switch and storage module combination you
get it to work
Any PCIe-capable host can use Fabric Attached Memory per se
An OpenMCCA-compatible PCIe switch (PLX 9700) and high-performance M.2 SSDs
such as Optane Memory or fast NVMe modules are required
Q&A
Thanks to our supporters
Contact Information
elgreco@linux.com
Thanks

More Related Content

PDF
Building an open memory-centric computing architecture using intel optane
PDF
UniFabric
PDF
UniPlex 1000 Series PCIe NVMe JBOF
PPTX
UniPlex Desktop Memory & PCIe Expansion
PDF
UniPlex T1 Storage Supercharger
PPTX
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
PPTX
UniPlex vScaleDB pat. pending
PDF
SOUG_SDM_OracleDB_V3
Building an open memory-centric computing architecture using intel optane
UniFabric
UniPlex 1000 Series PCIe NVMe JBOF
UniPlex Desktop Memory & PCIe Expansion
UniPlex T1 Storage Supercharger
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
UniPlex vScaleDB pat. pending
SOUG_SDM_OracleDB_V3

What's hot (20)

PPT
SOUG IMDT Oracle In-Memory
PDF
SOUG_GV_Flashgrid_V4
PDF
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
PDF
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
PPTX
Webinar: What’s Your Path to NVMe?
PPTX
TDS-16489U - Dual Processor
PPTX
Ceph Day San Jose - Ceph at Salesforce
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
PPTX
Enterprise Storage NAS - Dual Controller
PDF
Introduction to NVMe Over Fabrics-V3R
PPTX
Webinar: NVMe, NVMe over Fabrics and Beyond - Everything You Need to Know
PDF
A Key-Value Store for Data Acquisition Systems
PPTX
Webinar: How NVMe Will Change Flash Storage
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
PDF
Ceph Day Beijing - Ceph RDMA Update
PDF
Ceph Day Beijing - Storage Modernization with Intel and Ceph
PPTX
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
PDF
Disrupt the Storage & Memory Hierarchy
PDF
Bridging Big - Small, Fast - Slow with Campaign Storage
PDF
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
SOUG IMDT Oracle In-Memory
SOUG_GV_Flashgrid_V4
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
Webinar: What’s Your Path to NVMe?
TDS-16489U - Dual Processor
Ceph Day San Jose - Ceph at Salesforce
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Enterprise Storage NAS - Dual Controller
Introduction to NVMe Over Fabrics-V3R
Webinar: NVMe, NVMe over Fabrics and Beyond - Everything You Need to Know
A Key-Value Store for Data Acquisition Systems
Webinar: How NVMe Will Change Flash Storage
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Disrupt the Storage & Memory Hierarchy
Bridging Big - Small, Fast - Slow with Campaign Storage
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
Ad

Similar to Offloading for Databases - Deep Dive (20)

PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PPTX
Persistent memory
PPT
PDF
Redefining Data Redundancywith RAID Offload
PPTX
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
PDF
Aerospike Hybrid Memory Architecture
PPTX
Presentation sparc m6 m5-32 server technical overview
PPTX
Flash memory summit enterprise udate 2019
PPTX
Using flash on the server side
PPTX
SSD для вашей базы данных, Петр Зайцев (Percona)
PDF
Towards Software Defined Persistent Memory
PPTX
CS 542 Putting it all together -- Storage Management
PDF
NVMe over Fibre Channel Introduction
PDF
DMA_document__1696148675.pdf
PDF
The Unofficial VCAP / VCP VMware Study Guide
PDF
Manage transactional and data mart loads with superior performance and high a...
PDF
SHARE Interface in Flash Storage for Relational and NoSQL Databases
PDF
Netezza vs Teradata vs Exadata
PDF
DB2 for z/OS - Starter's guide to memory monitoring and control
PDF
Netezza Deep Dives
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Persistent memory
Redefining Data Redundancywith RAID Offload
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
Aerospike Hybrid Memory Architecture
Presentation sparc m6 m5-32 server technical overview
Flash memory summit enterprise udate 2019
Using flash on the server side
SSD для вашей базы данных, Петр Зайцев (Percona)
Towards Software Defined Persistent Memory
CS 542 Putting it all together -- Storage Management
NVMe over Fibre Channel Introduction
DMA_document__1696148675.pdf
The Unofficial VCAP / VCP VMware Study Guide
Manage transactional and data mart loads with superior performance and high a...
SHARE Interface in Flash Storage for Relational and NoSQL Databases
Netezza vs Teradata vs Exadata
DB2 for z/OS - Starter's guide to memory monitoring and control
Netezza Deep Dives
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Business Analytics and business intelligence.pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to machine learning and Linear Models
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Reliability_Chapter_ presentation 1221.5784
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Analytics and business intelligence.pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Acumen Training GuidePresentation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx

Offloading for Databases - Deep Dive

  • 2. Agenda Introduction Definition of offloading (DB view) Offloading techniques we can use Demo-Time ☺ Findings Q&A
  • 3. Introduction About me Married Linux since 1998 Oracle since 2000 OCM & OCP Master Database Engineer @BIT since 2014
  • 4. Definition of offloading (DB view) In general: «Everything, that saves resources on the database server»
  • 5. Definition of offloading (DB view) Examples of offloading implementations NIC (TCP/IP Offload, iSCSI Offload, Infiniband RDMA, NVMe) Storage Adatapets (RAID Calculation, SCSI) Math Co-Processors FPGAs DMA-Engines Distributed Computing (e.g. using MPI) Remote DB Engine (Hadoop Connector, Gluent)
  • 6. Definition of offloading (DB view) How is it done the Exadata? Offloading via DMA-Engine of the Infiniband HCA Enables Remote-DMA (RDMA) Operations (DB to Cell) The storage cell can be acessed at near zero cpu cost Latency of a DMA operation is higher than PIO via CPU therefore good for large amounts of data e.g. DWH, but worse for OLTP The task can be distributed Order e.g. to execute a sub-query on a node via MPI-call and to transmit the start or end memory address to the requester (DB server) The DB server now only needs to merge the partial results. The DB server is in this sense more acting as a client
  • 7. Offloading techniques we can use The following devices have a DMA engine: RDMA-enabled network adapters and Infiniband cards Intel IOATDMA chip on Xeon boards (for NVMe SSDs PCIe switch cards PLX-based NVMe controllers Or the PCIe chip in your Intel Xeon computer ;-) Lowest latency
  • 8. Offloading techniques we can use The following protocols have (R) DMA support: iSCSI over RMDA NFS over RDMA NVMe over Fabrics (RDMA-based) or RDMA Block Device Needs the least CPU Good starting point
  • 9. Offloading techniques we can use Comparison (Native PCIe fabric vs. NVMe over Fabrics) Native PCIe fabric has significantly less latency Setup with PCIe-JBOF is less complex than NVMe over Fabrics Throughput is identical
  • 10. Offloading techniques we can use That PCIe is quite cool… What other tricks can it do? DMA-Engine like Infiniband Connect multiple PCIe root complexes via Non-Transparent Bridge Network protocol IPoPCIe analogous to IPoIB, but performs way better Device Sharing via I / O Virtualization (SR-IOV, MR-IOV)
  • 11. Offloading techniques we can use How do we get the system really fast? Answer: Memory! The only question is: Which memory? Where is it located? How is it structured?
  • 12. Demo-Time ☺ Demo 1: Device Sharing Description Host 1 has a SR-IOV capable NIC Host 1 initializes a Virtual Function Through Non-Transparent Bridge (NTB) Host 2 can access that function by loading the device driver for the NIC https://www.youtube.com/watch?v=GPh0Ms3dfPo
  • 13. Demo-Time ☺ Demo 1: Device Sharing Expected behaviour Works as designed ☺ Depending on the approach PCIe switch chip, there is device driver dependencies
  • 14. Demo-Time ☺ Demo 2: DMA-Transfer Description Host 1 and Host2 are fitted with a PCIe Switch based host card and connected back to back PLXSDK comes with a Sample Program supporting PIO and DMA transfer We measure the overall throughput and cpu load https://www.youtube.com/watch?v=LNPBr3WvuNg
  • 15. Demo-Time ☺ Demo 2: DMA-Transfer Expected behaviour Large data transfer benefits from DMA (DWH) ☺ Small, time critical transfers have less latency with PIO (OLTP) You’ll need both modes
  • 16. Demo-Time ☺ Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC Description Database and Memory hosts are fitted with a PCIe Switch based host card and connected to a central PCIe Switch Memory hosts’s physical DRAM is expanded with OptaneGrid 3DXpoint into an SDM Pool (mirrored via PCIe NTB) Database Servers expose a tiered PMEM Device using local DRAM (mirrored via PCIe NTB) and the remote SDM Pool accessed over PCIe NTB) ASM High Redudancy on top of PMEM Devices with preferred mirror read and device mapper path swapping db0 db1 db2 mem0 mem1 mem2 SDM DRAM Optane GRID SDM DRAM Optane GRID SDM DRAM Optane GRID ASM PMEM DRAM Expansion PMEM DRAM Expansion PMEM DRAM Expansion PCIe Switch RAC NTB Domain
  • 17. Demo-Time ☺ Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC 16 GB/s throughput per licensable core (4cores, 8 threads per db node) 85 % of native aggregated memory controller performance
  • 18. Findings Generic offloading is possible per se, but different than expected : Fabric Attached Memory Yes, the DB is running in memory (mirrored) Question is: In which server’s memory (local or remote)? How do we acccess it (local memory extension or DMA call)? How is it constructed (DRAM or Software Defined Memory)? Using the right PCIe-Switch and storage module combination you get it to work Any PCIe-capable host can use Fabric Attached Memory per se An OpenMCCA-compatible PCIe switch (PLX 9700) and high-performance M.2 SSDs such as Optane Memory or fast NVMe modules are required
  • 19. Q&A
  • 20. Thanks to our supporters