SlideShare a Scribd company logo
OpenCAPI: Next Generation of
Acceleration for the Cognitive Area
—
Brian Allison
OpenCAPI Technology and Enablement
E-mail: ballis@us.ibm.com
Industry Collaboration and Innovation
2
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium
3
Key Messages Throughout
 IBM’s Strong History in IO Offerings
 OpenCAPI is an Open IO Standard
 Not tied to Power – Architecture Agnostic
 High Performance – No OS/Hypervisor/FW Overhead with
Low Latency and High Bandwidth
 Programing Ease
 Very Low Accelerator Design Overhead
 Supports heterogeneous environment – Use Cases
 Ideal for Accelerated Computing and SCM
 Optimized for within a single system node
 Products exist today!
4
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium
5
What’s Driving the Creation of a High Perf. Bus
• Historical silicon technology improvements out of steam
• More cores on a processor help but you’ll never have enough; especially for today’s emerging
workloads (analytics, artificial intelligence, machine learning, real time analysis, etc.)
• New advanced memory technologies are changing the economics of computing
• Companies realizing the need to off load the microprocessors from routine algorithms to
meet demand and improve system performance  Accelerated Computing
• Accelerators - If you are going to use them, you need a lot of data in/out quickly
6
Computation Data Access
© 2018 IBM Corporation
POWER Processor Technology Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
7
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive Era
− Enhanced Core and Chip
Architecture Optimized for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing
Open Coherent Accelerator Processor Interface
Industry driving these necessary Accelerator attributes
• High performance (move a lot of data quickly; bandwidth, latency)
• Fulfill various accelerator form factors (e.g., GPUs, FPGAs, ASICs)
• Introduction of device coherency requirements
• Emergence of complex Storage and Memory solutions
• Growing demand for network performance with increased
computational demand
• Need to be architecture agnostic to enable the ecosystem growth and
adaption
8
Accelerated Computing Example
IBM’s AC922 built for Accelerated Computing
 Classic system topology setup with two POWER9 sockets, each socket having at
least two NVIDIA GPUs
Does Accelerated Computing work?
• Summit Supercomputer, an IBM-built supercomputer now running at the
Department of Energy’s (DOE) Oak Ridge National Laboratory, captured the
number one spot with a performance of 143,5 petaflops on High Performance
Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356
nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla
V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR
InfiniBand network.
9
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 10
Why OpenCAPI and what is it?
 OpenCAPI is a new ‘bottom’s up’ IO standard
 Key Attributes of OpenCAPI 3.0
• Open IO Standard – Choice for developers and others to contribute and grow an ecosystem
• Coherent interface – Microprocessor memory, accelerator and caches share the same memory space
• Architecture agnostic – Capable going beyond Power Architecture
• Not tied to Power – Architecture Agnostic
• High performance – No OS/Hypervisor/FW Overhead for Low Latency and High Bandwidth
• Ease of programing
• Ease of implementation with minimal accelerator design overhead
• Ideal for accelerated computing and SCM including various form factors (FPGA, GPU, ASIC, TPU, etc.)
• Optimized for within a single system node
• Supports heterogeneous environment – Use Cases
 OpenCAPI 3.1
 Applies OpenCAPI technology for use of standard DRAM off the microprocessor
 Based on an Open Memory Interface (OMI)
 Further tuned for extreme lower latency
11
Accelerated
OpenCAPI Device
OpenCAPI Key Attributes
12
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
U
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
3. High performance 25Gbps PHY design with zero ‘overhead’
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit
6. Wide range of Use Cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both Classic Memory and emerging Advanced Storage Class Memory
9. Minimal OpenCAPI design overhead (FPGA less than 5%)
Caches
Application
 Storage/Compute/Network etc
 ASIC/FPGA/FFSA
 FPGA, SOC, GPU Accelerator
 Load/Store or Block Access
Standard System Memory Advanced SCM
Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
Device Memory
Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the website
at www.opencapi.org
- Register
- Download
13
© 2018 IBM Corporation
Proposed POWER Processor Technology and I/O Roadmap
POWER8 Architecture POWER9 Architecture
2014
POWER8
12 cores
22nm
New Micro-
Architecture
New Process
Technology
2016
POWER8
w/ NVLink
12 cores
22nm
Enhanced
Micro-
Architecture
With NVLink
2017
P9 SO
12/24 cores
14nm
New Micro-
Architecture
Direct attach
memory
New Process
Technology
2018
P9 SU
12/24 cores
14nm
Enhanced
Micro-
Architecture
Buffered
Memory
POWER7 Architecture
2010
POWER7
8 cores
45nm
New Micro-
Architecture
New Process
Technology
2012
POWER7+
8 cores
32nm
Enhanced
Micro-
Architecture
New Process
Technology
2020+
P10
TBA cores
New Micro-
Architecture
New
Technology
POWER10
2019
P9
w/ Adv. I/O
12/24 cores
14nm
Enhanced
Micro-
Architecture
New
Memory
Subsystem
150 GB/s
PCIe Gen4 x48
25 GT/s
300GB/s
CAPI 2.0,
OpenCAPI3.0,
NVLink
Sustained Memory Bandwidth
Standard I/O Interconnect
Advanced I/O Signaling
Advanced I/O Architecture
210 GB/s
PCIe Gen4 x48
25 GT/s
300GB/s
CAPI 2.0,
OpenCAPI3.0,
NVLink
350+ GB/s
PCIe Gen4 x48
25 GT/s
300GB/s
CAPI 2.0,
OpenCAPI4.0,
NVLink
435+ GB/s
PCIe Gen5
32 & 50 GT/s
TBA
210 GB/s
PCIe Gen3
N/A
CAPI 1.0
210 GB/s
PCIe Gen3
20 GT/s
160GB/s
CAPI 1.0 ,
NVLink
65 GB/s
PCIe Gen2
N/A
N/A
65 GB/s
PCIe Gen2
N/A
N/A
Statement of Direction, Subject to Change 14
POWER9 IO Features
8 and 16Gbps PHY
Protocols Supported
• PCIe Gen3 x16 and PCIe Gen4 x8
• CAPI 2.0 on PCIe Gen4
PCIeGen4
P9
25Gbs
25Gbps PHY
Protocols Supported
• OpenCAPI 3.0
• NVLink 2.0
Silicon Die
Various packages
(scale-out, scale-up)
POWER9 IO Leading the Industry
• PCIe Gen4
• CAPI 2.0 (Power)
• NVLink 2.0
• OpenCAPI 3.0
POWER9
15
Why do I care about Virtual Addressing?
 An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Culmination => Improves Accelerator Performance
 The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI accelerator development
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access
16
17
OpenCAPI vs I/O Device Driver – Because minimizing SW Path Length is crucial for performance
Typical I/O Model Flow using Device Driver
Flow with a OpenCAPI Model
Shared Memory
Notify Accelerator Acceleration
Shared memory completion
or fast thread wake up
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Intr
Completion
Copy or Unpin
Result Data
Return From DD
Completion
TL/DL
25Gb I/O
Processor
FPGA/SoC/GPU
Functionn
Function0
Function1
Function2
OpenCAPI
TLx/DLx
Device Memory
Host Memory
Total ~13µs for data prep
Total 0.36µs
400 Instructions 100 Instructions
0.3µs 0.06µs
Application
Dependent, but
Equal to above
3,000 Instructions
1,000 Instructions
1,000 Instructions
4.9µs
300 Instructions 10,000 Instructions
7.9µs
Application
Dependent, but
Equal to below
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 18
So How is Accelerated Computing Leveraged?
Okay, I’m sold. But how do I leverage this technology?
Isolate and identify your heavily used workloads,
algorithms and/or applications
Place this workload onto an accelerator
Path of least resistance and flexibility is an FPGA that is
programmable
Purchase any of the OpenCAPI vendor FPGA cards and an
OpenCAPI enabled system and start developing your
solution!
19
Comparison of Memory Paradigms
Emerging Storage Class Memory
Processor Chip DLx/TLx
SCMData
OpenCAPI 3.1 Architecture
Ultra Low Latency ASIC Memory buffer chip
adding ~ +7-10 ns on top of native DDR direct
connect!!
Storage Class Memory tiered with traditional DDR
Memory all built upon OpenCAPI 3.1 & 3.0
architecture.
Still have the ability to use Load/Store Semantics
Storage Class Memories have the potential to be
the next disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
Main Memory
Processor Chip DLx/TLx
DDR4/5
Example: Basic DDR attach
Data
Tiered Memory
Processor Chip
DLx/TLx
DDR4/5
DLx/TLx
SCM
Data
Data
Tier 1 Memory
Tier 2 Memory
Acceleration Paradigms with Great Performance
21
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
Processor Chip
Acc
Data
Egress Transform
DLx/TLx
Processor Chip
Acc
Data
Bi-Directional Transform
Acc
DLx/TLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
Memory Transform
Processor Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Ingress Transform
Processor Chip
Acc
DataDLx/TLx
Examples: Video Analytics, Network Security,
Deep Packet Inspection, Data Plane Accelerator,
Video Encoding (H.265), High Frequency Trading etc
Needle-In-A-Haystack Engine
Large
Haystack
Of Data
OpenCAPI is ideal for acceleration due
to Bandwidth to/from accelerators, best
of breed latency, and flexibility of an
Open architecture
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 22
IBM AC922
Air Cooled
23
Power9 Systems with OpenCAPI
System Details
• 2 – Socket 2U
• Up to 40 cores
• Up to 2TB memory (16 DDR4 Dimms)
• 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled
• 2 2.5” SFF Drive Bays
• 4 OpenPOWER Mezzanine Sockets
• Up to 4 NVLink V100 GPUs
• Up to 4 socketed OpenCAPI Adapters*
• Up to 1 cabled OpenCAPI Cards w/ SlimSAS
adapter*
* Future Support
24
Power9 Systems with OpenCAPI
• Dual Socket
• OCP 48V
• Up to 4TB memory (32 DDR4 Dimms)
• 5 PCIe Gen4 Slots, 3 CAPI2.0 Enabled
• 2 x8 OCP Gen4 PCIe Mezz Slots CAPI2.0 Enabled
• 4 x8 25 Gbps SlimSAS Accelerator Ports
• Supports up to 4 OpenCAPI cabled adapters
Google & Rackspace Zaius
Motherboard
Rackspace BarrelEye G2 ServerSed
25
• Zaius Motherboard
• Full-depth 48V
Open Rack V2
• 2OU Chassis
• High Density, Hot
Swap Storage Bay
• 4 x8 25Gb/s
Coherent Attach
Ports
“The OpenCAPI accelerator and
software ecosystem is growing
rapidly. With collateral available via
the Open Compute website,
accelerator developers find it easy to
design and test their solutions on our
platform.”
Adi Gangidi, Senior Design
Engineer with Rackspace
Power9 Systems with OpenCAPI
26
• 2 Socket 2U
• Up to 48 cores
• Up to 4TB memory (32 DDR4 DIMMs)
• 4 Gen4 PCIe Slots, CAPI2.0 Enabled
• 6 Gen3 PCIe Slots
• Up to 24 SFF / 12 LFF Drives
• 4 x8 25 Gbps Ports
• Up to 4 cabled OpenCAPI
Adapters*
Mihawk
“In order to provide the best backend architecture in AI, Big Data, and
Cloud applications, Wistron POWER9 system design incorporates OpenCAPI
technology through 25Gbps high speed link to dramatically change the
traditional data transition method. This design not only improves GPU
performance, but also utilizes next generation advanced memory, coherent
network, storage, and FPGA. This is an ideal system infrastructure to
meet next decade computing world challenges.”
Donald Hwang, CTO and President of EBG at Wistron Corporation.
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 27
28
OpenCAPI and CAPI2 Adapters
Nallatech 250S+
Storage Expansion
• Xilinx US+ KU15P FPGA
• 4 GB DDR4
• PCIe Gen4 x8 and CAPI2
• 4x M.2 Slots
• M.2 to MiniSAS or Oculink for U.2
drive support
CAPI Flash API, Accelerated DB, Burst
Buffer
Nallatech 250-SoC
Multipurpose Converged Network /
Storage
• Xilinx Zynq US+ ZU19EG FPGA
• 8/16 GB DDR4, 4/8 GB DDR4 ARM
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 4 x8 Oculink Ports support NVMe,
Network, or OpenCAPI
• 2 100Gb QSFP28 Cages
Mellanox Innova2
Network + FPGA
• Xilinx US+ KU15P FPGA
• Mellanox CX5 NIC
• 16 GB DDR4
• PCIe Gen4 x8
• 2 25Gb SFP Cages
• X8 25Gb/s OpenCAPI Support
Network Acceleration (NFV, Packet
Classification), Security Acceleration
29
OpenCAPI and CAPI2 Adapters
AlphaData ADM-9V3
High Performance Reconfigurable
Computing
• Xilinx US+ VU3P FPGA
• 16 / 32 GB DDR4
• PCIe Gen3 x16 or Gen4 x8 and CAPI2
• 2 QSFP28 Cages
• X8 25Gb/s OpenCAPI SlimSAS
Data Center, Network Accel, HPC, HFT
Bittware XUPVV4
Massive FPGA
• Xilinx US+ VU13P FPGA
• 4 288-pin DIMM Slots, DDR4 or Dual QDR
• Up to 512GB DDR4
• PCIe Gen3 x16, CAPI2 Capable
• 4 100Gb QSFP28 Cages
• 2 x8 25Gb/s OpenCAPI Support
Optimized for Thermal Performance for
Large acceleration in the Data Center
AlphaData ADM-9H7
Large FPGA with 8GB HBM
• Xilinx US+ VU37P FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 4 100Gb QSFP28 Cages
AlphaData ADM-9H3
Large FPGA with 8GB HBM
• Xilinx Virtex US+ VU33P-3 FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 1 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 2 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, HPC
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium
30
TLx and DLx Reference Designs in an FPGA
 TLX and DLX will be provided as reference designs to OpenCAPI consortium members
• Associated reference design specifications for TLx and DLx will also be delivered
along with RTL
 Open Verilog – free to enhance, improve or leverage pieces of reference design for
your own accelerator development
 Designed for 64B packet flow running at 400MHz
 Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device
31
VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile
DLx 9392/788160
(1.19%)
19026/394080
(4.82%)
0/197280
(0%)
7.5/720
(1.0%)
TLx 13806/788160
(1.75%)
8463/394080
(2.14%)
2156/197280
(1.09%)
0/720
(0%)
Total 23108/788160
(2.94%)
27849/394080
(6.98%)
2156/197280
(1.09%)
7.5/720
(1.0%)
Because power efficiency and size do matter
OpenCAPI Device
• Customer application and accelerator
• Operation system enablement
• Little Endian Linux
• Reference Kernel Driver (ocxl)
• Reference User Library (libocxl)
• Hardware and reference designs to
enable coherent acceleration
Core
Processor
OS
App
(software)
Memory (Coherent)
Accelerated
Function(s)
TLx
DLx
25G
ocxl
libocxl
OCSE (OpenCAPI Simulation Environment)
models the red outlined area
OCSE enables AFU and Application co-
simulation when the reference libocxl
and reference TLx/DLx are used
Contributed to the Enablement
Workgroup under the OpenCAPI
consortium
Cable
Memory (Coherent)
Elements of the OpenCAPI Simulation Environment
25G
DL
TL
32
Exerciser Examples – Provided to OCC Members
 MemCopy
• The MemCopy example is a data mover from source address -> destination address
using Virtual Addressing and includes these features
• Configuration and MMIO Register Space
• acTag Table used for Bus/Device/Function and Process ID identification
• 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B,
128B, or 256B operations
 Memory Home Agent
• The Memory Home Agent example implements memory off the endpoint OpenCAPI
accelerator to act as a coherent extension to the host processor memory
• The Memory Home Agent example includes these features
• Configuration and MMIO Register Space
• Individual and pipelined operation for memory loads and stores
• Interrupts, with error details reported to software through MMIO registers
• Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address
 Open Examples – free to enhance, improve or leverage pieces of exerciser
examples for your own accelerator development 33
Reference Card Design Number 1
 Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group
 Definition of the cable(s) is driven as part of the PHY Mechanical Work Group
 Representative Diagram is articulated below
34
FPGA
OpenCAPI Protocol Over the 25Gbps link
Including Discovery, Configuration and Enumeration
Power and Ground only
Convenient Fixture
Amphenol SlimSAS OpenCAPI
Connector
Reference Card Design Number 2
 Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group
 Definition of the cable(s) will be driven as part of the PHY Mechanical Work Group
 Representative Diagram is articulated below
35
Amphenol SlimSAS OpenCAPI
Connector
Mezzanine based Form Factor
Table of Enablement Deliveries
36
Item Delivery Name Where to Obtain Available
OpenCAPI 3.0 TLx and DLx
Reference Xilinx FPGA Designs (RTL
and Specifications)
<snapshot>.tar.gz Enablement WG Today
Xilinx Vivado Project Build with
Memcopy Exerciser
Vivado Project Flow Enablement WG Today
Device Discovery and Configuration
Specification and RTL
OpenCAPI 3.0 Configuration Sub-
System Reference Design
Specification
Enablement WG Causeway Today
AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today
25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signaling WG Causeway Today
25Gbps PHY Mechanical
Specification
25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today
OpenCAPI Simulation Environment
(OCSE)
ocse-<version>.tar.gz
OpenCAPIDemokit.pdf
Enablement WG Today
Today
Memcopy and Memory Home
Agent Exercisers
MCP3 and LPC
<snapshot>.tar.gz
Enablement WG Today
Reference Driver Available LIBOCXL Ubuntu 18.04
GitHub
Today
u
SmartDV OpenCAPI VIP Environment
contains:
• Complete regression suite
• Usage examples
• Detailed documentation
• User’s Guide and Release Notes
Benefits
• Complete Verification of
OpenCAPI Design
• Easy to Use
• Simplify Result Analysis
• Runs in every sim environment
37
SmartDV
OpenCAPI Verification IP
http://www.smart-dv.com/vip/opencapi.html
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
Design Enablement
Performance Metrics
OpenCAPI Consortium
38
CAPI and OpenCAPI Performance
39
CAPI 1.0
PCIE Gen3 x8
Measured BW
@8Gb/s
CAPI 2.0
PCIE Gen4 x8
Measured BW
@16Gb/s
OpenCAPI 3.0
25 Gb/s x8
Measured BW
@25Gb/s
128B DMA
Read
3.81 GB/s 12.57 GB/s 22.1 GB/s
128B DMA
Write
4.16 GB/s 11.85 GB/s 21.6 GB/s
256B DMA
Read
N/A 13.94 GB/s 22.1 GB/s
256B DMA
Write
N/A 14.04 GB/s 22.0 GB/s
POWER8
Introduced
in 2013
POWER9
Second
Generation
POWER9
Open Architecture with a
Clean Slate Focused on
Bandwidth and Latency
POWER8 CAPI 1.0
POWER9 CAPI 2.0
and OpenCAPI 3.0
Xilinx
KU60/VU3P FPGA
Latency Test Results
OpenCAPI Link
P9 OpenCAPI
3.9GHz Core, 2.4GHz Nest
Xilinx FPGA VU3P
298ns‡
2ns Jitter
TL, DL, PHY
TLx, DLx, PHYx (80ns‖)
378ns† Total Latency
PCIe G4 Link
P9 PCIe Gen4
Xilinx FPGA VU3P
est. <337ns
PCIe Stack
Xilinx PCIe HIP (218ns¶)
est. <555ns§ Total Latency
PCIe G3 Link
P9 PCIe Gen3
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
337ns
7ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
737ns§ Total Latency
PCIe G3 Link
Kaby Lake PCIe Gen3*
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
376ns
31ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
776ns§ Total Latency
* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost)
† Derived from round-trip time minus simulated FPGA app time
‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time
§ Derived from measured CPU turnaround time plus vendor provided HIP latency
‖ Derived from simulation
¶ Vendor provided latency statistic
RACE TO ZERO LATENCY
BECAUSE JITTER MATTERS
40
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
Design Enablement
Performance Metrics
OpenCAPI Consortium
41
OpenCAPI Consortium Overview
Goals
1. Provide a forum to give the industry ability to innovate the next generation bus protocol
2. Drive hardware/software innovation to enable choice and efficiency in data center
architectures
3. Build an ecosystem with the flexibility to build servers and data centers best suited for their
computational demands
Mission
Create an open coherent high performance bus interface based on a new bus standard called Open
Coherent Accelerator Processor Interface (OpenCAPI) and grow the ecosystem that utilizes this interface.
Incorporated September 13, 2016
Announced October 14, 2016
42
OpenCAPI Consortium
• Open forum founded by AMD, Google, IBM, Mellanox, and Micron
• Manage the OpenCAPI specification, establish enablement, grow the ecosystem
• Currently about 40 members
• Why its own consortium? Architecture agnostic thus capable of going beyond Power Architecture
• Consortium now established
• Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western
Digital, Xilinx)
• Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels
• Technical Steering Committee and Marketing/Communications Committee
• Website www.opencapi.org
• OpenCAPI 3.0 and 3.1 Specifications available as contributed to consortium as starting point for the Work Groups
• OpenCAPI 4.0 now added to the web site !
• AFU Coherent Data Caching of System Memory
• AFU Address Translation Caching (allows posted operations to system memory)
Incorporated September 13, 2016
Announced October 14, 2016
43
OpenCAPI Workgroup Status
39
Item
OpenCAPI Technical Steering Committee
Marketing & Communications Committee
PHY Signaling Workgroup
PHY Mechanical Workgroup
TL Architecture Specification Workgroup
DL Architecture Specification Workgroup
Enablement Workgroup
Compliance Workgroup
Accelerator/Memory Workgroup
OpenCAPI Workgroup Status
45
Item Availability
OpenCAPI Technical Steering Committee Up and running
Marketing & Communications Committee Up and running
PHY Signaling Workgroup Up and running
PHY Mechanical Workgroup Up and running
TL Architecture Specification Workgroup Up and running
DL Architecture Specification Workgroup Up and running
Enablement Workgroup Up and running
Compliance Workgroup Up and running
Accelerator/Memory Workgroup Forthcoming
OpenCAPI Workgroup Status
46
Item Availability Specs Available for Review
OpenCAPI Technical Steering Committee Up and running WG Process Spec
Marketing & Communications Committee Up and running Regular News Letters
PHY Signaling Workgroup Up and running Today
PHY Mechanical Workgroup Up and running May 2019
TL Architecture Specification Workgroup Up and running Today
DL Architecture Specification Workgroup Up and running April 2019
Enablement Workgroup Up and running Today
Compliance Workgroup Up and running Ongoing
Accelerator/Memory Workgroup Forthcoming --
Cross Industry Collaboration and Innovation
47
OpenCAPI Protocol
Welcoming new members in all areas of the ecosystem
Systems and Software
SW
Research &
Academic
Products and Services
Deployment
SOC
Accelerator Solutions
Membership Entitlement Details
Strategic level - $25K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Vote on new Board Members
• Nominate and/or run for officer election
• Prominent listing in appropriate materials
Observing level - $5K
• Final Specifications and enablement
• License for Product development
Contributor level - $15K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Submit proposals
Academic and Non-Profit level - Free
• Final Specifications and enablement
• Workgroup participation and voting
48
OpenCAPI Consortium Next Steps
JOIN TODAY!
www.opencapi.org
49

More Related Content

PDF
OpenCAPI Technology Ecosystem
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
Accelerate Big Data Processing with High-Performance Computing Technologies
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
It's Time to ROCm!
PDF
IBM HPC Transformation with AI
PDF
SGI: Meeting Manufacturing's Need for Production Supercomputing
OpenCAPI Technology Ecosystem
OpenPOWER Acceleration of HPCC Systems
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Accelerate Big Data Processing with High-Performance Computing Technologies
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
It's Time to ROCm!
IBM HPC Transformation with AI
SGI: Meeting Manufacturing's Need for Production Supercomputing

What's hot (20)

PDF
Best Practices and Performance Studies for High-Performance Computing Clusters
PPT
OpenPOWER Webinar
PPTX
Introduction to architecture exploration
PDF
OpenPOWER Latest Updates
PDF
SDVIs and In-Situ Visualization on TACC's Stampede
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Intel python 2017
PDF
Covid-19 Response Capability with Power Systems
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PDF
@IBM Power roadmap 8
PDF
AMD It's Time to ROC
PDF
00 opencapi acceleration framework yonglu_ver2
PDF
POWER9 for AI & HPC
PDF
High Performance Interconnects: Landscape, Assessments & Rankings
PDF
DOME 64-bit μDataCenter
PDF
CAPI and OpenCAPI Hardware acceleration enablement
PDF
Summit workshop thompto
PDF
BXI: Bull eXascale Interconnect
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
PDF
Deeplearningusingcloudpakfordata
Best Practices and Performance Studies for High-Performance Computing Clusters
OpenPOWER Webinar
Introduction to architecture exploration
OpenPOWER Latest Updates
SDVIs and In-Situ Visualization on TACC's Stampede
TAU E4S ON OpenPOWER /POWER9 platform
Intel python 2017
Covid-19 Response Capability with Power Systems
Using a Field Programmable Gate Array to Accelerate Application Performance
@IBM Power roadmap 8
AMD It's Time to ROC
00 opencapi acceleration framework yonglu_ver2
POWER9 for AI & HPC
High Performance Interconnects: Landscape, Assessments & Rankings
DOME 64-bit μDataCenter
CAPI and OpenCAPI Hardware acceleration enablement
Summit workshop thompto
BXI: Bull eXascale Interconnect
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deeplearningusingcloudpakfordata
Ad

Similar to OpenCAPI next generation accelerator (20)

PDF
Demystify OpenPOWER
PDF
GTC15-Manoj-Roge-OpenPOWER
PDF
6 open capi_meetup_in_japan_final
PPTX
Ibm symp14 referentin_barbara koch_power_8 launch bk
PDF
Power overview 2018 08-13b
PPTX
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
PDF
IBM Power8 announce
PDF
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
PDF
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
PDF
OCP Telco Engineering Workshop at BCE2017
PDF
OpenPOWER Foundation Overview
PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PDF
ODSA Sub-Project Launch
PDF
ODSA Sub-Project Launch
PPTX
Introduction to HPC & Supercomputing in AI
PDF
Heterogeneous Computing : The Future of Systems
PDF
RedisConf17 - Redis Enterprise on IBM Power Systems
PDF
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
PDF
HPC Infrastructure To Solve The CFD Grand Challenge
Demystify OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
6 open capi_meetup_in_japan_final
Ibm symp14 referentin_barbara koch_power_8 launch bk
Power overview 2018 08-13b
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
IBM Power8 announce
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
OCP Telco Engineering Workshop at BCE2017
OpenPOWER Foundation Overview
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
ODSA Sub-Project Launch
ODSA Sub-Project Launch
Introduction to HPC & Supercomputing in AI
Heterogeneous Computing : The Future of Systems
RedisConf17 - Redis Enterprise on IBM Power Systems
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
HPC Infrastructure To Solve The CFD Grand Challenge
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
IBM BOA for POWER
PDF
OpenPOWER System Marconi100
PDF
POWER10 innovations for HPC
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
AI in the enterprise
PDF
Robustness in deep learning
PDF
Perspectives of Frond end Design
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
IBM BOA for POWER
OpenPOWER System Marconi100
POWER10 innovations for HPC
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
Robustness in deep learning
Perspectives of Frond end Design

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Advanced Soft Computing BINUS July 2025.pdf
NewMind AI Monthly Chronicles - July 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto

OpenCAPI next generation accelerator

  • 1. OpenCAPI: Next Generation of Acceleration for the Cognitive Area — Brian Allison OpenCAPI Technology and Enablement E-mail: ballis@us.ibm.com
  • 3. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 3
  • 4. Key Messages Throughout  IBM’s Strong History in IO Offerings  OpenCAPI is an Open IO Standard  Not tied to Power – Architecture Agnostic  High Performance – No OS/Hypervisor/FW Overhead with Low Latency and High Bandwidth  Programing Ease  Very Low Accelerator Design Overhead  Supports heterogeneous environment – Use Cases  Ideal for Accelerated Computing and SCM  Optimized for within a single system node  Products exist today! 4
  • 5. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 5
  • 6. What’s Driving the Creation of a High Perf. Bus • Historical silicon technology improvements out of steam • More cores on a processor help but you’ll never have enough; especially for today’s emerging workloads (analytics, artificial intelligence, machine learning, real time analysis, etc.) • New advanced memory technologies are changing the economics of computing • Companies realizing the need to off load the microprocessors from routine algorithms to meet demand and improve system performance  Accelerated Computing • Accelerators - If you are going to use them, you need a lot of data in/out quickly 6 Computation Data Access
  • 7. © 2018 IBM Corporation POWER Processor Technology Roadmap 2H12 POWER7+ 32 nm - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state - Up to 12 Cores - SMT8 - CAPI Acceleration - High Bandwidth GPU Attach 1H14 – 2H161H10 POWER7 45 nm - 8 Cores - SMT4 - eDRAM L3 Cache POWER9 Family 14nm POWER8 Family 22nm 7 Enterprise Enterprise Enterprise & Big Data Optimized 2H17 – 2H18+ Built for the Cognitive Era − Enhanced Core and Chip Architecture Optimized for Emerging Workloads − Processor Family with Scale-Up and Scale-Out Optimized Silicon − Premier Platform for Accelerated Computing
  • 8. Open Coherent Accelerator Processor Interface Industry driving these necessary Accelerator attributes • High performance (move a lot of data quickly; bandwidth, latency) • Fulfill various accelerator form factors (e.g., GPUs, FPGAs, ASICs) • Introduction of device coherency requirements • Emergence of complex Storage and Memory solutions • Growing demand for network performance with increased computational demand • Need to be architecture agnostic to enable the ecosystem growth and adaption 8
  • 9. Accelerated Computing Example IBM’s AC922 built for Accelerated Computing  Classic system topology setup with two POWER9 sockets, each socket having at least two NVIDIA GPUs Does Accelerated Computing work? • Summit Supercomputer, an IBM-built supercomputer now running at the Department of Energy’s (DOE) Oak Ridge National Laboratory, captured the number one spot with a performance of 143,5 petaflops on High Performance Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356 nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR InfiniBand network. 9
  • 10. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 10
  • 11. Why OpenCAPI and what is it?  OpenCAPI is a new ‘bottom’s up’ IO standard  Key Attributes of OpenCAPI 3.0 • Open IO Standard – Choice for developers and others to contribute and grow an ecosystem • Coherent interface – Microprocessor memory, accelerator and caches share the same memory space • Architecture agnostic – Capable going beyond Power Architecture • Not tied to Power – Architecture Agnostic • High performance – No OS/Hypervisor/FW Overhead for Low Latency and High Bandwidth • Ease of programing • Ease of implementation with minimal accelerator design overhead • Ideal for accelerated computing and SCM including various form factors (FPGA, GPU, ASIC, TPU, etc.) • Optimized for within a single system node • Supports heterogeneous environment – Use Cases  OpenCAPI 3.1  Applies OpenCAPI technology for use of standard DRAM off the microprocessor  Based on an Open Memory Interface (OMI)  Further tuned for extreme lower latency 11
  • 12. Accelerated OpenCAPI Device OpenCAPI Key Attributes 12 TL/DL 25Gb I/O Any OpenCAPI Enabled Processor U Accelerated Function TLx/DLx 1. Architecture agnostic bus – Applicable with any system/microprocessor architecture 2. Optimized for High Bandwidth and Low Latency 3. High performance 25Gbps PHY design with zero ‘overhead’ 4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor 5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit 6. Wide range of Use Cases and access semantics 7. CPU coherent device memory (Home Agent Memory) 8. Architected for both Classic Memory and emerging Advanced Storage Class Memory 9. Minimal OpenCAPI design overhead (FPGA less than 5%) Caches Application  Storage/Compute/Network etc  ASIC/FPGA/FFSA  FPGA, SOC, GPU Accelerator  Load/Store or Block Access Standard System Memory Advanced SCM Solutions BufferedSystemMemory OpenCAPIMemoryBuffers Device Memory
  • 13. Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI specifications are downloadable from the website at www.opencapi.org - Register - Download 13
  • 14. © 2018 IBM Corporation Proposed POWER Processor Technology and I/O Roadmap POWER8 Architecture POWER9 Architecture 2014 POWER8 12 cores 22nm New Micro- Architecture New Process Technology 2016 POWER8 w/ NVLink 12 cores 22nm Enhanced Micro- Architecture With NVLink 2017 P9 SO 12/24 cores 14nm New Micro- Architecture Direct attach memory New Process Technology 2018 P9 SU 12/24 cores 14nm Enhanced Micro- Architecture Buffered Memory POWER7 Architecture 2010 POWER7 8 cores 45nm New Micro- Architecture New Process Technology 2012 POWER7+ 8 cores 32nm Enhanced Micro- Architecture New Process Technology 2020+ P10 TBA cores New Micro- Architecture New Technology POWER10 2019 P9 w/ Adv. I/O 12/24 cores 14nm Enhanced Micro- Architecture New Memory Subsystem 150 GB/s PCIe Gen4 x48 25 GT/s 300GB/s CAPI 2.0, OpenCAPI3.0, NVLink Sustained Memory Bandwidth Standard I/O Interconnect Advanced I/O Signaling Advanced I/O Architecture 210 GB/s PCIe Gen4 x48 25 GT/s 300GB/s CAPI 2.0, OpenCAPI3.0, NVLink 350+ GB/s PCIe Gen4 x48 25 GT/s 300GB/s CAPI 2.0, OpenCAPI4.0, NVLink 435+ GB/s PCIe Gen5 32 & 50 GT/s TBA 210 GB/s PCIe Gen3 N/A CAPI 1.0 210 GB/s PCIe Gen3 20 GT/s 160GB/s CAPI 1.0 , NVLink 65 GB/s PCIe Gen2 N/A N/A 65 GB/s PCIe Gen2 N/A N/A Statement of Direction, Subject to Change 14
  • 15. POWER9 IO Features 8 and 16Gbps PHY Protocols Supported • PCIe Gen3 x16 and PCIe Gen4 x8 • CAPI 2.0 on PCIe Gen4 PCIeGen4 P9 25Gbs 25Gbps PHY Protocols Supported • OpenCAPI 3.0 • NVLink 2.0 Silicon Die Various packages (scale-out, scale-up) POWER9 IO Leading the Industry • PCIe Gen4 • CAPI 2.0 (Power) • NVLink 2.0 • OpenCAPI 3.0 POWER9 15
  • 16. Why do I care about Virtual Addressing?  An OpenCAPI device operates in the virtual address spaces of the applications that it supports • Eliminates kernel and device driver software overhead • Allows device to operate on application memory without kernel-level data copies/pinned pages • Simplifies programming effort to integrate accelerators into applications • Culmination => Improves Accelerator Performance  The Virtual-to-Physical Address Translation occurs in the host CPU • Reduces design complexity of OpenCAPI accelerator development • Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures • Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access 16
  • 17. 17 OpenCAPI vs I/O Device Driver – Because minimizing SW Path Length is crucial for performance Typical I/O Model Flow using Device Driver Flow with a OpenCAPI Model Shared Memory Notify Accelerator Acceleration Shared memory completion or fast thread wake up DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Intr Completion Copy or Unpin Result Data Return From DD Completion TL/DL 25Gb I/O Processor FPGA/SoC/GPU Functionn Function0 Function1 Function2 OpenCAPI TLx/DLx Device Memory Host Memory Total ~13µs for data prep Total 0.36µs 400 Instructions 100 Instructions 0.3µs 0.06µs Application Dependent, but Equal to above 3,000 Instructions 1,000 Instructions 1,000 Instructions 4.9µs 300 Instructions 10,000 Instructions 7.9µs Application Dependent, but Equal to below
  • 18. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 18
  • 19. So How is Accelerated Computing Leveraged? Okay, I’m sold. But how do I leverage this technology? Isolate and identify your heavily used workloads, algorithms and/or applications Place this workload onto an accelerator Path of least resistance and flexibility is an FPGA that is programmable Purchase any of the OpenCAPI vendor FPGA cards and an OpenCAPI enabled system and start developing your solution! 19
  • 20. Comparison of Memory Paradigms Emerging Storage Class Memory Processor Chip DLx/TLx SCMData OpenCAPI 3.1 Architecture Ultra Low Latency ASIC Memory buffer chip adding ~ +7-10 ns on top of native DDR direct connect!! Storage Class Memory tiered with traditional DDR Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics Storage Class Memories have the potential to be the next disruptive technology….. Examples include ReRAM, MRAM, Z-NAND…… All are racing to become the defacto Main Memory Processor Chip DLx/TLx DDR4/5 Example: Basic DDR attach Data Tiered Memory Processor Chip DLx/TLx DDR4/5 DLx/TLx SCM Data Data Tier 1 Memory Tier 2 Memory
  • 21. Acceleration Paradigms with Great Performance 21 Examples: Encryption, Compression, Erasure prior to delivering data to the network or storage Processor Chip Acc Data Egress Transform DLx/TLx Processor Chip Acc Data Bi-Directional Transform Acc DLx/TLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Needle-in-a-haystack Engine Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory Memory Transform Processor Chip Acc DataDLx/TLx Example: Basic work offload Processor Chip Acc NeedlesDLx/TLx Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor Ingress Transform Processor Chip Acc DataDLx/TLx Examples: Video Analytics, Network Security, Deep Packet Inspection, Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading etc Needle-In-A-Haystack Engine Large Haystack Of Data OpenCAPI is ideal for acceleration due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  • 22. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 22
  • 23. IBM AC922 Air Cooled 23 Power9 Systems with OpenCAPI System Details • 2 – Socket 2U • Up to 40 cores • Up to 2TB memory (16 DDR4 Dimms) • 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled • 2 2.5” SFF Drive Bays • 4 OpenPOWER Mezzanine Sockets • Up to 4 NVLink V100 GPUs • Up to 4 socketed OpenCAPI Adapters* • Up to 1 cabled OpenCAPI Cards w/ SlimSAS adapter* * Future Support
  • 24. 24 Power9 Systems with OpenCAPI • Dual Socket • OCP 48V • Up to 4TB memory (32 DDR4 Dimms) • 5 PCIe Gen4 Slots, 3 CAPI2.0 Enabled • 2 x8 OCP Gen4 PCIe Mezz Slots CAPI2.0 Enabled • 4 x8 25 Gbps SlimSAS Accelerator Ports • Supports up to 4 OpenCAPI cabled adapters Google & Rackspace Zaius Motherboard
  • 25. Rackspace BarrelEye G2 ServerSed 25 • Zaius Motherboard • Full-depth 48V Open Rack V2 • 2OU Chassis • High Density, Hot Swap Storage Bay • 4 x8 25Gb/s Coherent Attach Ports “The OpenCAPI accelerator and software ecosystem is growing rapidly. With collateral available via the Open Compute website, accelerator developers find it easy to design and test their solutions on our platform.” Adi Gangidi, Senior Design Engineer with Rackspace
  • 26. Power9 Systems with OpenCAPI 26 • 2 Socket 2U • Up to 48 cores • Up to 4TB memory (32 DDR4 DIMMs) • 4 Gen4 PCIe Slots, CAPI2.0 Enabled • 6 Gen3 PCIe Slots • Up to 24 SFF / 12 LFF Drives • 4 x8 25 Gbps Ports • Up to 4 cabled OpenCAPI Adapters* Mihawk “In order to provide the best backend architecture in AI, Big Data, and Cloud applications, Wistron POWER9 system design incorporates OpenCAPI technology through 25Gbps high speed link to dramatically change the traditional data transition method. This design not only improves GPU performance, but also utilizes next generation advanced memory, coherent network, storage, and FPGA. This is an ideal system infrastructure to meet next decade computing world challenges.” Donald Hwang, CTO and President of EBG at Wistron Corporation.
  • 27. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 27
  • 28. 28 OpenCAPI and CAPI2 Adapters Nallatech 250S+ Storage Expansion • Xilinx US+ KU15P FPGA • 4 GB DDR4 • PCIe Gen4 x8 and CAPI2 • 4x M.2 Slots • M.2 to MiniSAS or Oculink for U.2 drive support CAPI Flash API, Accelerated DB, Burst Buffer Nallatech 250-SoC Multipurpose Converged Network / Storage • Xilinx Zynq US+ ZU19EG FPGA • 8/16 GB DDR4, 4/8 GB DDR4 ARM • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 4 x8 Oculink Ports support NVMe, Network, or OpenCAPI • 2 100Gb QSFP28 Cages Mellanox Innova2 Network + FPGA • Xilinx US+ KU15P FPGA • Mellanox CX5 NIC • 16 GB DDR4 • PCIe Gen4 x8 • 2 25Gb SFP Cages • X8 25Gb/s OpenCAPI Support Network Acceleration (NFV, Packet Classification), Security Acceleration
  • 29. 29 OpenCAPI and CAPI2 Adapters AlphaData ADM-9V3 High Performance Reconfigurable Computing • Xilinx US+ VU3P FPGA • 16 / 32 GB DDR4 • PCIe Gen3 x16 or Gen4 x8 and CAPI2 • 2 QSFP28 Cages • X8 25Gb/s OpenCAPI SlimSAS Data Center, Network Accel, HPC, HFT Bittware XUPVV4 Massive FPGA • Xilinx US+ VU13P FPGA • 4 288-pin DIMM Slots, DDR4 or Dual QDR • Up to 512GB DDR4 • PCIe Gen3 x16, CAPI2 Capable • 4 100Gb QSFP28 Cages • 2 x8 25Gb/s OpenCAPI Support Optimized for Thermal Performance for Large acceleration in the Data Center AlphaData ADM-9H7 Large FPGA with 8GB HBM • Xilinx US+ VU37P FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 2 x8 25 Gb/s OpenCAPI Ports (support up to 50 GB/s) • 4 100Gb QSFP28 Cages AlphaData ADM-9H3 Large FPGA with 8GB HBM • Xilinx Virtex US+ VU33P-3 FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 1 x8 25 Gb/s OpenCAPI Ports (support up to 50 GB/s) • 2 100Gb QSFP28 Cages ML/DL, Inference, System Modeling, HPC
  • 30. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 30
  • 31. TLx and DLx Reference Designs in an FPGA  TLX and DLX will be provided as reference designs to OpenCAPI consortium members • Associated reference design specifications for TLx and DLx will also be delivered along with RTL  Open Verilog – free to enhance, improve or leverage pieces of reference design for your own accelerator development  Designed for 64B packet flow running at 400MHz  Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device 31 VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile DLx 9392/788160 (1.19%) 19026/394080 (4.82%) 0/197280 (0%) 7.5/720 (1.0%) TLx 13806/788160 (1.75%) 8463/394080 (2.14%) 2156/197280 (1.09%) 0/720 (0%) Total 23108/788160 (2.94%) 27849/394080 (6.98%) 2156/197280 (1.09%) 7.5/720 (1.0%) Because power efficiency and size do matter
  • 32. OpenCAPI Device • Customer application and accelerator • Operation system enablement • Little Endian Linux • Reference Kernel Driver (ocxl) • Reference User Library (libocxl) • Hardware and reference designs to enable coherent acceleration Core Processor OS App (software) Memory (Coherent) Accelerated Function(s) TLx DLx 25G ocxl libocxl OCSE (OpenCAPI Simulation Environment) models the red outlined area OCSE enables AFU and Application co- simulation when the reference libocxl and reference TLx/DLx are used Contributed to the Enablement Workgroup under the OpenCAPI consortium Cable Memory (Coherent) Elements of the OpenCAPI Simulation Environment 25G DL TL 32
  • 33. Exerciser Examples – Provided to OCC Members  MemCopy • The MemCopy example is a data mover from source address -> destination address using Virtual Addressing and includes these features • Configuration and MMIO Register Space • acTag Table used for Bus/Device/Function and Process ID identification • 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B, 128B, or 256B operations  Memory Home Agent • The Memory Home Agent example implements memory off the endpoint OpenCAPI accelerator to act as a coherent extension to the host processor memory • The Memory Home Agent example includes these features • Configuration and MMIO Register Space • Individual and pipelined operation for memory loads and stores • Interrupts, with error details reported to software through MMIO registers • Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address  Open Examples – free to enhance, improve or leverage pieces of exerciser examples for your own accelerator development 33
  • 34. Reference Card Design Number 1  Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group  Definition of the cable(s) is driven as part of the PHY Mechanical Work Group  Representative Diagram is articulated below 34 FPGA OpenCAPI Protocol Over the 25Gbps link Including Discovery, Configuration and Enumeration Power and Ground only Convenient Fixture Amphenol SlimSAS OpenCAPI Connector
  • 35. Reference Card Design Number 2  Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group  Definition of the cable(s) will be driven as part of the PHY Mechanical Work Group  Representative Diagram is articulated below 35 Amphenol SlimSAS OpenCAPI Connector Mezzanine based Form Factor
  • 36. Table of Enablement Deliveries 36 Item Delivery Name Where to Obtain Available OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications) <snapshot>.tar.gz Enablement WG Today Xilinx Vivado Project Build with Memcopy Exerciser Vivado Project Flow Enablement WG Today Device Discovery and Configuration Specification and RTL OpenCAPI 3.0 Configuration Sub- System Reference Design Specification Enablement WG Causeway Today AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today 25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signaling WG Causeway Today 25Gbps PHY Mechanical Specification 25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today OpenCAPI Simulation Environment (OCSE) ocse-<version>.tar.gz OpenCAPIDemokit.pdf Enablement WG Today Today Memcopy and Memory Home Agent Exercisers MCP3 and LPC <snapshot>.tar.gz Enablement WG Today Reference Driver Available LIBOCXL Ubuntu 18.04 GitHub Today u
  • 37. SmartDV OpenCAPI VIP Environment contains: • Complete regression suite • Usage examples • Detailed documentation • User’s Guide and Release Notes Benefits • Complete Verification of OpenCAPI Design • Easy to Use • Simplify Result Analysis • Runs in every sim environment 37 SmartDV OpenCAPI Verification IP http://www.smart-dv.com/vip/opencapi.html
  • 38. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions Design Enablement Performance Metrics OpenCAPI Consortium 38
  • 39. CAPI and OpenCAPI Performance 39 CAPI 1.0 PCIE Gen3 x8 Measured BW @8Gb/s CAPI 2.0 PCIE Gen4 x8 Measured BW @16Gb/s OpenCAPI 3.0 25 Gb/s x8 Measured BW @25Gb/s 128B DMA Read 3.81 GB/s 12.57 GB/s 22.1 GB/s 128B DMA Write 4.16 GB/s 11.85 GB/s 21.6 GB/s 256B DMA Read N/A 13.94 GB/s 22.1 GB/s 256B DMA Write N/A 14.04 GB/s 22.0 GB/s POWER8 Introduced in 2013 POWER9 Second Generation POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency POWER8 CAPI 1.0 POWER9 CAPI 2.0 and OpenCAPI 3.0 Xilinx KU60/VU3P FPGA
  • 40. Latency Test Results OpenCAPI Link P9 OpenCAPI 3.9GHz Core, 2.4GHz Nest Xilinx FPGA VU3P 298ns‡ 2ns Jitter TL, DL, PHY TLx, DLx, PHYx (80ns‖) 378ns† Total Latency PCIe G4 Link P9 PCIe Gen4 Xilinx FPGA VU3P est. <337ns PCIe Stack Xilinx PCIe HIP (218ns¶) est. <555ns§ Total Latency PCIe G3 Link P9 PCIe Gen3 3.9GHz Core, 2.4GHz Nest Altera FPGA Stratix V 337ns 7ns Jitter PCIe Stack Altera PCIe HIP (400ns¶) 737ns§ Total Latency PCIe G3 Link Kaby Lake PCIe Gen3* 3.9GHz Core, 2.4GHz Nest Altera FPGA Stratix V 376ns 31ns Jitter PCIe Stack Altera PCIe HIP (400ns¶) 776ns§ Total Latency * Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost) † Derived from round-trip time minus simulated FPGA app time ‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time § Derived from measured CPU turnaround time plus vendor provided HIP latency ‖ Derived from simulation ¶ Vendor provided latency statistic RACE TO ZERO LATENCY BECAUSE JITTER MATTERS 40
  • 41. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions Design Enablement Performance Metrics OpenCAPI Consortium 41
  • 42. OpenCAPI Consortium Overview Goals 1. Provide a forum to give the industry ability to innovate the next generation bus protocol 2. Drive hardware/software innovation to enable choice and efficiency in data center architectures 3. Build an ecosystem with the flexibility to build servers and data centers best suited for their computational demands Mission Create an open coherent high performance bus interface based on a new bus standard called Open Coherent Accelerator Processor Interface (OpenCAPI) and grow the ecosystem that utilizes this interface. Incorporated September 13, 2016 Announced October 14, 2016 42
  • 43. OpenCAPI Consortium • Open forum founded by AMD, Google, IBM, Mellanox, and Micron • Manage the OpenCAPI specification, establish enablement, grow the ecosystem • Currently about 40 members • Why its own consortium? Architecture agnostic thus capable of going beyond Power Architecture • Consortium now established • Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western Digital, Xilinx) • Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels • Technical Steering Committee and Marketing/Communications Committee • Website www.opencapi.org • OpenCAPI 3.0 and 3.1 Specifications available as contributed to consortium as starting point for the Work Groups • OpenCAPI 4.0 now added to the web site ! • AFU Coherent Data Caching of System Memory • AFU Address Translation Caching (allows posted operations to system memory) Incorporated September 13, 2016 Announced October 14, 2016 43
  • 44. OpenCAPI Workgroup Status 39 Item OpenCAPI Technical Steering Committee Marketing & Communications Committee PHY Signaling Workgroup PHY Mechanical Workgroup TL Architecture Specification Workgroup DL Architecture Specification Workgroup Enablement Workgroup Compliance Workgroup Accelerator/Memory Workgroup
  • 45. OpenCAPI Workgroup Status 45 Item Availability OpenCAPI Technical Steering Committee Up and running Marketing & Communications Committee Up and running PHY Signaling Workgroup Up and running PHY Mechanical Workgroup Up and running TL Architecture Specification Workgroup Up and running DL Architecture Specification Workgroup Up and running Enablement Workgroup Up and running Compliance Workgroup Up and running Accelerator/Memory Workgroup Forthcoming
  • 46. OpenCAPI Workgroup Status 46 Item Availability Specs Available for Review OpenCAPI Technical Steering Committee Up and running WG Process Spec Marketing & Communications Committee Up and running Regular News Letters PHY Signaling Workgroup Up and running Today PHY Mechanical Workgroup Up and running May 2019 TL Architecture Specification Workgroup Up and running Today DL Architecture Specification Workgroup Up and running April 2019 Enablement Workgroup Up and running Today Compliance Workgroup Up and running Ongoing Accelerator/Memory Workgroup Forthcoming --
  • 47. Cross Industry Collaboration and Innovation 47 OpenCAPI Protocol Welcoming new members in all areas of the ecosystem Systems and Software SW Research & Academic Products and Services Deployment SOC Accelerator Solutions
  • 48. Membership Entitlement Details Strategic level - $25K • Draft and Final Specifications and enablement • License for Product development • Workgroup participation and voting • TSC participation • Vote on new Board Members • Nominate and/or run for officer election • Prominent listing in appropriate materials Observing level - $5K • Final Specifications and enablement • License for Product development Contributor level - $15K • Draft and Final Specifications and enablement • License for Product development • Workgroup participation and voting • TSC participation • Submit proposals Academic and Non-Profit level - Free • Final Specifications and enablement • Workgroup participation and voting 48
  • 49. OpenCAPI Consortium Next Steps JOIN TODAY! www.opencapi.org 49