SlideShare a Scribd company logo
HPC Transformation with AI
COGNITIVE SYSTEMS
Ing. Florin Manaila
Senior Architect and Inventor
Cognitive Systems (Distributed Deep Learning and HPC)
IBM Systems Hardware Europe
Member of the IBM Academy of Technology (AoT)
March 24, 2020
Technical R&D today disruption
2Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Knowledge Discovery Pipeline
3Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Infrastructure
Demands for AI
Equipped for volumes of data
Flexible storage for a
range of data demands
Versatile, power-efficient
data center accelerators
Advanced I/O for
minimal latency
Scalability and distributed
data center capability
Inference
Powerful data center
accelerators with coherence
Advanced I/O for high
bandwidth and low latency
Proven scalability
Training
Equipped for volumes of data
*** IBM and Business Partner Internal Use Only ***
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Next-Generation
Infrastructure
Stack
5Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
6
AI Workflow
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Distributed Deep Learning
Common options
7
SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL
1x Accelerator 4x Accelerators 4x Accelerators
4x n Accelerators
Longer Training Time Shorter Training Time
System1System2Systemn
System
Data
Data
DataDataDataData
DataDataData
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Node 0
Data-Parallel Framework
Distributed Learning
Partition 0
GPU 0
GPU 1
GPU 2
GPU 3
Partition (0,0)
Partition (0,1)
Partition (0,2)
Partition (0,3)
Node 1
Partition 1
GPU 0
GPU 1
GPU 2
GPU 3
Partition (1,0)
Partition (1,1)
Partition (1,2)
Partition (1,3)
8
Large Dataset
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Rearchitecting
the hardware
for AI
9Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
10
Experimentation Scaling Production
Architecture for large IBM HPC Cluster
Hardware overview for bare-metal / K8s
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Architecture for large IBM HPC/AI Cluster
Hardware overview for bare-metal / K8s
11Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Why Capable
and High BW Interconnects?
§ Each part of an application runs on the best
compute location
o But there are performance and
programmability challenges
o Desire a highly-capable interconnect between
PEs
§ Low-latency communication and high data
bandwidth
§ Fine-grained + bulk data transfers
§ Consistent, unified view of memory
§ Hardware cache coherence & atomic operations
PE Type A
(e.g. CPU)
PE Type B
(e.g. GPU)
Large, low-latency
Memory
Small, High-
bandwidth
Memory
Heterogeneous systems are attractive for efficient performance
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
13
Enterprise AI Hardware Portfolio Expansion
IBM Power AC922
TRAIN
Powering the Fastest Supercomputer
DATA
IBM Power IC922
INFERENCE
IBM Power IC922
Deploy AI into ProductionStorage Dense Server
§ Enterprise ready cloud deployment
with RH OpenShift and Power
Systems reliability
§ Superior I/O for data movement:
PCIe Gen 4
§ Superior price/performance
§ Best training platform with 4x
faster model iteration
§ ~6x data throughput with NVLink
to GPUs
§ Synergistic HW/SW offerings for
ease of use and leadership
performance
§ NVIDIA V100 SMX2 GPUs
§ Superior density and through-put to
inference accelerators
§ Open design for accelerator
flexibility
§ Deploy inference at scale with SW
capabilities leveraging superior IO
§ NVIDIA T4 GPUs
§ Upcoming: FPGAs and ASICs
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Inference server details
14
Form Factor
§ 19” Rack 2U Server
POWER9 Processor
§ 2 dd2.3x P9 Nimbus chips (LaGrange pkg)
§ TDP : 225W
§ 12 (160W), 16, 20 cores (SMT <= 4)
Memory
§ Direct Attach Memory
§ 32 DDR4 ISDIMM Slots @2400 MHz (double
drop)
§ 16 DDR4 ISDIMMs @2667 MHz (single drop)
§ 16, 32, 64 GB RDIMMs
§ 2 TB Max memory
§ 340 GB/s peak memory BW (with 16x DIMMs)
10 Integrated I/O Slots – Standard PCIe Riser
§ 2 PCIe G3 x16 FHFL Slots
(Supports double-wide accelerator)
§ 2 PCIe G4 x16 LP Slots
§ 2 PCIe G3 x8 FHFL Slots (physically x16)
§ 2 PCIe G3 x8 FHHL Slots
§ 2 PCIe G3 x16 LP Slots
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Inference server details
15
Internal Storage
§ Integrated Storage Controller = None
§ 24x 2.5” SAS/SATA
Native I/O
§ 2x USB 3.0 in rear
§ 2x 1G baseT (one shared mgmt) + 1x
1G dedicated IPMI
§ Serial port, VGA port
§ TPM2.0 via Nuvoton
NPCT650ABAWX included (for
Secure OS and trusted boot)
MTM (machine type – model)
• 9183-22X
Accelerators
Nvidia T4 Accelerator 16GB PCIe3 x16 LP
More to come
Networking
Mellanox MCX555A-ECAT 1-PORT EDR 100Gb IB CONNECTX-
5 GEN3 PCIe x16 CAPI CAPABLE
Mellanox MCX556A-ECAT 2-PORT EDR 100Gb IB CONNECTX-
5 GEN4 PCIe x16 CAPI CAPABLE
Mellanox MCX516A-CDAT 2-PORT 100Gb ROCE EN
CONNECTX-5 GEN4 PCIe x16
Mellanox MCX4121A-XCAT 2-PORT 10Gb NIC&ROCE SR/Cu
PCIe 3.0
Mellanox MCX4121A-ACAT 2-PORT 25/10Gb NIC&ROCE
SR/Cu PCIe 3.0
Marvell BCM957810A1008ICDM 2-PORT E'NET (2X10 10Gb),
PCIe Gen 2 X8
Marvell BCM957800A1006ICDM QUAD E'NET (2X1 + 2X10
10Gb), PCIe Gen 2 X8
Marvell BCM957800A1006ICDM QUAD E'NET (2X1 + 2X10
10Gb), PCIe Gen 2 X8
Broadcom BCM5719-4P 1Gb E'NET(UTP) 4-PORT ADPTR,
PCIE-x4
Fiber Channel
Broadcom LPe16002B-M6 2-PORT FIBER CHANNEL(16Gb/s),
PCIE3-8X
Broadcom LPe32002-M2 2-PORT FIBER CHANNEL(32Gb/s),
PCIE3-8X
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Inference server details
16
8x SAS/SATA 8x SAS/SATA 8x SAS/SATA
9183-22X
OS Support – LE
§ RHEL 7.6-alt
RAS
§ Concurrent Maintenance disks
§ Redundant Hot plug Power
§ Redundant Hot plug fans
§ Customer Install and Repair
§ Simplified Op Panel
§ In-rack system service
BMC Service Processor
§ Aspeed AST2500
§ OpenBMC
Certifications
§ FCC Class A
§ ASHRAE A2 Environment
(10-35C)
§ Acoustics Datacenter 1A
HDD Drives
HDD; 600GB; 2.5"; 10k; SAS; 12Gb/s; 4Kn/512e; SED
HDD; 1200GB; 2.5"; 10k; SAS; 12Gb/s; 4Kn/512e; SED
HDD; 2400GB; 2.5"; 10k; SAS; 12Gb/s; 4Kn/512e; Non-SED
SSD Drives
SSD; 240GB; 2.5"; SATA; 6Gb/s; 1.4 DWPD; NonSED
SSD; 960GB; 2.5"; SATA; 6Gb/s; 2.5 DWPD; NonSED
SSD; 1920GB; 2.5"; SATA; 6Gb/s; 2.5 DWPD; NonSED
SSD; 3840GB; 2.5"; SATA; 6Gb/s; 2.5 DWPD; NonSED
CONTROLLERS
Broadcom (LSI) MegaRAID 9361-8i SAS3 Controller w/ 8
internal ports (2GB Cache) PCIe 3.0 x8 LP with cables
Broadcom 9300-8i PCIe gen3 x8 LP with cables
Broadcom 9305-16i PCIe gen3 x8 LP with cables
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Feature List:
§ REST Management
§ IPMI
§ SSH based SOL
§ Power and Cooling
Management
§ Event Logs
§ Zeroconf discoverable
§ Sensors
Features In
Progress:
§ Full IPMI 2.0
Compliance with DCMI
§ Verified Boot
§ HTML5 Java Script Web
User Interface
§ BMC RAS
IBM is the
OpenBMC
Community Leader
§ Facebook
§ Google
§ IBM
§ Intel
§ Microsoft
§ OCP
17
OpenBMC is a free open
source management
software Linux distribution
§ Inventory
§ LED Management
§ Host Watchdog
§ Simulation
§ Code Update Support for
multiple BMC/BIOS
images
§ POWER On Chip
Controller (OCC) SupportCognitive Systems Europe / March 24 / © 2020 IBM Corporation
Next-Generation
Software Stack
18Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Workload characteristics: Both compute and data intensive!
19Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
20
AI Infrastructure Stack
ON-CLOUD and ON-PREM
Transform & Prep
Data (ETL)
Micro-Services / Applications
Governance AI
(Fairness, Explainable AI,
Model Health, Accuracy)
APIs
(external and in-house)
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytorch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Watson ML Community Edition (WMLCE)
21
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Watson ML Community Edition (WMLCE)
22
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
23
IBMWatsonMachineLearning
CommunityEdition
DockerContainers
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
24
IBMWatsonMachineLearning
CommunityEdition
UniversalBaseImages(UBI)
Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
25Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
Thank you
26
Florin Manaila
—
florin.manaila@de.ibm.com
ibm.com
27

More Related Content

PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Summit workshop thompto
PDF
Ac922 cdac webinar
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
PDF
Covid-19 Response Capability with Power Systems
PDF
OpenPOWER System Marconi100
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
TAU E4S ON OpenPOWER /POWER9 platform
Summit workshop thompto
Ac922 cdac webinar
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Covid-19 Response Capability with Power Systems
OpenPOWER System Marconi100

What's hot (20)

PDF
POWER10 innovations for HPC
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PPT
OpenPOWER Webinar
PDF
Deeplearningusingcloudpakfordata
PDF
OpenPOWER Latest Updates
PDF
OpenPOWER Webinar on Machine Learning for Academic Research
PDF
IBM Data Centric Systems & OpenPOWER
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Overview of HPC Interconnects
PDF
NNSA Explorations: ARM for Supercomputing
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
State of ARM-based HPC
PDF
DOME 64-bit μDataCenter
PDF
SNAP MACHINE LEARNING
PPTX
WML OpenPOWER presentation
PDF
IBM BOA for POWER
PDF
Programming Models for Exascale Systems
POWER10 innovations for HPC
MIT's experience on OpenPOWER/POWER 9 platform
OpenPOWER Webinar
Deeplearningusingcloudpakfordata
OpenPOWER Latest Updates
OpenPOWER Webinar on Machine Learning for Academic Research
IBM Data Centric Systems & OpenPOWER
Preparing to program Aurora at Exascale - Early experiences and future direct...
Hardware & Software Platforms for HPC, AI and ML
Overview of HPC Interconnects
NNSA Explorations: ARM for Supercomputing
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
CUDA-Python and RAPIDS for blazing fast scientific computing
Energy Efficient Computing using Dynamic Tuning
State of ARM-based HPC
DOME 64-bit μDataCenter
SNAP MACHINE LEARNING
WML OpenPOWER presentation
IBM BOA for POWER
Programming Models for Exascale Systems
Ad

Similar to IBM HPC Transformation with AI (20)

PPTX
PowerAI Deep dive
PDF
Power overview 2018 08-13b
PDF
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
PPTX
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
PDF
IBM Power Systems at FIS InFocus 2019
PDF
BUD17 Socionext SC2A11 ARM Server SoC
PDF
Heterogeneous Computing : The Future of Systems
PDF
POWER9 for AI & HPC
PDF
OpenPOWER/POWER9 Webinar from MIT and IBM
PDF
IBM POWER8 Systems Technology Group Development
PPTX
Ibm symp14 referentin_barbara koch_power_8 launch bk
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PPTX
IBM PureSystems
PDF
Palestra IBM-Mack Zvm linux
PPTX
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
PDF
Intro to Cell Broadband Engine for HPC
PPTX
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
PPTX
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
PDF
POWER9: IBM’s Next Generation POWER Processor
PowerAI Deep dive
Power overview 2018 08-13b
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
IBM Power Systems at FIS InFocus 2019
BUD17 Socionext SC2A11 ARM Server SoC
Heterogeneous Computing : The Future of Systems
POWER9 for AI & HPC
OpenPOWER/POWER9 Webinar from MIT and IBM
IBM POWER8 Systems Technology Group Development
Ibm symp14 referentin_barbara koch_power_8 launch bk
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
IBM PureSystems
Palestra IBM-Mack Zvm linux
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Intro to Cell Broadband Engine for HPC
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
POWER9: IBM’s Next Generation POWER Processor
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
AI in the enterprise
PDF
Robustness in deep learning
PDF
Perspectives of Frond end Design
PDF
A2O Core implementation on FPGA
PDF
OpenPOWER Foundation Introduction
PDF
Open Hardware and Future Computing
PDF
AI/Cloud Technology access
PDF
Special Purpose IBM Center of excellence lab
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
Robustness in deep learning
Perspectives of Frond end Design
A2O Core implementation on FPGA
OpenPOWER Foundation Introduction
Open Hardware and Future Computing
AI/Cloud Technology access
Special Purpose IBM Center of excellence lab

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology

IBM HPC Transformation with AI

  • 1. HPC Transformation with AI COGNITIVE SYSTEMS Ing. Florin Manaila Senior Architect and Inventor Cognitive Systems (Distributed Deep Learning and HPC) IBM Systems Hardware Europe Member of the IBM Academy of Technology (AoT) March 24, 2020
  • 2. Technical R&D today disruption 2Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 3. Knowledge Discovery Pipeline 3Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 4. Infrastructure Demands for AI Equipped for volumes of data Flexible storage for a range of data demands Versatile, power-efficient data center accelerators Advanced I/O for minimal latency Scalability and distributed data center capability Inference Powerful data center accelerators with coherence Advanced I/O for high bandwidth and low latency Proven scalability Training Equipped for volumes of data *** IBM and Business Partner Internal Use Only *** Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 6. 6 AI Workflow Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 7. Distributed Deep Learning Common options 7 SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL 1x Accelerator 4x Accelerators 4x Accelerators 4x n Accelerators Longer Training Time Shorter Training Time System1System2Systemn System Data Data DataDataDataData DataDataData Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 8. Node 0 Data-Parallel Framework Distributed Learning Partition 0 GPU 0 GPU 1 GPU 2 GPU 3 Partition (0,0) Partition (0,1) Partition (0,2) Partition (0,3) Node 1 Partition 1 GPU 0 GPU 1 GPU 2 GPU 3 Partition (1,0) Partition (1,1) Partition (1,2) Partition (1,3) 8 Large Dataset Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 9. Rearchitecting the hardware for AI 9Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 10. 10 Experimentation Scaling Production Architecture for large IBM HPC Cluster Hardware overview for bare-metal / K8s Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 11. Architecture for large IBM HPC/AI Cluster Hardware overview for bare-metal / K8s 11Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 12. Why Capable and High BW Interconnects? § Each part of an application runs on the best compute location o But there are performance and programmability challenges o Desire a highly-capable interconnect between PEs § Low-latency communication and high data bandwidth § Fine-grained + bulk data transfers § Consistent, unified view of memory § Hardware cache coherence & atomic operations PE Type A (e.g. CPU) PE Type B (e.g. GPU) Large, low-latency Memory Small, High- bandwidth Memory Heterogeneous systems are attractive for efficient performance Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 13. 13 Enterprise AI Hardware Portfolio Expansion IBM Power AC922 TRAIN Powering the Fastest Supercomputer DATA IBM Power IC922 INFERENCE IBM Power IC922 Deploy AI into ProductionStorage Dense Server § Enterprise ready cloud deployment with RH OpenShift and Power Systems reliability § Superior I/O for data movement: PCIe Gen 4 § Superior price/performance § Best training platform with 4x faster model iteration § ~6x data throughput with NVLink to GPUs § Synergistic HW/SW offerings for ease of use and leadership performance § NVIDIA V100 SMX2 GPUs § Superior density and through-put to inference accelerators § Open design for accelerator flexibility § Deploy inference at scale with SW capabilities leveraging superior IO § NVIDIA T4 GPUs § Upcoming: FPGAs and ASICs Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 14. Inference server details 14 Form Factor § 19” Rack 2U Server POWER9 Processor § 2 dd2.3x P9 Nimbus chips (LaGrange pkg) § TDP : 225W § 12 (160W), 16, 20 cores (SMT <= 4) Memory § Direct Attach Memory § 32 DDR4 ISDIMM Slots @2400 MHz (double drop) § 16 DDR4 ISDIMMs @2667 MHz (single drop) § 16, 32, 64 GB RDIMMs § 2 TB Max memory § 340 GB/s peak memory BW (with 16x DIMMs) 10 Integrated I/O Slots – Standard PCIe Riser § 2 PCIe G3 x16 FHFL Slots (Supports double-wide accelerator) § 2 PCIe G4 x16 LP Slots § 2 PCIe G3 x8 FHFL Slots (physically x16) § 2 PCIe G3 x8 FHHL Slots § 2 PCIe G3 x16 LP Slots Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 15. Inference server details 15 Internal Storage § Integrated Storage Controller = None § 24x 2.5” SAS/SATA Native I/O § 2x USB 3.0 in rear § 2x 1G baseT (one shared mgmt) + 1x 1G dedicated IPMI § Serial port, VGA port § TPM2.0 via Nuvoton NPCT650ABAWX included (for Secure OS and trusted boot) MTM (machine type – model) • 9183-22X Accelerators Nvidia T4 Accelerator 16GB PCIe3 x16 LP More to come Networking Mellanox MCX555A-ECAT 1-PORT EDR 100Gb IB CONNECTX- 5 GEN3 PCIe x16 CAPI CAPABLE Mellanox MCX556A-ECAT 2-PORT EDR 100Gb IB CONNECTX- 5 GEN4 PCIe x16 CAPI CAPABLE Mellanox MCX516A-CDAT 2-PORT 100Gb ROCE EN CONNECTX-5 GEN4 PCIe x16 Mellanox MCX4121A-XCAT 2-PORT 10Gb NIC&ROCE SR/Cu PCIe 3.0 Mellanox MCX4121A-ACAT 2-PORT 25/10Gb NIC&ROCE SR/Cu PCIe 3.0 Marvell BCM957810A1008ICDM 2-PORT E'NET (2X10 10Gb), PCIe Gen 2 X8 Marvell BCM957800A1006ICDM QUAD E'NET (2X1 + 2X10 10Gb), PCIe Gen 2 X8 Marvell BCM957800A1006ICDM QUAD E'NET (2X1 + 2X10 10Gb), PCIe Gen 2 X8 Broadcom BCM5719-4P 1Gb E'NET(UTP) 4-PORT ADPTR, PCIE-x4 Fiber Channel Broadcom LPe16002B-M6 2-PORT FIBER CHANNEL(16Gb/s), PCIE3-8X Broadcom LPe32002-M2 2-PORT FIBER CHANNEL(32Gb/s), PCIE3-8X Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 16. Inference server details 16 8x SAS/SATA 8x SAS/SATA 8x SAS/SATA 9183-22X OS Support – LE § RHEL 7.6-alt RAS § Concurrent Maintenance disks § Redundant Hot plug Power § Redundant Hot plug fans § Customer Install and Repair § Simplified Op Panel § In-rack system service BMC Service Processor § Aspeed AST2500 § OpenBMC Certifications § FCC Class A § ASHRAE A2 Environment (10-35C) § Acoustics Datacenter 1A HDD Drives HDD; 600GB; 2.5"; 10k; SAS; 12Gb/s; 4Kn/512e; SED HDD; 1200GB; 2.5"; 10k; SAS; 12Gb/s; 4Kn/512e; SED HDD; 2400GB; 2.5"; 10k; SAS; 12Gb/s; 4Kn/512e; Non-SED SSD Drives SSD; 240GB; 2.5"; SATA; 6Gb/s; 1.4 DWPD; NonSED SSD; 960GB; 2.5"; SATA; 6Gb/s; 2.5 DWPD; NonSED SSD; 1920GB; 2.5"; SATA; 6Gb/s; 2.5 DWPD; NonSED SSD; 3840GB; 2.5"; SATA; 6Gb/s; 2.5 DWPD; NonSED CONTROLLERS Broadcom (LSI) MegaRAID 9361-8i SAS3 Controller w/ 8 internal ports (2GB Cache) PCIe 3.0 x8 LP with cables Broadcom 9300-8i PCIe gen3 x8 LP with cables Broadcom 9305-16i PCIe gen3 x8 LP with cables Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 17. Feature List: § REST Management § IPMI § SSH based SOL § Power and Cooling Management § Event Logs § Zeroconf discoverable § Sensors Features In Progress: § Full IPMI 2.0 Compliance with DCMI § Verified Boot § HTML5 Java Script Web User Interface § BMC RAS IBM is the OpenBMC Community Leader § Facebook § Google § IBM § Intel § Microsoft § OCP 17 OpenBMC is a free open source management software Linux distribution § Inventory § LED Management § Host Watchdog § Simulation § Code Update Support for multiple BMC/BIOS images § POWER On Chip Controller (OCC) SupportCognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 18. Next-Generation Software Stack 18Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 19. What’s in the training of deep neural networks? Neural network model Billions of parameters Gigabytes Computation Iterative gradient based search Millions of iterations Mainly matrix operations Data Millions of images, sentences Terabytes Workload characteristics: Both compute and data intensive! 19Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 20. 20 AI Infrastructure Stack ON-CLOUD and ON-PREM Transform & Prep Data (ETL) Micro-Services / Applications Governance AI (Fairness, Explainable AI, Model Health, Accuracy) APIs (external and in-house) Machine & Deep Learning Libraries & Frameworks Distributed Computing Data Lake & Data Stores Segment Specific: Finance, Retail, Healthcare, Automotive Speech, Vision, NLP, Sentiment TensorFlow, Caffe, Pytorch SparkML, Snap.ML Spark, MPI Hadoop HDFS, NoSQL DBs, Parallel File System Accelerated Infrastructure Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 21. Watson ML Community Edition (WMLCE) 21 Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 22. Watson ML Community Edition (WMLCE) 22 Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 25. 25Cognitive Systems Europe / March 24 / © 2020 IBM Corporation
  • 27. 27