SlideShare a Scribd company logo
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
IBM PowerAI LMS and DDL
—
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
IBM Systems Hardware Europe
AGENDA
2
PowerAI
Overview
Large Model Support (LMS)
Distributed Deep Learning Library (DDL)
Demo
LMS
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
When decision made
How decisions made
Understanding user goals
Optimizes for
Performance
Traditional Runtime Self-aware Runtime
Runtime
Evidence-based
Yes
Application metrics (science
accomplished)
Improves without user
action
Decide
Act
Observe
Decide
Design time
Ad-hoc, based on guesses
about future
No
System metrics (utilization)
Static
Act
New ways of
Innovation
3Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
4
AI Infrastructure Stack
Vision
Enterprise
L1-L3 Support
Base
Transform & Prep
Data (ETL)
Micro-Services / Applications
AI APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytoch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Goal: Search for the best parameters to make model fit data.
Workload characteristics: Both compute and data intensive!
Data processing stages for distributed deep learning
Training data
on storage
CPU:
Coordination
and data prep
GPU
computation
Parameter data
exchange
across systems
Network,
NVLink,
GPU Memory
POWER9
CPU
Storage
NVMe, SSD
GPU
PCIe Gen. 4 2nd Gen
NVLink
Hillery Hunter, IBM, GTC 2018
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Balanced system design for seamless data movement
IBM AC922 Architecture
7
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
NVLink CPU-GPU
reducing communication time, keeping hardware utilized
§ NVLink reduces communication time and overhead
§ Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
PCIe-attached
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
Advantage: data communication and
GPU performance
Hillery Hunter, IBM, GTC 2018
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
IBM PowerAI at the glance
9Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Large Model Support
POWER
CPU
DDR4
GPU
NVLink
Graphics
Memory
Traditional Model Support à
CPUDDR4
GPU
PCIe
Graphics
Memory
PowerAI with Large Model Support (LMS)
(Competitors)
Limited memory on GPU
forces trade-off in model
size / data resolution
PowerAI
Use system memory and
GPU to support more
complex models and
higher resolution data
10Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Large AI Models Train
~4 Times Faster
POWER9 Servers with NVLink to GPUs
vs
x86 Servers with PCIe to GPUs
11
3.1 Hours
49 Mins
0
2000
4000
6000
8000
10000
12000
Xeon x86 2640v4 w/
4x V100 GPUs
Power AC922 w/ 4x
V100 GPUs
Time(secs)
Caffe with LMS (Large Model Support)
Runtime of 1000 Iterations
3.8x Faster
GoogleNet model on Enlarged
ImageNet Dataset (2240x2240)Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
LMS Usage in Caffe (PowerAI 5.1)
12Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
LMS enables processing of high definition images, large models, and higher batch sizes that doesn’t fit in
GPU memory today (Maximum GPU memory available in Nvidia P100 GPUs is 16GB).
LMS Options
• lms <size in KB>
• lms_frac <x>, where 0<x<1.0
Example of running IBM Caffe with LMS for Deep Residual Network – Resnet152 :
/opt/DL/caffe-ibm/bin/caffe train -gpu 0,1,2,3 –solver=solver.prototxt -lms 10000 –lms_frac=0.5
Note that configuring the “lms” and “lms_frac”
values depends on the below factors:
•Batch size used
•Model used
•Number of GPUs used
•System memory available
PowerAI Distributed Deep Learning Library (DDL)
Communication Library for Distributed
Deep Learning Training
• Enables deep learning software to scale to 100s of
servers with GPUs
• Works across variety of system sizes
• Works with variety of network types, switch topologies
Released results @ 256 P100 GPUs
• Better scaling efficiency than Facebook AI Research:
95% (IBM) vs <90% (FB)
• Higher image recognition accuracy than Microsoft:
33.8% (IBM) vs 29.8% (MS)
TECHNICAL DETAILS:
https://arxiv.org/abs/1708.02188
Costly inter-GPU communication time can limit scaling
Example: DevBox with 4 TITAN X GPU
Computation Communication
Training cycle
Hillery Hunter, IBM, GTC 2018
c
PIX
PXB
c
PIX
PXB
c
PIX
PXB
c
PIX
PXB
Scaling within a system suffers without peer-to-peer
Example: 16 GPUs on PCIe in a system
P2P GPUDirect
is Key
Hillery Hunter, IBM, GTC 2018
What does DDL do?
DDL for
TensorFlow
1. Places the job on the local GPU to the CPU (negotiating to use
NVLink interface)
2. Places the job on its nearest neighbor, to leverage NVLink
GPU:GPU communication
3. Places the job on the same system, on the other socket
4. Sends the job, integrating RDMA over IB (not present in the
frameworks themselves), to a remote system and it’s first GPU
Same kind of intelligence you see in good HPC job schedulers, but created with specific tuning for our architecture
16Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
Communication paths
PowerAI DDL: Fully utilize bandwidth for links within each node and across all nodes
à Learners communicate as efficiently as possible
Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
PCle PCle
Network IB, Eth Network IB, Eth
Hillery Hunter, IBM, GTC 2018
64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL training ResNet-50 Imagenet1k, Caffe
#GPUs 4 8 16 32 64 128 256
#Nodes 1 2 4 8 16 32 64
Speedup 1.0 2.0 3.9 7.9 15.5 30.5 60.6
Scaling efficiency 1.00 1.00 .98 .99 .97 .95 .95
ideal
actual
18Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL training ResNet-101 Imagenet22k, Caffe
#GPUs 4 8 64 256
#Nodes 1 2 16 64
Speedup 1.0 1.8 3.9 13.8
Scaling efficiency 1.00 .92 .86 .85
ideal
actual
7 hours to 33.8% top-1 accuracy using 256 GPUs
19Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
20
TensorFlow 1.4 Performance on IBM POWER9 with V100
Single node
35% more images processed per second vs tested x86 systems
ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012)
Training on 1.2M images
Validation on 50K images
§ Results are based IBM Internal Measurements running
1000 iterations of HPM Resnet50 on 1.2M images and
validation on 50K images with Dataset from ILSVRC 2012
also known as Imagenet 2012.
§ Software: Tensorflow 1.4.0 framework and HPM
Resnet50 https://github.com/tensorflow/benchmarks.git (
commit: f5d85aef) and with the following parameters:
Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet;
local-parameter-device: gpu; variable-update: replicated
Date of testing: November 26, 2017
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
21
TensorFlow 1.4 Performance on IBM POWER9 with V100
Multiple nodes
Distributed Deep Learning: IBM POWER9™ with Nvidia
Tesla V100 results in 2.3X more data processed on
TensorFlow versus tested x86 systems
2.3X more images processed per second vs tested x86
systems
PowerAI Distributed Deep Learning (DDL) library provides
innovative distribution methods enabling AI frameworks
to scale to multiple servers leveraging all attached GPUs
ResNet50 testing on ILSVRC 2012 dataset (also known as
Imagenet 2012)
Training on 1.2M images
Validation on 50K images
Date of testing: December 2, 2017
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
DDL TF operator functions/semantics
22Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
1. Init function: This must be called before any real TF operators. Typically, we can execute this op on CPU using an
additional session. The input is the DDL configuration. This will inform the targeted network topology and learner
mapping to it. The output consists of MPI information (rank,size) and GPU assignment for TF.
DDL TF operator functions/semantics
23Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
2. Bcast function: Broadcast is to synchronize all the trainable parameters (i.e., weights and biases) before
the training. Broadcast can be called once init has been called and completed on the assigned GPU device.
Each and every trainable parameter must be broadcasted to ensure good convergence.
DDL TF operator functions/semantics
24Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
3. AllReduceN function : This is an aggregated version of AllReduce. Essentially, this takes an array of N
tensors, performs allreduce in a single shot, and return an array of N reduced tensors. The benefits of using
AllReduceN are better performance and simpler integration.
25Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Thank you
26
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
florin.manaila@de.ibm.com
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
27Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

More Related Content

PDF
Distributed deep learning reference architecture v3.2l
PPTX
PowerAI Deep dive
PPTX
Large Model support and Distribute deep learning
PDF
OpenPOWER/POWER9 AI webinar
PPTX
WML OpenPOWER presentation
PDF
OpenPOWER/POWER9 Webinar from MIT and IBM
PPTX
OpenPOWER foundation
PDF
Covid-19 Response Capability with Power Systems
Distributed deep learning reference architecture v3.2l
PowerAI Deep dive
Large Model support and Distribute deep learning
OpenPOWER/POWER9 AI webinar
WML OpenPOWER presentation
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER foundation
Covid-19 Response Capability with Power Systems

What's hot (20)

PDF
Transparent Hardware Acceleration for Deep Learning
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PPTX
2018 bsc power9 and power ai
PPTX
AI OpenPOWER Academia Discussion Group
PDF
BSC LMS DDL
PDF
OpenPOWER Webinar on Machine Learning for Academic Research
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
PPT
OpenPOWER Webinar
PDF
CFD on Power
PDF
IBM Data Centric Systems & OpenPOWER
PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PDF
IBM HPC Transformation with AI
PDF
SNAP MACHINE LEARNING
PDF
End-to-End Big Data AI with Analytics Zoo
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
Deeplearningusingcloudpakfordata
PDF
FPGA Hardware Accelerator for Machine Learning
PDF
Summit workshop thompto
PDF
OpenPOWER System Marconi100
PDF
IBM BOA for POWER
Transparent Hardware Acceleration for Deep Learning
Innovation with ai at scale on the edge vt sept 2019 v0
2018 bsc power9 and power ai
AI OpenPOWER Academia Discussion Group
BSC LMS DDL
OpenPOWER Webinar on Machine Learning for Academic Research
Xilinx Edge Compute using Power 9 /OpenPOWER systems
OpenPOWER Webinar
CFD on Power
IBM Data Centric Systems & OpenPOWER
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
IBM HPC Transformation with AI
SNAP MACHINE LEARNING
End-to-End Big Data AI with Analytics Zoo
Workload Transformation and Innovations in POWER Architecture
Deeplearningusingcloudpakfordata
FPGA Hardware Accelerator for Machine Learning
Summit workshop thompto
OpenPOWER System Marconi100
IBM BOA for POWER
Ad

Similar to OpenPOWER Boot camp in Zurich (20)

PDF
Open power ddl and lms
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPT
IBM Special Announcement session Intel #IDF2013 September 10, 2013
PDF
Harnessing the virtual realm for successful real world artificial intelligence
PDF
IBM Cloud Paris Meetup - 20190520 - IA & Power
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PPTX
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
PPTX
DATE 2020: Design, Automation and Test in Europe Conference
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
Dell NVIDIA AI Powered Transformation Webinar
PPTX
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
PPT
Running Dicom Visualization On The Cell (Ps3) Rsna Poster Presentation
PPTX
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
PDF
Distributed TensorFlow on Hops (Papis London, April 2018)
PDF
InTech Event | Cognitive Infrastructure for Enterprise AI
PPTX
Introduction to PowerAI - The Enterprise AI Platform
PPTX
Ibm symp14 referentin_barbara koch_power_8 launch bk
PPTX
Continuous Machine and Deep Learning with Apache Ignite
Open power ddl and lms
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
IBM Special Announcement session Intel #IDF2013 September 10, 2013
Harnessing the virtual realm for successful real world artificial intelligence
IBM Cloud Paris Meetup - 20190520 - IA & Power
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
DATE 2020: Design, Automation and Test in Europe Conference
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Dell NVIDIA AI Powered Transformation Webinar
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Running Dicom Visualization On The Cell (Ps3) Rsna Poster Presentation
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
Distributed TensorFlow on Hops (Papis London, April 2018)
InTech Event | Cognitive Infrastructure for Enterprise AI
Introduction to PowerAI - The Enterprise AI Platform
Ibm symp14 referentin_barbara koch_power_8 launch bk
Continuous Machine and Deep Learning with Apache Ignite
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
OpenPOWER Latest Updates
PDF
POWER10 innovations for HPC
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
AI in the enterprise
PDF
Robustness in deep learning
PDF
Perspectives of Frond end Design
PDF
A2O Core implementation on FPGA
PDF
OpenPOWER Foundation Introduction
PDF
Open Hardware and Future Computing
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
OpenPOWER Latest Updates
POWER10 innovations for HPC
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
Robustness in deep learning
Perspectives of Frond end Design
A2O Core implementation on FPGA
OpenPOWER Foundation Introduction
Open Hardware and Future Computing

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Advanced IT Governance
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Advanced IT Governance
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

OpenPOWER Boot camp in Zurich

  • 1. Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation IBM PowerAI LMS and DDL — Florin Manaila Senior IT Architect and Inventor Cognitive Systems (HPC and Deep Learning) IBM Systems Hardware Europe
  • 2. AGENDA 2 PowerAI Overview Large Model Support (LMS) Distributed Deep Learning Library (DDL) Demo LMS Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 3. When decision made How decisions made Understanding user goals Optimizes for Performance Traditional Runtime Self-aware Runtime Runtime Evidence-based Yes Application metrics (science accomplished) Improves without user action Decide Act Observe Decide Design time Ad-hoc, based on guesses about future No System metrics (utilization) Static Act New ways of Innovation 3Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 4. 4 AI Infrastructure Stack Vision Enterprise L1-L3 Support Base Transform & Prep Data (ETL) Micro-Services / Applications AI APIs (Eg: Watson) In-House APIs Machine & Deep Learning Libraries & Frameworks Distributed Computing Data Lake & Data Stores Segment Specific: Finance, Retail, Healthcare, Automotive Speech, Vision, NLP, Sentiment TensorFlow, Caffe, Pytoch SparkML, Snap.ML Spark, MPI Hadoop HDFS, NoSQL DBs, Parallel File System Accelerated Infrastructure Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 5. What’s in the training of deep neural networks? Neural network model Billions of parameters Gigabytes Computation Iterative gradient based search Millions of iterations Mainly matrix operations Data Millions of images, sentences Terabytes Goal: Search for the best parameters to make model fit data. Workload characteristics: Both compute and data intensive!
  • 6. Data processing stages for distributed deep learning Training data on storage CPU: Coordination and data prep GPU computation Parameter data exchange across systems Network, NVLink, GPU Memory POWER9 CPU Storage NVMe, SSD GPU PCIe Gen. 4 2nd Gen NVLink Hillery Hunter, IBM, GTC 2018 Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 7. Balanced system design for seamless data movement IBM AC922 Architecture 7 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s CP U NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s Multi-Host Socket Direct PCIe Gen4, CAPI 2.0 Infiniband NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 X-Bus 4B CP U PCIe NVMe Flash Storage Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 8. NVLink CPU-GPU reducing communication time, keeping hardware utilized § NVLink reduces communication time and overhead § Data gets from GPU-GPU, Memory-GPU faster, for shorter training times x86 based PCIe-attached GPU system POWER8 + Tesla P100+NVLink ImageNet / Alexnet: Minibatch size = 128 170 ms 78 ms Advantage: data communication and GPU performance Hillery Hunter, IBM, GTC 2018 Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 9. IBM PowerAI at the glance 9Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 10. Large Model Support POWER CPU DDR4 GPU NVLink Graphics Memory Traditional Model Support à CPUDDR4 GPU PCIe Graphics Memory PowerAI with Large Model Support (LMS) (Competitors) Limited memory on GPU forces trade-off in model size / data resolution PowerAI Use system memory and GPU to support more complex models and higher resolution data 10Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 11. Large AI Models Train ~4 Times Faster POWER9 Servers with NVLink to GPUs vs x86 Servers with PCIe to GPUs 11 3.1 Hours 49 Mins 0 2000 4000 6000 8000 10000 12000 Xeon x86 2640v4 w/ 4x V100 GPUs Power AC922 w/ 4x V100 GPUs Time(secs) Caffe with LMS (Large Model Support) Runtime of 1000 Iterations 3.8x Faster GoogleNet model on Enlarged ImageNet Dataset (2240x2240)Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 12. LMS Usage in Caffe (PowerAI 5.1) 12Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation LMS enables processing of high definition images, large models, and higher batch sizes that doesn’t fit in GPU memory today (Maximum GPU memory available in Nvidia P100 GPUs is 16GB). LMS Options • lms <size in KB> • lms_frac <x>, where 0<x<1.0 Example of running IBM Caffe with LMS for Deep Residual Network – Resnet152 : /opt/DL/caffe-ibm/bin/caffe train -gpu 0,1,2,3 –solver=solver.prototxt -lms 10000 –lms_frac=0.5 Note that configuring the “lms” and “lms_frac” values depends on the below factors: •Batch size used •Model used •Number of GPUs used •System memory available
  • 13. PowerAI Distributed Deep Learning Library (DDL) Communication Library for Distributed Deep Learning Training • Enables deep learning software to scale to 100s of servers with GPUs • Works across variety of system sizes • Works with variety of network types, switch topologies Released results @ 256 P100 GPUs • Better scaling efficiency than Facebook AI Research: 95% (IBM) vs <90% (FB) • Higher image recognition accuracy than Microsoft: 33.8% (IBM) vs 29.8% (MS) TECHNICAL DETAILS: https://arxiv.org/abs/1708.02188
  • 14. Costly inter-GPU communication time can limit scaling Example: DevBox with 4 TITAN X GPU Computation Communication Training cycle Hillery Hunter, IBM, GTC 2018
  • 15. c PIX PXB c PIX PXB c PIX PXB c PIX PXB Scaling within a system suffers without peer-to-peer Example: 16 GPUs on PCIe in a system P2P GPUDirect is Key Hillery Hunter, IBM, GTC 2018
  • 16. What does DDL do? DDL for TensorFlow 1. Places the job on the local GPU to the CPU (negotiating to use NVLink interface) 2. Places the job on its nearest neighbor, to leverage NVLink GPU:GPU communication 3. Places the job on the same system, on the other socket 4. Sends the job, integrating RDMA over IB (not present in the frameworks themselves), to a remote system and it’s first GPU Same kind of intelligence you see in good HPC job schedulers, but created with specific tuning for our architecture 16Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 17. Network Switch GPU Memory POWER CPUDDR4 GPU Storage Network IB, Eth PCle DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory GPU Memory POWER CPUDDR4 GPU Storage Network IB, Eth PCle DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory Communication paths PowerAI DDL: Fully utilize bandwidth for links within each node and across all nodes à Learners communicate as efficiently as possible Network Switch GPU Memory POWER CPUDDR4 GPU Storage DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory GPU Memory POWER CPUDDR4 GPU Storage DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory PCle PCle Network IB, Eth Network IB, Eth Hillery Hunter, IBM, GTC 2018
  • 18. 64 32 16 8 4 2 1 1 2 4 8 16 32 64 #Nodes Speedup PowerAI DDL PowerAI DDL training ResNet-50 Imagenet1k, Caffe #GPUs 4 8 16 32 64 128 256 #Nodes 1 2 4 8 16 32 64 Speedup 1.0 2.0 3.9 7.9 15.5 30.5 60.6 Scaling efficiency 1.00 1.00 .98 .99 .97 .95 .95 ideal actual 18Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 19. 64 32 16 8 4 2 1 1 2 4 8 16 32 64 #Nodes Speedup PowerAI DDL PowerAI DDL training ResNet-101 Imagenet22k, Caffe #GPUs 4 8 64 256 #Nodes 1 2 16 64 Speedup 1.0 1.8 3.9 13.8 Scaling efficiency 1.00 .92 .86 .85 ideal actual 7 hours to 33.8% top-1 accuracy using 256 GPUs 19Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 20. 20 TensorFlow 1.4 Performance on IBM POWER9 with V100 Single node 35% more images processed per second vs tested x86 systems ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012) Training on 1.2M images Validation on 50K images § Results are based IBM Internal Measurements running 1000 iterations of HPM Resnet50 on 1.2M images and validation on 50K images with Dataset from ILSVRC 2012 also known as Imagenet 2012. § Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git ( commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet; local-parameter-device: gpu; variable-update: replicated Date of testing: November 26, 2017 Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 21. 21 TensorFlow 1.4 Performance on IBM POWER9 with V100 Multiple nodes Distributed Deep Learning: IBM POWER9™ with Nvidia Tesla V100 results in 2.3X more data processed on TensorFlow versus tested x86 systems 2.3X more images processed per second vs tested x86 systems PowerAI Distributed Deep Learning (DDL) library provides innovative distribution methods enabling AI frameworks to scale to multiple servers leveraging all attached GPUs ResNet50 testing on ILSVRC 2012 dataset (also known as Imagenet 2012) Training on 1.2M images Validation on 50K images Date of testing: December 2, 2017 Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 22. DDL TF operator functions/semantics 22Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation 1. Init function: This must be called before any real TF operators. Typically, we can execute this op on CPU using an additional session. The input is the DDL configuration. This will inform the targeted network topology and learner mapping to it. The output consists of MPI information (rank,size) and GPU assignment for TF.
  • 23. DDL TF operator functions/semantics 23Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation 2. Bcast function: Broadcast is to synchronize all the trainable parameters (i.e., weights and biases) before the training. Broadcast can be called once init has been called and completed on the assigned GPU device. Each and every trainable parameter must be broadcasted to ensure good convergence.
  • 24. DDL TF operator functions/semantics 24Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation 3. AllReduceN function : This is an aggregated version of AllReduce. Essentially, this takes an array of N tensors, performs allreduce in a single shot, and return an array of N reduced tensors. The benefits of using AllReduceN are better performance and simpler integration.
  • 25. 25Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 26. Thank you 26 Florin Manaila Senior IT Architect and Inventor Cognitive Systems (HPC and Deep Learning) florin.manaila@de.ibm.com Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 27. 27Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation