SlideShare a Scribd company logo
Scalable Deep Learning
on Distributed GPUs
Emiliano Molinaro, PhD
eScience Center & Institut for Matematik og Datalogi
Syddansk Universitet, Odense Denmark
Outline
2
•Introduction
•Deep Neural Networks:
- Model training
- Scalability on multiple GPUs
•Distributed architectures:
- Async Parameter Server
- Sync AllReduce
•Data pipeline
• Application: Variational Autoencoder for clustering of
single cell gene expression levels
• Summary
Introduction
3
Deep Neural Networks (DNNs) and Deep Learning (DL) are more and more integral
parts of public/private research and industrial applications
MAIN REASONS:
- rise in computational power
- advances in data science
- IoT and big data high availability
APPLICATIONS:
- speech recognition
- image classification
- computer vision
- anomaly detection
- recommender systems
- …
Increase of data volume and model complexity requires computing power and memory,
such as High Performance Computing (HPC) resources
Efficient parallel and distributed algorithms/frameworks for scaling DL are crucial
to speed up the training process and handle big data processing and analysis
Deep Neural Network
4
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
INPUTLAYER
HIDDEN LAYERS
OUTPUTLAYER
Training loop
5
training data
Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
training data
input pipeline
Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
backward propagation: gradients
training data
input pipeline
lossfunction
Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
backward propagation: gradients
training data
input pipeline
lossfunction
update model
parameters
Training on a single device
6
CPU GPU
compute gradients and loss functionupdate model parameters
Scaling on multiple GPUs
7
CPU
GPU
GPU GPU
GPU
Scaling on multiple devices
8
Data parallelism
9
training data
DNN copied on each worker and trained with a subset of the input data
Data parallelism
9
training data
input pipeline
DNN copied on each worker and trained with a subset of the input data
Data parallelism
9
training data
•
• • •
.
.
.
.
•
• • •
.
.
.
.
•
• • •
.
.
.
.
worker 1
worker 2
worker 3
input pipeline
parameters sync: average gradients
DNN copied on each worker and trained with a subset of the input data
Model parallelism
10
DNN is divided across the different workers
•
• • •
.
.
.
.
•
• •
.
.
.
.
worker 1 worker 2 worker 3
training data
Data parallelism architectures
11
Data parallelism architectures
11
Async Parameter Server
PS1 PS2
worker 1 worker 2 worker 3
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
Data parallelism architectures
11
Async Parameter Server
PS1 PS2
worker 1 worker 2 worker 3
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
Sync AllReduce
worker 1 worker 2
worker 3worker 4
- approach more common on systems with fast
accelerators
- no parameter servers, each worker has its own copy of
model parameters
- all workers are synchronized
- workers communicate among themselves to propagate
the gradients and update the model parameters
- AllReduce algorithms used to combine the gradients,
depending on the type of communication: Ring
AllReduce, NVIDIA’s NCCL, …
Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
parallelize the ETL phases
read multiple files in parallel
distribute data transformation operations on multiple CPU cores
prefetch data for the next step during backward propagation
Single cell RNA sequencing data
13
CD14+ Monocytes
Double negative T cells
CD14+ Monocytes__
Double negative T cells__
Mature B cell
CD8 Effector
NK cells
Plasma cell
CD8 Effector__
FCGR3A+ Monocytes
CD8 Naive
Megakaryocytes
Immature B cell
CD14+ Monocytes______
Dendritic cells
CD8 Effector____
pDC
Dendritic cells__
Variational Autoencoder (VAE)
research done in collaboration with Institut for Biokemi og Molekylær Biologi, SDU
~10k peripheral blood mononuclear cells
Dataset from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
Scaling ofVAE training
14
simulation done on GPU nodes of Puhti
supercomputer, CSC, Finland
Node specs:
2 x 20 cores @ 2,1 GHz, 384 GiB
4 GPUs connected with NVLink, 4x32 GB
GPU specs:
Xeon Gold 6230 Nvidia V100
Collective ops:
- TensorFlow ring-based gRPC
- NVIDIA’s NCCL
scaling performance depends on the device topology and the AllReduce algorithm
Summary
15
❖ Scaling DNNs on multiple GPUs is crucial to speed up model training and
make DL suitable for big data processing and analysis
❖ Implementation of distributed architecture/parallel programming depends on
the model complexity and the device topology
❖ Construction of efficient input data pipelines improves training performance
❖ Further applications:
- development of algorithms for parameter optimization
(hyper-parameter search, architecture search)
- integration of distributed training with big data frameworks
(Spark, Hadoop, etc.)
- integration with cloud computing services (SDUCloud)

More Related Content

PPTX
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PPTX
Advanced computer architecture lesson 1 and 2
PPTX
Applications of paralleL processing
PPT
Parallel processing
PDF
Memory interleaving and superscalar processor
PDF
Lecture02 types
PPT
Parallel processing Concepts
PDF
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Advanced computer architecture lesson 1 and 2
Applications of paralleL processing
Parallel processing
Memory interleaving and superscalar processor
Lecture02 types
Parallel processing Concepts
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS

What's hot (20)

DOCX
INTRODUCTION TO PARALLEL PROCESSING
DOCX
Introduction to parallel computing
PPTX
Parallel processing
PPTX
Introduction to Parallel and Distributed Computing
PPT
multiprocessors and multicomputers
PDF
Introduction to Parallel Computing
PPTX
Lecture 04 Chapter 1 - Introduction to Parallel Computing
PDF
2 parallel processing presentation ph d 1st semester
PPSX
Research Scope in Parallel Computing And Parallel Programming
PPTX
Parallel Algorithms Advantages and Disadvantages
PPTX
Advanced computer architecture
PPT
Chapter 2 pc
PPT
Lecture 3
PPTX
Scaling up Machine Learning Algorithms for Classification
PPT
Par com
PPTX
Multiprocessor architecture and programming
PPTX
Application of Parallel Processing
PPTX
Parallel computing and its applications
PDF
computer system architecture
PPTX
Lecture 04 chapter 2 - Parallel Programming Platforms
INTRODUCTION TO PARALLEL PROCESSING
Introduction to parallel computing
Parallel processing
Introduction to Parallel and Distributed Computing
multiprocessors and multicomputers
Introduction to Parallel Computing
Lecture 04 Chapter 1 - Introduction to Parallel Computing
2 parallel processing presentation ph d 1st semester
Research Scope in Parallel Computing And Parallel Programming
Parallel Algorithms Advantages and Disadvantages
Advanced computer architecture
Chapter 2 pc
Lecture 3
Scaling up Machine Learning Algorithms for Classification
Par com
Multiprocessor architecture and programming
Application of Parallel Processing
Parallel computing and its applications
computer system architecture
Lecture 04 chapter 2 - Parallel Programming Platforms
Ad

Similar to Scalable Deep Learning on Distributed GPUs (20)

PDF
Distributed DNN training: Infrastructure, challenges, and lessons learned
PDF
A Platform for Accelerating Machine Learning Applications
PDF
Toronto meetup 20190917
PDF
Netflix machine learning
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
PPTX
GPU and Deep learning best practices
PPTX
improve deep learning training and inference performance
PDF
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
PDF
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
PDF
PPTX
Beyond data and model parallelism for deep neural networks
PDF
Large Scale Deep Learning with TensorFlow
PPTX
Introduction to deep learning
PDF
Tutorial on Deep Learning
PPTX
2nd DL Meetup @ Dublin - Irene
PPTX
Chug dl presentation
PDF
Distributed deep learning
PDF
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PPTX
Parallel & Distributed Deep Learning - Dataworks Summit
PDF
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
Distributed DNN training: Infrastructure, challenges, and lessons learned
A Platform for Accelerating Machine Learning Applications
Toronto meetup 20190917
Netflix machine learning
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
GPU and Deep learning best practices
improve deep learning training and inference performance
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
Beyond data and model parallelism for deep neural networks
Large Scale Deep Learning with TensorFlow
Introduction to deep learning
Tutorial on Deep Learning
2nd DL Meetup @ Dublin - Irene
Chug dl presentation
Distributed deep learning
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Parallel & Distributed Deep Learning - Dataworks Summit
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Clinical guidelines as a resource for EBP(1).pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Mega Projects Data Mega Projects Data
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to machine learning and Linear Models
Clinical guidelines as a resource for EBP(1).pdf

Scalable Deep Learning on Distributed GPUs

  • 1. Scalable Deep Learning on Distributed GPUs Emiliano Molinaro, PhD eScience Center & Institut for Matematik og Datalogi Syddansk Universitet, Odense Denmark
  • 2. Outline 2 •Introduction •Deep Neural Networks: - Model training - Scalability on multiple GPUs •Distributed architectures: - Async Parameter Server - Sync AllReduce •Data pipeline • Application: Variational Autoencoder for clustering of single cell gene expression levels • Summary
  • 3. Introduction 3 Deep Neural Networks (DNNs) and Deep Learning (DL) are more and more integral parts of public/private research and industrial applications MAIN REASONS: - rise in computational power - advances in data science - IoT and big data high availability APPLICATIONS: - speech recognition - image classification - computer vision - anomaly detection - recommender systems - … Increase of data volume and model complexity requires computing power and memory, such as High Performance Computing (HPC) resources Efficient parallel and distributed algorithms/frameworks for scaling DL are crucial to speed up the training process and handle big data processing and analysis
  • 4. Deep Neural Network 4 • • • • • • • • • • • • . . . . . . . . . . . . INPUTLAYER HIDDEN LAYERS OUTPUTLAYER
  • 6. Training loop 5 • • • • • • • • • • • • . . . . . . . . . . . . forward propagation: predictions training data input pipeline
  • 7. Training loop 5 • • • • • • • • • • • • . . . . . . . . . . . . • • • • • • • • • • • • . . . . . . . . . . . . forward propagation: predictions backward propagation: gradients training data input pipeline lossfunction
  • 8. Training loop 5 • • • • • • • • • • • • . . . . . . . . . . . . • • • • • • • • • • • • . . . . . . . . . . . . forward propagation: predictions backward propagation: gradients training data input pipeline lossfunction update model parameters
  • 9. Training on a single device 6 CPU GPU compute gradients and loss functionupdate model parameters
  • 10. Scaling on multiple GPUs 7 CPU GPU GPU GPU GPU
  • 11. Scaling on multiple devices 8
  • 12. Data parallelism 9 training data DNN copied on each worker and trained with a subset of the input data
  • 13. Data parallelism 9 training data input pipeline DNN copied on each worker and trained with a subset of the input data
  • 14. Data parallelism 9 training data • • • • . . . . • • • • . . . . • • • • . . . . worker 1 worker 2 worker 3 input pipeline parameters sync: average gradients DNN copied on each worker and trained with a subset of the input data
  • 15. Model parallelism 10 DNN is divided across the different workers • • • • . . . . • • • . . . . worker 1 worker 2 worker 3 training data
  • 17. Data parallelism architectures 11 Async Parameter Server PS1 PS2 worker 1 worker 2 worker 3 - parameter servers store the variables - all workers are independent - workers do the bulk of computation - architecture easy to scale - downside: workers can get out of sync, delay convergence
  • 18. Data parallelism architectures 11 Async Parameter Server PS1 PS2 worker 1 worker 2 worker 3 - parameter servers store the variables - all workers are independent - workers do the bulk of computation - architecture easy to scale - downside: workers can get out of sync, delay convergence Sync AllReduce worker 1 worker 2 worker 3worker 4 - approach more common on systems with fast accelerators - no parameter servers, each worker has its own copy of model parameters - all workers are synchronized - workers communicate among themselves to propagate the gradients and update the model parameters - AllReduce algorithms used to combine the gradients, depending on the type of communication: Ring AllReduce, NVIDIA’s NCCL, …
  • 19. Pipelining 12 The input pipeline is preprocessing the data and make it available for training on the GPUs However, GPUs process data and perform calculations much faster than a CPU Avoid bottleneck of input data … … build an efficient ETL data processing: 1. Extract phase: read data from a persistence storage 2. Transform phase: apply different transformations to the input data (shuffle, repeat, map, batch, …) 3. Load phase: provide the processed data to the accelerator for training
  • 20. Pipelining 12 The input pipeline is preprocessing the data and make it available for training on the GPUs However, GPUs process data and perform calculations much faster than a CPU Avoid bottleneck of input data … … build an efficient ETL data processing: 1. Extract phase: read data from a persistence storage 2. Transform phase: apply different transformations to the input data (shuffle, repeat, map, batch, …) 3. Load phase: provide the processed data to the accelerator for training parallelize the ETL phases read multiple files in parallel distribute data transformation operations on multiple CPU cores prefetch data for the next step during backward propagation
  • 21. Single cell RNA sequencing data 13 CD14+ Monocytes Double negative T cells CD14+ Monocytes__ Double negative T cells__ Mature B cell CD8 Effector NK cells Plasma cell CD8 Effector__ FCGR3A+ Monocytes CD8 Naive Megakaryocytes Immature B cell CD14+ Monocytes______ Dendritic cells CD8 Effector____ pDC Dendritic cells__ Variational Autoencoder (VAE) research done in collaboration with Institut for Biokemi og Molekylær Biologi, SDU ~10k peripheral blood mononuclear cells Dataset from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
  • 22. Scaling ofVAE training 14 simulation done on GPU nodes of Puhti supercomputer, CSC, Finland Node specs: 2 x 20 cores @ 2,1 GHz, 384 GiB 4 GPUs connected with NVLink, 4x32 GB GPU specs: Xeon Gold 6230 Nvidia V100 Collective ops: - TensorFlow ring-based gRPC - NVIDIA’s NCCL scaling performance depends on the device topology and the AllReduce algorithm
  • 23. Summary 15 ❖ Scaling DNNs on multiple GPUs is crucial to speed up model training and make DL suitable for big data processing and analysis ❖ Implementation of distributed architecture/parallel programming depends on the model complexity and the device topology ❖ Construction of efficient input data pipelines improves training performance ❖ Further applications: - development of algorithms for parameter optimization (hyper-parameter search, architecture search) - integration of distributed training with big data frameworks (Spark, Hadoop, etc.) - integration with cloud computing services (SDUCloud)