SlideShare a Scribd company logo
Training Deep Learning Models
on Multi-GPUs Systems
Dmitry Spodarets
Odessa / Machine Learning Meetup / 27.02.2019
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
Multi-GPUs Systems
Distributed training
Data parallel vs model parallel
Faster or larger models?
Distributed TensorFlow training
https://www.youtube.com/watch?v=bRMGoPqsn20
Distributed training
Asynchronous vs Synchronous
Fast or precise?
Keras - multi-GPU training is not automatic :(
https://keras.io/utils/#multi_gpu_model
Bottlenecks
RAM / CPU I/O
Connections
NVLink (200 GB/sec)
NVSwitches (300 GB/sec)
Distributed training framework for
TensorFlow, Keras, PyTorch, and MXNet
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example - Keras
import keras
from keras import backend as K
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod.
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model…
model = ...
opt = keras.optimizers.Adadelta(1.0)
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy'])
# Broadcast initial variable states from rank 0 to all other processes.
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
model.fit(x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))
Running Horovod
To run on a machine with 4 GPUs:
$ mpirun -np 4 
-H localhost:4 
-bind-to none -map-by slot 
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH 
-mca pml ob1 -mca btl ^openib 
python train.py
To run on 4 machines with 4 GPUs each:
$ mpirun -np 16 
-H server1:4,server2:4,server3:4,server4:4 
-bind-to none -map-by slot 
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH 
-mca pml ob1 -mca btl ^openib 
python train.py
Running Horovod
https://eng.uber.com/horovod/
Which GPU to Get for Deep Learning
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Dmitry Spodarets
d.spodarets@flyelephant.net
https://flyelephant.net/gpu-dedicated-servers

More Related Content

PPT
Dad i want a supercomputer on my next
PPTX
HOW TO CREATE AWESOME POLYGLOT APPLICATIONS USING GRAALVM
PDF
Why not mruby?
PDF
Achieving the ultimate performance with KVM
PDF
Building a Deep Learning (Dream) Machine
PDF
Offloading for Databases - Deep Dive
PDF
Enhance! Real-time webcam video super-resolution
PDF
Puppet Camp Dublin - 06/2012
Dad i want a supercomputer on my next
HOW TO CREATE AWESOME POLYGLOT APPLICATIONS USING GRAALVM
Why not mruby?
Achieving the ultimate performance with KVM
Building a Deep Learning (Dream) Machine
Offloading for Databases - Deep Dive
Enhance! Real-time webcam video super-resolution
Puppet Camp Dublin - 06/2012

What's hot (18)

TXT
Opportuni1012017 programming model_tied_to_hardware_processors
PDF
Python и программирование GPU (Ивашкевич Глеб)
PDF
Modern Interface to Mainframe - The Compuware Workbench (B. Ebner)
 
PDF
UniPlex T1 Storage Supercharger
PDF
Cassandra from tarball to production
PDF
Kauli SSPにおけるVyOSの導入事例
PDF
1101: GRID 技術セッション 2:vGPU Sizing
PDF
Gpu accelerated BERT deployment on aws
PDF
UniFabric
PPTX
Citrix TechXperts Perth May 2016
PDF
Overclocking & Economy
PDF
Ceph Day Beijing - Ceph RDMA Update
PPTX
MySQL新技术研究与实践
PDF
PostgreSQL with OpenCL
PPTX
UniPlex Desktop Memory & PCIe Expansion
PDF
Optcarrot: A Pure-Ruby NES Emulator
PPTX
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
PDF
Juju 基礎編
Opportuni1012017 programming model_tied_to_hardware_processors
Python и программирование GPU (Ивашкевич Глеб)
Modern Interface to Mainframe - The Compuware Workbench (B. Ebner)
 
UniPlex T1 Storage Supercharger
Cassandra from tarball to production
Kauli SSPにおけるVyOSの導入事例
1101: GRID 技術セッション 2:vGPU Sizing
Gpu accelerated BERT deployment on aws
UniFabric
Citrix TechXperts Perth May 2016
Overclocking & Economy
Ceph Day Beijing - Ceph RDMA Update
MySQL新技术研究与实践
PostgreSQL with OpenCL
UniPlex Desktop Memory & PCIe Expansion
Optcarrot: A Pure-Ruby NES Emulator
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
Juju 基礎編
Ad

Similar to «Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets. (20)

PPTX
Distributed Deep learning Training.
PDF
Uber's Journey in Distributed Deep Learning
PDF
SIGMOD SRC 2021 Talk
PDF
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
PDF
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
PDF
Data Parallel Deep Learning
PDF
Democratizing machine learning on kubernetes
PDF
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
PDF
Horovod - Distributed TensorFlow Made Easy
PDF
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
PPTX
GPU and Deep learning best practices
PDF
Distributed DNN training: Infrastructure, challenges, and lessons learned
PDF
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
PPTX
Using Public Datasets with TensorFlow.pptx
PPTX
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PDF
End-to-End Deep Learning with Horovod on Apache Spark
PDF
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
PDF
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PPTX
Brief introduction to Distributed Deep Learning
Distributed Deep learning Training.
Uber's Journey in Distributed Deep Learning
SIGMOD SRC 2021 Talk
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Data Parallel Deep Learning
Democratizing machine learning on kubernetes
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Horovod - Distributed TensorFlow Made Easy
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
GPU and Deep learning best practices
Distributed DNN training: Infrastructure, challenges, and lessons learned
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Using Public Datasets with TensorFlow.pptx
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
End-to-End Deep Learning with Horovod on Apache Spark
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Brief introduction to Distributed Deep Learning
Ad

More from Provectus (20)

PPTX
Choosing the right IDP Solution
PPTX
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
PPTX
Choosing the Right Document Processing Solution for Healthcare Organizations
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
PPTX
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
PDF
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
PDF
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
PDF
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
PDF
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
PDF
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
PPTX
How to implement authorization in your backend with AWS IAM
Choosing the right IDP Solution
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Choosing the Right Document Processing Solution for Healthcare Organizations
MLOps and Data Quality: Deploying Reliable ML Models in Production
AI Stack on AWS: Amazon SageMaker and Beyond
Feature Store as a Data Foundation for Machine Learning
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
How to implement authorization in your backend with AWS IAM

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Machine Learning_overview_presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25-Week II
sap open course for s4hana steps from ECC to s4
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
Machine Learning_overview_presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm

«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.

  • 1. Training Deep Learning Models on Multi-GPUs Systems Dmitry Spodarets Odessa / Machine Learning Meetup / 27.02.2019
  • 4. Distributed training Data parallel vs model parallel Faster or larger models? Distributed TensorFlow training https://www.youtube.com/watch?v=bRMGoPqsn20
  • 5. Distributed training Asynchronous vs Synchronous Fast or precise? Keras - multi-GPU training is not automatic :( https://keras.io/utils/#multi_gpu_model
  • 6. Bottlenecks RAM / CPU I/O Connections
  • 9. Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet
  • 10. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 11. Horovod Example - Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod. hvd.init() # Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model… model = ... opt = keras.optimizers.Adadelta(1.0) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy']) # Broadcast initial variable states from rank 0 to all other processes. callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)] model.fit(x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))
  • 12. Running Horovod To run on a machine with 4 GPUs: $ mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py To run on 4 machines with 4 GPUs each: $ mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
  • 14. Which GPU to Get for Deep Learning http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/