SlideShare a Scribd company logo
Horovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod
Uber’s Open Source Distributed Deep Learning
Framework for TensorFlow
Alex Sergeev, Machine Learning Platform, Uber Engineering
@alsrgv
Deep Learning @ Uber
● Self-Driving Vehicles
● Trip Forecasting
● Fraud Detection
● … and many more!
TensorFlow
● Most popular open source framework for deep learning
● Combines high performance with ability to tinker with low
level model details
● Has end-to-end support from research to production
Going Distributed
● Train very large models
● Speed up model training
Model Parallelism Data Parallelism
Going Distributed Cont.
● Modern GPUs have a lot of
RAM
● Vast majority of use cases are
data-parallel
● Facebook demonstrated
training ResNet-50 on
ImageNet in 1 hour
(arxiv.org/abs/1706.02677)
Parameter Server Technique
tf.Server()
tf.ClusterSpec()
tf.train.replicas_device_setter()
tf.train.SyncReplicasOptimizer()
Parameter Server
Worker GPU Towers
Parameter Server Technique - Example Script
Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
Parameter Server Technique - Performance
Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one
epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
How Can We Do Better?
● Re-think necessary complexity for data-parallel case
● Improve communication algorithm
● Use RDMA-capable networking (RoCE, InfiniBand)
Meet Horovod
● Distributed training framework for TensorFlow
● Inspired by work of Baidu, Facebook, et al.
● Uses bandwidth-optimal communication protocols
○ Makes use of RDMA (RoCE, InfiniBand) if available
● Seamlessly installs on top of TensorFlow via
pip install horovod
● Named after traditional Russian folk dance where
participants dance in a circle with linked hands
Horovod Technique
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations.
Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod Example - Keras
import keras
from keras import backend as K
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod.
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model…
model = ...
opt = keras.optimizers.Adadelta(1.0)
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy'])
# Broadcast initial variable states from rank 0 to all other processes.
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
model.fit(x_train, y_train,
callbacks=callbacks,
epochs=10,
validation_data=(x_test, y_test))
Horovod Example - Estimator API
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
def model_fn(features, labels, mode):
loss = …
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model",
config=tf.estimator.RunConfig(session_config=config))
mnist_classifier.train(input_fn=train_input_fn, steps=100, hooks=hooks)
Running Horovod
● MPI takes care of launching processes on all machines
● Run on a 4 GPU machine (Open MPI 3.0.0):
○ $ mpirun -np 4 
-H localhost:4 
-bind-to none -map-by slot 
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH 
python train.py
● Run on 4 machines with 4 GPUs (Open MPI 3.0.0):
○ $ mpirun -np 16 
-H server1:4,server2:4,server3:4,server4:4 
-bind-to none -map-by slot 
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH 
python train.py
● Boilerplate mpirun arguments are easily hidden in a convenience script
Debugging - Horovod Timeline
● Discovered that ResNet-152 has a lot of tiny tensors
● Added Tensor Fusion - smart batching causes large gains
(bigger gain on less optimized networks)
Horovod Performance
With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes.
Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
Horovod Performance Cont.
RDMA further helps to improve efficiency - by 30% for VGG-16.
Practical Aspects - Initialization
● Use broadcast operation to make sure all workers start
with the same weights
● Otherwise, averaged gradient
will not point towards minimum
(shown in red)
Practical Aspects - Data Partitioning
● Shuffle the dataset
● Partition records among workers
● Train by sequentially reading the partition
● After epoch is done, reshuffle and partition again
NOTE: make sure that all
partitions contain the
same number of batches,
otherwise the training will
reach deadlock
Practical Aspects - Random Sampling
● Shuffle the dataset
● Train by randomly reading data from whole dataset
● After epoch is done, reshuffle
Practical Aspects - Data
● Random sampling may cause some records to be read
multiple times in a single epoch, while others not read at all
● In practice, both approaches typically yield same results
● Conclusion: use the most convenient option for your case
● Remember: validation can also be distributed, but need to
make sure to average validation results from all the workers
when using learning rate schedules that depend on validation
○ Horovod comes with MetricAverageCallback for Keras
Practical Aspects - Learning Rate Adjustment
● Facebook in paper “Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour” (arxiv.org/abs/1706.02677)
recommends linear scaling of learning rate:
○ LRN
= LR1
* N
○ Requires smooth warmup during
first K epochs, as shown below
○ Works up to batch size 8192
● Horovod comes with
LearningRateWarmupCallback for Keras
Practical Aspects - Learning Rate Adjustment Cont.
● Yang You, Igor Gitman, Boris Ginsburg in paper “Large
Batch Training of Convolutional Networks” demonstrated
scaling to batch of 32K examples (arxiv.org/abs/1708.03888)
○ Use per-layer adaptive learning rate scaling
● Google published a paper “Don't Decay the Learning Rate,
Increase the Batch Size” (arxiv.org/abs/1711.00489) arguing
that typical learning rate decay can be replaced with an
increase of the batch size
Practical Aspects - Checkpointing
● Typically, a server would have multiple GPUs
● To avoid clashes, write checkpoints, TensorBoard logs
and other artifacts on worker 0:
○ if hvd.rank() == 0:
# write checkpoint
Practical Results at Uber
● Used Facebook’s learning rate adjustment technique
● Trained convolutional networks and LSTMs in hours
instead of days or weeks with the same final accuracy
● You can do that, too!
Giving Back
Horovod is available on GitHub today:
https://github.com/uber/horovod
Thank you!
Horovod on our Eng Blog: https://eng.uber.com/horovod
Michelangelo on our Eng Blog: https://eng.uber.com/michelangelo
ML at Uber on YouTube: http://t.uber.com/ml-meetup
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

More Related Content

PDF
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
PPTX
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PDF
Uber's Journey in Distributed Deep Learning
PDF
Horovod - Distributed TensorFlow Made Easy
PDF
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
PDF
TinyML as-a-Service
PDF
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
PPTX
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Uber's Journey in Distributed Deep Learning
Horovod - Distributed TensorFlow Made Easy
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
TinyML as-a-Service
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...

What's hot (20)

PDF
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
ODP
OpenMp
PPTX
PDF
Open mp
PDF
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
PDF
Introduction to OpenMP
PDF
Introduction to Polyaxon
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
PDF
Cheap HPC
PDF
Open mp directives
PDF
Introduction to OpenMP
PDF
Introduction to OpenMP (Performance)
PDF
SC13: OpenMP and NVIDIA
PDF
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
PDF
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
PDF
Cutting edge hyperparameter tuning made simple with ray tune
PDF
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
ODP
openmp
PDF
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
PDF
Machine Learning with TensorFlow.js
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
OpenMp
Open mp
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Introduction to OpenMP
Introduction to Polyaxon
Concurrent Programming OpenMP @ Distributed System Discussion
Cheap HPC
Open mp directives
Introduction to OpenMP
Introduction to OpenMP (Performance)
SC13: OpenMP and NVIDIA
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Cutting edge hyperparameter tuning made simple with ray tune
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
openmp
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Machine Learning with TensorFlow.js
Ad

Similar to Horovod ubers distributed deep learning framework by Alex Sergeev from Uber (20)

PDF
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
PDF
Data Parallel Deep Learning
PDF
End-to-End Deep Learning with Horovod on Apache Spark
PPTX
Distributed Model Training using MXNet with Horovod
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
PPTX
Distributed Deep learning Training.
PDF
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
PDF
Building Applications with Apache MXNet
PDF
A Tour of Tensorflow's APIs
PDF
Tensorflow 2.0 and Coral Edge TPU
PDF
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
PDF
TensorFlow example for AI Ukraine2016
PPTX
Deep Learning with MXNet
PDF
Democratizing machine learning on kubernetes
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Machine Learning with Python: Distributed Training and Data Resources on Blue...
PDF
Deep Dive on Deep Learning (June 2018)
PPTX
StackNet Meta-Modelling framework
PDF
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Data Parallel Deep Learning
End-to-End Deep Learning with Horovod on Apache Spark
Distributed Model Training using MXNet with Horovod
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Distributed Deep learning Training.
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Building Applications with Apache MXNet
A Tour of Tensorflow's APIs
Tensorflow 2.0 and Coral Edge TPU
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
TensorFlow example for AI Ukraine2016
Deep Learning with MXNet
Democratizing machine learning on kubernetes
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Resource-Efficient Deep Learning Model Selection on Apache Spark
Machine Learning with Python: Distributed Training and Data Resources on Blue...
Deep Dive on Deep Learning (June 2018)
StackNet Meta-Modelling framework
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Ad

More from Bill Liu (20)

PDF
Walk Through a Real World ML Production Project
PDF
Redefining MLOps with Model Deployment, Management and Observability in Produ...
PDF
Productizing Machine Learning at the Edge
PPTX
Transformers in Vision: From Zero to Hero
PDF
Deep AutoViML For Tensorflow Models and MLOps Workflows
PDF
Metaflow: The ML Infrastructure at Netflix
PDF
Practical Crowdsourcing for ML at Scale
PDF
Building large scale transactional data lake using apache hudi
PDF
Deep Reinforcement Learning and Its Applications
PDF
Big Data and AI in Fighting Against COVID-19
PDF
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
PDF
Build computer vision models to perform object detection and classification w...
PDF
Causal Inference in Data Science and Machine Learning
PDF
Weekly #106: Deep Learning on Mobile
PDF
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
PDF
AISF19 - On Blending Machine Learning with Microeconomics
PDF
AISF19 - Travel in the AI-First World
PDF
AISF19 - Unleash Computer Vision at the Edge
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Toronto meetup 20190917
Walk Through a Real World ML Production Project
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Productizing Machine Learning at the Edge
Transformers in Vision: From Zero to Hero
Deep AutoViML For Tensorflow Models and MLOps Workflows
Metaflow: The ML Infrastructure at Netflix
Practical Crowdsourcing for ML at Scale
Building large scale transactional data lake using apache hudi
Deep Reinforcement Learning and Its Applications
Big Data and AI in Fighting Against COVID-19
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Build computer vision models to perform object detection and classification w...
Causal Inference in Data Science and Machine Learning
Weekly #106: Deep Learning on Mobile
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - Travel in the AI-First World
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Toronto meetup 20190917

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
sap open course for s4hana steps from ECC to s4
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing

Horovod ubers distributed deep learning framework by Alex Sergeev from Uber

  • 2. Horovod Uber’s Open Source Distributed Deep Learning Framework for TensorFlow Alex Sergeev, Machine Learning Platform, Uber Engineering @alsrgv
  • 3. Deep Learning @ Uber ● Self-Driving Vehicles ● Trip Forecasting ● Fraud Detection ● … and many more!
  • 4. TensorFlow ● Most popular open source framework for deep learning ● Combines high performance with ability to tinker with low level model details ● Has end-to-end support from research to production
  • 5. Going Distributed ● Train very large models ● Speed up model training Model Parallelism Data Parallelism
  • 6. Going Distributed Cont. ● Modern GPUs have a lot of RAM ● Vast majority of use cases are data-parallel ● Facebook demonstrated training ResNet-50 on ImageNet in 1 hour (arxiv.org/abs/1706.02677)
  • 8. Parameter Server Technique - Example Script Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
  • 9. Parameter Server Technique - Performance Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
  • 10. How Can We Do Better? ● Re-think necessary complexity for data-parallel case ● Improve communication algorithm ● Use RDMA-capable networking (RoCE, InfiniBand)
  • 11. Meet Horovod ● Distributed training framework for TensorFlow ● Inspired by work of Baidu, Facebook, et al. ● Uses bandwidth-optimal communication protocols ○ Makes use of RDMA (RoCE, InfiniBand) if available ● Seamlessly installs on top of TensorFlow via pip install horovod ● Named after traditional Russian folk dance where participants dance in a circle with linked hands
  • 12. Horovod Technique Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
  • 13. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 14. Horovod Example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs", config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 15. Horovod Example - Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod. hvd.init() # Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model… model = ... opt = keras.optimizers.Adadelta(1.0) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy']) # Broadcast initial variable states from rank 0 to all other processes. callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)] model.fit(x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))
  • 16. Horovod Example - Estimator API import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... def model_fn(features, labels, mode): loss = … opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Create the Estimator mnist_classifier = tf.estimator.Estimator( model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model", config=tf.estimator.RunConfig(session_config=config)) mnist_classifier.train(input_fn=train_input_fn, steps=100, hooks=hooks)
  • 17. Running Horovod ● MPI takes care of launching processes on all machines ● Run on a 4 GPU machine (Open MPI 3.0.0): ○ $ mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH python train.py ● Run on 4 machines with 4 GPUs (Open MPI 3.0.0): ○ $ mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH python train.py ● Boilerplate mpirun arguments are easily hidden in a convenience script
  • 18. Debugging - Horovod Timeline ● Discovered that ResNet-152 has a lot of tiny tensors ● Added Tensor Fusion - smart batching causes large gains (bigger gain on less optimized networks)
  • 19. Horovod Performance With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes. Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
  • 20. Horovod Performance Cont. RDMA further helps to improve efficiency - by 30% for VGG-16.
  • 21. Practical Aspects - Initialization ● Use broadcast operation to make sure all workers start with the same weights ● Otherwise, averaged gradient will not point towards minimum (shown in red)
  • 22. Practical Aspects - Data Partitioning ● Shuffle the dataset ● Partition records among workers ● Train by sequentially reading the partition ● After epoch is done, reshuffle and partition again NOTE: make sure that all partitions contain the same number of batches, otherwise the training will reach deadlock
  • 23. Practical Aspects - Random Sampling ● Shuffle the dataset ● Train by randomly reading data from whole dataset ● After epoch is done, reshuffle
  • 24. Practical Aspects - Data ● Random sampling may cause some records to be read multiple times in a single epoch, while others not read at all ● In practice, both approaches typically yield same results ● Conclusion: use the most convenient option for your case ● Remember: validation can also be distributed, but need to make sure to average validation results from all the workers when using learning rate schedules that depend on validation ○ Horovod comes with MetricAverageCallback for Keras
  • 25. Practical Aspects - Learning Rate Adjustment ● Facebook in paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” (arxiv.org/abs/1706.02677) recommends linear scaling of learning rate: ○ LRN = LR1 * N ○ Requires smooth warmup during first K epochs, as shown below ○ Works up to batch size 8192 ● Horovod comes with LearningRateWarmupCallback for Keras
  • 26. Practical Aspects - Learning Rate Adjustment Cont. ● Yang You, Igor Gitman, Boris Ginsburg in paper “Large Batch Training of Convolutional Networks” demonstrated scaling to batch of 32K examples (arxiv.org/abs/1708.03888) ○ Use per-layer adaptive learning rate scaling ● Google published a paper “Don't Decay the Learning Rate, Increase the Batch Size” (arxiv.org/abs/1711.00489) arguing that typical learning rate decay can be replaced with an increase of the batch size
  • 27. Practical Aspects - Checkpointing ● Typically, a server would have multiple GPUs ● To avoid clashes, write checkpoints, TensorBoard logs and other artifacts on worker 0: ○ if hvd.rank() == 0: # write checkpoint
  • 28. Practical Results at Uber ● Used Facebook’s learning rate adjustment technique ● Trained convolutional networks and LSTMs in hours instead of days or weeks with the same final accuracy ● You can do that, too!
  • 29. Giving Back Horovod is available on GitHub today: https://github.com/uber/horovod
  • 30. Thank you! Horovod on our Eng Blog: https://eng.uber.com/horovod Michelangelo on our Eng Blog: https://eng.uber.com/michelangelo ML at Uber on YouTube: http://t.uber.com/ml-meetup
  • 31. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.