Distributed TensorFlow on Hops (Papis London, April 2018)

Techniques for Distributed TensorFlow on Hops
Jim Dowling
CEO, Logical Clocks AB
Assoc Prof, KTH Stockholm
Senior Researcher, RISE SICS
jim_dowling
Europe 2018

©2018 Logical Clocks AB. All Rights Reserved
AI Hierarchy of Needs
2
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Data
Engineers
Data
Scientists
Data
Science-
Engineers

Distributed Deep Learning is Important
3
“We have three main lines of attack:
1.We can search for improved model architectures.
2.We can scale computation.
3.We can create larger training data sets.”
*https://blog.acolyer.org/2018/03/28/deep-learning-scaling-is-predictable-empirically/
How to improve the state of the art in deep learning? *

More Data means Better Predictions
Prediction
Performance
Traditional ML
Deep Neural Nets
Amount Labelled Data
Hand-crafted
can outperform
4/45

How much Labelled Data do we Need?*
5
*https://arxiv.org/pdf/1712.00409.pdf

More Data and Prediction Improvement*
6
Generalization
Error

Better Model and Prediction Improvement*
7
Generalization
Error

Get Better Models with More Compute
“Methods that scale with computation
are the future of AI”*
- Rich Sutton (A Founding Father of Reinforcement Learning)
* https://www.youtube.com/watch?v=EeMCEQa85tw
8/45

• Model Architecture Search*
- Explore on smaller datasets, then scale to
larger datasets => enables more searches.
• SOTA on CIFAR10 (2.13% top 1)
SOTA on ImageNet (3.8% top 5)
- 450 GPU / 7 days
- 900 TPU / 5 days
Parallel Experiments to Find Better Models
*https://arxiv.org/abs/1802.01548
9/45

Parallel Experiments
1/4
Time
The Outer Loop (hyperparameters):
“I have to run a hundred experiments to find the
best model,” he complained, as he showed me
his Jupyter notebooks. “That takes time. Every
experiment takes a lot of programming, because
there are so many different parameters.
[Rants of a Data Scientist]Hops

Need for a Distributed Filesystem
11
Experiment 1 Experiment N
Driver
Distributed FSTensorBoard
Training/test data, evaluation results,
experiment configurations, etc

•Datasets are getting larger
•Model checkpointing
•Model-architecture search
•Hyperparameter search
•Hierarchical Filesystems (fast)
- HDFS / HopsFS
- Ceph, GlusterFS
•Object Stores (slow)
- S3, GCS, WFS
More on Why we need a Distributed Filesystem
12/45
*http://www.logicalclocks.com/fixing-the-small-files-problem-in-hdfs/
PLUG for HopsFS

What about Distributed Training?
13

More Compute should mean Faster Training
Training
Performance
Single-Host
Distributed
Available Compute
14/45

Distributed Training
2/4
Weeks
Time
The Inner Loop (training):
“ All these experiments took a lot of computation
— we used hundreds of GPUs/TPUs for days.
Much like a single modern computer can
outperform thousands of decades-old machines,
we hope that in the future these experiments will
become household.”
[Google SoTA ImageNet, Cifar-10, March18]
Mins
Hops

Reduce DNN Training Time
In 2017, Facebook
reduced training
time on ImageNet
for a CNN from 2
weeks to 1 hour
by scaling out to
256 GPUs using
Ring-AllReduce on
Caffe2.
https://arxiv.org/abs/1706.02677
16/45

Distributed Training: Theory and Practice
17 17/45
Image from @hardmaru on Twitter.

Asynchronous vs Synchronous SGD
•Synchronous Stochastic Gradient Descent (SGD) now dominant
“Revisiting Synchronous SGD”, Chen et al, ICLR 2016
https://research.google.com/pubs/pub45187.html
18

Synchronous Distributed SGD Algorithms not all Equal
Training
Performance
Parameter Servers
AllReduce
Available Compute
19/45

Ring-AllReduce vs Parameter Server
GPU 0
GPU 1
GPU 2
GPU 3
send
send
send
send
recv
recv
recv
recv GPU 1 GPU 2 GPU 3 GPU 4
Param Server(s)
Network Bandwidth is the Bottleneck for Distributed Training
20/45

AllReduce outperforms Parameter Servers
21/45
*https://github.com/uber/horovod
16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network
(synthetic data). Speed below is images processed per second.*
For Bigger Models, Parameter Servers don’t scale

ML in Production: Machine Learning Pipelines

A Machine Learning Pipeline with TensorFlow
23/45
Data
Collection
Experimentation Training Serving
Feature
Extraction
Data
Transformation
& Verification
TfServingTensorFlowSpark
Distributed FS
Message Queue
Resource Manager with GPU support
Test
Kubernetes
Data Engineering Data Science Ops

Hops Small Data ML Pipeline
24/45
Hops (Kafka/HopsFS/Spark/TensorFlow/Kubernetes)
Data
Collection
Feature
Extraction
Data
Transformation
& Verification
Test
Project Teams (Data Engineers/Scientists)
TfServingTensorFlow

PySpark
Hops Big Data ML Pipeline
25/45
Hops (Kafka/HopsFS/Spark/TensorFlow/Kubernetes)
Data
Collection
Feature
Extraction
Data
Transformation
& Verification
Test
Project Teams (Data Engineers/Scientists)
TfServingTensorFlow

Why not Kubeflow?
•Operational Reasons
-No Integrated Enterprise Security Framework
• Encryption-in-Transit, Encryption-at-Rest
-Stateful services not designed for Kubernetes
• Distributed Storage, Kafka, Databases
•Usability Reasons
-Not a Fully Managed Platform
• Write YML files and restart just to install a new Python library
-Slow startup times for applications/notebooks
26/45

Machine Learning Pipelines in Code

28/45
Small Data Preparation with tf.data API
def input_fn(batch_size):
files = tf.data.Dataset.list_files(IMAGES_DIR)
def tfrecord_dataset(filename):
return tf.data.TFRecordDataset(filename,
num_parallel_reads=32, buffer_size=8*1024*1024)
dataset = files.apply(tf.data.parallel_interleave
(tfrecord_dataset, cycle_length=32, sloppy=True)
dataset = dataset.apply(tf.data.map_and_batch(parser_fn, batch_size,
num_parallel_batches=4))
dataset = dataset.prefetch(4)
return dataset
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition
Clean/Transform Data

Big Data Preparation with PySpark
from mmlspark import ImageTransformer
images = spark.readImages(IMAGE_PATH, recursive = True,
sampleRatio = 0.1).cache()
tr = (ImageTransformer().setOutputCol(“transformed”)
.resize(height = 200, width = 200)
.crop(0, 0, height = 180, width = 180) )
smallImages = tr.transform(images).select(“transformed”)
29/45
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition

Hyperparam Opt. with Tf/Spark on Hops
def model_fn(learning_rate, dropout):
import tensorflow as tf
from hops import tensorboard, hdfs, devices
[TensorFlow Code here]
from hops import experiment
args_dict = {'learning_rate': [0.001, 0.005, 0.01],
'dropout': [0.5, 0.6]}
experiment.launch(spark, model_fn, args_dict)
Launch TF jobs in Spark Executors
30/45
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition

HyperParam Opt. Visualization on TensorBoard
31/45
Hyperparam Opt Results Visualization

Distributed Training with Horovod on Hops
def conv_model(feature, target, mode)
…..
hvd.init()
opt = hvd.DistributedOptimizer(opt)
if hvd.local_rank()==0:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
else:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
from hops import allreduce
allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')
“Pure” TensorFlow code
32/45
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition

Hops API
•Python (also Java/Scala)
-Manage tensorboard, Load/save models in HDFS
-Horovod, TensorFlowOnSpark
-Parallel experiments
• Gridsearch
• Model Architecture Search with Genetic Algorithms
-Secure Streaming Analytics with Kafka/Spark/Flink
• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Zookeeper/etc
33/45
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition

TensorFlow Model Serving
34/45
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition

Hops: Next Generation Hadoop*
16x
Throughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
37x
Number of files
Scale Challenge Winner (2017)
37
GPUs in
YARN
37/45

Engineering
Kafka Topic
Project-X
Project Model for Sensitive Data/GDPR
38/45
Project-42
Shared DBTopic
Project-All
CompanyDB
Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017
FX Project
FX Topic
FX DB
FX Data
Stream
Shared Analytics
FX team

Hopsworks Data Platform
39/45

Proj-42
Projects sandbox Private Data
A Project is a Grouping of Users and Data
Proj-X
Shared TopicTopic /Projs/My/Data
Proj-AllCompanyDB
Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017
40/45

How are Projects used?
Engineering
Kafka Topic
FX Project
FX Topic
FX DB
FX Data Stream
Shared Interactive Analytics
FX team
41/45

Python in the Cluster: Per-Project Conda Envs
Python libraries are usable by Spark/Tensorflow
42/45

HopsFS
YARN
FeatureStore
Tensorflow
Serving
Public Cloud or On-Premise
Tensorboard
TensorFlow in Hopsworks
Experiments
Kafka
Hive
43/45

44/45

Summary
•The future of Deep Learning is Distributed
https://www.oreilly.com/ideas/distributed-tensorflow
•Hops is a new Data Platform with first-class support for
Python / Deep Learning / ML / Data Governance / GPUs
*https://twitter.com/karpathy/status/972701240017633281
“It is starting to look like deep learning workflows of the future
feature autotuned architectures running with autotuned
compute schedules across arbitrary backends.”
Andrej Karpathy - Head of AI @ Tesla

The Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman
Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel,
Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson,
August Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram
Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto
Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro,
Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos
Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al
Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka,
Tobias Johansson , Roberto Bampi, Roshan Sedar.
www.hops.io
@hopshadoop

Spark Scikit-learn integration
from sklearn import svm, grid_search, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)
59/45

Distributed TensorFlow on Hops (Papis London, April 2018)

More Related Content

What's hot (20)

Similar to Distributed TensorFlow on Hops (Papis London, April 2018) (20)

More from Jim Dowling (20)

Recently uploaded (20)

Distributed TensorFlow on Hops (Papis London, April 2018)