SlideShare a Scribd company logo
Jim Dowling Assoc Prof, KTH
Senior Researcher, RISE SICS
CEO, Logical Clocks AB
SPARK & TENSORFLOW
AS-A-SERVICE
#EUai8
Hops
Newton confirmed what many suspected
• In August 1684, Halley
visited Newton:
“What type of curve does
a planet describe in its
orbit about the sun,
assuming an inverse
square law of attraction?”
2#EUai8
• In June 2017,
Facebook showed
how to reduce training
time on ImageNet for
a Deep CNN from 2
weeks to 1 hour by
scaling out to 256
GPUs.
3#EUai8
https://arxiv.org/abs/1706.02677
Facebook confirmed what many suspected
AI Hierarchy of Needs
5
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
AI Hierarchy of Needs
6
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
Analytics
Prediction
AI Hierarchy of Needs
7
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Hops
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
Deep Learning Hierarchy of Scale
8#EUai8
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours
Deep Learning Hierarchy of Scale
9#EUai8
Public
Clouds
On-Premise
Single GPU
Multiple GPUs on a Single GPU Server
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Single GPU
Many GPUs on a Single GPU Server
Parallel Experiments on GPU Servers
Single Host DL
Distributed DL
DNN Training Time and Researcher Productivity
• Distributed Deep Learning
– Interactive analysis!
– Instant gratification!
• Single Host Deep Learning
– Google-Envy
10
“My Model’s Training.”
Training
What Hardware do you Need?
• SingleRoot PCI
Complex Server*
– 10 Nvidia GTX 1080Ti
• 11 GB Memory
– 256 GB Ram
– 2 Intel Xeon CPUs
– 2x56 Gb Infiniband
15K Euro
• Nvidia DGX-1
– 8 Nvidia Tesla P100/V100
• 16 GB Memory
– 512 GB Ram
– 2 Intel Xeon CPUs
– 4x100 Gb Infiniband
– NVLink**
up to 150K Euro
*https://www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems
**https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/
12#EUai8
SingleRoot
Complex Server
with 10 GPUs
[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]
Tensorflow GAN Training Example*
13#EUai8
*https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1080-ti-single-root-deep-learning-server-part-1/
Cluster of Commodity GPU Servers
14#EUai8
InfiniBand
Max 1-2 GPU Servers per Rack (2-4 KW per server)
Spark and TF – Cluster Integration
15#EUai8
Training Data and Model Store
Cluster Manager
Single GPU
Experiment
Parallel Experiments
(HyperParam Tuning)
Distributed
Training Job
Deprecated
Mix of commodity GPUs and more
powerful GPUs good for (1) parallel
experiments and (2) distributed training
GPU Resource Requests in Hops
16#EUai8
HopsYARN (Supports GPUs-as-a-Resource)
4 GPUs on any host
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
20 GPUs on 2 hosts with ‘Infiniband_P100’
Hops
HopsFS
HopsFS: Next Generation HDFS*
17
16x
Throughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf
37x
Number of files
Scale Challenge Winner (2017)
Small Files**
TensorFlow Spark API Integration
• Tight Integration
– Databricks’ Tensorframes and Deep Learning Pipelines
• Loose Integration
– TensorFlow-on-Spark, Hops TfLauncher
• PySpark as a wrapper for TensorFlow
18#EUai8
Deep Learning Pipelines
19#EUai8
graph = tf.Graph() with tf.Session(graph=graph) as sess:
image_arr = utils.imageInputPlaceholder()
frozen_graph = tfx.strip_and_freeze_until(…)
transformer = TFImageTransformer(…)
image_df = readImages("/data/myimages")
processed_image_df = transformer.transform(image_df)
…
select image, driven_by_007(image) as probability from car_examples
order by probability desc limit 6
Inferencing possible with SparkSQL
Hops TfLauncher – TF in Spark
def model_fn(learning_rate, dropout):
import tensorflow as tf
from hops import tensorboard, hdfs, devices
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001], 'dropout': [0.5]}
tflauncher.launch(spark, model_fn, args_dict)
20
Launch TF jobs as Mappers in Spark
“Pure” TensorFlow code
in the Executor
Hops TfLauncher – Parallel Experiments
21#EUai8
def model_fn(learning_rate, dropout):
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001, 0.005, 0.01],
'dropout': [0.5, 0.6, 0.7]}
tflauncher.launch(spark, model_fn, args_dict)
Launches 3 Executors with 3 different Hyperparameter
settings. Each Executor can have 1-N GPUs.
New TensorFlow APIs
tf.data.Dataset tf.estimator.Estimator tf.data.Iterator
22#EUai8
def model_fn(features, labels, mode, params):
…
dataset = tf.data.TFRecordDataset([“/v/f1.tfrecord", “/v/f2.tfrecord"])
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
iterator = Iterator.from_dataset(dataset)
….
nn = tf.estimator.Estimator(model_fn=model_fn, params=dict_hyp_params)
Prefer over RDDs-to-feed_dict
Distributed TensorFlow
• AllReduce
– Horovod by Uber with MPI/NCCL
– Baidu AllReduce/MPI in TensorFlow/contrib
• Distributed Parameter Servers
– TensorFlow-on-Spark
– Distributed TensorFlow
23#EUai8
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Asynchronous SGD vs Synchronous SGD
• Synchronous Stochastic Gradient Descent (SGD) now dominant,
due to improved convergence guarantees:
– “Revisiting Synchronous SGD”, Chen et al, ICLR 2016
https://research.google.com/pubs/pub45187.html
24
Distributed TF with Parameter Servers
25
Synchronous SGD
with Data Parallelism
Tensorflow-on-Spark (Yahoo!)
• Rewrite TensorFlow apps to Distributed TensorFlow
• Two modes:
1. feed_dict: RDD.mapPartitions()
2. TFReader + queue_runner: direct HDFS access from Tensorflow
26[Image from https://www.slideshare.net/Hadoop_Summit/tensorflowonspark-scalable-tensorflow-learning-on-spark-clusters]
TFonSpark with Spark Streaming
27#EUai8
[Image from https://www.slideshare.net/Hadoop_Summit/tensorflowonspark-scalable-tensorflow-learning-on-spark-clusters]
All-Reduce/MPI
28
GPU 0
GPU 1
GPU 2
GPU 3
send
send
send
send
recv
recv
recv
recv
AllReduce: Minimize Inter-Host B/W
29
Only one slow
worker or comms
link is needed to
bottleneck DNN
training time.
AllReduce Algorithm
• AllReduce sums all Gradients in N Layers (L1..LN)
using N GPUs in parallel (simplified steps shown).
GPU 0
GPU 1
GPU 2
GPU 3
L1 L2 L3 L4
L1 L2 L3 L4
L1 L2 L3 L4
L1 L2 L3 L4
Backprop
AllReduce Algorithm
GPU 0
GPU 1
GPU 2
GPU 3
L10+L11+L12+L13 L2 L3 L4
Backprop
L10+L11+L12+L13 L2 L3 L4
L10+L11+L12+L13 L2 L3 L4
L10+L11+L12+L13 L2 L3 L4
• Aggregate Gradients from the first layer (L1) while
sending Gradients for L2
AllReduce Algorithm
GPU 0
GPU 1
GPU 2
GPU 3
Backprop
L10+L11+L12+L13 L20+L21+L22+L23 L3 L4
L10+L11+L12+L13 L20+L21+L22+L23 L3 L4
L10+L11+L12+L13 L20+L21+L22+L23 L3 L4
L10+L11+L12+L13 L20+L21+L22+L23 L3 L4
• Broadcast Gradients from higher layers while
computing Gradients at lower layers.
AllReduce Algorithm
GPU 0
GPU 1
GPU 2
GPU 3
Backprop
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4
• Nearly there.
AllReduce Algorithm
GPU 0
GPU 1
GPU 2
GPU 3
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43
L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43
• Finished an iteration.
Hops AllReduce/Horovod/TensorFlow
35#EUai8
import horovod.tensorflow as hvd
def conv_model(feature, target, mode)
…..
def main(_):
hvd.init()
opt = hvd.DistributedOptimizer(opt)
if hvd.local_rank()==0:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
else:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
from hops import allreduce
allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')
“Pure” TensorFlow code
Parameter Server vs AllReduce (Uber)*
36
*https://github.com/uber/horovod
Setup: 16 servers with 4 P100 GPUs each connected by 40 Gbit/s network (synthetic data).
VGG
model
is larger
Dist. Synchnrous SGD: N/W is the Bottleneck
37
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8 9 10
1 GPU 4 GPUs
N/W N/W N/W N/W N/W
Amount
Work
Time
Reduce N/W Comms Time, Increase Computation Time
Amdahl’s Law
Hopsworks:Tensorflow/Spark-as-a-Service
38#EUai8
Hopsworks: Full AI Hierarchy of Needs
39
Develop Train Test Deploy
MySQL Cluster
Hive
InfluxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorflow
Jupyter, Zeppelin
Jobs, Kibana, Grafana
REST
API
Hopsworks
Proj-42
Hopsworks Abstractions
40
A Project is a Grouping of Users and Data
Proj-X
Shared TopicTopic /Projs/My/Data
Proj-AllCompanyDB
Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017
Per-Project Conda Libs in Hopsworks
41#EUai8
Dela*
42
Peer-to-Peer Search and Download for Huge DataSets
(ImageNet, YouTube8M, MsCoCo, Reddit, etc)
*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)
DEMO
43#EUai8
Register and Play for today:
http://spark.hops.site
Conclusions
• Many good frameworks for TF and Spark
– TensorFlowOnSpark, Deep Learning Pipelines
• Hopsworks support for TF and Spark
– GPUs-as-a-Resource in HopsYARN
– TfLauncher, TensorFlow-on-Spark, Horovod
– Jupyter with Conda Support
• More on GPU-Servers at www.logicalclocks.com
44#EUai8
Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud
Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios
Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersso,n August
Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Roberto Bampi, ArunaKumari Yedurupaka, Tobias Johansson, Fanti Machmount Al Samisti,
Braulio Grana, Adam Alpire, Zahin Azher Rashid, Vasileios Giannokostas, Johan Svedlund
Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri”
Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig
Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana
Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj
Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Please Follow Us!
@hopshadoop
Hops Heads
Please Star Us!
http://github.com/
hopshadoop/hopsworks

More Related Content

PDF
Odsc workshop - Distributed Tensorflow on Hops
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
PDF
GPUs in Big Data - StampedeCon 2014
PPTX
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
PPTX
GPU and Deep learning best practices
PDF
Atlanta Spark User Meetup 09 22 2016
PDF
Intro to Machine Learning for GPUs
PDF
Making Hardware Accelerator Easier to Use
Odsc workshop - Distributed Tensorflow on Hops
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
GPUs in Big Data - StampedeCon 2014
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPU and Deep learning best practices
Atlanta Spark User Meetup 09 22 2016
Intro to Machine Learning for GPUs
Making Hardware Accelerator Easier to Use

What's hot (20)

PDF
Demystifying DataFrame and Dataset
PPTX
Profiling & Testing with Spark
PDF
Analyzing OS X Systems Performance with the USE Method
PDF
CuPy: A NumPy-compatible Library for GPU
PDF
Velocity 2015 linux perf tools
PDF
From DTrace to Linux
PDF
DTrace Topics: Introduction
PPTX
Tensorflow internal
PDF
What Linux can learn from Solaris performance and vice-versa
PDF
myHadoop 0.30
PDF
Easy and High Performance GPU Programming for Java Programmers
PDF
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
PDF
Transparent GPU Exploitation for Java
PDF
Exploiting GPUs in Spark
PDF
Performance Analysis: new tools and concepts from the cloud
PDF
On heap cache vs off-heap cache
PDF
Introduction to Chainer 11 may,2018
PDF
How Netflix Tunes EC2 Instances for Performance
PDF
Chainer ui v0.3 and imagereport
PDF
Ehsan parallel accelerator-dec2015
Demystifying DataFrame and Dataset
Profiling & Testing with Spark
Analyzing OS X Systems Performance with the USE Method
CuPy: A NumPy-compatible Library for GPU
Velocity 2015 linux perf tools
From DTrace to Linux
DTrace Topics: Introduction
Tensorflow internal
What Linux can learn from Solaris performance and vice-versa
myHadoop 0.30
Easy and High Performance GPU Programming for Java Programmers
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Transparent GPU Exploitation for Java
Exploiting GPUs in Spark
Performance Analysis: new tools and concepts from the cloud
On heap cache vs off-heap cache
Introduction to Chainer 11 may,2018
How Netflix Tunes EC2 Instances for Performance
Chainer ui v0.3 and imagereport
Ehsan parallel accelerator-dec2015
Ad

Similar to Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs (20)

PDF
Vertex Perspectives | AI Optimized Chipsets | Part II
PDF
Deep Dive on Deep Learning (June 2018)
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
PDF
Netflix machine learning
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PDF
1605.08695.pdf
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
AI and Deep Learning
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
PPTX
AI on the Edge
PDF
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
PDF
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
PPTX
Machine learning and Deep learning on edge devices using TensorFlow
PPTX
Deep Learning with Spark and GPUs
PDF
Toward Distributed, Global, Deep Learning Using IoT Devices
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
PDF
Neural Networks from Scratch - TensorFlow 101
PDF
A Platform for Accelerating Machine Learning Applications
PPTX
Learn about Tensorflow for Deep Learning now! Part 1
Vertex Perspectives | AI Optimized Chipsets | Part II
Deep Dive on Deep Learning (June 2018)
Innovation with ai at scale on the edge vt sept 2019 v0
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Netflix machine learning
TensorFrames: Google Tensorflow on Apache Spark
1605.08695.pdf
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
AI and Deep Learning
Austin,TX Meetup presentation tensorflow final oct 26 2017
AI on the Edge
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
Machine learning and Deep learning on edge devices using TensorFlow
Deep Learning with Spark and GPUs
Toward Distributed, Global, Deep Learning Using IoT Devices
Enabling a hardware accelerated deep learning data science experience for Apa...
Neural Networks from Scratch - TensorFlow 101
A Platform for Accelerating Machine Learning Applications
Learn about Tensorflow for Deep Learning now! Part 1
Ad

More from Jim Dowling (20)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Hops fs huawei internal conference july 2021
PDF
Hopsworks MLOps World talk june 21
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PDF
GANs for Anti Money Laundering
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Hopsworks data engineering melbourne april 2020
PDF
The Bitter Lesson of ML Pipelines
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PDF
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
ARVC and flecainide case report[EI] Jim.docx.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Serverless ML Workshop with Hopsworks at PyData Seattle
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Building Hopsworks, a cloud-native managed feature store for machine learning
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Hops fs huawei internal conference july 2021
Hopsworks MLOps World talk june 21
Hopsworks Feature Store 2.0 a new paradigm
Metadata and Provenance for ML Pipelines with Hopsworks
GANs for Anti Money Laundering
Berlin buzzwords 2020-feature-store-dowling
Hopsworks data engineering melbourne april 2020
The Bitter Lesson of ML Pipelines
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks in the cloud Berlin Buzzwords 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced IT Governance
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced Soft Computing BINUS July 2025.pdf
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Advanced IT Governance
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Monthly Chronicles - July 2025
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Electronic commerce courselecture one. Pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Machine learning based COVID-19 study performance prediction

Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs

  • 1. Jim Dowling Assoc Prof, KTH Senior Researcher, RISE SICS CEO, Logical Clocks AB SPARK & TENSORFLOW AS-A-SERVICE #EUai8 Hops
  • 2. Newton confirmed what many suspected • In August 1684, Halley visited Newton: “What type of curve does a planet describe in its orbit about the sun, assuming an inverse square law of attraction?” 2#EUai8
  • 3. • In June 2017, Facebook showed how to reduce training time on ImageNet for a Deep CNN from 2 weeks to 1 hour by scaling out to 256 GPUs. 3#EUai8 https://arxiv.org/abs/1706.02677 Facebook confirmed what many suspected
  • 4. AI Hierarchy of Needs 5 DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion [Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
  • 5. AI Hierarchy of Needs 6 DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion [Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ] Analytics Prediction
  • 6. AI Hierarchy of Needs 7 DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion Hops [Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
  • 7. Deep Learning Hierarchy of Scale 8#EUai8 DDL AllReduce on GPU Servers DDL with GPU Servers and Parameter Servers Parallel Experiments on GPU Servers Single GPU Many GPUs on a Single GPU Server Days/Hours Days Weeks Minutes Training Time for ImageNet Hours
  • 8. Deep Learning Hierarchy of Scale 9#EUai8 Public Clouds On-Premise Single GPU Multiple GPUs on a Single GPU Server DDL AllReduce on GPU Servers DDL with GPU Servers and Parameter Servers Single GPU Many GPUs on a Single GPU Server Parallel Experiments on GPU Servers Single Host DL Distributed DL
  • 9. DNN Training Time and Researcher Productivity • Distributed Deep Learning – Interactive analysis! – Instant gratification! • Single Host Deep Learning – Google-Envy 10 “My Model’s Training.” Training
  • 10. What Hardware do you Need? • SingleRoot PCI Complex Server* – 10 Nvidia GTX 1080Ti • 11 GB Memory – 256 GB Ram – 2 Intel Xeon CPUs – 2x56 Gb Infiniband 15K Euro • Nvidia DGX-1 – 8 Nvidia Tesla P100/V100 • 16 GB Memory – 512 GB Ram – 2 Intel Xeon CPUs – 4x100 Gb Infiniband – NVLink** up to 150K Euro *https://www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems **https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/
  • 11. 12#EUai8 SingleRoot Complex Server with 10 GPUs [Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]
  • 12. Tensorflow GAN Training Example* 13#EUai8 *https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1080-ti-single-root-deep-learning-server-part-1/
  • 13. Cluster of Commodity GPU Servers 14#EUai8 InfiniBand Max 1-2 GPU Servers per Rack (2-4 KW per server)
  • 14. Spark and TF – Cluster Integration 15#EUai8 Training Data and Model Store Cluster Manager Single GPU Experiment Parallel Experiments (HyperParam Tuning) Distributed Training Job Deprecated Mix of commodity GPUs and more powerful GPUs good for (1) parallel experiments and (2) distributed training
  • 15. GPU Resource Requests in Hops 16#EUai8 HopsYARN (Supports GPUs-as-a-Resource) 4 GPUs on any host 10 GPUs on 1 host 100 GPUs on 10 hosts with ‘Infiniband’ 20 GPUs on 2 hosts with ‘Infiniband_P100’ Hops HopsFS
  • 16. HopsFS: Next Generation HDFS* 17 16x Throughput FasterBigger *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi **https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf 37x Number of files Scale Challenge Winner (2017) Small Files**
  • 17. TensorFlow Spark API Integration • Tight Integration – Databricks’ Tensorframes and Deep Learning Pipelines • Loose Integration – TensorFlow-on-Spark, Hops TfLauncher • PySpark as a wrapper for TensorFlow 18#EUai8
  • 18. Deep Learning Pipelines 19#EUai8 graph = tf.Graph() with tf.Session(graph=graph) as sess: image_arr = utils.imageInputPlaceholder() frozen_graph = tfx.strip_and_freeze_until(…) transformer = TFImageTransformer(…) image_df = readImages("/data/myimages") processed_image_df = transformer.transform(image_df) … select image, driven_by_007(image) as probability from car_examples order by probability desc limit 6 Inferencing possible with SparkSQL
  • 19. Hops TfLauncher – TF in Spark def model_fn(learning_rate, dropout): import tensorflow as tf from hops import tensorboard, hdfs, devices ….. from hops import tflauncher args_dict = {'learning_rate': [0.001], 'dropout': [0.5]} tflauncher.launch(spark, model_fn, args_dict) 20 Launch TF jobs as Mappers in Spark “Pure” TensorFlow code in the Executor
  • 20. Hops TfLauncher – Parallel Experiments 21#EUai8 def model_fn(learning_rate, dropout): ….. from hops import tflauncher args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6, 0.7]} tflauncher.launch(spark, model_fn, args_dict) Launches 3 Executors with 3 different Hyperparameter settings. Each Executor can have 1-N GPUs.
  • 21. New TensorFlow APIs tf.data.Dataset tf.estimator.Estimator tf.data.Iterator 22#EUai8 def model_fn(features, labels, mode, params): … dataset = tf.data.TFRecordDataset([“/v/f1.tfrecord", “/v/f2.tfrecord"]) dataset = dataset.map(...) dataset = dataset.shuffle(buffer_size=10000) dataset = dataset.batch(32) iterator = Iterator.from_dataset(dataset) …. nn = tf.estimator.Estimator(model_fn=model_fn, params=dict_hyp_params) Prefer over RDDs-to-feed_dict
  • 22. Distributed TensorFlow • AllReduce – Horovod by Uber with MPI/NCCL – Baidu AllReduce/MPI in TensorFlow/contrib • Distributed Parameter Servers – TensorFlow-on-Spark – Distributed TensorFlow 23#EUai8 DDL AllReduce on GPU Servers DDL with GPU Servers and Parameter Servers
  • 23. Asynchronous SGD vs Synchronous SGD • Synchronous Stochastic Gradient Descent (SGD) now dominant, due to improved convergence guarantees: – “Revisiting Synchronous SGD”, Chen et al, ICLR 2016 https://research.google.com/pubs/pub45187.html 24
  • 24. Distributed TF with Parameter Servers 25 Synchronous SGD with Data Parallelism
  • 25. Tensorflow-on-Spark (Yahoo!) • Rewrite TensorFlow apps to Distributed TensorFlow • Two modes: 1. feed_dict: RDD.mapPartitions() 2. TFReader + queue_runner: direct HDFS access from Tensorflow 26[Image from https://www.slideshare.net/Hadoop_Summit/tensorflowonspark-scalable-tensorflow-learning-on-spark-clusters]
  • 26. TFonSpark with Spark Streaming 27#EUai8 [Image from https://www.slideshare.net/Hadoop_Summit/tensorflowonspark-scalable-tensorflow-learning-on-spark-clusters]
  • 27. All-Reduce/MPI 28 GPU 0 GPU 1 GPU 2 GPU 3 send send send send recv recv recv recv
  • 28. AllReduce: Minimize Inter-Host B/W 29 Only one slow worker or comms link is needed to bottleneck DNN training time.
  • 29. AllReduce Algorithm • AllReduce sums all Gradients in N Layers (L1..LN) using N GPUs in parallel (simplified steps shown). GPU 0 GPU 1 GPU 2 GPU 3 L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4 Backprop
  • 30. AllReduce Algorithm GPU 0 GPU 1 GPU 2 GPU 3 L10+L11+L12+L13 L2 L3 L4 Backprop L10+L11+L12+L13 L2 L3 L4 L10+L11+L12+L13 L2 L3 L4 L10+L11+L12+L13 L2 L3 L4 • Aggregate Gradients from the first layer (L1) while sending Gradients for L2
  • 31. AllReduce Algorithm GPU 0 GPU 1 GPU 2 GPU 3 Backprop L10+L11+L12+L13 L20+L21+L22+L23 L3 L4 L10+L11+L12+L13 L20+L21+L22+L23 L3 L4 L10+L11+L12+L13 L20+L21+L22+L23 L3 L4 L10+L11+L12+L13 L20+L21+L22+L23 L3 L4 • Broadcast Gradients from higher layers while computing Gradients at lower layers.
  • 32. AllReduce Algorithm GPU 0 GPU 1 GPU 2 GPU 3 Backprop L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L4 • Nearly there.
  • 33. AllReduce Algorithm GPU 0 GPU 1 GPU 2 GPU 3 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43 L10+L11+L12+L13 L20+L21+L22+L23 L30+L31+L32+L33 L40+L41+L42+L43 • Finished an iteration.
  • 34. Hops AllReduce/Horovod/TensorFlow 35#EUai8 import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb') “Pure” TensorFlow code
  • 35. Parameter Server vs AllReduce (Uber)* 36 *https://github.com/uber/horovod Setup: 16 servers with 4 P100 GPUs each connected by 40 Gbit/s network (synthetic data). VGG model is larger
  • 36. Dist. Synchnrous SGD: N/W is the Bottleneck 37 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 7 8 9 10 1 GPU 4 GPUs N/W N/W N/W N/W N/W Amount Work Time Reduce N/W Comms Time, Increase Computation Time Amdahl’s Law
  • 38. Hopsworks: Full AI Hierarchy of Needs 39 Develop Train Test Deploy MySQL Cluster Hive InfluxDB ElasticSearch KafkaProjects,Datasets,Users HopsFS / YARN Spark, Flink, Tensorflow Jupyter, Zeppelin Jobs, Kibana, Grafana REST API Hopsworks
  • 39. Proj-42 Hopsworks Abstractions 40 A Project is a Grouping of Users and Data Proj-X Shared TopicTopic /Projs/My/Data Proj-AllCompanyDB Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017
  • 40. Per-Project Conda Libs in Hopsworks 41#EUai8
  • 41. Dela* 42 Peer-to-Peer Search and Download for Huge DataSets (ImageNet, YouTube8M, MsCoCo, Reddit, etc) *http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)
  • 42. DEMO 43#EUai8 Register and Play for today: http://spark.hops.site
  • 43. Conclusions • Many good frameworks for TF and Spark – TensorFlowOnSpark, Deep Learning Pipelines • Hopsworks support for TF and Spark – GPUs-as-a-Resource in HopsYARN – TfLauncher, TensorFlow-on-Spark, Horovod – Jupyter with Conda Support • More on GPU-Servers at www.logicalclocks.com 44#EUai8
  • 44. Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersso,n August Bonds, Filotas Siskos, Mahmoud Hamed. Active: Alumni: Roberto Bampi, ArunaKumari Yedurupaka, Tobias Johansson, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu. Please Follow Us! @hopshadoop Hops Heads Please Star Us! http://github.com/ hopshadoop/hopsworks