SlideShare a Scribd company logo
UNET: Massive Scale DNN on
Spark
Deep Neural Net
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Convolutional Neural Net
Overview
 Components: Solver, Parameter Server, Model Splits.
 Massive Scale: Data Parallel & Model Parallel.
 Train Method: Async and Sync
 Algorithms: RBM, DA, SGD, CNN, LSTM, AdaGrad,
L1/L2, L-BFGS. CG, etc.
 Extensibility: Can be extended to any algorithm that
can be modeled as data flow.
 Highly optimized with lock free implementation, and
software pipeline maximizing the performance.
 Highly flexible and modulized to support arbitrary
network.
Architecture: Data / Model Parallel
Solver
Model1_3
Model1_2
Model1_1
Q
PS_2
Q
PS_3
Q
PS_1
One Solver RDD
(1 partition)
One Parameter Server RDD
(3 Partitions)
Three Replicated Model RDD
(3 Partitions Each)
Data Parallel
Component: Models & Parameter server
Multiple models trained independently
Each model fits one splits of training data, and
calculates the sub-gradient
Asynchronously, each model update/retrieve
parameters to/from parameter server
Data Parallel
(2 replicated Models with 1 Parameter Server)
Parameter Server
Q
ModelYModelX
Parameter Sync
Model Parallel
Model is huge, and cannot be hold in one
machine.
Training is computational heavy
Model partitioned into multiple splits.
Each split may located in different physical
machines.
Model Parallel
(3 Partitions)
Data Communication:
• node-level
• group-level
Control RPC traffic
Netty based Data Traffic
Master
Executor
Executor
Executor
Data / Model Parallel
Solver
Model1_3
Model1_2
Model1_1
Q
PS_2
Q
PS_3
Q
PS_1
One Solver RDD
(1 partition)
One Parameter Server RDD
(3 Partitions)
Three Replicated Model RDD
(3 Partitions Each)
A Simple Network
Convolutional Fully Mesh Softmax Facility Master
Parameter Management
 ParamMgr.Node for fully meshed layer
Managed by individual node.
 ParamMgr.Group for convolutional layer
Shared by all nodes in the group, and managed by
the group. The group gather/scatter the
parameters from its members, which may locate in
different executors.
 ParamMgr.Const for softmax master layer
The parameters are constant.
qi,1
qi,2
qi,3
qi,4
Node
Params
Parameter Type (Link vs. Node)
q1,I
l
q2,I
l
q3,I
l
Left-link
Params
qi,1
l+1
qi,2
l+1
qi,3
l+1
Right-link
Params
1. Each parameter is associated with either a link or a node.
2. Each node/link may have multiple parameters associated.
3. Link parameters are managed by upstream.
4. Each category of parameters may be managed by either the node or the group.
Network Partitioning
• The DNN network is organized by layers
• Each layer is defined as three-dimension cube by (x, y, z).
• Each dimension can be arbitrarily partitioned, defined as (sx, sy,
sz), s specifies the number of partitions of one dimension.
• One layer can be in multiple executors, and one partition is the
basic unit to be distributed in executors.
x(sx=3)
z(sz=3)
y (sy=2)
Software Components
 Layer: logical group in deep neuron net.
 Group: logical unit having similar input/output topology and
functionality. A group can further have subgroups.
 Node: the basic computation unit provide neuron functionality.
 Connection: define the network topology between layers, such as
fully meshed, convolutional, tiled convolutional, etc.
 Adaptors: mapping the remote upstream/down stream neuron to
local neuron in the topology defined by connections.
 Function: define the activation of each neuron.
 Master: provide central aggregation and scatter for softmax neuron.
 Solver: central place to drive the model training and monitoring.
 Parameter Server: the server used by neuron to update/retrieve
parameters.
Memory Overhead
 Neuron does not need to keep the inputs from upstream,
but only keeps the aggregation record.
 The calculation is associative in both forward/backward path
(through function split trick).
 The link gradient is calculated and updated in the upstream
 Memory overhead is O(N + M), N is the neuron size and M
is the parameter size.
Network Overhead
 Neuron forwards same output to its upstream/downstream
neurons.
 Receiving neurons compute the input or update the gradient.
 Neuron forwards its output to the executors only if it hosts
neurons requesting it.
 Neuron forwards its output to an executor only once
regardless of the number of neurons requesting it.
Complexity
Memory: O(M+N) independent of network
partition mechanism.
M: the number of parameters
N: The number of nodes.
Communication: O(N)
Realized by
 Each node managing its outgoing link parameter
instead of incoming link parameter
 The trick to split the function across the layers
Distributed Pipeline
 MicroBatch: The number of training examples in one pipeline stage
 max_buf: the length of the pipleline.
 Batch algorithms: Significantly improve the performance when the
training data set is big enough to fully populate the pipeline.
 SGD: the improvement is limited, because the pipeline cannot be fully
populated if the miniBatch size is not big enough.
Executor 4
Executor 3
Executor 2
Executor 1 Micro Batch i +4
Micro Batch i +3
Micro Batch i +2
Micro Batch i +1
Micro Batch i +1
Micro Batch i +1
Micro Batch i +1
Micro Batch i +2
Micro Batch i +2 Micro Batch i +3
T1 T2 T3 T4
Connections
 Easy extensible through Adaptors.
 Adaptor is used to mapping global status to its local status.
 Fully Meshed
 (Tiled) Convolutional
 NonShared Convolutional

More Related Content

PPT
Simulation of Scale-Free Networks
PDF
Scale free network Visualiuzation
PDF
Implementation of Spanning Tree Protocol using ns-3
PPTX
Broadcast in Hypercube
PPTX
Communication costs in parallel machines
ODP
Chapter - 04 Basic Communication Operation
PPTX
Classification by backpropacation
PPT
Artificial neural networks in hydrology
Simulation of Scale-Free Networks
Scale free network Visualiuzation
Implementation of Spanning Tree Protocol using ns-3
Broadcast in Hypercube
Communication costs in parallel machines
Chapter - 04 Basic Communication Operation
Classification by backpropacation
Artificial neural networks in hydrology

What's hot (20)

PDF
Lecture 11 neural network principles
PPT
All-Reduce and Prefix-Sum Operations
PPTX
The Use of Wireless networks for Control Applications
PPT
Lec 6-bp
PDF
Analysis and design of a half hypercube interconnection network topology
PPTX
Data Applied: Similarity
PDF
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Lecture 3 parallel programming platforms
PPTX
Convolutional neural network from VGG to DenseNet
PPTX
Multilayer & Back propagation algorithm
PPTX
Physical organization of parallel platforms
PDF
Solution(1)
PPTX
Radial basis function network ppt bySheetal,Samreen and Dhanashri
PPT
Lecture3
PPT
Chapter 4 pc
PDF
Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...
PPTX
Deep Belief Networks for Spam Filtering
PPTX
A Tale of Data Pattern Discovery in Parallel
PPT
Chap4 slides
Lecture 11 neural network principles
All-Reduce and Prefix-Sum Operations
The Use of Wireless networks for Control Applications
Lec 6-bp
Analysis and design of a half hypercube interconnection network topology
Data Applied: Similarity
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Lecture 3 parallel programming platforms
Convolutional neural network from VGG to DenseNet
Multilayer & Back propagation algorithm
Physical organization of parallel platforms
Solution(1)
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Lecture3
Chapter 4 pc
Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...
Deep Belief Networks for Spam Filtering
A Tale of Data Pattern Discovery in Parallel
Chap4 slides
Ad

Similar to UNET: Massive Scale DNN on Spark (20)

PDF
A Platform for Accelerating Machine Learning Applications
PDF
Startup.Ml: Using neon for NLP and Localization Applications
PPTX
Introduction to deep learning
PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Hardware Acceleration for Machine Learning
PPTX
SparkNet presentation
PDF
Neural Networks AI presentation.pdf
PPTX
A Survey of Convolutional Neural Networks
PDF
Artificial Neural Networks: Introduction, Neural Network representation, Appr...
PDF
Scalable Deep Learning on Distributed GPUs
PPTX
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
PDF
Distributed DNN training: Infrastructure, challenges, and lessons learned
PDF
Open source ai_technical_trend
PPTX
Parallel & Distributed Deep Learning - Dataworks Summit
PDF
Distributed deep learning
PDF
Introduction to Chainer
PPTX
Deep learning
PPTX
08 neural networks
PPTX
Deep Learning with Apache Spark: an Introduction
PPTX
Parallel/Distributed Deep Learning and CDSW
A Platform for Accelerating Machine Learning Applications
Startup.Ml: Using neon for NLP and Localization Applications
Introduction to deep learning
DLD meetup 2017, Efficient Deep Learning
Hardware Acceleration for Machine Learning
SparkNet presentation
Neural Networks AI presentation.pdf
A Survey of Convolutional Neural Networks
Artificial Neural Networks: Introduction, Neural Network representation, Appr...
Scalable Deep Learning on Distributed GPUs
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
Distributed DNN training: Infrastructure, challenges, and lessons learned
Open source ai_technical_trend
Parallel & Distributed Deep Learning - Dataworks Summit
Distributed deep learning
Introduction to Chainer
Deep learning
08 neural networks
Deep Learning with Apache Spark: an Introduction
Parallel/Distributed Deep Learning and CDSW
Ad

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
L1 - Introduction to python Backend.pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Digital Strategies for Manufacturing Companies
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
top salesforce developer skills in 2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Essential Infomation Tech presentation.pptx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
medical staffing services at VALiNTRY
PPTX
Introduction to Artificial Intelligence
2025 Textile ERP Trends: SAP, Odoo & Oracle
L1 - Introduction to python Backend.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Strategies for Manufacturing Companies
Operating system designcfffgfgggggggvggggggggg
top salesforce developer skills in 2025.pdf
Odoo POS Development Services by CandidRoot Solutions
Which alternative to Crystal Reports is best for small or large businesses.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms I-SECS-1021-03
wealthsignaloriginal-com-DS-text-... (1).pdf
CHAPTER 2 - PM Management and IT Context
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Essential Infomation Tech presentation.pptx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
How to Migrate SBCGlobal Email to Yahoo Easily
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Adobe Illustrator 28.6 Crack My Vision of Vector Design
medical staffing services at VALiNTRY
Introduction to Artificial Intelligence

UNET: Massive Scale DNN on Spark

  • 1. UNET: Massive Scale DNN on Spark
  • 2. Deep Neural Net Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
  • 4. Overview  Components: Solver, Parameter Server, Model Splits.  Massive Scale: Data Parallel & Model Parallel.  Train Method: Async and Sync  Algorithms: RBM, DA, SGD, CNN, LSTM, AdaGrad, L1/L2, L-BFGS. CG, etc.  Extensibility: Can be extended to any algorithm that can be modeled as data flow.  Highly optimized with lock free implementation, and software pipeline maximizing the performance.  Highly flexible and modulized to support arbitrary network.
  • 5. Architecture: Data / Model Parallel Solver Model1_3 Model1_2 Model1_1 Q PS_2 Q PS_3 Q PS_1 One Solver RDD (1 partition) One Parameter Server RDD (3 Partitions) Three Replicated Model RDD (3 Partitions Each)
  • 6. Data Parallel Component: Models & Parameter server Multiple models trained independently Each model fits one splits of training data, and calculates the sub-gradient Asynchronously, each model update/retrieve parameters to/from parameter server
  • 7. Data Parallel (2 replicated Models with 1 Parameter Server) Parameter Server Q ModelYModelX Parameter Sync
  • 8. Model Parallel Model is huge, and cannot be hold in one machine. Training is computational heavy Model partitioned into multiple splits. Each split may located in different physical machines.
  • 9. Model Parallel (3 Partitions) Data Communication: • node-level • group-level Control RPC traffic Netty based Data Traffic Master Executor Executor Executor
  • 10. Data / Model Parallel Solver Model1_3 Model1_2 Model1_1 Q PS_2 Q PS_3 Q PS_1 One Solver RDD (1 partition) One Parameter Server RDD (3 Partitions) Three Replicated Model RDD (3 Partitions Each)
  • 11. A Simple Network Convolutional Fully Mesh Softmax Facility Master
  • 12. Parameter Management  ParamMgr.Node for fully meshed layer Managed by individual node.  ParamMgr.Group for convolutional layer Shared by all nodes in the group, and managed by the group. The group gather/scatter the parameters from its members, which may locate in different executors.  ParamMgr.Const for softmax master layer The parameters are constant.
  • 13. qi,1 qi,2 qi,3 qi,4 Node Params Parameter Type (Link vs. Node) q1,I l q2,I l q3,I l Left-link Params qi,1 l+1 qi,2 l+1 qi,3 l+1 Right-link Params 1. Each parameter is associated with either a link or a node. 2. Each node/link may have multiple parameters associated. 3. Link parameters are managed by upstream. 4. Each category of parameters may be managed by either the node or the group.
  • 14. Network Partitioning • The DNN network is organized by layers • Each layer is defined as three-dimension cube by (x, y, z). • Each dimension can be arbitrarily partitioned, defined as (sx, sy, sz), s specifies the number of partitions of one dimension. • One layer can be in multiple executors, and one partition is the basic unit to be distributed in executors. x(sx=3) z(sz=3) y (sy=2)
  • 15. Software Components  Layer: logical group in deep neuron net.  Group: logical unit having similar input/output topology and functionality. A group can further have subgroups.  Node: the basic computation unit provide neuron functionality.  Connection: define the network topology between layers, such as fully meshed, convolutional, tiled convolutional, etc.  Adaptors: mapping the remote upstream/down stream neuron to local neuron in the topology defined by connections.  Function: define the activation of each neuron.  Master: provide central aggregation and scatter for softmax neuron.  Solver: central place to drive the model training and monitoring.  Parameter Server: the server used by neuron to update/retrieve parameters.
  • 16. Memory Overhead  Neuron does not need to keep the inputs from upstream, but only keeps the aggregation record.  The calculation is associative in both forward/backward path (through function split trick).  The link gradient is calculated and updated in the upstream  Memory overhead is O(N + M), N is the neuron size and M is the parameter size.
  • 17. Network Overhead  Neuron forwards same output to its upstream/downstream neurons.  Receiving neurons compute the input or update the gradient.  Neuron forwards its output to the executors only if it hosts neurons requesting it.  Neuron forwards its output to an executor only once regardless of the number of neurons requesting it.
  • 18. Complexity Memory: O(M+N) independent of network partition mechanism. M: the number of parameters N: The number of nodes. Communication: O(N) Realized by  Each node managing its outgoing link parameter instead of incoming link parameter  The trick to split the function across the layers
  • 19. Distributed Pipeline  MicroBatch: The number of training examples in one pipeline stage  max_buf: the length of the pipleline.  Batch algorithms: Significantly improve the performance when the training data set is big enough to fully populate the pipeline.  SGD: the improvement is limited, because the pipeline cannot be fully populated if the miniBatch size is not big enough. Executor 4 Executor 3 Executor 2 Executor 1 Micro Batch i +4 Micro Batch i +3 Micro Batch i +2 Micro Batch i +1 Micro Batch i +1 Micro Batch i +1 Micro Batch i +1 Micro Batch i +2 Micro Batch i +2 Micro Batch i +3 T1 T2 T3 T4
  • 20. Connections  Easy extensible through Adaptors.  Adaptor is used to mapping global status to its local status.  Fully Meshed  (Tiled) Convolutional  NonShared Convolutional