Common Design of Deep Learning Frameworks

Tutorial:
Deep Learning Implementations
and Frameworks
Seiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+
*Preferred Networks, Inc. (PFN)
{tokui,oono}@preferred.jp
+National Institute of Advanced Industrial Science and Technology (AIST)
atsu-kan@aist.go.jp, mail@kamishima.net
1

Overview of this tutorial
•1st session (KO, 8:30 ‒ 10:00)
•Introduction
•Basics of neural networks
•Common design of neural network
implementations
•2nd session (ST, 10:30 ‒ 12:30)
•Differences of deep learning frameworks
•Coding examples of frameworks
•Conclusion

Common Design of
Deep Learning Frameworks
Kenta Oono <oono@preferred.jp>
Preferred Networks Inc.
2016/4/19 3DLIF Tutorial @ PAKDD2016

Objective of this part
• How deep learning frameworks represent various neural
networks.
• How deep learning frameworks realize the training procedure
of neural networks.
• Technology stack that is common to most of deep learning
frameworks.

Steps for training neural networks
Prepare the training dataset
Repeat until meeting some criterion
Prepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)
Update the NN parameters

Technology stack of DL framework
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflow
management
Dataset Management
Training Loop
Keras, Lasagne
Blocks, TF Learn
Computational graph
management
Build computational graph
Forward prop/Backprop
Theano, TensorFlow
Torch.nn
Multi-dimensional
array library
Linear algebra NumPy, CuPy
Eigen, torch (core)
Numerical computation
package
Matrix operation
Convolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU

management
Dataset Management
Training Loop
Keras, Lasagne
Blocks, TF Learn
Computational graph
management
Theano, TensorFlow
Torch.nn
Multi-dimensional
array library
Eigen, torch (core)
package
Matrix operation
Convolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU

Neural Network as a Computational Graph
• In simplest form, NN is represented as a computational graph
(CG) that is a stack of bipartite DAGs (Directed Acyclic Graph)
consisting of data nodes and operator nodes.
y = x1 * x2
z = y - x3
x1 mul suby
x3
z
x2
data node
operator node

Example: Multi-layer Perceptron (MLP)
x Affine
W1 b1
h1 ReLU a1
Affine
W2 b2
h2 ReLU a2
Soft
max
y
Cross
Entropy
Lo
ss
t
It is choice of
implementation if CG
includes weights and
biases.

Example: Recurrent Neural Network (RNN)
x1
RNN
Unit
h1
RNN
Unit
x2
h2
RNN
Unit
xT
h0 ・・・ hT
RNN unit can be :
• Affine + activation function
• LSTM (Long Short-Term
Memory)
• GRU (Gated Recurrent Unit)
x h y
xt
ht-1
ht
W b

Example: Stacked RNN
x1
RNN
Unit
h1
RNN
Unit
x2
h2
RNN
Unit
xT
h0 ・・・ hT
RNN
Unit
z1
RNN
Unit
z2
RNN
Unit
z0 ・・・ zT
Soft
max
Affine y

Example: RNN with control flow nodes
loop
enter
s
i
predic
ate
pr
ed
s
h0
x
switch s
RNN
Unit
s’update
loop
end
y
pred=True
pred=False
• TensorFlow has control flow
nodes (e.g. cond, switch,
while)
• As CG has a loop, some
mechanism is necessary that
resolves he dependency of
nodes to schedule the order
of calculation.
W
b

Automatic Differentiation
• Computes gradient of some specified data nodes (e.g. loss)
with respect to each data node.
• Each operator node must have backward operation to
calculate gradients w.r.t. its inputs from gradients w.r.t. its
outputs (realization of chain rule).
• e.g. Function class of Chainer has backward method.
• e.g. Each layer classes of Caffe has Backward_cpu and
Backward_gpu methods
• e.g. Autograd has a thin wrapper that adds gradient methods as a
closure to most of NumPy methods.

Backprop through CG
∇y z∇x1 z ∇z z = 1
y = x1 * x2
z = y - x3
x1 mul suby
x3
z
x2

Backprop as extended graphs
x1 mul suby
x3
z
x2
dzid
neg
mul
mul
dy
dx
3
dx
1
dx
2
forward
propagation
backward
propagation
y = x1 * x2
z = y - x3

Example: Theano

management
Dataset Management
Training Loop
Keras, Lasagne
Blocks, TF Learn
Computational graph
management
Theano, TensorFlow
Torch.nn
Multi-dimensional
array library
Eigen, torch (core)
package
Matrix operation
Convolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU

Numerical optimizer
• Many gradient-based optimization algorithms are
implemented.
• Stochastic Gradient Descent (SGD) is implemented in most DL
frameworks.
• It depends on concrete tasks which optimizer works best.
w: parameters of neural network
θ: states of optimizer
L: loss function
Γ: optimizer-specific function
initialize w, θ
until meet the criteria:
get data (x, y)
calculate ∇w L(x, y; w)
w, θ ← Γ(w, θ, ∇w L)

Serialization
• Save/Load the snapshot of training process in specified format
(e.g. hdf5, npz, protobuf)
• Models to be trained (= architectures and parameters of NNs)
• States of training procedure (e.g. epoch, learning rate, momentum)
• Serialization enhance the portability of models.
• Publish pre-trained model (e.g. Model Zoo (Caffe), MXNet, TensorFlow)
• Import pre-trained model of other DL frameworks
• e.g. Chainer supports BVLC-official reference models of Caffe.

Computational optimizer
• Convert CGs to make them simplified and efficient.
e.g. Theano
y = x1 * x2
z = y - x3

Abstraction of ML workflow
• Offers typical training/validation/evaluation procedures as APIs.
• Users should call a single API and do not have to write the procedure
manually.
• e.g. fit, evaluate methods of Model class in Keras.
Prepare the training dataset
Repeat until meeting some criterion
Prepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)
Update the NN parameters

Graphical interface
• Computational graph management
• Editor, Visualizer
• Visualization of training procedure
• Visualization of feature maps, output of NNs etc.
• Transition of error and accuracy
• Performance monitor
• e.g. Throughput, latency, memory usage

management
Dataset Management
Training Loop
Keras, Lasagne
Blocks, TF Learn
Computational graph
management
Theano, TensorFlow
Torch.nn
Multi-dimensional
array library
Eigen, torch (core)
package
Matrix operation
Convolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU

GPU support
• CUDA: Computing platform for GPGPU on NVIDIA GPU
• language extension, compiler, library etc.
• DL frameworks prepare wrappers for CUDA.
• GPU-array library that utilizes cuBLAS, cuRAND etc.
• Layer implementation with cuDNN (e.g. Convolution, sigmoid, LSTM)
• Designed to switch CPU and GPU easily.
• e.g. Users can write CPU-GPU agnostic code.
• e.g. Switch CPU/GPU with environment variables.
• Some framework supports Open CL as a GPU environment, but
CUDA is more popular for now.

Multi-dimensional array library (CPU / GPU)
• In charge of concrete calculation of data nodes.
• Heavily depends on BLAS (CPU) or CUDA / CUDA Toolkits
(GPU)
• CPU
• Third-party library: Eigen::Tensor, NumPy
• Scratch: ND4J (DL4J), mshadow (MXNet)
• GPU
• Third-party library: Eigen::Tensor, PyCUDA, gpuarray
• Scratch: ND4J (DL4J), mshadow (MXNet), CuPy (Chainer)

Which device to use?
• GPU is (by far) faster than CPU in most case.
• Most of tensor calculation consists of element-wise calculation,
matrix multiplications and convolutions.
• Exceptional cases
• Difficult to apply mini-batch technique.
• e.g. variable-length training dataset
• e.g. The architecture of NN depends on the training data.
• GPU calculation cannot hide transfer of data to GPU.
• e.g. Minibatch size is too small.

Technology stack of Chainer
cuDNN
Chainer
NumPy CuPy
BLAS
cuBLAS,
cuRAND
CPU GPU
name
Graphical interface
management
Computational graph
management
Multi-dimensional
array library
package
Hardware

Technology stack of TensorFlow
cuDNN
TensorFlow
Eigen::Tensor
BLAS
cuBLAS,
cuRAND
CPU GPU
name
Graphical interface
management
Computational graph
management
Multi-dimensional
array library
package
Hardware
TensorBoard
TF Learn

Technology stack of Theano
CUDA, OpenCL
CUDAToolkit
Theano
BLAS
CPU GPU
name
Graphical interface
management
Computational graph
management
Multi-dimensional
array library
package
Hardware
lib
gpuarray
NumPy

Technology stack of Keras
name
Graphical interface
management
Computational graph
management
Multi-dimensional
array library
package
Hardware
Keras
TensorFlowTheano
Technology
Stack of Theano
Technology
Stack of TF

Summary
• Most DL frameworks have many components in common and
can be organized as a similar technology stack.
• At upper layer of the stack, frameworks are designed to
support users to follow typical ML workflows.
• At middle layer, manipulations on computational graphs are
automated.
• At lower layer, optimized tensor calculations are
implemented.
• Realization of these components differ between frameworks,
as we will see in the following part.

memorandum

Training of Neural Networks
• L is designed so that its value gets small as the prediction more
“accurate”
• In deep learning context
• L : represented by neural networks
• w : parameters of neural networks
argminw ∑(x, y) L(x, y; w)
w: parameters
x: feature vector
y: training label
L: loss function
e.g.: Classification problem
332016/4/19 DLIF Tutorial @ PAKDD2016

Layer = function + data nodes
• Layers (e.g. Fully connected layer, convolutional layer) can
be considered as a function with parameters to be optimized.
• In most of modern frameworks, parameters of layers can be
considered as data nodes in a computational graph.
• Framework need to be differentiate which data nodes are
parameters to be optimized or data point.

Execution Engine
• It calculates the dependency between data node and
schedules the execution of parts of computational graph
(especially in multi-node or multi-GPU setting)

Common Design of Deep Learning Frameworks

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Common Design of Deep Learning Frameworks (20)

More from Kenta Oono (14)

Recently uploaded (20)

Common Design of Deep Learning Frameworks