APSys Presentation Final copy2

IMPLEMENTATION AND EVALUATION OF DEEP NEURAL NETWORKS
(DNN) ON MAINSTREAM HETEROGENEOUS SYSTEMS
JUNLI GU, MAOHUA ZHU, ZHITAO ZHOU, FENG ZHANG
ZHEN LIN, QIANFENG ZHANG, MAURICIO BRETERNITZ
AMD (RESEARCH)
JUNE 26, 2014
JUNLI.GU@AMD.COM

| DNN PROJECT2
BACKGROUND
 What is a Deep Neural Network (DNN)?
‒ 3~8 hidden layers, millions to billions of parameters
‒ DNN + Big Data is leading recent direction in machine learning
 Rich Varieties of DNN Structures
‒ MLP (Multi-level Perceptron)/ AutoEncoder
‒ CNN (Convolutional Neural Network)
‒ DBN (Deep belief network)/RBM (Restricted Boltzmann Machine)
 DNN Applications
‒ Speech Recognition
‒ Image Classification/recognition/retrieval
‒ Documentation retrieval, Handwriting recognition
‒ OCR…
 Industry Use of DNN
‒ Google, Yahoo, Baidu, Alibaba, Tencent, iFlytek, Microsoft, Bank and Finance
neurons
weighted
connection
Input
Output
hidden1
hidden2
hidden3

| DNN PROJECT3
MOTIVATION
DNN challenges hardware:
Computation Heavy, Memory Heavy and Parallel Execution
Fortunately, rich data/model parallelism of DNN
==> GPU passive hardware parallelism
==> Heterogeneous Platforms:
Clusters of CPU+GPU, or APU server?
Note: APU is a processor with both CPU and GPU on the same die.

| DNN PROJECT4
CPU+GPU CLUSTER
 Existing Platforms
‒ CPU cluster (scale out)
‒ CPU + GPU clusters (scale up + scale out)
 Bottlenecks
‒ GPU device memory size limitation for DNN data/model
‒ Every 250M parameters require 1GB memory
‒ Communication overheads are bottleneck
‒ Intra node between CPU and GPU, intern node
‒ GPU is big and power hungry, low density
• Google Brain’s 1000 processor system
• Stanford Univ. Andrew Y. Ng etc., “Deep learning with COTS HPC
systems”, International Conference on Machine Learning, 2013
CPUs
Infiniband
Connection
GPU
GPU
GPU
GPU
CPUs
GPU
GPU
GPU
GPU
CPUs
GPU
GPU
GPU
GPU
PCIE
PCIE
PCIE
A node

| DNN PROJECT5
APU AND APU SERVER
 APU
‒ In 2009, AMD launched the first chip integrated with both CPU and GPU
‒ Programming through OpenCL
 Architectural Advantages
‒ Unified address memory: GPU CPU share very big memory
‒ Very efficient data sharing: no data copy
‒ Fully coherent memory
‒ Sharing through pointers
 APU Server
‒ High density, low power data server
‒ Customized fast FABRIC
‒ In advance research on internal prototype
CPU
GPU
SharedMemory
HSA features
Credit: AMD Sea Micro
8x8x8=512 nodes

| DNN PROJECT6
SOME QUICK TAKE AWAYS
CPU+GPU cluster gets 2x speedup with 6x more power
2.4 APUs can achieve the same performance with 2.5x less
power.
APUs can be integrated as high density, power efficient data
centers to reduce complexity and cost.

| DNN PROJECT7
OUTLINE
 Background and Motivation
 DNN Algorithm Architectures
‒ MLP (Multi-Layer Perceptron )
‒ Autoencoder
 Evaluation on Multiple Platforms
 Bottleneck Analysis
 Conclusions and Next Plan

| DNN PROJECT8
DNN ALGORITHM ARCHITECTURE 1– MLP
 MLP (Multi-Layer Perceptron )
‒ Speech recognition
‒ Layers of matrix multiply + non-linear functions
 Compute Patterns
‒ Layers of matrix multiplication
‒ Reflects most DNN compute-intensive
‒ CPU prepares data, GPU computes
MLP Structure
1100
2048
2048
2048
2048
2048
2048
2048
9304
Adjacent layers are fully
connected
Parameter space:
44 million
(layer size: 1k-2k-2k-2k-
2k-2k-2k-2k-2k-9k)
Forward/Backward Propagation
Output
Input
Hidden layers
x 1z 1a 2z 2a 3z 3a
1 2 3
Input Layer Hidden Layer Hidden Layer Output Layer
1w 2w 3w
Forward
Propagation
Back
Propagation
𝑧1 = 𝑥𝑤1 + 𝑏1
𝑎1 = 𝑓(𝑧1)
𝑧2 = 𝑎1 𝑤2 + 𝑏2
𝑎2 = 𝑓(𝑧2)
𝑧3 = 𝑎2 𝑤3 + 𝑏3
𝑎3 = 𝑓(𝑧3)
𝑒 =
1
2
𝑦 − 𝑎3
2
𝛿3 = − 𝑦 − 𝑎3 .∗ 𝑓′(𝑧3)
𝜕𝑒
𝜕𝑤3
= 𝑎2
𝑇
𝛿3
𝛿2 = 𝑤3
𝑇
𝛿3.∗ 𝑓′(𝑧2)
𝜕𝑒
𝜕𝑤2
= 𝑎1
𝑇
𝛿2
𝛿1 = 𝑤2
𝑇
𝛿2.∗ 𝑓′(𝑧1)
𝜕𝑒
𝜕𝑤1
= 𝑥 𝑇
𝛿1
error

| DNN PROJECT9
 Autoencoder + L-BFGS Training
‒ Used for pre-training (Hinton et al, 2006)
‒ Semantic retrieval (Krizhevsky et al, 2011)
‒ L-BFGS good scalability (Le et al, 2011 )
DNN ALGORITHM ARCHITECTURE 2–AUTOENCODER
 Compute Patterns
‒ A mix of CPU compute with GPU compute
‒ Frequent CPU-GPU interactions and data transfers
‒ A good fit to leverage APU advantages
Input
Layer
Reconstruction
Layer
Output
Code
1 Encode the input and
then reconstruct the code
for cost computing
3072
6144
1024
W1 W2
6144
3072
W2
T W1
T
2 Parameter space:
25 million
(layer size: 3k-6k-1k-6k-3k)
Autoencoder Structure L-BFGS Training Algorithm
Back
Propagation
Forward
Propagation
Meet
line search
Condition?
Get Cost and
Gradients
Cost and
Gradients
Try New
Step Length
L-BFGS
Compute
New Direction
N
Y
CPU
GPU

| DNN PROJECT10
OUTLINE
‒Implementation on APUs and GPUs
‒Performance/power/perf_per_watt comparison

| DNN PROJECT11
EVALUATION METHODOLOGY AND PLATFORMS
 Implementations based on commercial BLAS libraries
‒ Mainstream X86 CPUs: C++ & math library
‒ AMD APUs & GPUs: OpenCL & CLAMDBLAS
‒ Mainstream GPU: CUDA C & CUBLAS (for competitive purposes)
 Platforms
Device Category Device Name
Throughput
(GFLOPS)
Price
(RMB)
TDP
(Watt)
CPU
version
AMD OCL
version
CUDA
version
Note
CPU Mainstream x86 848 2240 84 √ √ Realtime power traces
APU series
AMD APU A10-7850k 856 1299 95 √ Realtime power traces
Mainstream x86 SOC 848 2240 84 √ Realtime power traces
Customer-end
GPU
AMD HD7970 3788.8 2000 250 √ TDP used
Mainstream GPU 3977 3799 250 √ √ TDP used

| DNN PROJECT12
EVALUATION METHODOLOGY AND PLATFORMS-CONT.
 Evaluation results indicate per-unit training speed
‒CNN not tested as work still under development
‒MLP and Autoencoder tested initial results
‒DNN model parameters and mini-batch size align with Internet industry
‒Single-node results presented
‒Further (ongoing) optimizations

| DNN PROJECT13
MLP MODEL(VOICE RECOGNITION)
• Kaveri 95w v.s. Mainstream x86
1.8x speedup
• Kaveri 95w v.s. Mainstream x86 SOC’s
3.7x speedup
Mini-batch size: 1024
CPU prepares data, GPU computes
Note: CLAMDBLAS offers an architecture-aware optimization tool called
clAmdBlasTune. Make sure to tune it the first time to run on a processor.

| DNN PROJECT14
PERFORMANCE/POWER/PERF_PER_WATT
 APU achieves the highest Perf./watt
Eg. 1.2x compared to GPU
 GPU achieves 5x perf. with 7x power
 CPU gets 60% perf. with 1.9x power
1
0.3
0.22
0.7
0.8
1 0.6
0.3
4.9
6.2
1
1.9
1.3
7.3 7.3
0
1
2
3
4
5
6
7
8
0
0.2
0.4
0.6
0.8
1
1.2
A10-7850K Mainstream
x86
Mainstream
x86 SOC's
AMD HD7970 Mainstream
GPU
SpeedandPower(normalizedtoAPU)
Perf.PerWatt(normalizedtoAPU)
Performance Per Watt Ratio Performance Ratio Power Ratio

| DNN PROJECT15
AUTOENCODER (IMAGE AND DOCUMENT RETRIEVAL)
• Algorithm is mix of
CPU+GPU compute
• APU v.s. Mainstream x86
8% slow down
• APU v.s. Mainstream x86 SOC’s
3.8x speedup
 The larger the batch size is, the bigger
advantage APU presents.
Data: CIFAR10, Mini-batch size: 2048
CPU: L-BFGS; GPU: Autoencoder forward and backward propogation

| DNN PROJECT16
PERFORMANCE/POWER/PERF_PER_WATT
 APU achieves the highest Perf./watt
Eg. 2x compared to dGPU
 GPU achieves 2x perf. with 5x power
 CPU gets 90% perf. with 1.4x power
1
0.65
0.3
0.46
0.5
1
0.9
0.3
2.2 2.4
1
1.4
0.9
4.8 4.8
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1
1.2
A10-7850K Mainstream
x86
Mainstream
x86 SOC's
AMD HD7970 Mainstream
GPU
SpeedandPower(normalizedtoAPU)
Perf.PerWatt(normalizedtoAPU)
Performance Per Watt Ratio Performance Ratio Power Ratio

| DNN PROJECT17
REAL CASE TRAINING
 MINIST Training through MLP Model
‒Handwritten digits , 60000 images
‒Mini-batch size 1024, 200 epochs
‒Accuracy 97% with random weights
‒Accuracy 98% with pre-trained weights
APU A10-7850 GPU HD7970 GPU vs. APU
Training
Process
Time 362 second 192 second 1.9x speedup
Average Power 47 Watt 250 Watt 5.3x power
Energy 17k Joule 40k Joule 2.4x energy
Predicting
Process
Time 8.1 second 3.5 second 2.3x speedup
Average Power 37 Watt 250 Watt 6.8x power
Energy 300 Joule 875 Joule 2.9x energy

| DNN PROJECT18
OUTLINE

| DNN PROJECT19
DNN PERFORMANCE BOTTLENECKS
 DNN is usually converted to Matrix Multiplication, which consumes major part of time.
‒ People use BLAS libraries provided on commercial processors.
 Weight matrix is transposed during back propagation.
‒ Flipped between row manner and column manner between fprop and bprop.
 Data transfers between CPU and GPU can consume most of time, especially for large
images.
‒ Task assignment: CPU prepares the data, GPU computes
‒ APU can remove the overheads through zero-copy technique

| DNN PROJECT20
FURTHER ANALYSIS-WEIGHT MATRIX TRANSPOSE
 Weight matrices will be transposed during back propagation (on BP’s critical path)
‒ 𝑧 = 𝑊 𝑇
𝜎
 What is the most efficient way to transpose on different platforms?
‒ 𝑠𝑔𝑒𝑚𝑚, 𝑠𝑔𝑒𝑚𝑚_𝑇, GPU_Tran + 𝑠𝑔𝑒𝑚𝑚, CPU_Tran + 𝑠𝑔𝑒𝑚𝑚
 Note: leveraging CPU to transposes matrix results in the worst performance, because CPU takes about a
magnitude to transpose,GPU wait_in_idle
Micro benchmark: transpose 2kx2k matrix A and multiply 𝐴 𝑇*B
Platforms AMD GPU
FX8320+HD7970
FX8320+Mainstream
GPU
AMD APU
A10-7850K
sgemm 8.62ms 6.09ms 53.26ms
sgemm_T 17.69ms 6.31ms 83.3ms
GPU Tran + sgemm 9.56ms 6.34ms 55.46ms
CPU Tran + sgemm 55.88ms 67.46ms 86.8ms
√
√
√

| DNN PROJECT21
FURTHER ANALYSIS-DATA TRANSFER OVERHEADS
 Data transfer overheads between CPU and GPU have been pointed out(A. et al., 2013) as the bottleneck
of DNN acceleration.
 First, we use autoencoder to quantify the data transfer overheads.
 Data transfer time increases linearly with data sizes. It is very difficult to train real world size images
without removal of this bottleneck.
DataTransferTime%
15%
24%
33%
18%
25%
34%
18%
27%
38%
21%
33%
40%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
3072 5120 7168
Input Data Size with different mini-batch size
256-batch 512-batch 1024-batch 2048-batch
Data transfer overheads on CPU + Mainstream GPU, one forward prop. and backward prop.
40% time is to move data,
for 48x48 RGB images

| DNN PROJECT22
DATA TRANSFER OVERHEADS
 How to avoid data copy through the zero-copy technique on APUs?
‒ APU: Zero-copy improves performance by 10%
‒ GPUs: Zero-copy degrades performance by 3.5x for AMD HD7970 and 8.7x for Mainstream GPU.
Zero-copy technique:
APUs: CPU and GPU share the
same piece of memory, efficient
GPUs: GPU accesses host memory
through PCIe, slow
Experiment design:
CPU initializes 2kx2k matrixes
(A, B), GPU performs C=A*B
Matrix multiplication performance comparison among copy and zero-copy
45
41
19
67
23
199
0
10
20
30
40
50
60
70
80
90
100
110
120
Copy Zero Copy Copy Zero Copy Copy Zero Copy
Kaveri HD7970 Mainstream GPU
ExecutionTime(ms)
Kernel Data Transfer

| DNN PROJECT23
CONCLUSIONS-APU SERVER ADVANTAGES
BASED ON AUTOENCODER RESULTS
AMD APU Server
 2.4 APUs can achieve similar performance with ~2.5x less power
 2.5x higher performance given the same power budget
HEADER
TCO (Total cost ownership)  APU server achieves the same performance with ~1.8x less dollars
Architectural Advantages
 APU servers remove GPU’s device memory limitation and data transfer
bottleneck, which fit better for Big Data inputs
Cluster of CPU + GPU
 2.4x speedup
 6x more power

| DNN PROJECT24
NEXT PLAN-AMD SOLUTIONS
 H/W solutions: Parallel implementation on systems and system level evaluation
‒ CPU + GPUs cluster
‒ APU server
 S/W solutions: OpenCL Implementation of DNN specific kernels
‒ OpenCL implementations and optimizations, applicable to general heterogeneous platforms
 Set up real world application scenarios with external company’s involvement and apply AMD solutions
to industry

| DNN PROJECT25
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

| DNN PROJECT26
BACK UP SLIDES

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4628
SYSTEM OVERVIEW
APU
APU
CPU
Cluster
To DRAM
Directory
GPU
Cluster
Direct-access bus
(used for graphics)
Invalidation
traffic
GPU compute
accesses must stay
coherent
Arrow thickness
→bandwidth

SYSTEM OVERVIEW
GPU
GPU Cluster
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
L1 L1 L1 L1 L1 L1 L1 L1L1 L1 L1 L1 L1 L1 L1 L1
CU CU CU CU CU CU CU CUCU CU CU CU CU CU CU CU
GPU L2 Cache
Very high bandwidth:
L2 has high miss rate
CU
I-Fetch / Decode
Register File
Ex Ex Ex Ex
Ex Ex Ex Ex
Ex Ex Ex Ex
Ex Ex Ex Ex
Local Scratchpad
Memory
Coalescer
To L1

SEAMICRO

APSys Presentation Final copy2

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to APSys Presentation Final copy2 (20)

APSys Presentation Final copy2

Editor's Notes