A Dataflow Processing Chip for Training Deep Neural Networks

A Dataflow Processing Chip for Training Deep Neural Networks
Dr. Chris Nicol
Chief Technology Officer
Wave Computing Copyright 2017.

Founded in 2010
• Tallwood Venture Capital
• Southern Cross Venture Partners
Headquartered in Campbell, CA
• World class team of 53 dataflow, data science, and systems experts
• 60+ patents
Invented Dataflow Processing Unit (DPU) architecture to
accelerate deep learning training by up to 1000x
• Coarse Grain Reconfigurable Array (CGRA) Architecture
• Static scheduling of data flow graphs onto massive array of processors
Now accepting qualified customers for Early Access Program
Wave Computing. Copyright 2017.
Wave Computing Profile

Extended training time due to increasing size of datasets
• Weeks to tune and train typical deep learning models
Hardware for accelerating ML was created for other applications
• GPUs for graphics, FPGA’s for RTL emulation
Data coming in “from the edge” is growing faster than
the datacenter can accommodate/use it…
Design
•Neural network
architecture
•Cost functions
Tune
•Parameter
initialization
•Learning rate
•Mini-batch size
Train
•Accuracy
•Convergence
Rate
Deploy
•For testing
Deploy
•For production
➢ Problem: Model
development times can
take days or weeks
Challenges of Machine Learning

Source: Google; http://download.tensorflow.org/paper/whitepaper2015.pdf
• Co-processors must
wait on the CPU for
instructions
• This limits
performance and
reduces efficiency
and scalability
• Restricts embedded
use cases to
inferencing-only
GPU waiting on CPU
Figure 13: EEG visualization of Inception training showing CPU and GPU activity.
Problems with Existing Solutions

Times
Times
I/O
Softmax
Plus
Plus
Mem I/OSigmoid
Programmed on
Deep Learning
Software
Run on Wave
Dataflow
Processor
Times
Times
Plus
Plus
Softmax
Sigmoid
Deep Learning
Networks are
Dataflow
Graphs
Wave Dataflow Processor
WaveFlow Agent Library
Wave Dataflow Processor is Ideal for Deep Learning

DDR4 DDR4
HMCHMC
HMCHMC
PCIe Gen3 x16 MCU
AXI4
AXI4
AXI4
AXI4
Secure DPU
Program
Buffer
Secure DPU
Program
Loader
16ff CMOS Process Node 16K Processors,
8192 DPU Arithmetic Units
Self-Timed,
MPP Synchronization
181 Peak Tera-Ops, 7.25 Tera
Bytes/sec Bisection Bandwidth
16 MB Distributed
Data Memory
8 MB Distributed
Instruction Memory
1.71 TB/s I/O Bandwidth
4096 Programmable FIFOs
270 GB/s Peak
Memory Bandwidth
2048 outstanding
memory requests
4 Billion 16-Byte Random
Access Transfers / sec
4 Hybrid Memory
Cube Interfaces
2 DDR4 Interfaces
PCIe Gen3 16-Lane
Host interface
32-b Andes N9 MCU 1 MB Program
Store for Paging
Hardware Engine for Fast
Loading of AES Encrypted
Programs
Up to 32 Programmable
dynamic reconfiguration zones
Variable Fabric Dimensions
(User Programmable at Boot)
Wave Dataflow Processing Unit
Chip Characteristics & Design Features
• Clock-less CGRA is robust to Process, Voltage & Temperature.
• Distributed memory architecture for parallel processing
• Optimized for data flow graph execution
• DMA-driven architecture – overlapping I/O and computation

Key DPU Board Features
• 65,536 CGRA Processing Elements
• 4 Wave DPU chips per board
• Modular, flexible design
• Multiple DPU boards per Wave
Compute Appliance
• Off-the-shelf components
• 32GB of ultra high-speed DRAM
• 512GB of DDR4 DRAM
• FPGA for high-speed
board-to-board communication
Wave Current Generation DPU Board

• Best-in-class, highly scalable deep learning training and inference
• More than orders of magnitude better compute-power efficiency
• Plug-and-play node in a datacenter network -- Big Data – Hadoop, Yarn, Spark, Kafka
• Native support of Google TensorFlow (initially)
Wave’s Solution: Dataflow Computer for Deep Learning

Pipelined 1KB Single Port Data RAM /w BIST & ECC
Pipelined 256-entry Instruction RAM /w ECC
Quad of PEs are fully
connected
PE c
PE a PE b
PE d
Dataflow Processing Element (PE)

• 16 Processor CLUSTER: a full custom tiled GDSII block
• Fully-Connected PE Quads with fan-out
• 8 DPU Arithmetic Units
– Per-cycle grouping into 8, 16, 24, 32, 64-b Operations
– Pipelined MAC Units with (un)Signed Saturation
– Support for floating point emulation
– Barrel Shifter, Bit Processor
– SIMD and MIMD instruction classes
– Data driven
• 16KB Data RAM
• 16 Instruction RAMs
• Full custom semi-static digital circuits
• Robust PVT insensitive operation
– Scalable to low voltages
– No global signals, no global clocks
Cluster of 16 Dataflow PEs

Each cluster has a pipelined instruction-driven word-level switch
Each cluster has a 4 independent pipelined
instruction driven byte-switches
Word switch supports fan-out and fan-in
fan-in
All switches have
registers for Router use
to avoid congestion
“valid” and “invalid” data in the switch enables fan-in
fan-out
Hybrid CGRA Architecture

From Asleep to Active
• Word switch fabric remains active
• If valid data arrives at switch input AND switch executes
instruction to send data to one of Quads THEN wake up PEs
• Copy PC from word switch to PE and byte switch iRAMs
• Send the incoming data to the PEs
From Active to Asleep
• A PE executes a “sleep” instruction
• All PE & byte switch execution is suspended
• PE can opt for fast wakeup or slow wakeup
(deep sleep with lower power)
Data-Driven Power Management

Compute Machine with AXI4 Interfaces

24 Compute Machines
Wave DPU Hierarchy

Wave DPU Memory Hierarchy

• Clock skew and jitter limit cycle time with traditional clock distribution
• Self-timed “done” signal from PEs if they are awake. Programmable
tuning of margin.
• Synchronized with neighboring Clusters to minimize skew
• 1-sigma local mismatch ~1.3ps and global + local mismatch ~6ps at
140ps cycle time
Clock Distribution and Generation
Network across entire Fabric
6-10 GHz Auto-Calibrated Clock Distribution

-4
-5
-3
-4
-6
-7
-5
-6
-2
-3
-1
-2
-4
-5
-3
-4
Up-counter in each cluster
initialized to -(1+Manhattan
distance from end cluster)
End cluster
Start cluster
When counter reaches 0, either:
- Reset the processors
- Suspend processors for configuration (at PC=0)
- Enable processors to execute (from PC=0)
-4
-5
-3
-4
-6
-7
-5
-6
-2
-3
-1
-2
-4
-5
-3
-4
Pre-program 4
clusters to ENTER
config mode.
Counters
operating
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
DMA new kernel
instructions into
cluster I-mems -4
-5
-3
-4
-6
-7
-5
-6
-2
-3
-1
-2
-4
-5
-3
-4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Old
kernel
Enter config mode Exit config modePropagate SignalPropagate Signal
Step 1 Step 2 Step 3 Step 4
Propagate control signal
from start cluster to end
cluster. Advances 1
cluster per cycle.
Pre-program 4
clusters to EXIT
config mode.
All clusters running in-synch
New
kernel
SW controls this process to manage surge current
Propagate signal
starts the up-counter
in each cluster.
Counters
operating
Counters
operating
New
kernel
Reset, Configuration Modes
Stop
(config mode)
(running)
(running)
(running)
(running)
(running)
(running) (running)
(running)
(running)

1
1
1
1
2
2
2
2
-
-
-
-
2
2
2
2
Kernel1 and kernel2 mounted.
3
3
3
3
I/O
Mount(kernel3)
(note: I/Os at bottom)
New Kernel
goes here!
1
1
1
1
2
2
2
2
3
3
3
3
2
2
2
2
After mount(kernel3)
I/O I/O
I/O
Just-in-time route
through Kernel 2
The I/Os cannot
go here!
• Runtime resource manager in lightweight host.
• Mount(). Online placement algorithm with maxrects management of empty clusters.
• Uses “porosity map” for each kernel showing route-thru opportunities. (SDK provides this)
• Just-in-time Place & Route (using A*) of I/Os through other kernels without functional side-effects.
• Unmount(). Removes paths through other kernels.
• Machines are combined for mounting large kernels. Partitioned during unmount().
• Periodic garbage collection used for cleanup.
• Average mount time < 1ms
Runtime resource manager performing mount()
Dynamic Reconfiguration

• WFG Compiler
• WFG Linker
• WFG Simulator
• DF agent partitioning
• DFG throughput
optimization
• Runs on Session Host
WaveFlow Execution
Engine
• Resource Manager
• Monitors
• Drivers
• Runs on a Wave Deep
Learning Computer
WaveFlow Session
Manager
WaveFlow
Agent Library
• BLAS 1,2,3
• CONV2D
• SoftMax, etc.
WaveFlow SDK
On line
Off line Encrypted Agent Code
WaveFlow Software Stack

WaveFlow agents are pre compiled off-line using WaveFlow SDK
• Wave provides a complete agent library for TensorFlow
• Customer can create additional agents for differentiation
Customer supplied
agent source code
Wave supplied
agent source code
WaveFlow
Agent Library
Wave SDK
• WFG Compiler
• WFG Linker
• WFG Simulator
• WFG Debugger
• MATMUL
• Batchnorm
• Relu, etc.
Your new
DNN training
technique
Encrypted Agent Code
WaveFlow Agent Library

SATSolver WFG Compiler
LLVM Frontend
WFG Linker
AssemblerArchitectural Simulator WaveFlow
Agents
WFG
Simulator
ML function (gemm, sigmoid, …)
To appear in ICCAD 2017
WFG = Wave Flow Graph
WaveFlow SDK

Kernels are islands of machine code scheduled onto machine cycles
Example: Sum of Products on 16 PEs in a single cluster
WFG of Sum of ProductsSum of Products Kernel
PE 0 to 15
Time
mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov movcr mov mov mov mov mov mov mov movcr movi mov mov mov
movcr mov mov mov movi mov mov mov movcr mov mov mov add8 mov
mov mov mov mov mov mov add8 mov mov mov incc8 mov movcr
mov mov mov movcr mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov movcr mov mov
mov mov mov memr movcr mov memr mov mov mov mov mov mov
mov movcr mov mov mov mov movcr mov mov mov movcr mov mov mov
mov mov mac mac mov mov mov mov mov mov mac mac mov mov mov mov
mac mac mov mov mac mac mac mac mac mac mov mov mac mac mac mac
mov mov mov memr mov mov mov mov mov mov mov mov mov movcr
mov mov mov movcr mov mov memr mov mov mov mov mov mov
incc8 mov mov mov mov incc8 mov mov mov mov mov mov mov mov
mov mov mov mov mov mov incc8 mov mov mov movcr movcr memw
mov mov mov mov mov mov mov movcr mov mov mov mov mov
mov mov memr mov mov mov memr incc8 mov mov mov mov st mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
memr mov mov mov memr mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov cmuxi mov mov
mov mov mov memr mov mov mov memr mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov
mov memr mov mov memr mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov memr mov mov mov mov mov mov
mov mov memr mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov
mov mov mov mov mov
mov mov mov
Wave SDK: Compiler Produces Kernels

Session Manager Partitions & Maps to DPUs & Memory
Inference Graph generated directly from Keras model by Wave Compiler
Wave Flow Graph Format
Mapping Inception V4 to DPUs
Single Node
64-DPU Computer

Benchmarks on a single node 64-DPU Data Flow Computer
• ImageNet training, 90 epochs, 1.28M images, 224x224x3
• Seq2Seq training using parameters from https://papers.nips.cc/paper/5346-sequence-to-sequence-
learning-with-neural-networks.pdf by I. Sutskever, O. Vinyals & Q. Le
Network Inferencing
(Images/sec)
Training time
AlexNet 962,000 40 mins
GoogleNet 420,000 1 hour 45 mins
Squeezenet 75,000 3 hours
Seq2Seq - 7 hours 15 min
Deep Neural Network Performance

Wave is now accepting qualified customers to its Early Access Program (EAP)
Provides select companies access to a Wave machine learning computer for testing
and benchmarking months before official system sales begin
For details about participation in the limited number of EAP positions,
contact info@wavecomp.com
Wave’s Early Access Program

A Dataflow Processing Chip for Training Deep Neural Networks

More Related Content

What's hot (20)

Similar to A Dataflow Processing Chip for Training Deep Neural Networks (20)

More from inside-BigData.com (20)

Recently uploaded (20)

A Dataflow Processing Chip for Training Deep Neural Networks