CONDOR: An automated framework to accelerate convolutional neural networks on FPGA

CONDOR
AN AUTOMATED FRAMEWORK TO ACCELERATE
CONVOLUTIONAL NEURAL NETWORKS ON FPGA
Soda-430-438 Woz Lounge 
Berkeley, CA 
May 23rd, 2018
Niccolò Raspa, Marco Bacis,  
Giuseppe Natale, Marco D. Santambrogio

Convolutional Neural Networks
!5

Deep Convolutional Neural Networks
!6

!7
CNN on siliconCNN on silicon
GPU ASIC
FPGA

LeNet - Training
!8
PROTOTXT 
CAFFEMODEL

LeNet - Deployment
!9
PROTOTXT 
CAFFEMODEL
?

Manual Design
!10
Extract the parameters 
and the weights
Write the code Synthesis
Evaluate DesignPackage IP
Iterate

Automatic Design
!11
CONDOR
PROTOTXT 
CAFFEMODEL

Framework Architecture
!12
Parse structure  
of the CNN
FRONTEND
Creation of HW
Accelerator
CORE LOGIC
Deployment
BACKEND

Create DAG computation
!13
PROTOTXT 
CAFFEMODEL
{
Input Data
Convolution
Pooling
Fully Connected
Convolution
Pooling
Fully Connected
Input Dimension: (28, 28, 1)
Output Dimension (24, 24, 20)
Kernel: 5
Padding: 0
Stride: 1
Input dimension (28, 28, 1)
Input Dimension: (24, 24, 20)
Output Dimension (12, 12, 20)
Kernel: 2
Padding: 0
Stride: 2

Map computation in hardware
!14
Area
Convolution Pooling Fully ConnectedConvolution Pooling

Integration with SDAccel
!16
CONDOR

What if I don’t have an FPGA?
!17
CONDOR

Features
!18
Cloud Integration
via Amazon F1 Instances
Automatic creation of
an hardware accelerator for FPGA
Tune the tradeoff between  
performance and power consumption
Support main deep
learning libraries

Roadmap
!19
Automated
Framework
Methodology for
Acceleration of
CNN
Integration
with Caffe
Cloud Integration
M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio.
“A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA”
In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Giuseppe Natale, Marco Bacis and Marco Domenico Santambrogio.
“On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks”
In: 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
*
2017
Support the new  
standard ONXX
2018
Open Source  
Release
2019?
Extend
Methodology
XOHW
Competition
*

CONDOR
ARCHITECTURAL CHOICES FOR THE FPGA-BASED
ACCELERATION OF CNNs
Soda-430-438 Woz Lounge 
Berkeley, CA 
May 23rd, 2018
Niccolò Raspa, Marco Bacis,  
Giuseppe Natale, Marco D. Santambrogio

Framework
!21
Parse structure  
of the CNN
FRONTEND
Creation of HW
Accelerator
CORE LOGIC
Deployment
BACKEND

FPGAs and CNNs
!22
Dataﬂow Computation
Data reusability
Distributed Architecture

Our ﬁrst approach
!23
[1] Giuseppe Natale, Marco Bacis, Marco D. Santambrogio
“On how to design fpga-based accelerators for Convolutional Neural Networks”, ISVLSI 2017
DMA
in
CONV POOL CONV LINEAR
w/b w/b w/b
POOL
POOL
POOL
POOL

Bigger networks, bigger FPGAs… or not?
!24
• Weights don’t ﬁt on the on-chip BRAMs

• Unrolling leads to the explosion of DSPs (multipliers) usage

Methodology improvements
!25
• No complete unrolling - partial accumulations

• Generic set of “one size ﬁts all” blocks

• Semi-dataﬂow architecture

• More complex data movement

Customizable data flow
!26
Datamover/Control
Conv
Pooling
ReLU
in w/b out

Customizable data flow
!27
ReLU
Pool
ReLU
ReLU
Conv
Conv
Conv
Datamover/Control
Conv
Pooling
ReLU
in w/b out
Datamover/Control
Conv
Pooling
ReLU
in w/b out
Datamover/Control
Conv
Pooling
ReLU
in w/b out
Datamover/Control
Conv
Pooling
ReLU
in w/b out

MAC
weights
input
result
Dataflow Blocks
!28
• Convolution, Pooling, ReLU etc…

• Non-uniform memory partitioning

• Streaming pattern

• Optimal full buﬀering

• Concurrent accesses

Partial accumulations approach
!29
• Custom level of parallelism

• Compute subset of both input/output feature maps

• Accumulation done with a FIFO and/or from DDR

Memory control and data buffering
!30
• Memory mapped to streaming and viceversa

• Exploit the maximum transaction size and bursts

Weights Double Buffering
!31
• Masks weights loading latency

• Allows to not ﬂush the MAC pipeline on each iteration
ping
pong

Input Caching
!32
• Reduces memory accesses

• Stores entire input for a layer

• Used for small layers (avoid lots of small transactions)
Datamover/ControlInput
Cache
in
Datamover/ControlInput
Cache
in

Architecture Evaluation
!33
•5 MB BRAM

•2880 DSPs (27x15 bits mult)

•1 DDR port (512 bits wide)

•115.2 GFLOPs max (100Mhz)
Alphadata Virtex-7
Setup Results
•30.6 GOPs, 56MB parameters

•4 input, 4 output ports

•27.2 GFLOPs estimated

•14.4 GFLOPs reached
VGG16 Network

Lessons Learned
!34
• Floating point is dead, long live the ﬁxed!

• Oﬀ-chip memory vs On-chip memory

• Old hardware vs New Hardware

Next Steps
!35
• Possibility to use URAMs as on chip storage (33.75 MB)
• Higher number of DSPs (~2.3X)

• Eﬃcient multiplication (8 bits ﬁxed point -> 2 mul/dsp)

• Higher memory BW (4 DDR ports)

Next Steps
MAC/
Window
FSM
Acc/ReLU
Pooling
1024 out512 in 2-64 out
Weights
I/O Buﬀer

“A Framework with Cloud Integration for  
CNN Acceleration on FPGA Devices”
Marco Bacis
marco.bacis@mail.polimi.it
Niccolo’ Raspa 
niccolo.raspa@mail.polimi.it
Giuseppe Natale
giuseppe.natale@polimi.it
Marco D. Santambrogio
marco.santambrogio@polimi.it
twitter.com/CondorAtNECST
facebook.com/CondorAtNECST

CONDOR: An automated framework to accelerate convolutional neural networks on FPGA

More Related Content

What's hot (20)

Similar to CONDOR: An automated framework to accelerate convolutional neural networks on FPGA (20)

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded (20)

CONDOR: An automated framework to accelerate convolutional neural networks on FPGA