SlideShare a Scribd company logo
Politecnico di Milano
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
marco.bacis@mail.polimi.it
Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo,
Marco Domenico Santambrogio
CNN Dataflow implementation on
FPGAs
Oracle HQ
Wednesday, 7th June 2017
Introduction 2
Issues
3
Challenges
Huge set of weights and data
Memory bounded computation
Need to have a scalable design in terms of
memory and resources
without losing in performance
+
=
● Exploitation of the dataflow pattern of CNN operations
● Independent modules with parametric level of parallelism
● Streaming + Dataflow computational paradigm with
efficient memory access
Our Solution 4
Methodology for CNN acceleration on FPGA with
5
Iterative Stencil Loops
Spatial dependencies
Memory bound
Enable efficient solutions in term of
performance and power
6
● Independent modules communicating over FIFOs
● Concurrent memory access and optimal full buffering
● Scalable without increasing external memory use
Streaming StencilTimestep
7
Proposed Approach
Implementation 8
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Convolution Module Structure 9
SST
Convolution Module - Parameters 10
• Input/Output Height
• Input/OutputWidth
• Number of Input Feature Maps
• Number of Output Feature Maps
Convolution Module - Parameters 11
• Kernel Height
• KernelWidth
• Number of Input Ports
• Number of Output Ports
# Input FMs received per cycle
# Output FMs sent per cycle
Implementation 12
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Fully Connected Module Structure 13
● Treated as a 1x1 convolution
● “Compressed” streaming approac
● 1 input port, 1 output port
● Low latency Floating point accumulatio
● Issue for pipelining
● Multiple accumulators + Loop Unrolling
Implementation 14
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Network Design 15
● Convolutional Module
● Memory structure based on I/O ports
● Single vs Multi channel memory cores
● Pooling Module
● Independent from channel
● One module for each previous output port
● Fully-Connected Module -> single pipelined core
Experimental Evaluation 16
● Two evaluation designs
● CIFAR-10 network
Conv -> Pool -> Conv -> Pool -> Lin ->Lin
● USPS network
Conv -> Pool -> Conv -> Lin
● Different design choices as a proof-of-concept of the
methodology
● Tested on a XilinxVC707 board
CIFAR-10 Network 17
5 x 5
3 in FMs
12 out FMs
32 x 32
Conv 1
2 x 2
12 in FMs
12 out FMs
28 x 28
Pool 1
5 x 5
12 in FMs
36 out FMs
14 x 14
Conv 2
2 x 2
36 in FMs
36 out FMs
10 x 10
Pool 2
900 in
36 out
Lin 1
36 in
10 out
Lin 2
USPS Network 18
5 x 5
1 in FMs
6 out FMs
16 x 16
Conv 1
2 x 2
6 in FMs
6 out FMs
12 x 12
Pool 1
5 x 5
6 in FMs
16 out FMs
6 x 6
Conv 2
64 in
10 out
Lin 1
Experimental Results 19
Performance improvements with increased batch size
Experimental Results 20
Dataset GFLOPS GFLOPS/W Images/s
Test Case 1 USPS 5.2 0.25 172414
Test Case 2 CIFAR-10 28.4 1.19 7809
MSR Work [1] CIFAR-10 - - 2318
Flips Flops LUTs BRAM DSP Slices
Test Case 1 41.10% 50.86% 3.50% 55.04%
Test Case 2 61.77% 71.24% 22.82% 74.32%
Performances and Power Efficiency Results
FPGA Resources Usage
[1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research
Whitepaper, 2015
Conclusions 21
● Modular and scalar methodology to accelerate CNNs on
FPGAs using a dataflow approach
● Performance improvement over large batches
● High level pipeline between layers
● Improved memory bandwidth utilization
● High scalability given limited resources
FutureWorks 22
Multi-FPGA / Split layers approach
Automatic DSE / CADTool
Different precision / data type
23
Questions?
Marco Bacis
M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio
“A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA”
IPDPS Workshops (RAW), May 2017
M. Bacis, G. Natale, and M. D. Santambrogio
“On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks”
ISVLSI Conference, July 2017 – To Appear
References
marco.bacis@mail.polimi.it

More Related Content

PPTX
CNN Dataflow implementation on FPGAs
PPTX
CNN Dataflow Implementation on FPGAs
PPTX
Stop-the-world GCs on milticores
PDF
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
PDF
PDF
Characteristics of an on chip cache on nec sx
PDF
http server on user-level mTCP stack accelerated by DPDK
PDF
TUKE System for MediaEval 2014 QUESST
CNN Dataflow implementation on FPGAs
CNN Dataflow Implementation on FPGAs
Stop-the-world GCs on milticores
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Characteristics of an on chip cache on nec sx
http server on user-level mTCP stack accelerated by DPDK
TUKE System for MediaEval 2014 QUESST

What's hot (6)

PDF
Memory efficient implementation of dense nets
PDF
Exploiting rateless codes in cloud storage systems
PDF
クラウド時代の半導体メモリー技術
PPTX
past-research-on-pc-router
PDF
mTCP使ってみた
PDF
Compositional Analysis for the Multi-Resource Server
Memory efficient implementation of dense nets
Exploiting rateless codes in cloud storage systems
クラウド時代の半導体メモリー技術
past-research-on-pc-router
mTCP使ってみた
Compositional Analysis for the Multi-Resource Server
Ad

Similar to CNN Dataflow Implementation on FPGAs (20)

PPTX
CNN Dataflow Implementation on FPGAs
PDF
Deep Learning Initiative @ NECSTLab
PPTX
Accelerating Deep Learning Inference 
on Mobile Systems
PDF
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
PDF
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PPTX
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
PPT
Harnessing OpenCL in Modern Coprocessors
PDF
Lightweight DNN Processor Design (based on NVDLA)
PDF
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
PPTX
Evaluating UCIe based multi-die SoC to meet timing and power
PDF
“The Importance of Memory for Breaking the Edge AI Performance Bottleneck,” a...
PDF
fccm2015-jain-presentation
PDF
High Speed Design Closure Techniques-Balachander Krishnamurthy
PDF
Chip Multiprocessing and the Cell Broadband Engine.pdf
PDF
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
PDF
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
PDF
数据中心网络研究:机遇与挑战
PDF
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
PDF
Model-driven Network Management
CNN Dataflow Implementation on FPGAs
Deep Learning Initiative @ NECSTLab
Accelerating Deep Learning Inference 
on Mobile Systems
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Harnessing OpenCL in Modern Coprocessors
Lightweight DNN Processor Design (based on NVDLA)
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
Evaluating UCIe based multi-die SoC to meet timing and power
“The Importance of Memory for Breaking the Edge AI Performance Bottleneck,” a...
fccm2015-jain-presentation
High Speed Design Closure Techniques-Balachander Krishnamurthy
Chip Multiprocessing and the Cell Broadband Engine.pdf
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
数据中心网络研究:机遇与挑战
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
Model-driven Network Management
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
PPTX
Punto e virgola Team - Stressometro
PDF
BitIt Team - Stay.straight
PDF
BabYodini Team - Talking Gloves
PDF
printf("Nome Squadra"); Team - NeoTon
PPTX
BlackBoard Team - Motion Tracking Platform
PDF
#include<brain.h> Team - HomeBeatHome
PDF
Flipflops Team - Wave U
PDF
Bug(atta) Team - Little Brother
PDF
#NECSTCamp: come partecipare
PDF
NECSTCamp101@2020.10.1
PDF
NECSTLab101 2020.2021
PDF
TreeHouse, nourish your community
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
PDF
Embedding based knowledge graph link prediction for drug repurposing
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
PDF
Luns - Automatic lungs segmentation through neural network
PDF
BlastFunction: How to combine Serverless and FPGAs
PDF
Maeve - Fast genome analysis leveraging exact string matching
Mesticheria Team - WiiReflex
Punto e virgola Team - Stressometro
BitIt Team - Stay.straight
BabYodini Team - Talking Gloves
printf("Nome Squadra"); Team - NeoTon
BlackBoard Team - Motion Tracking Platform
#include<brain.h> Team - HomeBeatHome
Flipflops Team - Wave U
Bug(atta) Team - Little Brother
#NECSTCamp: come partecipare
NECSTCamp101@2020.10.1
NECSTLab101 2020.2021
TreeHouse, nourish your community
TiReX: Tiled Regular eXpressionsmatching architecture
Embedding based knowledge graph link prediction for drug repurposing
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
EMPhASIS - An EMbedded Public Attention Stress Identification System
Luns - Automatic lungs segmentation through neural network
BlastFunction: How to combine Serverless and FPGAs
Maeve - Fast genome analysis leveraging exact string matching

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Digital Logic Computer Design lecture notes
web development for engineering and engineering
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
573137875-Attendance-Management-System-original
OOP with Java - Java Introduction (Basics)
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Lecture Notes Electrical Wiring System Components
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
additive manufacturing of ss316l using mig welding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Digital Logic Computer Design lecture notes

CNN Dataflow Implementation on FPGAs

  • 1. Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) marco.bacis@mail.polimi.it Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, Marco Domenico Santambrogio CNN Dataflow implementation on FPGAs Oracle HQ Wednesday, 7th June 2017
  • 3. Issues 3 Challenges Huge set of weights and data Memory bounded computation Need to have a scalable design in terms of memory and resources without losing in performance + =
  • 4. ● Exploitation of the dataflow pattern of CNN operations ● Independent modules with parametric level of parallelism ● Streaming + Dataflow computational paradigm with efficient memory access Our Solution 4 Methodology for CNN acceleration on FPGA with
  • 5. 5 Iterative Stencil Loops Spatial dependencies Memory bound Enable efficient solutions in term of performance and power
  • 6. 6 ● Independent modules communicating over FIFOs ● Concurrent memory access and optimal full buffering ● Scalable without increasing external memory use Streaming StencilTimestep
  • 8. Implementation 8 1. Convolution Module Structure 2. Fully Connected Module Structure 3. Network Design
  • 10. Convolution Module - Parameters 10 • Input/Output Height • Input/OutputWidth • Number of Input Feature Maps • Number of Output Feature Maps
  • 11. Convolution Module - Parameters 11 • Kernel Height • KernelWidth • Number of Input Ports • Number of Output Ports # Input FMs received per cycle # Output FMs sent per cycle
  • 12. Implementation 12 1. Convolution Module Structure 2. Fully Connected Module Structure 3. Network Design
  • 13. Fully Connected Module Structure 13 ● Treated as a 1x1 convolution ● “Compressed” streaming approac ● 1 input port, 1 output port ● Low latency Floating point accumulatio ● Issue for pipelining ● Multiple accumulators + Loop Unrolling
  • 14. Implementation 14 1. Convolution Module Structure 2. Fully Connected Module Structure 3. Network Design
  • 15. Network Design 15 ● Convolutional Module ● Memory structure based on I/O ports ● Single vs Multi channel memory cores ● Pooling Module ● Independent from channel ● One module for each previous output port ● Fully-Connected Module -> single pipelined core
  • 16. Experimental Evaluation 16 ● Two evaluation designs ● CIFAR-10 network Conv -> Pool -> Conv -> Pool -> Lin ->Lin ● USPS network Conv -> Pool -> Conv -> Lin ● Different design choices as a proof-of-concept of the methodology ● Tested on a XilinxVC707 board
  • 17. CIFAR-10 Network 17 5 x 5 3 in FMs 12 out FMs 32 x 32 Conv 1 2 x 2 12 in FMs 12 out FMs 28 x 28 Pool 1 5 x 5 12 in FMs 36 out FMs 14 x 14 Conv 2 2 x 2 36 in FMs 36 out FMs 10 x 10 Pool 2 900 in 36 out Lin 1 36 in 10 out Lin 2
  • 18. USPS Network 18 5 x 5 1 in FMs 6 out FMs 16 x 16 Conv 1 2 x 2 6 in FMs 6 out FMs 12 x 12 Pool 1 5 x 5 6 in FMs 16 out FMs 6 x 6 Conv 2 64 in 10 out Lin 1
  • 19. Experimental Results 19 Performance improvements with increased batch size
  • 20. Experimental Results 20 Dataset GFLOPS GFLOPS/W Images/s Test Case 1 USPS 5.2 0.25 172414 Test Case 2 CIFAR-10 28.4 1.19 7809 MSR Work [1] CIFAR-10 - - 2318 Flips Flops LUTs BRAM DSP Slices Test Case 1 41.10% 50.86% 3.50% 55.04% Test Case 2 61.77% 71.24% 22.82% 74.32% Performances and Power Efficiency Results FPGA Resources Usage [1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research Whitepaper, 2015
  • 21. Conclusions 21 ● Modular and scalar methodology to accelerate CNNs on FPGAs using a dataflow approach ● Performance improvement over large batches ● High level pipeline between layers ● Improved memory bandwidth utilization ● High scalability given limited resources
  • 22. FutureWorks 22 Multi-FPGA / Split layers approach Automatic DSE / CADTool Different precision / data type
  • 23. 23 Questions? Marco Bacis M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio “A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA” IPDPS Workshops (RAW), May 2017 M. Bacis, G. Natale, and M. D. Santambrogio “On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks” ISVLSI Conference, July 2017 – To Appear References marco.bacis@mail.polimi.it