SlideShare a Scribd company logo
© 2020 SPRL, SoE, Santa Clara University
New Methods for Implementation of 2-D
Convolution for Convolutional Neural
Network (CNN)
Tokunbo Ogunfunmi
Signal Processing Research Lab (SPRL),
Electrical & Computer Engr. (ECEN) Dept.,
School of Engineering,
Santa Clara University.
September 2020
© 2020 SPRL, SoE, Santa Clara University
Outline
➢ Motivation
➢ Challenges in Implementing 2-D Convolution for CNNs
➢ Method #1
➢ Method #2
➢ Future Work
➢ Summary and Conclusions
2
© 2020 SPRL, SoE, Santa Clara University
Convolutional Neural Networks
• CNNs are most popular for vision tasks like image classification and segmentation.
• CNNs are computationally intensive.
• Computation and data movement requires energy.
• Data read and write major energy consumer.
• Activations, partial sums and weights constitute the most amount of data moved.
3
© 2020 SPRL, SoE, Santa Clara University
2D Convolution Operation
• Weights multiplied by input feature
map and accumulated.
• Kernel or weights are synonymous.
• Filters in CNNs convolve over
multiple channels.
Image Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5.6-ConvolutionalNetworks.pdf
4
© 2020 SPRL, SoE, Santa Clara University
Challenges in FPGA Implementation of DNNs
Computation Engine
PE PE
PE PE
On chip
Output
buffer
External
Memory
FPGA
Input image
And weights
Input pixels
and weights
Input pixels
and weights
Output
Feature
Map
Pixel
Input size
Loop Tiling Loop Unrolling
Output feature size
Loop Tiling
On chip
buffer for
input and
weights
2 3 2
External
Memory
1
1 Challenge 1 : Huge Memory Transfer (Input and Output)
2 Challenge 2 : Large Onchip Buffers (Input and Output)
3 Challenge 3: Large Compute
4 Challenge 4: Complicated Scheduling and Dataflow Control
4
1
5
© 2020 SPRL, SoE, Santa Clara University
Method #1
FIFO Based
© 2020 SPRL, SoE, Santa Clara University
Convolution – Tile Based
• Conv. with 3x3 kernel
• Need to read at least 3
rows of pixels into line buffers.
• Better tile based
processing with 4 line
buffers.
7
© 2020 SPRL, SoE, Santa Clara University
Proposed Dataflow
• Method proposed for VGG16 which has only 3x3 kernel
• Can be extended to other kernel sizes as well
• The proposed method aims to reduce the read and write bandwidth.
• Aims to read the input feature map only once.
[4]. A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017.
8
© 2020 SPRL, SoE, Santa Clara University
The Basic Idea and an Example
Partial sum
FIFO1
Output
FIFO
Partial sum
FIFO2
33
14
4
16
7
23
1
21
0
12
7
14
5
14
9
19
5
20
4
87
11
2
10
0
75 85
67 95 75 65 82
90
11
5
14
3
23
2
17
8
1 0 -1 2 0 -2 1 0 -1
-23
-53
-43
-22
-44
-67
-50
-
100
-
153
-55
-
110
-
153
-13
-26
-48
-13
-80
9
© 2020 SPRL, SoE, Santa Clara University
Processing Element (PE)
• Uses 3 FIFOs to compute convolution output
• Partial sums are stored in 2 FIFOs.
• 3rd FIFO used to accumulate outputs
• Partial sum FIFO size = width of the input image.
• Rounded up to 256 in case of VGG16
• 2 such FIFOs
• Output FIFO used to combine output of Processing elements
• Output FIFO size for VGG16 =>256x256 = 64k
10
© 2020 SPRL, SoE, Santa Clara University
Processing Element (PE) Architecture
11
© 2020 SPRL, SoE, Santa Clara University
Parallel Implementation
• Example of how 64 channel Input Feature (IF) map is processed in groups of 4.
12
© 2020 SPRL, SoE, Santa Clara University
Hardware Platform
• XILINX PYNQZ1 has a ZNYQ
7000 soc.
• Has an ARM processor
running at 650 MHz.
• Programable logic works at
100 MHz.
• Programmable logic can be
controlled using Python code
Image source : https://reference.digilentinc.com/_media/reference/programmable-logic/pynq-z1/pynq-z1-1.png
© 2020 SPRL, SoE, Santa Clara University
Architecture Implementation
• The architecture was implemented using C++
• HLS used to convert C++ to hardware.
• RTL IP created and built into block design
• Xilinx Vivado used to synthesize, place and route the block design
• Fixed point 16 format was used for the weights and partial sums.
• Python code using PYNQ library used to implement the Conv. Layer operation.
14
© 2020 SPRL, SoE, Santa Clara University
FPGA Utilization
Includes the space required for
AXI DMAs, Block RAMs other
blocks.
15
© 2020 SPRL, SoE, Santa Clara University
Results and Comparisons
[10] Y. Chen, T. Krishna, J.S. Emer and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep
convolutional neural networks” IEEE Journal of Solid State Circuits, vol. 52, no. 1, pp. 127-138, January 2017.
[4]. A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017.
16
© 2020 SPRL, SoE, Santa Clara University
Method #2
Single Partial Product 2-D
(SPP2D)
© 2020 SPRL, SoE, Santa Clara University
Convolution Operation
i0 i1 i2 i3 i4
i5 i6 i7 i8 i9
i10 i11 i12 i13 i14
i15 i16 i17 i18 i19
i20 i21 i22 i23 i24
i0*w0 i1*w1 i2*w2
i5*w3 i6*w4 i7*w5
i10*w6 i11*w7 i12*w8
Input
o0 o1 o2
o3 o4 o5
o6 o7 o8
w0 w1 w2
w3 w4 w5
w6 w7 w8
Kernel Output
i1*w0 i2*w1 i3*w2
i6*w3 i7*w4 i8*w5
i11*w6 i12*w7 i13*w8
i2*w0 i3*w1 i4*w2
i7*w3 i8*w4 i9*w5
i12*w6 i13*w7 i14*w8
i5*w0 i6*w1 i7*w2
i10*w3 i11*w4 i12*w5
i15*w6 i16*w7 i17*w8
i6*w0 i7*w1 i8*w2
i11*w3 i12*w4 i13*w5
i16*w6 i17*w7 i18*w8
i7*w0 i8*w1 i9*w2
i12*w3 i13*w4 i14*w5
i17*w6 i18*w7 i19*w8
i10*w0 i11*w1 i12*w2
i15*w3 i16*w4 i17*w5
i20*w6 i21*w7 i22*w8
i11*w0 i12*w1 i13*w2
i16*w3 i17*w4 i18*w5
i21*w6 i22*w7 i23*w8
Σ
i12*w0 i13*w1 i14*w2
i17*w3 i18*w4 i19*w5
i22*w6 i23*w7 i24*w8
Consider and input of size 5x5,kernel of size 3x3.We consider a convolution operation
with stride 1 and with zero padding.
18
© 2020 SPRL, SoE, Santa Clara University
o0 o1 o2
o3 o4 o5
o6 o7 o8
o0 o1 o2
o3 o4 o5
o6 o7 o8
o0 o1 o2
o3 o4 o5
o6 o7 o8
Convolution Operation
i0 i1 i2 i3 i4
i5 i6 i7 i8 i9
i10 i11 i12 i13 i14
i15 i16 i17 i18 i19
i20 i21 i22 i23 i24
i0 i1 i2 i3 i4
i5 i6 i7 i8 i9
i10 i11 i12 i13 i14
i15 i16 i17 i18 i19
i20 i21 i22 i23 i24
i0 i1 i2 i3 i4
i5 i6 i7 i8 i9
i10 i11 i12 i13 i14
i15 i16 i17 i18 i19
i20 i21 i22 i23 i24
Frequency of use of an input pixels is N(x) where x is the frequency itself. For example i0 has frequency 1
19
© 2020 SPRL, SoE, Santa Clara University
Convolution Operation
We use the notation N(x) to convey the frequency of use for an input pixel, here x is the frequency.
For example, pixel i12 has frequency N(9). It is the input pixel that is used 9 times with all 9 kernel
elements.
20
© 2020 SPRL, SoE, Santa Clara University
Pattern of Input Pixel Frequency in Sliding Window
Pattern of the frequency with which input pixels are needed in the existing* implementation
N(9) pixels always lies in the center of the input ( (N-4)x(N-4) where N is input dimension) while all the
other frequencies lie on the periphery boundary which is two pixels deep.
21
© 2020 SPRL, SoE, Santa Clara University
Patterns in Existing Implementation
Pattern of the frequency with which input pixels are needed in the existing* implementation
N(3) pixels always lies in the center of the input while all the other frequencies lie on top and bottom
and are two pixels deep
A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017.
22
© 2020 SPRL, SoE, Santa Clara University
Generalized Equation for Pattern of Input Pixels for Sliding
Window Operation
N(9)
N(6)
N(3)
N(4)
N(2)
N(1)
N pixels
N pixels
2 columns 2 columns
2 rows
2 rows
input size N(9) N(6) and N(3) N(4) and N(1) N(2)
5 1 4 4 8
6 4 8 4 8
7 9 12 4 8
10 36 24 4 8
14 100 40 4 8
28 576 96 4 8
56 2704 208 4 8
112 11664 432 4 8
224 48400 880 4 8
Number of inputs with N(x) Generalized Expression
N(9) (Hinput – 2) x (Winput – 2)
N(6) and N(3) ((Hinput – 4) x 2 ) + ((Winput – 4)x2)
N(4) and N(1) 4
N(2) 2x4
Hinput and Winput are dimension of input and are N
pixels in this example
23
© 2020 SPRL, SoE, Santa Clara University
SPP2D – Input stream
• Only i12 occupies all the multipliers with the 9 weight
• Complementary Sets : (i7,i22), (i17,i2), (i11,i 14), (i6,i19,i21,i24), (i16,i1,i19,i4),
(i13, i10), (i8,i5,i23,i20), (i18,i3,i15,i0)
24
© 2020 SPRL, SoE, Santa Clara University
SPP2D – Optimized Input stream
Two benefits of combining input pixels into complementary
sets
1. All multipliers are occupied
2. Arrive at output faster. Theoretically in 9 cycles for this
arrangement
i0 i1 i2 i3 i4
i5 i6 i7 i8 i9
i10 i11 i12 i13 i14
i15 i16 i17 i18 i19
i20 i21 i22 i23 i24
o0 o1 o2
o3 o4 o5
o6 o7 o8
w0 w1 w2
w3 w4 w5
w6 w7 w8
Kernel Output
Input
clock cycles 1 2 3 4 5 6 7 8 9
N(x) N(4),N(2),N(1) N(4),N(2),N(1) N(6),N(3) N(4),N(2),N(1) N(4),N(2),N(1) N(6),N(3) N(6),N(3) N(6),N(3) N(9)
weights
Complementary
sets
i18+i3+i15+i0 i16+i1+i19+i4 i17 +i2 i8+i5 +i23+i20 i6+i19+i21+i24 i7 +i22 i13 + i10 i11 +i14 i12
w0 w0i0 w0i1 w0i2 w0i5 w0i6 w0i7 w0i10 w0i11 w0i12
w1 w1i3 w1i1 w1i2 w1i8 w1i6 w1i7 w1i13 w1i11 w1i12
w2 w2i3 w2i4 w2i2 w2i8 w2i9 w2i7 w2i13 w2i14 w2i12
w3 w3i15 w3i16 w3i17 w3i5 w3i6 w3i7 w3i10 w3ii11 w3i12
w4 w4i18 w4i16 w4i17 w4i8 w4i6 w4i7 w4i13 w4i11 w4i12
w5 w5i18 w5i19 w5i17 w5i8 w5i9 w5i7 w5i13 w5i14 w5i12
w6 w6i15 w6i16 w6i17 w6i20 w6i21 w6i22 w6i10 w6i11 w6i12
w7 w7i18 w7i16 w7i17 w7i23 w7i21 w7i22 w7i13 w7i11 w7i12
w8 w8i18 w8i19 w8i17 w8i23 w8i24 w8i22 w8i13 w8i14 w8i12
25
© 2020 SPRL, SoE, Santa Clara University
w0
2
w1
2
w2
3
w3
1
w4
2
w5
1
w6
1
w7
3
w8
1
SPP2D – Partial Products Sorted into their Outputs
The highlighted partial
products in red
contribute to the first
output pixel
Output
o0
77
o1
75
o2
93
o3
69
o4
68
o5
82
o6
81
o7
98
o8
85
i18+i3+i15+i0 6 4 4 2 3 3 2 3 3
i16+i1+i19+i4 5 5 6 1 1 3 1 1 3
i17 +i2 7 7 7 8 8 8 8 8 8
i8+i5 +i23+i20 8 1 1 8 1 1 9 2 2
i6+i19+i21+i24 2 2 5 2 2 5 9 9 4
i7 +i22 8 8 8 8 8 8 10 10 10
i13 + i10 5 9 9 5 9 9 5 9 9
i11 +i 14 2 2 8 2 2 8 2 2 8
i12 3 3 3 3 3 3 3 3 3
Σ 12 10 21 8 4 8 5 6 3
26
© 2020 SPRL, SoE, Santa Clara University
SPP2D – Hardware Architecture
External
Memory
Weight
Buffer
Input Buffer
Selector
Accumulator
Output
Buffer
9 weights
Multiplier
Input Stream
27
© 2020 SPRL, SoE, Santa Clara University
SPP2D – Hardware Architecture
• Delivers output in 9 cycles for an
input of 5x5 and kernel of size 3x3.
• Architecture involves blowing up an
input matrix of 25 pixels to 81 pixels.
• The selector accumulator for this
example is designed for a 5x5 input
and 3x3 weights. Need to scale it to
an input size of 224x224 for VGG16
example.
28
5x5 input
results
25 pixels
Would require a
big buffer to
accommodate
81 pixels
The mux selector
accumulator
needs to scale to
an input of size
224x224
© 2020 SPRL, SoE, Santa Clara University
Results and Comparisons
Our Algorithm is 9x faster than the sliding window and 3x faster than the Warren Gross
Implementation
[1] A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017.
29
© 2020 SPRL, SoE, Santa Clara University
Future Work
© 2020 SPRL, SoE, Santa Clara University
Future work (1)
• Use Compression: CNNs can be compressed to INT8 with minimal impact on accuracy.
• More Processing Elements (PEs) can be implemented.
• Faster operation
• Compress weights and activations to reduce bandwidth requirement.
• The utilization percentage of Method #1 FIFOs for the later layers of the CNNs is low
31
© 2020 SPRL, SoE, Santa Clara University
Future work (2)
• Better utilization of FIFOs for later layers of CNNs of Method #1.
• Better utilization of Multipliers for layers of CNNs of Method #2.
• These two methods can be utilized for other non-FPGA platforms e.g. ASICs, CPUs,
GPUs, etc.
• Demonstrate scalability to practical sizes such as 224x224.
32
© 2020 SPRL, SoE, Santa Clara University
Summary and Conclusions
© 2020 SPRL, SoE, Santa Clara University
Summary and Conclusions
• We presented two new methods for 2-D convolution that offer considerable reduction
in power, computational complexity and efficiency offering a considerably better
architecture.
• The first method is based on using FIFOs and computes convolution results using row-
wise inputs as opposed to traditional tile-based processing giving considerably
reduced latency.
• The second method Single Partial Product 2-D (SPP2D) Convolution prevents
recalculation of partial weights and reduces input reuse.
• Hardware implementation results with improvements are presented.
34
© 2020 SPRL, SoE, Santa Clara University
References & Acknowledgements
35
Reference 1
A FIFO Based Accelerator for CNNs
Reference 2
A Fast 2-D Convolution Technique for
Deep Neural Networks
Acknowledgements
Xilinx University Program
Vineet Panchbaiyye, Santa Clara
University
Anaam Ansari, Santa Clara University
© 2020 SPRL, SoE, Santa Clara University
Questions & Answers
36
Contact Information:
Tokunbo Ogunfunmi
Santa Clara University
Email: Togunfunmi@scu.edu

More Related Content

PDF
IRJET-Artificial Neural Networks to Determine Source of Acoustic Emission and...
PDF
IRJET - Object Detection using Deep Learning with OpenCV and Python
PDF
Compressed sensing techniques for sensor data using unsupervised learning
PDF
Bhadale group of companies ai neural networks and algorithms catalogue
PDF
Dario izzo - Machine Learning methods and space engineering
PPTX
2022 03 22_蔡煒俊_u-net_convolutional_networks_for_biomedical_image_segmentation
PDF
184816386 x mining
PDF
CSMR11b.ppt
IRJET-Artificial Neural Networks to Determine Source of Acoustic Emission and...
IRJET - Object Detection using Deep Learning with OpenCV and Python
Compressed sensing techniques for sensor data using unsupervised learning
Bhadale group of companies ai neural networks and algorithms catalogue
Dario izzo - Machine Learning methods and space engineering
2022 03 22_蔡煒俊_u-net_convolutional_networks_for_biomedical_image_segmentation
184816386 x mining
CSMR11b.ppt

What's hot (12)

PDF
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
PDF
One-Pass Clustering Superpixels
PDF
Image Steganography Based On Non Linear Chaotic Algorithm
PDF
Deep Learning based Segmentation Pipeline for Label-Free Phase-Contrast Micro...
PDF
Defense_thesis
PDF
Transfer Learning Model for Image Segmentation by Integrating U-NetPlusPlus a...
PDF
A Random Forest using a Multi-valued Decision Diagram on an FPGa
PDF
Picmet15sasaki20150805.ppt
PDF
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
PDF
Deep Learning Initiative @ NECSTLab
PDF
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
PPT
Number systems
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
One-Pass Clustering Superpixels
Image Steganography Based On Non Linear Chaotic Algorithm
Deep Learning based Segmentation Pipeline for Label-Free Phase-Contrast Micro...
Defense_thesis
Transfer Learning Model for Image Segmentation by Integrating U-NetPlusPlus a...
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Picmet15sasaki20150805.ppt
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
Deep Learning Initiative @ NECSTLab
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Number systems
Ad

Similar to “New Methods for Implementation of 2-D Convolution for Convolutional Neural Networks,” a Presentation from Santa Clara University (20)

PDF
20320140503007
PDF
International Journal of Computational Engineering Research (IJCER)
PDF
Investigating the Performance of NoC Using Hierarchical Routing Approach
PDF
Investigating the Performance of NoC Using Hierarchical Routing Approach
PPTX
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
PDF
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
PDF
20320140503011
PDF
Biologically inspired deep residual networks
PPTX
PPTX
Learning biologically relevant features using convolutional neural networks f...
PDF
Design and Structuring of a Multiprocessor System based on Transputers
PPTX
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
PDF
MSc Thesis Presentation
PDF
Design & analysis various basic logic gates usingQuantum Dot Cellular Automat...
PDF
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
PDF
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
PDF
Design of fault tolerant algorithm for network on chip router using field pr...
PDF
REVIEW ON OBJECT DETECTION WITH CNN
PPTX
NS-CUK Seminar: H.B.Kim, Review on "Inductive Representation Learning on Lar...
PDF
Implementation of first order statistical processor on FPGA for feature extra...
20320140503007
International Journal of Computational Engineering Research (IJCER)
Investigating the Performance of NoC Using Hierarchical Routing Approach
Investigating the Performance of NoC Using Hierarchical Routing Approach
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
20320140503011
Biologically inspired deep residual networks
Learning biologically relevant features using convolutional neural networks f...
Design and Structuring of a Multiprocessor System based on Transputers
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
MSc Thesis Presentation
Design & analysis various basic logic gates usingQuantum Dot Cellular Automat...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Design of fault tolerant algorithm for network on chip router using field pr...
REVIEW ON OBJECT DETECTION WITH CNN
NS-CUK Seminar: H.B.Kim, Review on "Inductive Representation Learning on Lar...
Implementation of first order statistical processor on FPGA for feature extra...
Ad

More from Edge AI and Vision Alliance (20)

PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25-Week II
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.

“New Methods for Implementation of 2-D Convolution for Convolutional Neural Networks,” a Presentation from Santa Clara University

  • 1. © 2020 SPRL, SoE, Santa Clara University New Methods for Implementation of 2-D Convolution for Convolutional Neural Network (CNN) Tokunbo Ogunfunmi Signal Processing Research Lab (SPRL), Electrical & Computer Engr. (ECEN) Dept., School of Engineering, Santa Clara University. September 2020
  • 2. © 2020 SPRL, SoE, Santa Clara University Outline ➢ Motivation ➢ Challenges in Implementing 2-D Convolution for CNNs ➢ Method #1 ➢ Method #2 ➢ Future Work ➢ Summary and Conclusions 2
  • 3. © 2020 SPRL, SoE, Santa Clara University Convolutional Neural Networks • CNNs are most popular for vision tasks like image classification and segmentation. • CNNs are computationally intensive. • Computation and data movement requires energy. • Data read and write major energy consumer. • Activations, partial sums and weights constitute the most amount of data moved. 3
  • 4. © 2020 SPRL, SoE, Santa Clara University 2D Convolution Operation • Weights multiplied by input feature map and accumulated. • Kernel or weights are synonymous. • Filters in CNNs convolve over multiple channels. Image Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5.6-ConvolutionalNetworks.pdf 4
  • 5. © 2020 SPRL, SoE, Santa Clara University Challenges in FPGA Implementation of DNNs Computation Engine PE PE PE PE On chip Output buffer External Memory FPGA Input image And weights Input pixels and weights Input pixels and weights Output Feature Map Pixel Input size Loop Tiling Loop Unrolling Output feature size Loop Tiling On chip buffer for input and weights 2 3 2 External Memory 1 1 Challenge 1 : Huge Memory Transfer (Input and Output) 2 Challenge 2 : Large Onchip Buffers (Input and Output) 3 Challenge 3: Large Compute 4 Challenge 4: Complicated Scheduling and Dataflow Control 4 1 5
  • 6. © 2020 SPRL, SoE, Santa Clara University Method #1 FIFO Based
  • 7. © 2020 SPRL, SoE, Santa Clara University Convolution – Tile Based • Conv. with 3x3 kernel • Need to read at least 3 rows of pixels into line buffers. • Better tile based processing with 4 line buffers. 7
  • 8. © 2020 SPRL, SoE, Santa Clara University Proposed Dataflow • Method proposed for VGG16 which has only 3x3 kernel • Can be extended to other kernel sizes as well • The proposed method aims to reduce the read and write bandwidth. • Aims to read the input feature map only once. [4]. A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017. 8
  • 9. © 2020 SPRL, SoE, Santa Clara University The Basic Idea and an Example Partial sum FIFO1 Output FIFO Partial sum FIFO2 33 14 4 16 7 23 1 21 0 12 7 14 5 14 9 19 5 20 4 87 11 2 10 0 75 85 67 95 75 65 82 90 11 5 14 3 23 2 17 8 1 0 -1 2 0 -2 1 0 -1 -23 -53 -43 -22 -44 -67 -50 - 100 - 153 -55 - 110 - 153 -13 -26 -48 -13 -80 9
  • 10. © 2020 SPRL, SoE, Santa Clara University Processing Element (PE) • Uses 3 FIFOs to compute convolution output • Partial sums are stored in 2 FIFOs. • 3rd FIFO used to accumulate outputs • Partial sum FIFO size = width of the input image. • Rounded up to 256 in case of VGG16 • 2 such FIFOs • Output FIFO used to combine output of Processing elements • Output FIFO size for VGG16 =>256x256 = 64k 10
  • 11. © 2020 SPRL, SoE, Santa Clara University Processing Element (PE) Architecture 11
  • 12. © 2020 SPRL, SoE, Santa Clara University Parallel Implementation • Example of how 64 channel Input Feature (IF) map is processed in groups of 4. 12
  • 13. © 2020 SPRL, SoE, Santa Clara University Hardware Platform • XILINX PYNQZ1 has a ZNYQ 7000 soc. • Has an ARM processor running at 650 MHz. • Programable logic works at 100 MHz. • Programmable logic can be controlled using Python code Image source : https://reference.digilentinc.com/_media/reference/programmable-logic/pynq-z1/pynq-z1-1.png
  • 14. © 2020 SPRL, SoE, Santa Clara University Architecture Implementation • The architecture was implemented using C++ • HLS used to convert C++ to hardware. • RTL IP created and built into block design • Xilinx Vivado used to synthesize, place and route the block design • Fixed point 16 format was used for the weights and partial sums. • Python code using PYNQ library used to implement the Conv. Layer operation. 14
  • 15. © 2020 SPRL, SoE, Santa Clara University FPGA Utilization Includes the space required for AXI DMAs, Block RAMs other blocks. 15
  • 16. © 2020 SPRL, SoE, Santa Clara University Results and Comparisons [10] Y. Chen, T. Krishna, J.S. Emer and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks” IEEE Journal of Solid State Circuits, vol. 52, no. 1, pp. 127-138, January 2017. [4]. A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017. 16
  • 17. © 2020 SPRL, SoE, Santa Clara University Method #2 Single Partial Product 2-D (SPP2D)
  • 18. © 2020 SPRL, SoE, Santa Clara University Convolution Operation i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i21 i22 i23 i24 i0*w0 i1*w1 i2*w2 i5*w3 i6*w4 i7*w5 i10*w6 i11*w7 i12*w8 Input o0 o1 o2 o3 o4 o5 o6 o7 o8 w0 w1 w2 w3 w4 w5 w6 w7 w8 Kernel Output i1*w0 i2*w1 i3*w2 i6*w3 i7*w4 i8*w5 i11*w6 i12*w7 i13*w8 i2*w0 i3*w1 i4*w2 i7*w3 i8*w4 i9*w5 i12*w6 i13*w7 i14*w8 i5*w0 i6*w1 i7*w2 i10*w3 i11*w4 i12*w5 i15*w6 i16*w7 i17*w8 i6*w0 i7*w1 i8*w2 i11*w3 i12*w4 i13*w5 i16*w6 i17*w7 i18*w8 i7*w0 i8*w1 i9*w2 i12*w3 i13*w4 i14*w5 i17*w6 i18*w7 i19*w8 i10*w0 i11*w1 i12*w2 i15*w3 i16*w4 i17*w5 i20*w6 i21*w7 i22*w8 i11*w0 i12*w1 i13*w2 i16*w3 i17*w4 i18*w5 i21*w6 i22*w7 i23*w8 Σ i12*w0 i13*w1 i14*w2 i17*w3 i18*w4 i19*w5 i22*w6 i23*w7 i24*w8 Consider and input of size 5x5,kernel of size 3x3.We consider a convolution operation with stride 1 and with zero padding. 18
  • 19. © 2020 SPRL, SoE, Santa Clara University o0 o1 o2 o3 o4 o5 o6 o7 o8 o0 o1 o2 o3 o4 o5 o6 o7 o8 o0 o1 o2 o3 o4 o5 o6 o7 o8 Convolution Operation i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i21 i22 i23 i24 i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i21 i22 i23 i24 i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i21 i22 i23 i24 Frequency of use of an input pixels is N(x) where x is the frequency itself. For example i0 has frequency 1 19
  • 20. © 2020 SPRL, SoE, Santa Clara University Convolution Operation We use the notation N(x) to convey the frequency of use for an input pixel, here x is the frequency. For example, pixel i12 has frequency N(9). It is the input pixel that is used 9 times with all 9 kernel elements. 20
  • 21. © 2020 SPRL, SoE, Santa Clara University Pattern of Input Pixel Frequency in Sliding Window Pattern of the frequency with which input pixels are needed in the existing* implementation N(9) pixels always lies in the center of the input ( (N-4)x(N-4) where N is input dimension) while all the other frequencies lie on the periphery boundary which is two pixels deep. 21
  • 22. © 2020 SPRL, SoE, Santa Clara University Patterns in Existing Implementation Pattern of the frequency with which input pixels are needed in the existing* implementation N(3) pixels always lies in the center of the input while all the other frequencies lie on top and bottom and are two pixels deep A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017. 22
  • 23. © 2020 SPRL, SoE, Santa Clara University Generalized Equation for Pattern of Input Pixels for Sliding Window Operation N(9) N(6) N(3) N(4) N(2) N(1) N pixels N pixels 2 columns 2 columns 2 rows 2 rows input size N(9) N(6) and N(3) N(4) and N(1) N(2) 5 1 4 4 8 6 4 8 4 8 7 9 12 4 8 10 36 24 4 8 14 100 40 4 8 28 576 96 4 8 56 2704 208 4 8 112 11664 432 4 8 224 48400 880 4 8 Number of inputs with N(x) Generalized Expression N(9) (Hinput – 2) x (Winput – 2) N(6) and N(3) ((Hinput – 4) x 2 ) + ((Winput – 4)x2) N(4) and N(1) 4 N(2) 2x4 Hinput and Winput are dimension of input and are N pixels in this example 23
  • 24. © 2020 SPRL, SoE, Santa Clara University SPP2D – Input stream • Only i12 occupies all the multipliers with the 9 weight • Complementary Sets : (i7,i22), (i17,i2), (i11,i 14), (i6,i19,i21,i24), (i16,i1,i19,i4), (i13, i10), (i8,i5,i23,i20), (i18,i3,i15,i0) 24
  • 25. © 2020 SPRL, SoE, Santa Clara University SPP2D – Optimized Input stream Two benefits of combining input pixels into complementary sets 1. All multipliers are occupied 2. Arrive at output faster. Theoretically in 9 cycles for this arrangement i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i21 i22 i23 i24 o0 o1 o2 o3 o4 o5 o6 o7 o8 w0 w1 w2 w3 w4 w5 w6 w7 w8 Kernel Output Input clock cycles 1 2 3 4 5 6 7 8 9 N(x) N(4),N(2),N(1) N(4),N(2),N(1) N(6),N(3) N(4),N(2),N(1) N(4),N(2),N(1) N(6),N(3) N(6),N(3) N(6),N(3) N(9) weights Complementary sets i18+i3+i15+i0 i16+i1+i19+i4 i17 +i2 i8+i5 +i23+i20 i6+i19+i21+i24 i7 +i22 i13 + i10 i11 +i14 i12 w0 w0i0 w0i1 w0i2 w0i5 w0i6 w0i7 w0i10 w0i11 w0i12 w1 w1i3 w1i1 w1i2 w1i8 w1i6 w1i7 w1i13 w1i11 w1i12 w2 w2i3 w2i4 w2i2 w2i8 w2i9 w2i7 w2i13 w2i14 w2i12 w3 w3i15 w3i16 w3i17 w3i5 w3i6 w3i7 w3i10 w3ii11 w3i12 w4 w4i18 w4i16 w4i17 w4i8 w4i6 w4i7 w4i13 w4i11 w4i12 w5 w5i18 w5i19 w5i17 w5i8 w5i9 w5i7 w5i13 w5i14 w5i12 w6 w6i15 w6i16 w6i17 w6i20 w6i21 w6i22 w6i10 w6i11 w6i12 w7 w7i18 w7i16 w7i17 w7i23 w7i21 w7i22 w7i13 w7i11 w7i12 w8 w8i18 w8i19 w8i17 w8i23 w8i24 w8i22 w8i13 w8i14 w8i12 25
  • 26. © 2020 SPRL, SoE, Santa Clara University w0 2 w1 2 w2 3 w3 1 w4 2 w5 1 w6 1 w7 3 w8 1 SPP2D – Partial Products Sorted into their Outputs The highlighted partial products in red contribute to the first output pixel Output o0 77 o1 75 o2 93 o3 69 o4 68 o5 82 o6 81 o7 98 o8 85 i18+i3+i15+i0 6 4 4 2 3 3 2 3 3 i16+i1+i19+i4 5 5 6 1 1 3 1 1 3 i17 +i2 7 7 7 8 8 8 8 8 8 i8+i5 +i23+i20 8 1 1 8 1 1 9 2 2 i6+i19+i21+i24 2 2 5 2 2 5 9 9 4 i7 +i22 8 8 8 8 8 8 10 10 10 i13 + i10 5 9 9 5 9 9 5 9 9 i11 +i 14 2 2 8 2 2 8 2 2 8 i12 3 3 3 3 3 3 3 3 3 Σ 12 10 21 8 4 8 5 6 3 26
  • 27. © 2020 SPRL, SoE, Santa Clara University SPP2D – Hardware Architecture External Memory Weight Buffer Input Buffer Selector Accumulator Output Buffer 9 weights Multiplier Input Stream 27
  • 28. © 2020 SPRL, SoE, Santa Clara University SPP2D – Hardware Architecture • Delivers output in 9 cycles for an input of 5x5 and kernel of size 3x3. • Architecture involves blowing up an input matrix of 25 pixels to 81 pixels. • The selector accumulator for this example is designed for a 5x5 input and 3x3 weights. Need to scale it to an input size of 224x224 for VGG16 example. 28 5x5 input results 25 pixels Would require a big buffer to accommodate 81 pixels The mux selector accumulator needs to scale to an input of size 224x224
  • 29. © 2020 SPRL, SoE, Santa Clara University Results and Comparisons Our Algorithm is 9x faster than the sliding window and 3x faster than the Warren Gross Implementation [1] A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture to accelerate convolution in deep neural networks,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2017. 29
  • 30. © 2020 SPRL, SoE, Santa Clara University Future Work
  • 31. © 2020 SPRL, SoE, Santa Clara University Future work (1) • Use Compression: CNNs can be compressed to INT8 with minimal impact on accuracy. • More Processing Elements (PEs) can be implemented. • Faster operation • Compress weights and activations to reduce bandwidth requirement. • The utilization percentage of Method #1 FIFOs for the later layers of the CNNs is low 31
  • 32. © 2020 SPRL, SoE, Santa Clara University Future work (2) • Better utilization of FIFOs for later layers of CNNs of Method #1. • Better utilization of Multipliers for layers of CNNs of Method #2. • These two methods can be utilized for other non-FPGA platforms e.g. ASICs, CPUs, GPUs, etc. • Demonstrate scalability to practical sizes such as 224x224. 32
  • 33. © 2020 SPRL, SoE, Santa Clara University Summary and Conclusions
  • 34. © 2020 SPRL, SoE, Santa Clara University Summary and Conclusions • We presented two new methods for 2-D convolution that offer considerable reduction in power, computational complexity and efficiency offering a considerably better architecture. • The first method is based on using FIFOs and computes convolution results using row- wise inputs as opposed to traditional tile-based processing giving considerably reduced latency. • The second method Single Partial Product 2-D (SPP2D) Convolution prevents recalculation of partial weights and reduces input reuse. • Hardware implementation results with improvements are presented. 34
  • 35. © 2020 SPRL, SoE, Santa Clara University References & Acknowledgements 35 Reference 1 A FIFO Based Accelerator for CNNs Reference 2 A Fast 2-D Convolution Technique for Deep Neural Networks Acknowledgements Xilinx University Program Vineet Panchbaiyye, Santa Clara University Anaam Ansari, Santa Clara University
  • 36. © 2020 SPRL, SoE, Santa Clara University Questions & Answers 36 Contact Information: Tokunbo Ogunfunmi Santa Clara University Email: Togunfunmi@scu.edu