SlideShare a Scribd company logo
An FPGA-based Acceleration
Methodology and Performance
Model for Iterative Stencils

Enrico Reggiani, Giuseppe Natale, Carlo Moroni, Marco D.
Santambrogio
Reconfigurable Architecture Worskshop
Vancouver, British Columbia, Canada
May 21, 2018
Giuseppe Natale - giuseppe.natale@polimi.it
An FPGA-based acceleration methodology and performance model for iterative stencils
HPC
Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1
Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1
Applications
Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1
Applications
Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1
Applications
Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1
Applications
Rationale
Problems with Iterative Stencils
!4
Target Architectures
• Low Operational Intensity

• Synchronization between timesteps
Rationale
Problems with Iterative Stencils
!4
Target Architectures
• Low Operational Intensity

• Synchronization between timesteps
Rationale
Problems with Iterative Stencils
!4
Target Architectures
• Low Operational Intensity

• Synchronization between timesteps
T iling
[2] J. Holewinski, L.-N. Pouchet, P. Sadayappan, High-performance Code Generation for Stencil Computations on GPU Architectures. ICS 2012
[1] V. Bandishti, I. Pananilath, U. Bodnhugula, Tiling Stencil Computations to Maximize Parallelism. SC 2012: 11
1,2
Previous work
G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
!5
FIFO Channel
Filter Filter Filter Filter
PE
Demux
Filter
Microarchitecture for single
stencil iteration:

Streaming Stencil Timestep
• Streaming computation

• Dataflow blocks

• Non-uniform memory partitioning
M
N
0 j
i
A[ i-1 ][ j ]
A[ i ][ j ] A[ i ][ j+1 ]A[ i ][ j-1 ]
A[ i+1 ][ j ]
On-chip buffering
FIFOChannel
FilterFilterFilterFilter
PE
Demux
Filter
Off-chip
Memory
…
Demux
Demux
Demux
Demux
PE
PE
PE
PE
SST 1 SST 2 SST N-1 SST N
Previous work
G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
!6
FIFOChannel
FilterFilterFilterFilter
PE
Demux
Filter
Off-chip
Memory
…
Demux
Demux
Demux
Demux
PE
PE
PE
PE
SST 1 SST 2 SST N-1 SST N
Previous work
Complete accelerator:

Chain of SSTs
G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
!6
• Constant off-chip BW requirements

• Dataflow Pipelining
FIFOChannel
FilterFilterFilterFilter
PE
Demux
Filter
Off-chip
Memory
…
Demux
Demux
Demux
Demux
PE
PE
PE
PE
SST 1 SST 2 SST N-1 SST N
Previous work
Complete accelerator:

Chain of SSTs
G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
!6
• Constant off-chip BW requirements

• Dataflow Pipelining
Electronic Design Automation
Framework
• Based on the polyhedral model

• Relies on High Level Synthesis

• Supported languages: C/C++
Contributions Overview
• HDL-based design

• Intra-iterations parallelization strategy

• Performance model
!7
• Relies on HLS

• No intra-iterations parallelism
Issues with previous work
Proposed Improvements
Intra-iterations parallelization
!8
Filter Filter Filter
A[i+1][j]
A[i+1][j+1]
A[i+1][j+2]
A[i][j]
A[i][j+2]
A[i][j+3]
Border
FIFO FIFO
A[0..N][0..M]
PE0
PE1
PE2
PE1/PE0
PE2/PE1
PE2
A[i-1][j]
A[i-1][j+1]
A[i-1][j+2]
PE0
PE1
PE2
PE2/PE1/PE0
Read 3
elements
per
clock
cycle
A[i][j+1] PE2/PE1/PE0
A[i][j-1] PE0
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1
Memory Channel
Parallelization Pattern: consecutive updates
!9
Intra-iterations parallelization
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1
• Minimum impact on-chip memory requirements
!9
Intra-iterations parallelization
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1
• Minimum impact on-chip memory requirements
• Data reuse among PEs
!9
Intra-iterations parallelization
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1
• Minimum impact on-chip memory requirements
• Data reuse among PEs
• Maximize off-chip BW usage
!9
Intra-iterations parallelization
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1
Performance Model
!10
Bandwidth
Total Transfer Time
Bandwidth[GB/s]
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
TotalTransferTime[s]
0
0.1
0.2
0.3
0.4
Number of Packets
0 250 500 750 1000 1250 1500 1750 2000
Total Transferred Bytes
0 500×10
6
1×10
9
Performance Model
!10
Bandwidth
Total Transfer Time
Bandwidth[GB/s]
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
TotalTransferTime[s]
0
0.1
0.2
0.3
0.4
Number of Packets
0 250 500 750 1000 1250 1500 1750 2000
Total Transferred Bytes
0 500×10
6
1×10
9
Experimental Setup
• Xilinx VC707 board, Virtex 7 FPGA, PCI-e 2.0 X8

• 250 MHz Target Frequency (200 MHz prev. work)

• 2 GB/s BW (800 MB/s prev. work)
!11
[3] V. Bandishti, I. Pananilath, U. Bodnhugula, Tiling Stencil Computations to Maximize Parallelism. SC 2012: 11
[2] J. A. Stratton et al., Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, vol 127, 2012
[1] S. Grauer-Gray, L. Xu et al., Auto-tuning a high-level language targeted to GPU codes. InPar 2012
Results (1)
!12
Jacobi-2D
Performance
Power Efficiency
GFLOPS
0
50
100
150
GFLOPS/W
1
10
# of SSTs
1 4 8 15 25 40 50
Jacobi-3D
Performance
Power Efficiency
GFLOPS
0
50
100
GFLOPS/W
1
2
5
# of SSTs
1 4 8 16 25
Heat-3D
Performance
Power Efficiency
GFLOPS
0
100
200
GFLOPS/W
2
5
10
# of SSTs
1 4 8 16 25
[1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
Parallel.
4
4
4
1
4
4
Performance and Energy Efficiency Trend
Results (1)
!12
Jacobi-2D
Performance
Power Efficiency
GFLOPS
0
50
100
150
GFLOPS/W
1
10
# of SSTs
1 4 8 15 25 40 50
Jacobi-3D
Performance
Power Efficiency
GFLOPS
0
50
100
GFLOPS/W
1
2
5
# of SSTs
1 4 8 16 25
Heat-3D
Performance
Power Efficiency
GFLOPS
0
100
200
GFLOPS/W
2
5
10
# of SSTs
1 4 8 16 25
[1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
Parallel.
4
4
4
1
4
4
Performance and Energy Efficiency Trend
~22x
Performance
Results (1)
!12
Jacobi-2D
Performance
Power Efficiency
GFLOPS
0
50
100
150
GFLOPS/W
1
10
# of SSTs
1 4 8 15 25 40 50
Jacobi-3D
Performance
Power Efficiency
GFLOPS
0
50
100
GFLOPS/W
1
2
5
# of SSTs
1 4 8 16 25
Heat-3D
Performance
Power Efficiency
GFLOPS
0
100
200
GFLOPS/W
2
5
10
# of SSTs
1 4 8 16 25
[1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
~8x
Power Efficiency
Parallel.
4
4
4
1
4
4
Performance and Energy Efficiency Trend
~22x
Performance
Results (2)
!13
Predicted Actual
TotalExecutionTime(s)
0.1
1
10
Number of Iterations Queued
0 50 100 150
Predicted Actual
TotalExecutionTime(s)
0.1
1
Number of Iterations Queued
0 50 100
Predicted Actual
TotalExecutionTime(s)
10
100
1000
Number of Iterations Queued
0 200 400
Predicted Actual
TotalExecutionTime(s)
0.1
1
Number of Iterations Queued
0 10 20
Predicted Actual
TotalExecutionTime(s)
0.01
0.1
Number of Iterations Queued
0 10 20
Predicted Actual
TotalExecutionTime(s)
0.1
1
Number of Iterations Queued
0 10 20
(a) Jacobi-2D (b) Game of Life (c) American Put Option
(d) Jacobi-3D (e) Heat-3D (f) 3d7pt
Jacobi-2D
%
0
20
40
60
80
#Iterations chained
1 4 8 15 25 40 50
Jacobi-3D
%
0
50
100
#Iterations chained
1 4 6 8 15 20 25
Heat-3D
%
0
20
40
60
80
#Iterations chained
1 4 6 8 15
LUT
FF
BRAM
DSPs
Model Accuracy
Resource Consumption Trend
Conclusions
• Exploit Intra and inter-iterations parallelism

• Efficient on-chip storage

• Performance Model
!14
An FPGA-based Acceleration Methodology and Performance Model for Iterative
Stencils

Enrico Reggiani, Giuseppe Natale, Carlo Moroni, Marco D. Santambrogio
Giuseppe Natale - giuseppe.natale@polimi.it
Acceleration Methodology for Iterative Stencils on FPGAs
Slides will be available @
www.slideshare.net/necstlab
facebook.com/groups/ReconfigurableArchitecturesWorkshop
Future Work
• Scale on a multi-FPGA system using custom
interconnection boards designed in collaboration with Elysis
Roofline Model
P : attainable performance

: peak performance

: peak bandwidth

I : operational Intensity

W : work

Q : memory Traffic
Williams, Samuel, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.
Communications of the ACM 52.4 (2009): 65-76.

More Related Content

PDF
計算力学シミュレーションに GPU は役立つのか?
PDF
FPL15 talk: Deep Convolutional Neural Network on FPGA
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
PDF
Naist2015 dec ver1
PDF
A Platform for Accelerating Machine Learning Applications
PDF
An NSA Big Graph experiment
PDF
20181212 - PGconfASIA - LT - English
計算力学シミュレーションに GPU は役立つのか?
FPL15 talk: Deep Convolutional Neural Network on FPGA
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Naist2015 dec ver1
A Platform for Accelerating Machine Learning Applications
An NSA Big Graph experiment
20181212 - PGconfASIA - LT - English

What's hot (20)

PDF
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
PDF
A CGRA-based Approach for Accelerating Convolutional Neural Networks
PPTX
Lrz kurs: gpu and mic programming with r
PPTX
Parallel K means clustering using CUDA
PDF
20201006_PGconf_Online_Large_Data_Processing
PDF
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
PDF
Japan Lustre User Group 2014
PDF
20170602_OSSummit_an_intelligent_storage
PDF
.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
PDF
20201128_OSC_Fukuoka_Online_GPUPostGIS
PDF
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
PDF
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
PDF
Slide tesi
PDF
Introduction to SeqAn, an Open-source C++ Template Library
PDF
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
Parallel Implementation of K Means Clustering on CUDA
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
A CGRA-based Approach for Accelerating Convolutional Neural Networks
Lrz kurs: gpu and mic programming with r
Parallel K means clustering using CUDA
20201006_PGconf_Online_Large_Data_Processing
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Japan Lustre User Group 2014
20170602_OSSummit_an_intelligent_storage
.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
20201128_OSC_Fukuoka_Online_GPUPostGIS
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
Slide tesi
Introduction to SeqAn, an Open-source C++ Template Library
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
第11回 配信講義 計算科学技術特論A(2021)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Ad

Similar to An FPGA-based acceleration methodology and performance model for iterative stencils (20)

PDF
The CAOS framework: Democratize the acceleration of compute intensive applica...
PDF
OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA-b...
PDF
Can FPGAs Compete with GPUs?
PDF
High-Performance Physics Solver Design for Next Generation Consoles
PDF
The CAOS framework: democratize the acceleration of compute intensive applica...
PDF
On the Capability and Achievable Performance of FPGAs for HPC Applications
PPTX
HiPEAC-Keynote.pptx
PPTX
Custom Hardware design for image processing.pptx
PDF
Towards Automated Design Space Exploration and Code Generation using Type Tra...
PDF
FPGA-enhanced Bioinformatics @ NECST
PDF
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
PDF
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
PDF
Liszt los alamos national laboratory Aug 2011
PDF
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
PPTX
CNN Dataflow Implementation on FPGAs
PPTX
Using FPGA in Embedded Devices
PDF
Intel's Presentation in SIGGRAPH OpenCL BOF
PDF
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
PDF
Hbdfpga fpl07
PPTX
IoT Tech Expo 2023_Pedro Trancoso presentation
The CAOS framework: Democratize the acceleration of compute intensive applica...
OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA-b...
Can FPGAs Compete with GPUs?
High-Performance Physics Solver Design for Next Generation Consoles
The CAOS framework: democratize the acceleration of compute intensive applica...
On the Capability and Achievable Performance of FPGAs for HPC Applications
HiPEAC-Keynote.pptx
Custom Hardware design for image processing.pptx
Towards Automated Design Space Exploration and Code Generation using Type Tra...
FPGA-enhanced Bioinformatics @ NECST
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
Liszt los alamos national laboratory Aug 2011
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
CNN Dataflow Implementation on FPGAs
Using FPGA in Embedded Devices
Intel's Presentation in SIGGRAPH OpenCL BOF
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
Hbdfpga fpl07
IoT Tech Expo 2023_Pedro Trancoso presentation
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
PPTX
Punto e virgola Team - Stressometro
PDF
BitIt Team - Stay.straight
PDF
BabYodini Team - Talking Gloves
PDF
printf("Nome Squadra"); Team - NeoTon
PPTX
BlackBoard Team - Motion Tracking Platform
PDF
#include<brain.h> Team - HomeBeatHome
PDF
Flipflops Team - Wave U
PDF
Bug(atta) Team - Little Brother
PDF
#NECSTCamp: come partecipare
PDF
NECSTCamp101@2020.10.1
PDF
NECSTLab101 2020.2021
PDF
TreeHouse, nourish your community
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
PDF
Embedding based knowledge graph link prediction for drug repurposing
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
PDF
Luns - Automatic lungs segmentation through neural network
PDF
BlastFunction: How to combine Serverless and FPGAs
PDF
Maeve - Fast genome analysis leveraging exact string matching
Mesticheria Team - WiiReflex
Punto e virgola Team - Stressometro
BitIt Team - Stay.straight
BabYodini Team - Talking Gloves
printf("Nome Squadra"); Team - NeoTon
BlackBoard Team - Motion Tracking Platform
#include<brain.h> Team - HomeBeatHome
Flipflops Team - Wave U
Bug(atta) Team - Little Brother
#NECSTCamp: come partecipare
NECSTCamp101@2020.10.1
NECSTLab101 2020.2021
TreeHouse, nourish your community
TiReX: Tiled Regular eXpressionsmatching architecture
Embedding based knowledge graph link prediction for drug repurposing
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
EMPhASIS - An EMbedded Public Attention Stress Identification System
Luns - Automatic lungs segmentation through neural network
BlastFunction: How to combine Serverless and FPGAs
Maeve - Fast genome analysis leveraging exact string matching

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
web development for engineering and engineering
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Well-logging-methods_new................
PPTX
Welding lecture in detail for understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Digital Logic Computer Design lecture notes
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
web development for engineering and engineering
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
OOP with Java - Java Introduction (Basics)
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
R24 SURVEYING LAB MANUAL for civil enggi
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
CH1 Production IntroductoryConcepts.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Well-logging-methods_new................
Welding lecture in detail for understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Internet of Things (IOT) - A guide to understanding

An FPGA-based acceleration methodology and performance model for iterative stencils

  • 1. An FPGA-based Acceleration Methodology and Performance Model for Iterative Stencils Enrico Reggiani, Giuseppe Natale, Carlo Moroni, Marco D. Santambrogio Reconfigurable Architecture Worskshop Vancouver, British Columbia, Canada May 21, 2018 Giuseppe Natale - giuseppe.natale@polimi.it
  • 3. HPC
  • 4. Iterative Stencil Algorithms !3 0 0 M-1 N-1 j i Timestep t Stencil σ 0 0 M-1 N-1 j i Timestep t+1
  • 5. Iterative Stencil Algorithms !3 0 0 M-1 N-1 j i Timestep t Stencil σ 0 0 M-1 N-1 j i Timestep t+1 Applications
  • 6. Iterative Stencil Algorithms !3 0 0 M-1 N-1 j i Timestep t Stencil σ 0 0 M-1 N-1 j i Timestep t+1 Applications
  • 7. Iterative Stencil Algorithms !3 0 0 M-1 N-1 j i Timestep t Stencil σ 0 0 M-1 N-1 j i Timestep t+1 Applications
  • 8. Iterative Stencil Algorithms !3 0 0 M-1 N-1 j i Timestep t Stencil σ 0 0 M-1 N-1 j i Timestep t+1 Applications
  • 9. Rationale Problems with Iterative Stencils !4 Target Architectures • Low Operational Intensity • Synchronization between timesteps
  • 10. Rationale Problems with Iterative Stencils !4 Target Architectures • Low Operational Intensity • Synchronization between timesteps
  • 11. Rationale Problems with Iterative Stencils !4 Target Architectures • Low Operational Intensity • Synchronization between timesteps T iling [2] J. Holewinski, L.-N. Pouchet, P. Sadayappan, High-performance Code Generation for Stencil Computations on GPU Architectures. ICS 2012 [1] V. Bandishti, I. Pananilath, U. Bodnhugula, Tiling Stencil Computations to Maximize Parallelism. SC 2012: 11 1,2
  • 12. Previous work G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 !5 FIFO Channel Filter Filter Filter Filter PE Demux Filter Microarchitecture for single stencil iteration:
 Streaming Stencil Timestep • Streaming computation • Dataflow blocks • Non-uniform memory partitioning M N 0 j i A[ i-1 ][ j ] A[ i ][ j ] A[ i ][ j+1 ]A[ i ][ j-1 ] A[ i+1 ][ j ] On-chip buffering
  • 13. FIFOChannel FilterFilterFilterFilter PE Demux Filter Off-chip Memory … Demux Demux Demux Demux PE PE PE PE SST 1 SST 2 SST N-1 SST N Previous work G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 !6
  • 14. FIFOChannel FilterFilterFilterFilter PE Demux Filter Off-chip Memory … Demux Demux Demux Demux PE PE PE PE SST 1 SST 2 SST N-1 SST N Previous work Complete accelerator:
 Chain of SSTs G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 !6 • Constant off-chip BW requirements • Dataflow Pipelining
  • 15. FIFOChannel FilterFilterFilterFilter PE Demux Filter Off-chip Memory … Demux Demux Demux Demux PE PE PE PE SST 1 SST 2 SST N-1 SST N Previous work Complete accelerator:
 Chain of SSTs G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 !6 • Constant off-chip BW requirements • Dataflow Pipelining Electronic Design Automation Framework • Based on the polyhedral model • Relies on High Level Synthesis • Supported languages: C/C++
  • 16. Contributions Overview • HDL-based design • Intra-iterations parallelization strategy • Performance model !7 • Relies on HLS • No intra-iterations parallelism Issues with previous work Proposed Improvements
  • 17. Intra-iterations parallelization !8 Filter Filter Filter A[i+1][j] A[i+1][j+1] A[i+1][j+2] A[i][j] A[i][j+2] A[i][j+3] Border FIFO FIFO A[0..N][0..M] PE0 PE1 PE2 PE1/PE0 PE2/PE1 PE2 A[i-1][j] A[i-1][j+1] A[i-1][j+2] PE0 PE1 PE2 PE2/PE1/PE0 Read 3 elements per clock cycle A[i][j+1] PE2/PE1/PE0 A[i][j-1] PE0 0 0 M-1j i Timestep t 0 0 M-1j i Timestep t+1 PE2 PE3PE1 Memory Channel Parallelization Pattern: consecutive updates
  • 18. !9 Intra-iterations parallelization 0 0 M-1j i Timestep t 0 0 M-1j i Timestep t+1 PE2 PE3PE1
  • 19. • Minimum impact on-chip memory requirements !9 Intra-iterations parallelization 0 0 M-1j i Timestep t 0 0 M-1j i Timestep t+1 PE2 PE3PE1
  • 20. • Minimum impact on-chip memory requirements • Data reuse among PEs !9 Intra-iterations parallelization 0 0 M-1j i Timestep t 0 0 M-1j i Timestep t+1 PE2 PE3PE1
  • 21. • Minimum impact on-chip memory requirements • Data reuse among PEs • Maximize off-chip BW usage !9 Intra-iterations parallelization 0 0 M-1j i Timestep t 0 0 M-1j i Timestep t+1 PE2 PE3PE1
  • 22. Performance Model !10 Bandwidth Total Transfer Time Bandwidth[GB/s] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 TotalTransferTime[s] 0 0.1 0.2 0.3 0.4 Number of Packets 0 250 500 750 1000 1250 1500 1750 2000 Total Transferred Bytes 0 500×10 6 1×10 9
  • 23. Performance Model !10 Bandwidth Total Transfer Time Bandwidth[GB/s] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 TotalTransferTime[s] 0 0.1 0.2 0.3 0.4 Number of Packets 0 250 500 750 1000 1250 1500 1750 2000 Total Transferred Bytes 0 500×10 6 1×10 9
  • 24. Experimental Setup • Xilinx VC707 board, Virtex 7 FPGA, PCI-e 2.0 X8 • 250 MHz Target Frequency (200 MHz prev. work) • 2 GB/s BW (800 MB/s prev. work) !11 [3] V. Bandishti, I. Pananilath, U. Bodnhugula, Tiling Stencil Computations to Maximize Parallelism. SC 2012: 11 [2] J. A. Stratton et al., Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, vol 127, 2012 [1] S. Grauer-Gray, L. Xu et al., Auto-tuning a high-level language targeted to GPU codes. InPar 2012
  • 25. Results (1) !12 Jacobi-2D Performance Power Efficiency GFLOPS 0 50 100 150 GFLOPS/W 1 10 # of SSTs 1 4 8 15 25 40 50 Jacobi-3D Performance Power Efficiency GFLOPS 0 50 100 GFLOPS/W 1 2 5 # of SSTs 1 4 8 16 25 Heat-3D Performance Power Efficiency GFLOPS 0 100 200 GFLOPS/W 2 5 10 # of SSTs 1 4 8 16 25 [1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 Parallel. 4 4 4 1 4 4 Performance and Energy Efficiency Trend
  • 26. Results (1) !12 Jacobi-2D Performance Power Efficiency GFLOPS 0 50 100 150 GFLOPS/W 1 10 # of SSTs 1 4 8 15 25 40 50 Jacobi-3D Performance Power Efficiency GFLOPS 0 50 100 GFLOPS/W 1 2 5 # of SSTs 1 4 8 16 25 Heat-3D Performance Power Efficiency GFLOPS 0 100 200 GFLOPS/W 2 5 10 # of SSTs 1 4 8 16 25 [1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 Parallel. 4 4 4 1 4 4 Performance and Energy Efficiency Trend ~22x Performance
  • 27. Results (1) !12 Jacobi-2D Performance Power Efficiency GFLOPS 0 50 100 150 GFLOPS/W 1 10 # of SSTs 1 4 8 15 25 40 50 Jacobi-3D Performance Power Efficiency GFLOPS 0 50 100 GFLOPS/W 1 2 5 # of SSTs 1 4 8 16 25 Heat-3D Performance Power Efficiency GFLOPS 0 100 200 GFLOPS/W 2 5 10 # of SSTs 1 4 8 16 25 [1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77 ~8x Power Efficiency Parallel. 4 4 4 1 4 4 Performance and Energy Efficiency Trend ~22x Performance
  • 28. Results (2) !13 Predicted Actual TotalExecutionTime(s) 0.1 1 10 Number of Iterations Queued 0 50 100 150 Predicted Actual TotalExecutionTime(s) 0.1 1 Number of Iterations Queued 0 50 100 Predicted Actual TotalExecutionTime(s) 10 100 1000 Number of Iterations Queued 0 200 400 Predicted Actual TotalExecutionTime(s) 0.1 1 Number of Iterations Queued 0 10 20 Predicted Actual TotalExecutionTime(s) 0.01 0.1 Number of Iterations Queued 0 10 20 Predicted Actual TotalExecutionTime(s) 0.1 1 Number of Iterations Queued 0 10 20 (a) Jacobi-2D (b) Game of Life (c) American Put Option (d) Jacobi-3D (e) Heat-3D (f) 3d7pt Jacobi-2D % 0 20 40 60 80 #Iterations chained 1 4 8 15 25 40 50 Jacobi-3D % 0 50 100 #Iterations chained 1 4 6 8 15 20 25 Heat-3D % 0 20 40 60 80 #Iterations chained 1 4 6 8 15 LUT FF BRAM DSPs Model Accuracy Resource Consumption Trend
  • 29. Conclusions • Exploit Intra and inter-iterations parallelism • Efficient on-chip storage • Performance Model !14 An FPGA-based Acceleration Methodology and Performance Model for Iterative Stencils Enrico Reggiani, Giuseppe Natale, Carlo Moroni, Marco D. Santambrogio Giuseppe Natale - giuseppe.natale@polimi.it Acceleration Methodology for Iterative Stencils on FPGAs Slides will be available @ www.slideshare.net/necstlab facebook.com/groups/ReconfigurableArchitecturesWorkshop Future Work • Scale on a multi-FPGA system using custom interconnection boards designed in collaboration with Elysis
  • 30. Roofline Model P : attainable performance
 : peak performance
 : peak bandwidth
 I : operational Intensity
 W : work
 Q : memory Traffic Williams, Samuel, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52.4 (2009): 65-76.