An FPGA-based acceleration methodology and performance model for iterative stencils

An FPGA-based Acceleration
Methodology and Performance
Model for Iterative Stencils

Enrico Reggiani, Giuseppe Natale, Carlo Moroni, Marco D.
Santambrogio
Reconﬁgurable Architecture Worskshop
Vancouver, British Columbia, Canada
May 21, 2018
Giuseppe Natale - giuseppe.natale@polimi.it

Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1

Iterative Stencil Algorithms
!3
0
0 M-1
N-1
j
i
Timestep t
Stencil σ
0
0 M-1
N-1
j
i
Timestep t+1
Applications

Rationale
Problems with Iterative Stencils
!4
Target Architectures
• Low Operational Intensity

• Synchronization between timesteps

Rationale
Problems with Iterative Stencils
!4
Target Architectures
• Low Operational Intensity

• Synchronization between timesteps
T iling
[2] J. Holewinski, L.-N. Pouchet, P. Sadayappan, High-performance Code Generation for Stencil Computations on GPU Architectures. ICS 2012
[1] V. Bandishti, I. Pananilath, U. Bodnhugula, Tiling Stencil Computations to Maximize Parallelism. SC 2012: 11
1,2

Previous work
G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
A polyhedral model-based framework for dataﬂow implementation on FPGA devices of iterative stencil loops. ICCAD 2016: 77
!5
FIFO Channel
Filter Filter Filter Filter
PE
Demux
Filter
Microarchitecture for single
stencil iteration: 
Streaming Stencil Timestep
• Streaming computation

• Dataﬂow blocks

• Non-uniform memory partitioning
M
N
0 j
i
A[ i-1 ][ j ]
A[ i ][ j ] A[ i ][ j+1 ]A[ i ][ j-1 ]
A[ i+1 ][ j ]
On-chip buffering

FIFOChannel
FilterFilterFilterFilter
PE
Demux
Filter
Oﬀ-chip
Memory
…
Demux
Demux
Demux
Demux
PE
PE
PE
PE
SST 1 SST 2 SST N-1 SST N
Previous work
!6

FIFOChannel
PE
Demux
Filter
Off-chip
Memory
…
Demux
Demux
Demux
Demux
PE
PE
PE
PE
Previous work
Complete accelerator: 
Chain of SSTs
!6
• Constant off-chip BW requirements

• Dataflow Pipelining

FIFOChannel
PE
Demux
Filter
Off-chip
Memory
…
Demux
Demux
Demux
Demux
PE
PE
PE
PE
Previous work
Complete accelerator: 
Chain of SSTs
!6
• Constant off-chip BW requirements

• Dataflow Pipelining
Electronic Design Automation
Framework
• Based on the polyhedral model

• Relies on High Level Synthesis

• Supported languages: C/C++

Contributions Overview
• HDL-based design

• Intra-iterations parallelization strategy

• Performance model
!7
• Relies on HLS

• No intra-iterations parallelism
Issues with previous work
Proposed Improvements

Intra-iterations parallelization
!8
Filter Filter Filter
A[i+1][j]
A[i+1][j+1]
A[i+1][j+2]
A[i][j]
A[i][j+2]
A[i][j+3]
Border
FIFO FIFO
A[0..N][0..M]
PE0
PE1
PE2
PE1/PE0
PE2/PE1
PE2
A[i-1][j]
A[i-1][j+1]
A[i-1][j+2]
PE0
PE1
PE2
PE2/PE1/PE0
Read 3
elements
per
clock
cycle
A[i][j+1] PE2/PE1/PE0
A[i][j-1] PE0
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1
Memory Channel
Parallelization Pattern: consecutive updates

!9
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1

• Minimum impact on-chip memory requirements
!9
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1

• Data reuse among PEs
!9
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1

• Data reuse among PEs
• Maximize oﬀ-chip BW usage
!9
0
0 M-1j
i
Timestep t
0
0 M-1j
i
Timestep t+1
PE2 PE3PE1

Performance Model
!10
Bandwidth
Total Transfer Time
Bandwidth[GB/s]
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
TotalTransferTime[s]
0
0.1
0.2
0.3
0.4
Number of Packets
0 250 500 750 1000 1250 1500 1750 2000
Total Transferred Bytes
0 500×10
6
1×10
9

Experimental Setup
• Xilinx VC707 board, Virtex 7 FPGA, PCI-e 2.0 X8

• 250 MHz Target Frequency (200 MHz prev. work)

• 2 GB/s BW (800 MB/s prev. work)
!11
[3] V. Bandishti, I. Pananilath, U. Bodnhugula, Tiling Stencil Computations to Maximize Parallelism. SC 2012: 11
[2] J. A. Stratton et al., Parboil: A revised benchmark suite for scientiﬁc and commercial throughput computing. Center for Reliable and High-Performance Computing, vol 127, 2012
[1] S. Grauer-Gray, L. Xu et al., Auto-tuning a high-level language targeted to GPU codes. InPar 2012

Results (1)
!12
Jacobi-2D
Performance
Power Efficiency
GFLOPS
0
50
100
150
GFLOPS/W
1
10
# of SSTs
1 4 8 15 25 40 50
Jacobi-3D
Performance
Power Efficiency
GFLOPS
0
50
100
GFLOPS/W
1
2
5
# of SSTs
1 4 8 16 25
Heat-3D
Performance
Power Efficiency
GFLOPS
0
100
200
GFLOPS/W
2
5
10
# of SSTs
1 4 8 16 25
[1] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M. D. Santambrogio,
Parallel.
4
4
4
1
4
4
Performance and Energy Efficiency Trend

Results (1)
!12
Jacobi-2D
Performance
Power Efficiency
GFLOPS
0
50
100
150
GFLOPS/W
1
10
# of SSTs
1 4 8 15 25 40 50
Jacobi-3D
Performance
Power Efficiency
GFLOPS
0
50
100
GFLOPS/W
1
2
5
# of SSTs
1 4 8 16 25
Heat-3D
Performance
Power Efficiency
GFLOPS
0
100
200
GFLOPS/W
2
5
10
# of SSTs
1 4 8 16 25
Parallel.
4
4
4
1
4
4
~22x
Performance

Results (1)
!12
Jacobi-2D
Performance
Power Efficiency
GFLOPS
0
50
100
150
GFLOPS/W
1
10
# of SSTs
1 4 8 15 25 40 50
Jacobi-3D
Performance
Power Efficiency
GFLOPS
0
50
100
GFLOPS/W
1
2
5
# of SSTs
1 4 8 16 25
Heat-3D
Performance
Power Efficiency
GFLOPS
0
100
200
GFLOPS/W
2
5
10
# of SSTs
1 4 8 16 25
~8x
Power Efficiency
Parallel.
4
4
4
1
4
4
~22x
Performance

Results (2)
!13
Predicted Actual
TotalExecutionTime(s)
0.1
1
10
Number of Iterations Queued
0 50 100 150
Predicted Actual
0.1
1
0 50 100
Predicted Actual
10
100
1000
0 200 400
Predicted Actual
0.1
1
0 10 20
Predicted Actual
0.01
0.1
0 10 20
Predicted Actual
0.1
1
0 10 20
(a) Jacobi-2D (b) Game of Life (c) American Put Option
(d) Jacobi-3D (e) Heat-3D (f) 3d7pt
Jacobi-2D
%
0
20
40
60
80
#Iterations chained
1 4 8 15 25 40 50
Jacobi-3D
%
0
50
100
#Iterations chained
1 4 6 8 15 20 25
Heat-3D
%
0
20
40
60
80
#Iterations chained
1 4 6 8 15
LUT
FF
BRAM
DSPs
Model Accuracy
Resource Consumption Trend

Conclusions
• Exploit Intra and inter-iterations parallelism

• Eﬃcient on-chip storage

• Performance Model
!14
An FPGA-based Acceleration Methodology and Performance Model for Iterative
Stencils

Enrico Reggiani, Giuseppe Natale, Carlo Moroni, Marco D. Santambrogio
Giuseppe Natale - giuseppe.natale@polimi.it
Acceleration Methodology for Iterative Stencils on FPGAs
Slides will be available @
www.slideshare.net/necstlab
facebook.com/groups/ReconﬁgurableArchitecturesWorkshop
Future Work
• Scale on a multi-FPGA system using custom
interconnection boards designed in collaboration with Elysis

Roofline Model
P : attainable performance 
: peak performance 
: peak bandwidth 
I : operational Intensity 
W : work 
Q : memory Traffic
Williams, Samuel, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.
Communications of the ACM 52.4 (2009): 65-76.

An FPGA-based acceleration methodology and performance model for iterative stencils

More Related Content

What's hot (20)

Similar to An FPGA-based acceleration methodology and performance model for iterative stencils (20)

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded (20)

An FPGA-based acceleration methodology and performance model for iterative stencils