A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

A High Performance Heterogeneous
FPGA-based Accelerator
with PyCoRAM
Team: PyCoRAMist
Shinya Takamaeda-Yamazaki
Tokyo Institute of Technology
JSPS Research Fellow (DC1)
February 21, 2014
Digilent Design Contest @TED Yokohama

The 1st IPSJ SIG-ARC High-Performance
Processor Design Contest (Jan 2014 @Tokyo)
n  A competition of developing a fast
computing system for the specified
applications on the specified platform
n  FPGA board: Digilent Atlys
l  FPGA: Xilinx Spartan-6 LX45
DRAM: DDR2-800 (1.6GB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 2

4 Specified Contest Applications
Hybrid System of CPU core + HW Accelerator
Suitable for HW AcceleratorsMatrix Mult & Stencil
Sort & Shortest Path Difficult for HW Accelerators
Application Description
Requirements for
Memory System
310_sort Integer Sort Low Latency
320_mm Matrix-Matrix Multiplication High Bandwidth
330_stencil 9-Point Stencil (Integer) High Bandwidth
340_spath Shortest Path Search Low Latency

How to Implement an Accelerator?
n  HDL? NO WAY! It’s so annoying L
l  Implementing the entire system using HDL is hard, because ...
•  Scheduling logic of computations and memory accesses
–  Double buffering requires complicated logics
–  State machine implementation is so annoying and error-prone
l  But, we want define the pipeline design in cycle-level
•  Essential for high performance of FPGA-based accelerators
–  HDL is still good weapon to write just a computation logic
–  The modern high-level synthesis tools are still not effective
n  Memory abstractions make up happy?
CoRAM Memory Architecture

CoRAM (Connected RAM) [Chung+,FPGA’11]
n  Abstract Memory System for FPGAs
l  High-level abstraction for memory management
•  Decoupling computing logics and memory access behaviors
•  Memory access patterns in software model (C language)
HW Kernels
(Computing Logics)
CoRAM
Memory
Read
Write
Manage
Control Threads
(Memory Access
Pattern in C)
CoRAM
Channel
Read/Write Read/Write
Communication
FIFOs (Registers)
Abstracted
On-chip Memories
Off-chip
Memory

PyCoRAM [Takamaeda+,CARL’13]
n  Python-based implementation of CoRAM memory
architecture for modern FPGA EDKs
l  CoRAM memory abstraction for EDK development flow
n  Key features
l  Control Thread in Python
•  We developed Python-to-Verilog HLS Compiler from scratch
l  AMBA AXI4 Interconnect for on-chip interconnect
•  For IP-core based development on Xilinx Platform Studio (XPS)

PyCoRAM Microarchitecture
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO

User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
Modeled in RTL
(Verilog HDL)
Memory Access
Pattern
in Python
def calc_sum(times):�
ram = CoramMemory(idx=0, datawidth=32, size=1024)�
channel = CoramChannel(idx=0, datawidth=32)�
addr = 0�
sum = 0�
for i in range(times):�
ram.write(0, addr, 128)�
channel.write(addr)�
sum += channel.read()�
addr += 128 * (32/8)�
print(‘sum=’, sum)�
calc_sum(8)�
# Transfer (off-chip DRAM to BRAM)
# Notification to User-logic
# Wait for Notification from User-logic
# $display Verilog system task
�
0�
1�
2�
3�
4�
5�
6�
7�
8�
9�
10�
11�

PyCoRAM IP
AXI4 Interconnect
DRAM ControllerFPGA
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
AXI I/F
CoRAM
Memory
DMAC
AXI I/F
CoRAM
Stream FSM
GPIO

FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator

FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator
9.8%
4.5%
0.4%
2.5% 28.1% 22.5%
6.3%

Matrix-Matrix Multiplication Accelerator
n  Each row of matrix A/B/C is stored on CoRAM memories
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  Fully-occupied pipeline for every cycle
l  Double buffering of computations and transmission of mat B
•  Mat B is transposed in advance by the other CoRAM hardware
•  1/4 of the total memory bandwidth is utilized (about 400MB/s)
Computing Logic (Verilog HDL)
Control
Thread
(Python)
sum
CoRAM
Memory 0
B
× +
CoRAM
Memory 1
CoRAM
Memory 2
Control Logic
CoRAM
Channel 0
8-stage
Multiply
PipelineA
C
check
sum+

Stencil Computation Accelerator
n  3 arrays for source and 1 array for result by CoRAM
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  The pipeline consumes data of 3 points for every cycle
•  (Sum of input data within latest 3 cycles) / 9
l  Write back of the result, then read the next array
•  1/12 of the total memory bandwidth is utilized (about 130MB/s)
Computing Logic (Verilog HDL) Control
Thread
(Python)
CoRAM
Memory 0
d1
CoRAM
Memory 2
CoRAM
Memory 3
Control Logic
CoRAM
Channel 0
41-stage
Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check
sum
CoRAM
Memory 1

L1 Data Cache for MIPS-core
n  CoRAM Memory as Data Memory
l  Data replacements are managed by the control thread
•  When a cache miss occurs, a handling request is issued to the CT
Cache
Logic
(Verilog HDL)
Control
Thread
(Python)
CoRAM
Memory
0,1
Control
Logic
CoRAM
Channel 0
D0
D1
MUX
Tag0
=
Select
Tag1
=
Write
Data
Addr Stall
Read
Data
Write
Enable
Read
Enable
reg
reg
reg

Evaluation
n  Evaluation targets
l  Reference design provided by the contest committee (Ref)
l  6-stage MIPS-core+L1 Cache (6-stage)
l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)
n  Application dataset
l  Dataset provided for first round match
n  FPGA EDA tools
l  Xilinx Platform Studio 14.6, PlanAhead 14.6
•  Optimization goal: Speed, Optimization Effort: High
•  AXI4 Interconnect: 32-bit Shared bus (Area optimized)
n  Compiler for MIPS-core
l  gcc 4.3.3 (-O3)

Performance
n  =Execution time (not including data transfer time)
n  Drastic speed up compared to the reference design
l  The 6-stage+MIPS-core achieves 3.5 times faster speed
l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster
speed at average, 47.1 times faster at maximum
3.9
1.4
5.9 4.7 3.53.9
35.2
47.1
4.7
13.2
0
5
10
15
20
25
30
35
40
45
50
310_sort 320_mm 330_stencil 340_spath Gmean
RelativePerformance
6-stage
6-stage+ACC
14.2 14.2
16.0
20.8
3.6
9.8
2.7
4.4
3.6
0.4 0.3
4.4
0
5
10
15
20
25
310_sort 320_mm 330_stencil 340_spath
Time[sec]
Ref
6-stage
6-stage+ACC

Conclusion
n  From IPSJ SIG-ARC High-Performance Processor
Design Contest
n  Development of a heterogeneous FPGA-based
accelerator with PyCoRAM
l  Heterogeneous system of MIPS-core and two accelerators
l  47.1 times faster than the reference design
n  The tool-chain and framework are available on GitHub
l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l  Pyverilog: http://shtaxxx.github.io/Pyverilog/

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region) (20)

More from Shinya Takamaeda-Y (9)

Recently uploaded (20)

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)