SlideShare a Scribd company logo
A High Performance Heterogeneous
FPGA-based Accelerator
with PyCoRAM
Team: PyCoRAMist
Shinya Takamaeda-Yamazaki
Tokyo Institute of Technology
JSPS Research Fellow (DC1)
February 21, 2014
Digilent Design Contest @TED Yokohama
The 1st IPSJ SIG-ARC High-Performance
Processor Design Contest (Jan 2014 @Tokyo)
n  A competition of developing a fast
computing system for the specified
applications on the specified platform
n  FPGA board: Digilent Atlys
l  FPGA: Xilinx Spartan-6 LX45
DRAM: DDR2-800 (1.6GB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 2
4 Specified Contest Applications
2014-02-21 Shinya T-Y. Tokyo Tech 3
Hybrid System of CPU core + HW Accelerator
Suitable for HW AcceleratorsMatrix Mult & Stencil
Sort & Shortest Path Difficult for HW Accelerators
Application Description
Requirements for
Memory System
310_sort Integer Sort Low Latency
320_mm Matrix-Matrix Multiplication High Bandwidth
330_stencil 9-Point Stencil (Integer) High Bandwidth
340_spath Shortest Path Search Low Latency
How to Implement an Accelerator?
n  HDL? NO WAY! It’s so annoying L
l  Implementing the entire system using HDL is hard, because ...
•  Scheduling logic of computations and memory accesses
–  Double buffering requires complicated logics
–  State machine implementation is so annoying and error-prone
l  But, we want define the pipeline design in cycle-level
•  Essential for high performance of FPGA-based accelerators
–  HDL is still good weapon to write just a computation logic
–  The modern high-level synthesis tools are still not effective
n  Memory abstractions make up happy?
2014-02-21 Shinya T-Y. Tokyo Tech 4
CoRAM Memory Architecture
CoRAM (Connected RAM) [Chung+,FPGA’11]
n  Abstract Memory System for FPGAs
l  High-level abstraction for memory management
•  Decoupling computing logics and memory access behaviors
•  Memory access patterns in software model (C language)
2014-02-21 Shinya T-Y. Tokyo Tech 5
HW Kernels
(Computing Logics)
CoRAM
Memory
Read
Write
Manage
Control Threads
(Memory Access
Pattern in C)
CoRAM
Channel
Read/Write Read/Write
Communication
FIFOs (Registers)
Abstracted
On-chip Memories
Off-chip
Memory
PyCoRAM [Takamaeda+,CARL’13]
n  Python-based implementation of CoRAM memory
architecture for modern FPGA EDKs
l  CoRAM memory abstraction for EDK development flow
n  Key features
l  Control Thread in Python
•  We developed Python-to-Verilog HLS Compiler from scratch
l  AMBA AXI4 Interconnect for on-chip interconnect
•  For IP-core based development on Xilinx Platform Studio (XPS)
2014-02-21 Shinya T-Y. Tokyo Tech 6
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 7
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 8
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
Modeled in RTL
(Verilog HDL)
Memory Access
Pattern
in Python
def calc_sum(times):�
ram = CoramMemory(idx=0, datawidth=32, size=1024)�
channel = CoramChannel(idx=0, datawidth=32)�
addr = 0�
sum = 0�
for i in range(times):�
ram.write(0, addr, 128)�
channel.write(addr)�
sum += channel.read()�
addr += 128 * (32/8)�
print(‘sum=’, sum)�
calc_sum(8)�
# Transfer (off-chip DRAM to BRAM)
# Notification to User-logic
# Wait for Notification from User-logic
# $display Verilog system task
�
0�
1�
2�
3�
4�
5�
6�
7�
8�
9�
10�
11�
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 9
PyCoRAM IP
AXI4 Interconnect
DRAM ControllerFPGA
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
AXI I/F
CoRAM
Memory
DMAC
AXI I/F
CoRAM
Stream FSM
GPIO
FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
2014-02-21 Shinya T-Y. Tokyo Tech 10
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator
FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
2014-02-21 Shinya T-Y. Tokyo Tech 11
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator
9.8%
4.5%
0.4%
2.5% 28.1% 22.5%
6.3%
Matrix-Matrix Multiplication Accelerator
n  Each row of matrix A/B/C is stored on CoRAM memories
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  Fully-occupied pipeline for every cycle
l  Double buffering of computations and transmission of mat B
•  Mat B is transposed in advance by the other CoRAM hardware
•  1/4 of the total memory bandwidth is utilized (about 400MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 12
Computing Logic (Verilog HDL)
Control
Thread
(Python)
sum
CoRAM
Memory 0
B
× +
CoRAM
Memory 1
CoRAM
Memory 2
Control Logic
CoRAM
Channel 0
8-stage
Multiply
PipelineA
C
check
sum+
Stencil Computation Accelerator
n  3 arrays for source and 1 array for result by CoRAM
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  The pipeline consumes data of 3 points for every cycle
•  (Sum of input data within latest 3 cycles) / 9
l  Write back of the result, then read the next array
•  1/12 of the total memory bandwidth is utilized (about 130MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 13
Computing Logic (Verilog HDL) Control
Thread
(Python)
CoRAM
Memory 0
d1
CoRAM
Memory 2
CoRAM
Memory 3
Control Logic
CoRAM
Channel 0
41-stage
Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check
sum
CoRAM
Memory 1
L1 Data Cache for MIPS-core
n  CoRAM Memory as Data Memory
l  Data replacements are managed by the control thread
•  When a cache miss occurs, a handling request is issued to the CT
2014-02-21 Shinya T-Y. Tokyo Tech 14
Cache
Logic
(Verilog HDL)
Control
Thread
(Python)
CoRAM
Memory
0,1
Control
Logic
CoRAM
Channel 0
D0
D1
MUX
Tag0
=
Select
Tag1
=
Write
Data
Addr Stall
Read
Data
Write
Enable
Read
Enable
reg
reg
reg
Evaluation
n  Evaluation targets
l  Reference design provided by the contest committee (Ref)
l  6-stage MIPS-core+L1 Cache (6-stage)
l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)
n  Application dataset
l  Dataset provided for first round match
n  FPGA EDA tools
l  Xilinx Platform Studio 14.6, PlanAhead 14.6
•  Optimization goal: Speed, Optimization Effort: High
•  AXI4 Interconnect: 32-bit Shared bus (Area optimized)
n  Compiler for MIPS-core
l  gcc 4.3.3 (-O3)
2014-02-21 Shinya T-Y. Tokyo Tech 15
Performance
n  =Execution time (not including data transfer time)
n  Drastic speed up compared to the reference design
l  The 6-stage+MIPS-core achieves 3.5 times faster speed
l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster
speed at average, 47.1 times faster at maximum
2014-02-21 Shinya T-Y. Tokyo Tech 16
3.9
1.4
5.9 4.7 3.53.9
35.2
47.1
4.7
13.2
0
5
10
15
20
25
30
35
40
45
50
310_sort 320_mm 330_stencil 340_spath Gmean
RelativePerformance
6-stage
6-stage+ACC
14.2 14.2
16.0
20.8
3.6
9.8
2.7
4.4
3.6
0.4 0.3
4.4
0
5
10
15
20
25
310_sort 320_mm 330_stencil 340_spath
Time[sec]
Ref
6-stage
6-stage+ACC
Conclusion
n  From IPSJ SIG-ARC High-Performance Processor
Design Contest
n  Development of a heterogeneous FPGA-based
accelerator with PyCoRAM
l  Heterogeneous system of MIPS-core and two accelerators
l  47.1 times faster than the reference design
n  The tool-chain and framework are available on GitHub
l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l  Pyverilog: http://shtaxxx.github.io/Pyverilog/
2014-02-21 Shinya T-Y. Tokyo Tech 17

More Related Content

PDF
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PDF
A CGRA-based Approach for Accelerating Convolutional Neural Networks
PDF
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
PPTX
小型安価なFPGAボードの紹介と任意波形発生器
PDF
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
PDF
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
PDF
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
PDF
FPGAs for Supercomputing: The Why and How
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
小型安価なFPGAボードの紹介と任意波形発生器
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
FPGAs for Supercomputing: The Why and How

What's hot (20)

PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
PDF
ゆるふわコンピュータ (IPSJ-ONE2017)
PPTX
An open flow for dn ns on ultra low-power RISC-V cores
PPTX
RISC-V 30907 summit 2020 joint picocom_mentor
PDF
RISC-V 30908 patra
PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
PPTX
Online test program generator for RISC-V processors
PDF
FPGA/Reconfigurable computing (HPRC)
PPTX
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
PDF
Pragmatic optimization in modern programming - modern computer architecture c...
PDF
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
PPTX
Reverse Engineering of Rocket Chip
KEY
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
PPTX
Introduction to FPGA acceleration
PDF
NNSA Explorations: ARM for Supercomputing
PDF
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
PDF
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
PDF
Fpga computing
PPT
Fpga(field programmable gate array)
PPTX
Dr.s.shiyamala fpga ppt
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
ゆるふわコンピュータ (IPSJ-ONE2017)
An open flow for dn ns on ultra low-power RISC-V cores
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30908 patra
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
Online test program generator for RISC-V processors
FPGA/Reconfigurable computing (HPRC)
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Pragmatic optimization in modern programming - modern computer architecture c...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Reverse Engineering of Rocket Chip
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
Introduction to FPGA acceleration
NNSA Explorations: ARM for Supercomputing
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Fpga computing
Fpga(field programmable gate array)
Dr.s.shiyamala fpga ppt
Ad

Viewers also liked (20)

PDF
PythonとVeriloggenを用いたRTL設計メタプログラミング
PDF
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
PDF
Pythonを用いた高水準ハードウェア設計環境の検討
PDF
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
PDF
マルチパラダイム型高水準ハードウェア設計環境の検討
PDF
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PDF
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
PDF
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
PDF
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PDF
FPGA・リコンフィギャラブルシステム研究の最新動向
PDF
Zynq+PyCoRAM(+Debian)入門
PDF
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PPTX
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
PDF
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PDF
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
PPTX
Ubuntuをインストールしたzyboボードにカメラを付けてopen cvで顔認識
PDF
高位合成ツールVivado hlsのopen cv対応
PDF
Gpu vs fpga
PPTX
Zynq + Vivado HLS入門
PDF
Pynqでカメラ画像をリアルタイムfastx コーナー検出
PythonとVeriloggenを用いたRTL設計メタプログラミング
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Pythonを用いた高水準ハードウェア設計環境の検討
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
マルチパラダイム型高水準ハードウェア設計環境の検討
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
FPGA・リコンフィギャラブルシステム研究の最新動向
Zynq+PyCoRAM(+Debian)入門
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
Ubuntuをインストールしたzyboボードにカメラを付けてopen cvで顔認識
高位合成ツールVivado hlsのopen cv対応
Gpu vs fpga
Zynq + Vivado HLS入門
Pynqでカメラ画像をリアルタイムfastx コーナー検出
Ad

Similar to A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region) (20)

PDF
00 opencapi acceleration framework yonglu_ver2
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PDF
11 Synchoricity as the basis for going Beyond Moore
PDF
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
PDF
Digital Systems Design
PDF
P4_tutorial.pdf
PDF
6 open capi_meetup_in_japan_final
PPTX
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
PPT
Current Trends in HPC
PPTX
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
PPTX
Microprocessor.ppt
PPT
No[1][1]
PPTX
chipset debuging FTF-DES-F1321-QorIQ-Debug.pptx
PDF
FPGA Selection Methodology for Real time projects
PPTX
DATE 2020: Design, Automation and Test in Europe Conference
PPTX
HiPEAC-Keynote.pptx
PDF
Performance challenges in software networking
PPT
NIOS II Processor.ppt
PPTX
TIVA_Workshop_Session I.pptx Embedded system design using TIVA
PDF
Lecture13.pdf UNIT 4 In digital logic Circuits
00 opencapi acceleration framework yonglu_ver2
Using a Field Programmable Gate Array to Accelerate Application Performance
11 Synchoricity as the basis for going Beyond Moore
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
Digital Systems Design
P4_tutorial.pdf
6 open capi_meetup_in_japan_final
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
Current Trends in HPC
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
Microprocessor.ppt
No[1][1]
chipset debuging FTF-DES-F1321-QorIQ-Debug.pptx
FPGA Selection Methodology for Real time projects
DATE 2020: Design, Automation and Test in Europe Conference
HiPEAC-Keynote.pptx
Performance challenges in software networking
NIOS II Processor.ppt
TIVA_Workshop_Session I.pptx Embedded system design using TIVA
Lecture13.pdf UNIT 4 In digital logic Circuits

More from Shinya Takamaeda-Y (9)

PDF
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
PDF
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
PDF
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
PDF
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
PDF
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
PDF
むかし名言集bot作りました!
PDF
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
PDF
Mapping Applications with Collectives over Sub-communicators on Torus Network...
PDF
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
むかし名言集bot作りました!
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
Mapping Applications with Collectives over Sub-communicators on Torus Network...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
MYSQL Presentation for SQL database connectivity
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
MYSQL Presentation for SQL database connectivity
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
NewMind AI Monthly Chronicles - July 2025
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

  • 1. A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM Team: PyCoRAMist Shinya Takamaeda-Yamazaki Tokyo Institute of Technology JSPS Research Fellow (DC1) February 21, 2014 Digilent Design Contest @TED Yokohama
  • 2. The 1st IPSJ SIG-ARC High-Performance Processor Design Contest (Jan 2014 @Tokyo) n  A competition of developing a fast computing system for the specified applications on the specified platform n  FPGA board: Digilent Atlys l  FPGA: Xilinx Spartan-6 LX45 DRAM: DDR2-800 (1.6GB/s) 2014-02-21 Shinya T-Y. Tokyo Tech 2
  • 3. 4 Specified Contest Applications 2014-02-21 Shinya T-Y. Tokyo Tech 3 Hybrid System of CPU core + HW Accelerator Suitable for HW AcceleratorsMatrix Mult & Stencil Sort & Shortest Path Difficult for HW Accelerators Application Description Requirements for Memory System 310_sort Integer Sort Low Latency 320_mm Matrix-Matrix Multiplication High Bandwidth 330_stencil 9-Point Stencil (Integer) High Bandwidth 340_spath Shortest Path Search Low Latency
  • 4. How to Implement an Accelerator? n  HDL? NO WAY! It’s so annoying L l  Implementing the entire system using HDL is hard, because ... •  Scheduling logic of computations and memory accesses –  Double buffering requires complicated logics –  State machine implementation is so annoying and error-prone l  But, we want define the pipeline design in cycle-level •  Essential for high performance of FPGA-based accelerators –  HDL is still good weapon to write just a computation logic –  The modern high-level synthesis tools are still not effective n  Memory abstractions make up happy? 2014-02-21 Shinya T-Y. Tokyo Tech 4 CoRAM Memory Architecture
  • 5. CoRAM (Connected RAM) [Chung+,FPGA’11] n  Abstract Memory System for FPGAs l  High-level abstraction for memory management •  Decoupling computing logics and memory access behaviors •  Memory access patterns in software model (C language) 2014-02-21 Shinya T-Y. Tokyo Tech 5 HW Kernels (Computing Logics) CoRAM Memory Read Write Manage Control Threads (Memory Access Pattern in C) CoRAM Channel Read/Write Read/Write Communication FIFOs (Registers) Abstracted On-chip Memories Off-chip Memory
  • 6. PyCoRAM [Takamaeda+,CARL’13] n  Python-based implementation of CoRAM memory architecture for modern FPGA EDKs l  CoRAM memory abstraction for EDK development flow n  Key features l  Control Thread in Python •  We developed Python-to-Verilog HLS Compiler from scratch l  AMBA AXI4 Interconnect for on-chip interconnect •  For IP-core based development on Xilinx Platform Studio (XPS) 2014-02-21 Shinya T-Y. Tokyo Tech 6
  • 7. PyCoRAM Microarchitecture 2014-02-21 Shinya T-Y. Tokyo Tech 7 User I/O User Logic CoRAM Channel CoRAM Register Control Thread DMAC CoRAM Memory DMAC CoRAM Stream FSM GPIO
  • 8. PyCoRAM Microarchitecture 2014-02-21 Shinya T-Y. Tokyo Tech 8 User I/O User Logic CoRAM Channel CoRAM Register Control Thread DMAC CoRAM Memory DMAC CoRAM Stream FSM GPIO Modeled in RTL (Verilog HDL) Memory Access Pattern in Python def calc_sum(times):� ram = CoramMemory(idx=0, datawidth=32, size=1024)� channel = CoramChannel(idx=0, datawidth=32)� addr = 0� sum = 0� for i in range(times):� ram.write(0, addr, 128)� channel.write(addr)� sum += channel.read()� addr += 128 * (32/8)� print(‘sum=’, sum)� calc_sum(8)� # Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task � 0� 1� 2� 3� 4� 5� 6� 7� 8� 9� 10� 11�
  • 9. PyCoRAM Microarchitecture 2014-02-21 Shinya T-Y. Tokyo Tech 9 PyCoRAM IP AXI4 Interconnect DRAM ControllerFPGA User I/O User Logic CoRAM Channel CoRAM Register Control Thread DMAC AXI I/F CoRAM Memory DMAC AXI I/F CoRAM Stream FSM GPIO
  • 10. FPGA Accelerator for PROCON n  6-stage MIPS-core + UART loader + Two accelerators l  XPS automatically synthesizes AXI4 interconnections l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz 2014-02-21 Shinya T-Y. Tokyo Tech 10 AXI4 Interconnect (32-bit, Shared-bus) DRAM Controller PyCoRAM Abstraction L1-D Cache (2-way, 32KB, 64bytes/line) 6-stage MIPS-core PyCoRAM Abstraction Memory Loader UART PyCoRAM Abstraction Matrix Multiplication Accelerator PyCoRAM Abstraction 9-point Stencil Accelerator
  • 11. FPGA Accelerator for PROCON n  6-stage MIPS-core + UART loader + Two accelerators l  XPS automatically synthesizes AXI4 interconnections l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz 2014-02-21 Shinya T-Y. Tokyo Tech 11 AXI4 Interconnect (32-bit, Shared-bus) DRAM Controller PyCoRAM Abstraction L1-D Cache (2-way, 32KB, 64bytes/line) 6-stage MIPS-core PyCoRAM Abstraction Memory Loader UART PyCoRAM Abstraction Matrix Multiplication Accelerator PyCoRAM Abstraction 9-point Stencil Accelerator 9.8% 4.5% 0.4% 2.5% 28.1% 22.5% 6.3%
  • 12. Matrix-Matrix Multiplication Accelerator n  Each row of matrix A/B/C is stored on CoRAM memories l  Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM l  Fully-occupied pipeline for every cycle l  Double buffering of computations and transmission of mat B •  Mat B is transposed in advance by the other CoRAM hardware •  1/4 of the total memory bandwidth is utilized (about 400MB/s) 2014-02-21 Shinya T-Y. Tokyo Tech 12 Computing Logic (Verilog HDL) Control Thread (Python) sum CoRAM Memory 0 B × + CoRAM Memory 1 CoRAM Memory 2 Control Logic CoRAM Channel 0 8-stage Multiply PipelineA C check sum+
  • 13. Stencil Computation Accelerator n  3 arrays for source and 1 array for result by CoRAM l  Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM l  The pipeline consumes data of 3 points for every cycle •  (Sum of input data within latest 3 cycles) / 9 l  Write back of the result, then read the next array •  1/12 of the total memory bandwidth is utilized (about 130MB/s) 2014-02-21 Shinya T-Y. Tokyo Tech 13 Computing Logic (Verilog HDL) Control Thread (Python) CoRAM Memory 0 d1 CoRAM Memory 2 CoRAM Memory 3 Control Logic CoRAM Channel 0 41-stage Add-Divide Pipeline d0 rslt d2 + / + check sum CoRAM Memory 1
  • 14. L1 Data Cache for MIPS-core n  CoRAM Memory as Data Memory l  Data replacements are managed by the control thread •  When a cache miss occurs, a handling request is issued to the CT 2014-02-21 Shinya T-Y. Tokyo Tech 14 Cache Logic (Verilog HDL) Control Thread (Python) CoRAM Memory 0,1 Control Logic CoRAM Channel 0 D0 D1 MUX Tag0 = Select Tag1 = Write Data Addr Stall Read Data Write Enable Read Enable reg reg reg
  • 15. Evaluation n  Evaluation targets l  Reference design provided by the contest committee (Ref) l  6-stage MIPS-core+L1 Cache (6-stage) l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC) n  Application dataset l  Dataset provided for first round match n  FPGA EDA tools l  Xilinx Platform Studio 14.6, PlanAhead 14.6 •  Optimization goal: Speed, Optimization Effort: High •  AXI4 Interconnect: 32-bit Shared bus (Area optimized) n  Compiler for MIPS-core l  gcc 4.3.3 (-O3) 2014-02-21 Shinya T-Y. Tokyo Tech 15
  • 16. Performance n  =Execution time (not including data transfer time) n  Drastic speed up compared to the reference design l  The 6-stage+MIPS-core achieves 3.5 times faster speed l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster speed at average, 47.1 times faster at maximum 2014-02-21 Shinya T-Y. Tokyo Tech 16 3.9 1.4 5.9 4.7 3.53.9 35.2 47.1 4.7 13.2 0 5 10 15 20 25 30 35 40 45 50 310_sort 320_mm 330_stencil 340_spath Gmean RelativePerformance 6-stage 6-stage+ACC 14.2 14.2 16.0 20.8 3.6 9.8 2.7 4.4 3.6 0.4 0.3 4.4 0 5 10 15 20 25 310_sort 320_mm 330_stencil 340_spath Time[sec] Ref 6-stage 6-stage+ACC
  • 17. Conclusion n  From IPSJ SIG-ARC High-Performance Processor Design Contest n  Development of a heterogeneous FPGA-based accelerator with PyCoRAM l  Heterogeneous system of MIPS-core and two accelerators l  47.1 times faster than the reference design n  The tool-chain and framework are available on GitHub l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/ l  Pyverilog: http://shtaxxx.github.io/Pyverilog/ 2014-02-21 Shinya T-Y. Tokyo Tech 17