SlideShare a Scribd company logo
FACE: Fast and Customizable Sorting
Accelerator for Heterogeneous
Many-core Systems
Ryohei Kobayashi, Kenji Kise
Tokyo Institute of Technology, Japan
MCSoC-15@Turin, Italy
Embedded Multicore/Many-core Architectures
13:40-14:10, September 23, 2015
1
Introduction
l Multi-core and many-core processors have
been mainstream to accelerate applications
by parallel processing
l # of cores has been increased depending
on Moore s Law
Multi-core and Many-core Processor
2
2006
Core 2 Duo
Conroe
(65nm)
2
Cores
2008
Core 2 Quad
Yorkfield
(45nm)
4
Cores
2010
Xeon 7500
Nehalem EX
(32nm)
8
Cores
2012
Xeon Phi
Knights Corner
(22nm)
50+
Cores
l The end of Moore s Law means that
approaches relying on the following points
are hopeless...
Ø The increase in # of cores (especially)
Ø Implementation of rich features
Ø Etc...
Will Moore s Law Continue?
3
Will Moore s Law Continue?
4
Yes. But we have motivation for
accelerators [1].
That s why we bought Altera
[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)
Will Moore s Law Continue?
5
Microsoft is a trademark of the
Microsoft group of companies
Yes. But we have motivation for
accelerators [1].
That s why we bought Altera
[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)
Will Moore s Law Continue?
6
Microsoft is a trademark of the
Microsoft group of companies
Yes. But we have motivation for
accelerators [1].
That s why we bought Altera
No, Moore s Law is ending.
That s why hardware
specialization will be critical [2]
[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)
[2] Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014
Will Moore s Law Continue?
7
Microsoft is a trademark of the
Microsoft group of companies
Yes. But we have motivation for
accelerators [1].
That s why we bought Altera
No, Moore s Law is ending.
That s why hardware
specialization will be critical [2]
[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)
[2] Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014
Dedicated hardware era is coming!!!
l Sorting is a fundamental computation kernel
This work: Sorting Accelerator
8
Databases[1] Image Processing[2] Data Compression[3]
[1] Rene Mueller et al, Sorting Networks on FPGAs, The VLDB Journal 2012
[2] Ratnayake, K et al,
An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals,
CISS 2007
[3] Martinez, J et al,
An FPGA-based parallel sorting architecture for the Burrows Wheeler transform,
ReConFig 2005
l Sorting is a fundamental computation kernel
This work: Sorting Accelerator
9
[1] Rene Mueller et al, Sorting Networks on FPGAs, The VLDB Journal 2012
[2] Ratnayake, K et al,
An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals,
CISS 2007
[3] Martinez, J et al,
An FPGA-based parallel sorting architecture for the Burrows Wheeler transform,
ReConFig 2005
Databases[1] Image Processing[2] Data Compression[3]
!
Sorting accelerators fulfilling the
following requirements do not exist...
Problem
ü  High Performance
ü  Customizable
ü  Open sourced
10
Our Proposed Sorting Accelerator
l Using the following sorting architectures
Ø The sorting network
Ø The merge sorter tree
Our Proposed Sorting Accelerator
11
1
4
3
2
4
3
2
1
>
>
>
The sorting network
Proposed Sorting Accelerator
The merge sorter tree
l A sorting architecture composed of wires
and comparators
l Example: Sorting 4 values in the network
Ø Smaller and larger values are carried to the top
and bottom
The Sorting Network*
12
1
4
3
2
4
3
2
1
Bubble sort network with 4-inputs and 4-outputs
* Donald E. Knuth. The Art of Computer Programming. 1998.
l A data path that executes merge process
The Merge Sorter Tree*
13
>
>
>
FIFO
Sorter Cell>
4-way merge sorter tree
* Dirk Koch et al, FPGASort, FPGA 11
l Sorting process in the merge sorter tree
Ø The data sequences in the leftmost FIFOs must
be sorted
The Merge Sorter Tree
14
>
>
>
>
>
>
>
>
>
89
35
13
22 1
3
3
22
1
2
1
89
5
3
7 5
1
2
3
2
5
22
3
7
98
x
x: Invalid Value
Cycle N Cycle N+1 Cycle N+2
Data Path of the Proposed Sorting
Accelerator
15
Debug HW
Host PC
DRAM
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
IN OUT
Stage
16
Example: Sorting 256 elements
l The generated initial data sequence is
stored in the external memory
Sorting 256 Elements from 256 to 1
17
256 255 254 … 64 63 …
3 2 1
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
DRAM
l Initialization is done
Sorting 256 Elements from 256 to 1
18
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
256
255
254
253
252
1
DRAM
l The data is sent to Sorting Network
Sorting 256 Elements from 256 to 1
19
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
256
255
254
…
6463…321
256
255
254
253
252
1
IN OUT
Stage
DRAM
l Sorting Network can sort 16 elements
Ø The initial data sequence turns into 16 sorted
data sequences by passed through this network
Sorting 256 Elements from 256 to 1
20
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
16 … 3 2 1 32 … 19 18 17 256 … 243 242 241
This is sorted
256
255
254
253
252
1
DRAM
l The data passed through the network is
stored in Input Buffer, and sent to Merge
Sorter Tree
Sorting 256 Elements from 256 to 1
21
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
241242243256…
225226240…
224…
208…
227
210211
194195
209
193
256
255
254
253
252
1
DRAM
l The root of the tree emits sorted data
sequences
Sorting 256 Elements from 256 to 1
22
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
241242243256…
226227240…
224…
208…
228
210211 209
193 194
195
…
196
201
202
225
203
256
255
254
253
252
1
DRAM
l The data sequence composed of 16 Units
turns into 4 sorted data sequences
Sorting 256 Elements from 256 to 1
23
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
64 … 3 2 1 128 … 67 66 65 256 … 195 194 193
This is sorted (64 elements)
192 … 131 130 129256
255
254
253
252
1
DRAM
l The data is stored in the external memory
Sorting 256 Elements from 256 to 1
24
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
193 194 195 196 197
198
199 200 201 202 203 204 ……
Store Area256
255
254
253
252
1
DRAM
l This data is not fully sorted yet...
Ø This data has to be sent to Merge Sorter Tree
again
Sorting 256 Elements from 256 to 1
25
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
256
255
254
253
252
1
193
194
195
62
63
64
DRAM
l The data is read form DRAM and sent to
Sorting Network
Sorting 256 Elements from 256 to 1
26
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
193
194
195
…
129130…626364
Load Area256
255
254
253
252
1
193
194
195
62
63
64
DRAM
l In this time, this network is a mere data
path because portions of the data sequence
are already sorted
Sorting 256 Elements from 256 to 1
27
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
256
255
254
253
252
1
193
194
195
62
63
64
DRAM
l The data passed through the network is
stored in Input Buffer, and sent to Merge
Sorter Tree
Sorting 256 Elements from 256 to 1
28
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
256
255
254
253
252
1
193
194
195
62
63
64
193194195256…
129130192…
128…
64…
131
6667
23
65
1
DRAM
l The root of the tree emits sorted data
sequences
Sorting 256 Elements from 256 to 1
29
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
256
255
254
253
252
1
193
194
195
62
63
64
64…
1 2 3
…
4
51
52
225
53
193194195256…
129130192…
128…
131
6667 65
DRAM
l The data is stored in the external memory
Sorting 256 Elements from 256 to 1
30
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
1 2 3 4 5
6
7 8 9 10 11 12 ……
Store Area256
255
254
253
252
1
193
194
195
62
63
64
DRAM
l This data is fully sorted !!
Ø The data is fully sorted by passed through the
network and the tree as required
﹣ log# of ways(# of elements/16)
Sorting 256 Elements from 256 to 1
31
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
1
2
3
4
5
256
193
194
195
62
63
64
DRAM
l The fully sorted data is sent to Host PC
Ø To verify that the accelerator accurately works
Sorting 256 Elements from 256 to 1
32
DRAM
Host PC
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
512-bit
shift register
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
1
2
3
4
5
256
193
194
195
62
63
64
Load Area
1 2 3
4
56…
DRAM
l Duplication of the merge sorter tree
Data Path of the Accelerator with the
Duplicated Merge Sorter Tree
33
512-bit
shift register
512-bit
shift register
512-bit
shift register
>
>
>
32
32
32
32
3232
32
32
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512-bit
shift register
32
512
512
512
512
Input Buffer
Merge Sorter Tree
Initial Data
Generator
Sorting
Network
512
512
512
512
DRAM
Controller
UART
Controller
Output Buffer 32512512
512
512
512
512
1
64
FPGA
Duplicated Logics
512-bit
shift register
l Effectiveness
Ø To sort data sequences in parallel
﹣ The sorting logic throughput is improved
Duplication of the Merge Sorter Tree
34
The accelerator with four 4-way trees sorts the initial data sequence
193 194 … 256 129 130 … 192 65 66 … 128 1 2 … 64
256 … 194 193 192 … 130 129 128 … 66 65 64 … 2 1
Initial Data Sequence
Sorting the data in parallel
Tree 0 Tree 1 Tree 2 Tree 3
Executing merge process in a tree
1 2 3 4 5 6 7 8 … … … … 253 254 255 256
Sorting is done!!!
l This accelerator is customizable by tuning #
of ways and duplicated trees
l This accelerator performance can be
formulated
Ø Designer can estimate accelerator performance
in advance and implement the best one fulfilling
hardware resource constraints
Characteristics of the Accelerator
35
# of required cycles to finish sorting
(k: # of ways, P: # of trees, N: # of elements)
36
Evaluation
l Implementation Platform
Ø Xilinx FPGA VC707 Evaluation Kit
Hardware Setup(1/2)
37
Power In
UART Port
JTAG Port
DDR3 SO-DIMM DRAM
(4GB*, 800MHz/1600Mbps)
* This kit originally has 1GB DDR3 SO-DIMM (800MHz/1600Mbps) memory
Virtex-7 VX485T FPGA
l All logics are implemented in Verilog HDL
l Design Tool: Vivado2014.4
l Operating frequency
Ø Logic: 200MHz,Memory bus: 800MHz
l Initial Data Generator
Ø Supporting the following data-generation types
﹣ A random data sequence using Xorshift*
﹣ A sorted data sequence
﹣ A reverse-order sorted data sequence
Hardware Setup(2/2)
38
* George Marsaglia, Xorshift RNGs, Journal of Statistical Software 2003.
l  Point:
Sorting Process Time and hardware resource
usage
Ø Dataset: 256M 32-bits integer values
l  Opponent: Intel Corei7-4770 @ 3.4GHz
Ø A single thread
Ø gcc 4.8.2 (-O3 optimization)
Ø Sorting algorithm
﹣ Merge sort
﹣ Quick sort
l  How to measure the execution time
Ø FPGA -> to get execution cycles
Ø CPU -> to use gettimeofday
Evaluation
39
l Sorting performance
Ø It is improved as # of ways and trees is larger
Ø It is independent of data-sequence types
Ø It is almost same as estimated one
Evaluation: Sorting Performance
40
0
10
20
30
40
50
60
m
erge
sortquick
sort
4-w
ay
4-w
ay/2-parallel
4-w
ay/4-parallel
8-w
ay
8-w
ay/2-parallel
8-w
ay/4-parallel
8-w
ay/8-parallel
16-w
ay
16-w
ay/2-parallel
16-w
ay/4-parallel
SortingProcessTime[sec]
xorshift
sorted
reverse
Estimated
0
10
20
30
40
50
60
merge sort quick sort 8-way/8-parallel
SortingProcessTime[sec]
xorshift
sorted
reverse
l In a case of random data sequence
Ø 10.06x faster than merge sort
Ø 8.01x faster than quick sort
Evaluation:
Sorting Performance of 8-way/8-parallel
41
10.06x
8.01x
l Most of the designs can be implemented on
low-end devices
Ø This accelerator is available on even embedded
systems
Customizable: the Performance and the
Hardware Resource Usage
42
4-way
4-way/2-parallel
4-way/4-parallel
8-way
8-way/2-parallel
8-way/4-parallel
8-way/8-parallel
16-way
16-way/2-parallel
16-way/4-parallel
0
2
4
6
8
10
12
0 10000 20000 30000 40000 50000 60000 70000
Speed-upRatio
# of Slices
Cost Performance
Available 15,850 Slices
(Artix-7 XC7A100T)
Available 50,950 Slices
(Kintex-7 XC7K325T)
l FACE is available on GitHub
Ø https://github.com/monotone-RK/FACE
l Currently, FACE can work on Xilinx FPGA
VC707 Evaluation Kit
Ø We will try to port another environment if you
have requests and if possible
Open Sourced
43Xilinx FPGA VC707 Evaluation Kit
44
45
Conclusion
l FACE: Fast and Customizable Sorting
Accelerator for Heterogeneous Many-core
Systems
Ø This accelerator is customizable by tuning # of
ways and duplicated trees
Ø This accelerator performance can be formulated
Ø Open sourced
﹣ Available on GitHub
(https://github.com/monotone-RK/FACE)
l Future Work
Ø Performance evaluation including data transfer
﹣ Bus system like AXI4 or Avalon, NoC, PCIe, etc...
Conclusion
46

More Related Content

PPTX
Debug dpdk process bottleneck & painpoints
PPT
No[1][1]
PPTX
RISC-V 30907 summit 2020 joint picocom_mentor
PPTX
Dpdk applications
PPTX
Dynamic user trace
PPTX
Debug generic process
PDF
Bridging the gap between hardware and software tracing
PDF
Trip down the GPU lane with Machine Learning
Debug dpdk process bottleneck & painpoints
No[1][1]
RISC-V 30907 summit 2020 joint picocom_mentor
Dpdk applications
Dynamic user trace
Debug generic process
Bridging the gap between hardware and software tracing
Trip down the GPU lane with Machine Learning

What's hot (20)

PPTX
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
PPTX
DPDK layer for porting IPS-IDS
PDF
Kernel Recipes 2019 - XDP closer integration with network stack
PDF
The Spectre of Meltdowns
PPTX
DPDK KNI interface
PDF
Debugging Hung Python Processes With GDB
PPTX
使用XMPP進行遠端設備控制
PDF
YOW2021 Computing Performance
PDF
Intel DPDK Step by Step instructions
PPTX
建構嵌入式Linux系統於SD Card
PDF
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
PPSX
FD.io Vector Packet Processing (VPP)
PDF
Lightweight DNN Processor Design (based on NVDLA)
PDF
Userspace networking
PDF
The Tofu Interconnect D for the Post K Supercomputer
PDF
Recent advance in netmap/VALE(mSwitch)
ODP
6. processes and threads
PPTX
The Data Center and Hadoop
PDF
Computing Performance: On the Horizon (2021)
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
DPDK layer for porting IPS-IDS
Kernel Recipes 2019 - XDP closer integration with network stack
The Spectre of Meltdowns
DPDK KNI interface
Debugging Hung Python Processes With GDB
使用XMPP進行遠端設備控制
YOW2021 Computing Performance
Intel DPDK Step by Step instructions
建構嵌入式Linux系統於SD Card
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
FD.io Vector Packet Processing (VPP)
Lightweight DNN Processor Design (based on NVDLA)
Userspace networking
The Tofu Interconnect D for the Post K Supercomputer
Recent advance in netmap/VALE(mSwitch)
6. processes and threads
The Data Center and Hadoop
Computing Performance: On the Horizon (2021)
Ad

Viewers also liked (12)

PDF
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
PDF
IEICE technical report (RECONF), January 2013.
PDF
CMPP 2012 held in conjunction with ICNC’12
PDF
hpc2013_20131223
PDF
A survey of how to efficiently implement application-specific hardware on an ...
PDF
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
PDF
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
PDF
FPGAベースのソーティングアクセラレータの設計と実装
PDF
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
PPTX
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
PDF
私が上智に通って唯一誇れること
PDF
A High-speed Verilog HDL Simulation Method using a Lightweight Translator
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
IEICE technical report (RECONF), January 2013.
CMPP 2012 held in conjunction with ICNC’12
hpc2013_20131223
A survey of how to efficiently implement application-specific hardware on an ...
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
FPGAベースのソーティングアクセラレータの設計と実装
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
私が上智に通って唯一誇れること
A High-speed Verilog HDL Simulation Method using a Lightweight Translator
Ad

Similar to FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems (20)

PPTX
MaPU-HPCA2016
PPTX
Exascale Capabl
PDF
PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Co...
PDF
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
PDF
Introduction to Programmable Networks by Clarence Anslem, Intel
PDF
MCSoC'13 Keynote Talk "Taming Big Data Streams"
PDF
Fast switching of threads between cores - Advanced Operating Systems
PDF
The Universal Dataplane
PDF
Trends in Systems and How to Get Efficient Performance
PDF
Maquina estado
PPTX
Microprocessor.ppt
PDF
In-Network Acceleration with FPGA (MEMO)
PDF
indira technical institue nashik course syllabus
PDF
Brkdct 3101
PPTX
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
PPTX
Programmable Exascale Supercomputer
PDF
The Principle Of Ultrasound Imaging System
PPTX
Introduce: IBM Power Linux with PowerKVM
PDF
Toku DB by Aswin
PPTX
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
MaPU-HPCA2016
Exascale Capabl
PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Co...
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
Introduction to Programmable Networks by Clarence Anslem, Intel
MCSoC'13 Keynote Talk "Taming Big Data Streams"
Fast switching of threads between cores - Advanced Operating Systems
The Universal Dataplane
Trends in Systems and How to Get Efficient Performance
Maquina estado
Microprocessor.ppt
In-Network Acceleration with FPGA (MEMO)
indira technical institue nashik course syllabus
Brkdct 3101
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Programmable Exascale Supercomputer
The Principle Of Ultrasound Imaging System
Introduce: IBM Power Linux with PowerKVM
Toku DB by Aswin
Exploring Parallel Merging In GPU Based Systems Using CUDA C.

Recently uploaded (20)

PDF
ISS2022 present sdabhsa hsdhdfahasda ssdsd
PPT
Hypersensitivity Namisha1111111111-WPS.ppt
PPTX
Unit-1.pptxgeyeuueueu7r7r7r77r7r7r7uriruru
PPTX
Lecture-3-Computer-programming for BS InfoTech
PDF
2_STM32&SecureElements2_STM32&SecureElements
PDF
Maxon CINEMA 4D 2025 Crack Free Download Latest Version
PPTX
Clauses_Part1.hshshpjzjxnznxnxnndndndndndndndnndptx
PPTX
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
PPTX
quadraticequations-111211090004-phpapp02.pptx
PPTX
Computers and mobile device: Evaluating options for home and work
PDF
20A LG INR18650HJ2 3.6V 2900mAh Battery cells for Power Tools Vacuum Cleaner
PPTX
A Clear View_ Interpreting Scope Numbers and Features
DOCX
A PROPOSAL ON IoT climate sensor 2.docx
DOCX
Edukasi kultural untuk kita semua maka c
PDF
Topic-1-Main-Features-of-Data-Processing.pdf
PPT
Lines and angles cbse class 9 math chemistry
PPTX
Fundamentals of Computer.pptx Computer BSC
PPT
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PPTX
Group 4 [BSIT-1C] Computer Network (1).pptx
ISS2022 present sdabhsa hsdhdfahasda ssdsd
Hypersensitivity Namisha1111111111-WPS.ppt
Unit-1.pptxgeyeuueueu7r7r7r77r7r7r7uriruru
Lecture-3-Computer-programming for BS InfoTech
2_STM32&SecureElements2_STM32&SecureElements
Maxon CINEMA 4D 2025 Crack Free Download Latest Version
Clauses_Part1.hshshpjzjxnznxnxnndndndndndndndnndptx
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
quadraticequations-111211090004-phpapp02.pptx
Computers and mobile device: Evaluating options for home and work
20A LG INR18650HJ2 3.6V 2900mAh Battery cells for Power Tools Vacuum Cleaner
A Clear View_ Interpreting Scope Numbers and Features
A PROPOSAL ON IoT climate sensor 2.docx
Edukasi kultural untuk kita semua maka c
Topic-1-Main-Features-of-Data-Processing.pdf
Lines and angles cbse class 9 math chemistry
Fundamentals of Computer.pptx Computer BSC
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
Group 4 [BSIT-1C] Computer Network (1).pptx

FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

  • 1. FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems Ryohei Kobayashi, Kenji Kise Tokyo Institute of Technology, Japan MCSoC-15@Turin, Italy Embedded Multicore/Many-core Architectures 13:40-14:10, September 23, 2015
  • 3. l Multi-core and many-core processors have been mainstream to accelerate applications by parallel processing l # of cores has been increased depending on Moore s Law Multi-core and Many-core Processor 2 2006 Core 2 Duo Conroe (65nm) 2 Cores 2008 Core 2 Quad Yorkfield (45nm) 4 Cores 2010 Xeon 7500 Nehalem EX (32nm) 8 Cores 2012 Xeon Phi Knights Corner (22nm) 50+ Cores
  • 4. l The end of Moore s Law means that approaches relying on the following points are hopeless... Ø The increase in # of cores (especially) Ø Implementation of rich features Ø Etc... Will Moore s Law Continue? 3
  • 5. Will Moore s Law Continue? 4 Yes. But we have motivation for accelerators [1]. That s why we bought Altera [1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)
  • 6. Will Moore s Law Continue? 5 Microsoft is a trademark of the Microsoft group of companies Yes. But we have motivation for accelerators [1]. That s why we bought Altera [1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)
  • 7. Will Moore s Law Continue? 6 Microsoft is a trademark of the Microsoft group of companies Yes. But we have motivation for accelerators [1]. That s why we bought Altera No, Moore s Law is ending. That s why hardware specialization will be critical [2] [1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015) [2] Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014
  • 8. Will Moore s Law Continue? 7 Microsoft is a trademark of the Microsoft group of companies Yes. But we have motivation for accelerators [1]. That s why we bought Altera No, Moore s Law is ending. That s why hardware specialization will be critical [2] [1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015) [2] Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014 Dedicated hardware era is coming!!!
  • 9. l Sorting is a fundamental computation kernel This work: Sorting Accelerator 8 Databases[1] Image Processing[2] Data Compression[3] [1] Rene Mueller et al, Sorting Networks on FPGAs, The VLDB Journal 2012 [2] Ratnayake, K et al, An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals, CISS 2007 [3] Martinez, J et al, An FPGA-based parallel sorting architecture for the Burrows Wheeler transform, ReConFig 2005
  • 10. l Sorting is a fundamental computation kernel This work: Sorting Accelerator 9 [1] Rene Mueller et al, Sorting Networks on FPGAs, The VLDB Journal 2012 [2] Ratnayake, K et al, An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals, CISS 2007 [3] Martinez, J et al, An FPGA-based parallel sorting architecture for the Burrows Wheeler transform, ReConFig 2005 Databases[1] Image Processing[2] Data Compression[3] ! Sorting accelerators fulfilling the following requirements do not exist... Problem ü  High Performance ü  Customizable ü  Open sourced
  • 11. 10 Our Proposed Sorting Accelerator
  • 12. l Using the following sorting architectures Ø The sorting network Ø The merge sorter tree Our Proposed Sorting Accelerator 11 1 4 3 2 4 3 2 1 > > > The sorting network Proposed Sorting Accelerator The merge sorter tree
  • 13. l A sorting architecture composed of wires and comparators l Example: Sorting 4 values in the network Ø Smaller and larger values are carried to the top and bottom The Sorting Network* 12 1 4 3 2 4 3 2 1 Bubble sort network with 4-inputs and 4-outputs * Donald E. Knuth. The Art of Computer Programming. 1998.
  • 14. l A data path that executes merge process The Merge Sorter Tree* 13 > > > FIFO Sorter Cell> 4-way merge sorter tree * Dirk Koch et al, FPGASort, FPGA 11
  • 15. l Sorting process in the merge sorter tree Ø The data sequences in the leftmost FIFOs must be sorted The Merge Sorter Tree 14 > > > > > > > > > 89 35 13 22 1 3 3 22 1 2 1 89 5 3 7 5 1 2 3 2 5 22 3 7 98 x x: Invalid Value Cycle N Cycle N+1 Cycle N+2
  • 16. Data Path of the Proposed Sorting Accelerator 15 Debug HW Host PC DRAM > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA IN OUT Stage
  • 18. l The generated initial data sequence is stored in the external memory Sorting 256 Elements from 256 to 1 17 256 255 254 … 64 63 … 3 2 1 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA DRAM
  • 19. l Initialization is done Sorting 256 Elements from 256 to 1 18 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 256 255 254 253 252 1 DRAM
  • 20. l The data is sent to Sorting Network Sorting 256 Elements from 256 to 1 19 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 256 255 254 … 6463…321 256 255 254 253 252 1 IN OUT Stage DRAM
  • 21. l Sorting Network can sort 16 elements Ø The initial data sequence turns into 16 sorted data sequences by passed through this network Sorting 256 Elements from 256 to 1 20 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 16 … 3 2 1 32 … 19 18 17 256 … 243 242 241 This is sorted 256 255 254 253 252 1 DRAM
  • 22. l The data passed through the network is stored in Input Buffer, and sent to Merge Sorter Tree Sorting 256 Elements from 256 to 1 21 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 241242243256… 225226240… 224… 208… 227 210211 194195 209 193 256 255 254 253 252 1 DRAM
  • 23. l The root of the tree emits sorted data sequences Sorting 256 Elements from 256 to 1 22 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 241242243256… 226227240… 224… 208… 228 210211 209 193 194 195 … 196 201 202 225 203 256 255 254 253 252 1 DRAM
  • 24. l The data sequence composed of 16 Units turns into 4 sorted data sequences Sorting 256 Elements from 256 to 1 23 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 64 … 3 2 1 128 … 67 66 65 256 … 195 194 193 This is sorted (64 elements) 192 … 131 130 129256 255 254 253 252 1 DRAM
  • 25. l The data is stored in the external memory Sorting 256 Elements from 256 to 1 24 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 193 194 195 196 197 198 199 200 201 202 203 204 …… Store Area256 255 254 253 252 1 DRAM
  • 26. l This data is not fully sorted yet... Ø This data has to be sent to Merge Sorter Tree again Sorting 256 Elements from 256 to 1 25 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 256 255 254 253 252 1 193 194 195 62 63 64 DRAM
  • 27. l The data is read form DRAM and sent to Sorting Network Sorting 256 Elements from 256 to 1 26 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 193 194 195 … 129130…626364 Load Area256 255 254 253 252 1 193 194 195 62 63 64 DRAM
  • 28. l In this time, this network is a mere data path because portions of the data sequence are already sorted Sorting 256 Elements from 256 to 1 27 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 256 255 254 253 252 1 193 194 195 62 63 64 DRAM
  • 29. l The data passed through the network is stored in Input Buffer, and sent to Merge Sorter Tree Sorting 256 Elements from 256 to 1 28 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 256 255 254 253 252 1 193 194 195 62 63 64 193194195256… 129130192… 128… 64… 131 6667 23 65 1 DRAM
  • 30. l The root of the tree emits sorted data sequences Sorting 256 Elements from 256 to 1 29 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 256 255 254 253 252 1 193 194 195 62 63 64 64… 1 2 3 … 4 51 52 225 53 193194195256… 129130192… 128… 131 6667 65 DRAM
  • 31. l The data is stored in the external memory Sorting 256 Elements from 256 to 1 30 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 1 2 3 4 5 6 7 8 9 10 11 12 …… Store Area256 255 254 253 252 1 193 194 195 62 63 64 DRAM
  • 32. l This data is fully sorted !! Ø The data is fully sorted by passed through the network and the tree as required ﹣ log# of ways(# of elements/16) Sorting 256 Elements from 256 to 1 31 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 1 2 3 4 5 256 193 194 195 62 63 64 DRAM
  • 33. l The fully sorted data is sent to Host PC Ø To verify that the accelerator accurately works Sorting 256 Elements from 256 to 1 32 DRAM Host PC > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree 512-bit shift register Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA 1 2 3 4 5 256 193 194 195 62 63 64 Load Area 1 2 3 4 56… DRAM
  • 34. l Duplication of the merge sorter tree Data Path of the Accelerator with the Duplicated Merge Sorter Tree 33 512-bit shift register 512-bit shift register 512-bit shift register > > > 32 32 32 32 3232 32 32 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512-bit shift register 32 512 512 512 512 Input Buffer Merge Sorter Tree Initial Data Generator Sorting Network 512 512 512 512 DRAM Controller UART Controller Output Buffer 32512512 512 512 512 512 1 64 FPGA Duplicated Logics 512-bit shift register
  • 35. l Effectiveness Ø To sort data sequences in parallel ﹣ The sorting logic throughput is improved Duplication of the Merge Sorter Tree 34 The accelerator with four 4-way trees sorts the initial data sequence 193 194 … 256 129 130 … 192 65 66 … 128 1 2 … 64 256 … 194 193 192 … 130 129 128 … 66 65 64 … 2 1 Initial Data Sequence Sorting the data in parallel Tree 0 Tree 1 Tree 2 Tree 3 Executing merge process in a tree 1 2 3 4 5 6 7 8 … … … … 253 254 255 256 Sorting is done!!!
  • 36. l This accelerator is customizable by tuning # of ways and duplicated trees l This accelerator performance can be formulated Ø Designer can estimate accelerator performance in advance and implement the best one fulfilling hardware resource constraints Characteristics of the Accelerator 35 # of required cycles to finish sorting (k: # of ways, P: # of trees, N: # of elements)
  • 38. l Implementation Platform Ø Xilinx FPGA VC707 Evaluation Kit Hardware Setup(1/2) 37 Power In UART Port JTAG Port DDR3 SO-DIMM DRAM (4GB*, 800MHz/1600Mbps) * This kit originally has 1GB DDR3 SO-DIMM (800MHz/1600Mbps) memory Virtex-7 VX485T FPGA
  • 39. l All logics are implemented in Verilog HDL l Design Tool: Vivado2014.4 l Operating frequency Ø Logic: 200MHz,Memory bus: 800MHz l Initial Data Generator Ø Supporting the following data-generation types ﹣ A random data sequence using Xorshift* ﹣ A sorted data sequence ﹣ A reverse-order sorted data sequence Hardware Setup(2/2) 38 * George Marsaglia, Xorshift RNGs, Journal of Statistical Software 2003.
  • 40. l  Point: Sorting Process Time and hardware resource usage Ø Dataset: 256M 32-bits integer values l  Opponent: Intel Corei7-4770 @ 3.4GHz Ø A single thread Ø gcc 4.8.2 (-O3 optimization) Ø Sorting algorithm ﹣ Merge sort ﹣ Quick sort l  How to measure the execution time Ø FPGA -> to get execution cycles Ø CPU -> to use gettimeofday Evaluation 39
  • 41. l Sorting performance Ø It is improved as # of ways and trees is larger Ø It is independent of data-sequence types Ø It is almost same as estimated one Evaluation: Sorting Performance 40 0 10 20 30 40 50 60 m erge sortquick sort 4-w ay 4-w ay/2-parallel 4-w ay/4-parallel 8-w ay 8-w ay/2-parallel 8-w ay/4-parallel 8-w ay/8-parallel 16-w ay 16-w ay/2-parallel 16-w ay/4-parallel SortingProcessTime[sec] xorshift sorted reverse Estimated
  • 42. 0 10 20 30 40 50 60 merge sort quick sort 8-way/8-parallel SortingProcessTime[sec] xorshift sorted reverse l In a case of random data sequence Ø 10.06x faster than merge sort Ø 8.01x faster than quick sort Evaluation: Sorting Performance of 8-way/8-parallel 41 10.06x 8.01x
  • 43. l Most of the designs can be implemented on low-end devices Ø This accelerator is available on even embedded systems Customizable: the Performance and the Hardware Resource Usage 42 4-way 4-way/2-parallel 4-way/4-parallel 8-way 8-way/2-parallel 8-way/4-parallel 8-way/8-parallel 16-way 16-way/2-parallel 16-way/4-parallel 0 2 4 6 8 10 12 0 10000 20000 30000 40000 50000 60000 70000 Speed-upRatio # of Slices Cost Performance Available 15,850 Slices (Artix-7 XC7A100T) Available 50,950 Slices (Kintex-7 XC7K325T)
  • 44. l FACE is available on GitHub Ø https://github.com/monotone-RK/FACE l Currently, FACE can work on Xilinx FPGA VC707 Evaluation Kit Ø We will try to port another environment if you have requests and if possible Open Sourced 43Xilinx FPGA VC707 Evaluation Kit
  • 45. 44
  • 47. l FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems Ø This accelerator is customizable by tuning # of ways and duplicated trees Ø This accelerator performance can be formulated Ø Open sourced ﹣ Available on GitHub (https://github.com/monotone-RK/FACE) l Future Work Ø Performance evaluation including data transfer ﹣ Bus system like AXI4 or Avalon, NoC, PCIe, etc... Conclusion 46