SlideShare a Scribd company logo
A Cost-Effective and Scalable
Merge Sorter Tree on FPGAs
☆Takuma Usui, Thiem Van Chu, and Kenji Kise
Tokyo Institute of Technology, Japan
Department of Computer Science
CANDAR’16@Hiroshima, Japan
11:35-12:00 (Presentation: 20min, Q&A: 5min),
November 24, 2016
Executive summary
 Integer sorting is a very important computing kernel which
can be accelerated using FPGAs.
 FPGA resources are too limited to build a high performance
merge sorter tree.
 We propose effective designs of cost-effective and scalable
merge sorter trees which have high performance in little
FPGA resource requirement.
 We evaluate our architecture, and it achieves 52.4x lower
FPGA slice usage without serious throughput degradation.
1
2
Introduction
Sorting is important
 Integer sorting is a fundamental computation kernel
3
Database OperationImage Processing Data Compression
Sorting
Merge Sorter Tree [4]
 It merges multiple sorted record sequences.
 𝐾: the number of input leaves (called as “ways”)
4
[4] Dirk Koch et al, “FPGASort”, FPGA’11
4-way merge sorter tree
<
<
<
Stage 1 Stage 0Stage 2
18
24
37
09
Input leaves
(ways)
01234789
FIFO
< Sorter cell
Performance and our purpose
 Sorting time: 𝑂(log 𝐾 #𝑟𝑒𝑐𝑜𝑟𝑑𝑠)
►Increasing 𝐾 is effective
 FPGA resource requirement: 𝑂(𝐾)
►Cannot be implemented with 𝐾 ≥ 2,048 even if using a large FPGA
 Our purpose: Build an optimal architecture for large trees
5
4-way merge sorter tree (𝐾 = 4)
<
<
<
Stage 1 Stage 0Stage 2
< Sorter cell
FIFO
Merge Sorter Tree: Steady state
 Only one sorter cell is operating in each stage at one time
 This feature is mentioned by the paper [12].
6
<< Active sorter cell Non-active sorter cell
13
11 10
82 12
81 80
2
15
23
1
<
<
<
Stage 1 Stage 0Stage 2
[12]Megumi Ito et al, “Logic-Saving FPGA-Based Merge Sort on Single Sort Cells” (in Japanese), IPSJ SIG Technical Report,
vol. 2014-ARC-208.
 Proposed by the paper [12] to reduce FPGA slices to 𝑂(log 𝐾).
►Only 8-way and 16-way trees are built
►Reduced slices: 19%, 43% by using BRAMs for the RAM layer.
Single Sort Cells Merge Sorter Tree (SSC) [12]
7
<<
4-way SSC
<
<
<
4-way merge
sorter tree
RAM layer
Only 1 cell is
located.
FIFOs are
gathered.
Cycle N: How to control
 FIFOs are numbered in each stage.
 Cell 0: 2(FIFO 0 of stage 1) < 3(FIFO 1 of stage 1), 2 to the root
 Send a request “Refill FIFO 0” to “Request queue”
8
<<
Request queue1
2
5 2
3
13
11 11
82 12
81 80
Cell 0Cell 1
13
10
10
2
3
2
FIFO 0
4-way SSC (𝐾 = 4)
0
1
2
3
0
1
23
13
10
Stage 1Stage 2
0 FIFO 0 is selected
Stage 0
Request: Refill FIFO 0
RAM layer
Cycle N+1: Execute the request
 It is difficult to detect that FIFO 0 of stage 1 is not full.
►It is necessary that Cell 1 observe the state of all FIFOs.
 Instead of that, the cell executes the issued request.
1. Read 13 and 11 from the two corresponding FIFOs
2. Write 11 to FIFO 0 of stage 1 (selected at the previous cycle)
3. Send request: “Refill FIFO 1” to Request queue 2
9
<<
Request queue1
Stage 1 Stage 0Stage 2
2
5
10 3
23 13
11
82 12
81 80
Cell 0Cell 1
13
11
11
10
3
3
FIFO 0
RAM layer
4-way SSC (𝐾 = 4)
0
1
2
3
0
1
15
Request queue 2
Execute the
request: Read,
select, and write
Write
Request: Refill FIFO 0
13
11
Request:
Refill FIFO 1
1
Cycle N+2: Complete the Request
 The request “Refill FIFO 0” has been completed.
 SSC repeats the operation recursively.
 All cells operates at the same time every cycle.
 SSC can operate every cycle.
10
<<
Request queue1
Stage 1 Stage 0Stage 2
3
11 10
5
23 13
15
82 12
81 80
Cell 0Cell 1
12
80
12
10
5
5
FIFO 0
RAM layer
4-way SSC (𝐾 = 4)
0
1
21
Request queue2
Read the
corresponding
2 records
Write
Request: Refill FIFO 1
Refilled
0
1
2
3
Request:
Refill FIFO 2
Request: Refill
FIFO 1
11
Proposal of Effective Designs
and
Evaluation
Design goals of Our Proposal
 Minimum performance degradation from the normal tree
 Minimum FPGA resource requirement
 Increasing 𝐾 does not decrease the frequency seriously.
12
Designs
1. Baseline design
2. Proposal 1: Critical-path optimized
►For not so large trees
3. Proposal 2: Record management with Block RAMs
►For so large trees
4. Combination of Proposal 1 and Proposal 2
13
Request queue 2
Baseline design
 Minimal design of SSC
 BRAMs for RAM layers (as [12]).
 A cell sends a request in the form of an ID of the selected FIFO.
 To execute a request by the cell, proper read addresses and a
write address have to be given to BRAMs.
 Address calculation logic converts an ID into the addresses.
14
An ID of the
selected FIFO
Stage 2 Stage 1 Stage 0
Request queue 1
Read addresses
An ID of the
selected FIFO
Cell1
< <
Issue
a request
Cell0
BRAMs
0
1
Write
address
Execute a
request
0
1
2
3
Address
calculation
logic 1
Request queue2
Cell1
Address calculation logic
 Focus on reading operation
 It is a combinational circuit.
 Each FIFO has a head pointer for reading.
 Address calculation logic contains FIFO IDs and head pointers.
►Managed with Distributed RAMs
15
< <
Issue
a request
An ID of the
selected FIFO
Cell0
Request
queue 1
Stage 2 Stage 1 Stage 0
BRAMs
Address
calculation
logic 1
Read addresses
An ID of the
selected FIFO
0
1
2
3
Head 0
Head 1
Head 2
Head 3
Distributed
RAMs
Request queue and BRAM cycle latency
 A case where just giving the top of the request queue
 A BRAM emits an entry 1 cycle after given an read address.
 The sorter cell can operate once per 2 cycle.
16
Cell1
<
10
22
21
14
45
34
50
23
Request
queue 1
0
Address
calculation
logic 1
1
Cell1
<
0 1
Address
calculation
logic 1
10
Stall
Request
queue 1
10
22
21
14
45
34
50
23
22
10
Cycle N Cycle N+1
Read
addresses
Read
addresses
Request queue 1
1Address
calculation
logic 1
0
Solution
 When the sorter cell is operating,
the 2nd request is given to the Address calculation logic.
 Request queue is divided into 2 parts.
 To operate cells every cycle, an input request is sometimes
passed through.
17Cycle N
Cell1
<
Request queue 1
0
1
Address
calculation
logic 1
10
Cycle N+1
Cell1
<
21
14
14
Active at Cycle N
Active at
Cycle N+1
10
22
21
14
45
34
50
23
45
22
21
14
34
50
23
10
22
Active at
Cycle N+1
Active at
Cycle N+2
Read
addresses
Read
addresses
14
Address
calculation
logic 1
Request full
 The sorter cell has to stall if the output request queue is full.
►It occurs when the main inputs get empty.
 When the cell is stalling, the top of the queue is given to the
Address calculation logic to keep the active elements.
18
Cell1
<
Address
calculation
logic 1
Cycle N+2
Cell1
<
Request queue 1
1 0
Stall
Request queue 2
is full
Request queue 1
1
0
Cycle N+1
Active at
Cycle N+1
Active at
Cycle N+2
Active at
Cycle N+1
Active at
Cycle N+2
45
22
21
14
34
50
23
Keep
active
21
14
45
22
21
14
56
34
50
23
14
21
14
Operate
correctly
Request
queue 1
is full
Read
addresses
Read
addresses
Designs
1. Baseline design
2. Proposal 1: Critical-path optimized
►For not so large trees
3. Proposal 2: Record management with BRAMs
►For so large trees
4. Combination of Proposal 1 and Proposal 2
19
Cell1
Proposal 1: Critical-path optimized
 The rear part of the request queue becomes a 2 entry FIFO.
►The wire to operate the cell every cycle is long, so divided.
 A pipeline register is inserted after a sorter cell.
20
< <
Issue
a request
Cell0
Stage 2 Stage 1 Stage 0
BRAMs
Request queue 1
Address
calculation
logic 1
An ID of the
selected FIFO
An ID of the
selected FIFORequest
queue 2
Designs
1. Baseline design
2. Proposal 1: Critical-path optimized
►For not so large trees
3. Proposal 2: Record management with BRAMs
►For so large trees
►BRAMs on Address calculation logic
4. Combination of Proposal 1 and Proposal 2
21
Request
queue 2
Cell1
Proposal 2: Record management with BRAMs
 Where 𝐾 is so large, Distributed RAMs in Address calculation
logic becomes too large
►Decrease performance and increase slice requirement
 Proposal: Record management with BRAMs
22
< <
Issue
a request
Cell0
BRAMs
Request
queue 1
Address
calculation
logic 1
Address
calculation
logic 1
Distributed
RAMs
BRAMs
Stage 2 Stage 1 Stage 0
Problem of Record management with BRAMs
 In Proposal 1, Required latency of the logics: 1
 A BRAM emits an entry 1 cycle after given a read address.
 Required latency of the logics: 2
 Doubles BRAM capacity (Please see our manuscript).
23
Request
queue 2
Cell1
< <
Issue
a request
Cell0
BRAMs
Request
queue 1
Address
calculation
logic 1
Address
calculation
logic 1
Stage 2 Stage 1 Stage 0
Overall Design of Proposal 2
 Exchange: Request queue and Address calculation logic
►Calculate the addresses just after the cell issues a request
 Sometimes through Request queue to address ports
 Required latency of the logics: 1 (as Proposal 1)
 FIFO capacity becomes the same as Proposal 1.
24
Request
queue 1
Address
calculation
logic 1
Exchanged
Request
queue 2
Cell1
< <
Issue
a request
Cell0
BRAMs
Stage 2 Stage 1 Stage 0
Designs
1. Baseline design
2. Proposal 1: Critical-path optimized
►For not so large trees
3. Proposal 2: Record management with BRAMs
►For so large trees
►BRAMs on Address calculation logic
4. Combination of Proposal 1 and Proposal 2
25
Design Combination
 Proposal 2 is effective only for large trees.
 Threshold: 𝐾 = 1,024 (determined by the evaluation).
 We combine Proposal 1 and Proposal 2
26
Proposal 1
>>…>>
Proposal 2
1,024 ways
2,048 ways
Evaluated Designs
 Normal merge sorter tree (Not SSC)
►A component of FACE [11]
 SSC
►Baseline: Baseline design
►Proposal 1: Critical-path optimized
►Proposal 2: Record management with BRAMs
►Combination: Combination of Proposal 1 and Proposal 2
27
<
<
<
<<
[11] R. Kobayashi et al, “FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems,” MCSoC’15
Evaluation Setup
 Data: 64-bit integer
 16 ≤ 𝐾 ≤ 𝟒, 𝟎𝟗𝟔
 Terms: Resource usage, clock frequency
 Simulation Tool: Synopsys VCS
 Design Tool: Xilinx Vivado 2014.4
►Synthesis option: Flow_PerfOptimized_High
►Implementation option: Performance ExplorePostRoutePhysOpt
 Target FPGA: Xilinx Virtex7 XC7VX485T-2
►It is on a VC707 Evaluation Kit, which is an ordinary evaluation
environment.
28[11] R. Kobayashi et al, “FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems,” MCSoC’15
Slice Usage
 52.4x better than the normal tree (𝐾 = 1024, Proposal 1)
 Slice usage is roughly proportional to log 𝐾 in almost all SSCs.
 Where 𝐾 ≥ 2,048, Proposal 1 consumes more slices.
►In the combined design, the usage is reduced to 𝑂(log 𝐾).
 The 4,096-way tree (Combination) utilizes only 1.72% of slices.
29
0
0.5
1
1.5
2
2.5
16 32 64 128 256 512 1024 2048 4096
Sliceusage[%]
Number of ways (K)
Baseline Proposal 1
Proposal 2 Combination
Operating Clock Frequency
 Almost equal to merging throughput
 Baseline is the lowest (about 150[MHz]).
 While the degradation is 1.61x in Baseline compared to Normal,
it is suppressed to 1.31x in Proposal 1 (𝐾 = 1,024).
 149[Million records/s] where 𝐾 = 4,096 in Combination
► 1.23x better than Baseline
30
0
50
100
150
200
250
300
16 32 64 128 256 512 1024 2048 4096
Frequency[MHz]
Number of ways (K)
Normal Baseline Proposal 1
Proposal 2 Combination
Conclusion
 We propose effective designs of cost-effective and scalable
merge sorter trees for FPGAs based on [12].
►For trees with thousands of input leaves
►Some optimizations and record management with BRAMs
 Our proposed optimizations lead to 1.23x performance
improvement compared to Baseline (𝐾 = 4096, Combination)
 Slice requirement is reduced to 𝑂(log 𝐾) even where 𝐾 is so
large without serious performance degradation compared to
the normal tree which consumes 𝑂(𝐾) slices.
► 1,024-way: 52.4x fewer slices with only 1.31x performance degradation
► 4,096-way: 149[Million records(64-bit)/s] ,1.72% slices
31
[12]Megumi Ito et al, “Logic-Saving FPGA-Based Merge Sort on Single Sort Cells” (in Japanese), IPSJ SIG Technical Report, vol. 2014-ARC-208.

More Related Content

PPTX
Intel IA 64
PPT
Intel 64bit Architecture
PDF
Advanced Microprocessors
DOC
Electronics product design companies in bangalore
PPTX
RTL-Design for beginners
PPT
11 instruction sets addressing modes
PPT
12 processor structure and function
PDF
Computer organiztion4
Intel IA 64
Intel 64bit Architecture
Advanced Microprocessors
Electronics product design companies in bangalore
RTL-Design for beginners
11 instruction sets addressing modes
12 processor structure and function
Computer organiztion4

What's hot (14)

PPT
PDF
VTU 4TH SEM CSE MICROPROCESSORS SOLVED PAPERS OF JUNE-2014 & JUNE-2015
PPT
Like 2014214
PPT
Bca 2nd sem-u-2.1-overview of register transfer, micro operations and basic c...
PPT
PPTX
Pentium (80586) Microprocessor By Er. Swapnil Kaware
PPT
Module 5 part1
PPTX
Basic Structure of a Computer System
PPT
16 control unit
PPTX
486 or 80486 DX Architecture
PDF
PDF
microprocessor Questions with solution
 
PPT
Coa module2
VTU 4TH SEM CSE MICROPROCESSORS SOLVED PAPERS OF JUNE-2014 & JUNE-2015
Like 2014214
Bca 2nd sem-u-2.1-overview of register transfer, micro operations and basic c...
Pentium (80586) Microprocessor By Er. Swapnil Kaware
Module 5 part1
Basic Structure of a Computer System
16 control unit
486 or 80486 DX Architecture
microprocessor Questions with solution
 
Coa module2
Ad

Viewers also liked (20)

PDF
Php非同期の技法
PDF
Phpstormちょっといい話
PDF
FPGAベースのソーティングアクセラレータの設計と実装
PDF
Algorithms lecture 3
PPTX
Merge sort algorithm
PDF
第21回関西PHP勉強会 ReactPHPは もっと流行って欲しい #phpkansai
PDF
Lecture 3 insertion sort and complexity analysis
PPTX
Insertion sort
PPTX
Java presentation on insertion sort
PPTX
Merge sort
ODP
Intro to Sorting + Insertion Sort
PPTX
Implementing Merge Sort
PPSX
Insertion Sort Demo
PPTX
Insertion Sort
PDF
Intersection Study - Algorithm(Sort)
PPT
Data Structure Insertion sort
PPTX
Merge sort
PPT
Insertion sort
PPTX
Insertion and merge sort
PDF
Insertion Sort Algorithm
Php非同期の技法
Phpstormちょっといい話
FPGAベースのソーティングアクセラレータの設計と実装
Algorithms lecture 3
Merge sort algorithm
第21回関西PHP勉強会 ReactPHPは もっと流行って欲しい #phpkansai
Lecture 3 insertion sort and complexity analysis
Insertion sort
Java presentation on insertion sort
Merge sort
Intro to Sorting + Insertion Sort
Implementing Merge Sort
Insertion Sort Demo
Insertion Sort
Intersection Study - Algorithm(Sort)
Data Structure Insertion sort
Merge sort
Insertion sort
Insertion and merge sort
Insertion Sort Algorithm
Ad

Similar to A Cost-Effective and Scalable Merge Sort Tree on FPGAs (20)

DOC
Advance data structure
DOCX
Integrating lock free and combining techniques for a practical and scalable f...
PDF
Data structure
PPT
QUEUE OPERATIONS in DATASTRUCTURE AND ALGORITHMS
PDF
Bw tree presentation
PPT
Stacks queues
PDF
Java Collections API
PDF
Communicating State Machines
PPTX
DSA_Ques ewoifhjerofhefhehfreofheek.pptx
PDF
Document 14 (6).pdf
PDF
Basic Terminologies of Queue...Basic operations on Queue
PPTX
queue.pptx
PPTX
Application of Queue.pptx
PDF
Ctrie Data Structure
PPTX
Queues presentation
PPTX
queue is a linear data structure fifo approach
PDF
PDF
DOCX
Data Structure Question Bank(2 marks)
PDF
wepik-demystifying-data-structures-understanding-queues-20240417143621GPlM.pdf
Advance data structure
Integrating lock free and combining techniques for a practical and scalable f...
Data structure
QUEUE OPERATIONS in DATASTRUCTURE AND ALGORITHMS
Bw tree presentation
Stacks queues
Java Collections API
Communicating State Machines
DSA_Ques ewoifhjerofhefhehfreofheek.pptx
Document 14 (6).pdf
Basic Terminologies of Queue...Basic operations on Queue
queue.pptx
Application of Queue.pptx
Ctrie Data Structure
Queues presentation
queue is a linear data structure fifo approach
Data Structure Question Bank(2 marks)
wepik-demystifying-data-structures-understanding-queues-20240417143621GPlM.pdf

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Lecture Notes Electrical Wiring System Components
DOCX
573137875-Attendance-Management-System-original
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
web development for engineering and engineering
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
OOP with Java - Java Introduction (Basics)
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
PPT on Performance Review to get promotions
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Digital Logic Computer Design lecture notes
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Foundation to blockchain - A guide to Blockchain Tech
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
573137875-Attendance-Management-System-original
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
web development for engineering and engineering
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
OOP with Java - Java Introduction (Basics)
R24 SURVEYING LAB MANUAL for civil enggi
PPT on Performance Review to get promotions
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction

A Cost-Effective and Scalable Merge Sort Tree on FPGAs

  • 1. A Cost-Effective and Scalable Merge Sorter Tree on FPGAs ☆Takuma Usui, Thiem Van Chu, and Kenji Kise Tokyo Institute of Technology, Japan Department of Computer Science CANDAR’16@Hiroshima, Japan 11:35-12:00 (Presentation: 20min, Q&A: 5min), November 24, 2016
  • 2. Executive summary  Integer sorting is a very important computing kernel which can be accelerated using FPGAs.  FPGA resources are too limited to build a high performance merge sorter tree.  We propose effective designs of cost-effective and scalable merge sorter trees which have high performance in little FPGA resource requirement.  We evaluate our architecture, and it achieves 52.4x lower FPGA slice usage without serious throughput degradation. 1
  • 4. Sorting is important  Integer sorting is a fundamental computation kernel 3 Database OperationImage Processing Data Compression Sorting
  • 5. Merge Sorter Tree [4]  It merges multiple sorted record sequences.  𝐾: the number of input leaves (called as “ways”) 4 [4] Dirk Koch et al, “FPGASort”, FPGA’11 4-way merge sorter tree < < < Stage 1 Stage 0Stage 2 18 24 37 09 Input leaves (ways) 01234789 FIFO < Sorter cell
  • 6. Performance and our purpose  Sorting time: 𝑂(log 𝐾 #𝑟𝑒𝑐𝑜𝑟𝑑𝑠) ►Increasing 𝐾 is effective  FPGA resource requirement: 𝑂(𝐾) ►Cannot be implemented with 𝐾 ≥ 2,048 even if using a large FPGA  Our purpose: Build an optimal architecture for large trees 5 4-way merge sorter tree (𝐾 = 4) < < < Stage 1 Stage 0Stage 2 < Sorter cell FIFO
  • 7. Merge Sorter Tree: Steady state  Only one sorter cell is operating in each stage at one time  This feature is mentioned by the paper [12]. 6 << Active sorter cell Non-active sorter cell 13 11 10 82 12 81 80 2 15 23 1 < < < Stage 1 Stage 0Stage 2 [12]Megumi Ito et al, “Logic-Saving FPGA-Based Merge Sort on Single Sort Cells” (in Japanese), IPSJ SIG Technical Report, vol. 2014-ARC-208.
  • 8.  Proposed by the paper [12] to reduce FPGA slices to 𝑂(log 𝐾). ►Only 8-way and 16-way trees are built ►Reduced slices: 19%, 43% by using BRAMs for the RAM layer. Single Sort Cells Merge Sorter Tree (SSC) [12] 7 << 4-way SSC < < < 4-way merge sorter tree RAM layer Only 1 cell is located. FIFOs are gathered.
  • 9. Cycle N: How to control  FIFOs are numbered in each stage.  Cell 0: 2(FIFO 0 of stage 1) < 3(FIFO 1 of stage 1), 2 to the root  Send a request “Refill FIFO 0” to “Request queue” 8 << Request queue1 2 5 2 3 13 11 11 82 12 81 80 Cell 0Cell 1 13 10 10 2 3 2 FIFO 0 4-way SSC (𝐾 = 4) 0 1 2 3 0 1 23 13 10 Stage 1Stage 2 0 FIFO 0 is selected Stage 0 Request: Refill FIFO 0 RAM layer
  • 10. Cycle N+1: Execute the request  It is difficult to detect that FIFO 0 of stage 1 is not full. ►It is necessary that Cell 1 observe the state of all FIFOs.  Instead of that, the cell executes the issued request. 1. Read 13 and 11 from the two corresponding FIFOs 2. Write 11 to FIFO 0 of stage 1 (selected at the previous cycle) 3. Send request: “Refill FIFO 1” to Request queue 2 9 << Request queue1 Stage 1 Stage 0Stage 2 2 5 10 3 23 13 11 82 12 81 80 Cell 0Cell 1 13 11 11 10 3 3 FIFO 0 RAM layer 4-way SSC (𝐾 = 4) 0 1 2 3 0 1 15 Request queue 2 Execute the request: Read, select, and write Write Request: Refill FIFO 0 13 11 Request: Refill FIFO 1 1
  • 11. Cycle N+2: Complete the Request  The request “Refill FIFO 0” has been completed.  SSC repeats the operation recursively.  All cells operates at the same time every cycle.  SSC can operate every cycle. 10 << Request queue1 Stage 1 Stage 0Stage 2 3 11 10 5 23 13 15 82 12 81 80 Cell 0Cell 1 12 80 12 10 5 5 FIFO 0 RAM layer 4-way SSC (𝐾 = 4) 0 1 21 Request queue2 Read the corresponding 2 records Write Request: Refill FIFO 1 Refilled 0 1 2 3 Request: Refill FIFO 2 Request: Refill FIFO 1
  • 12. 11 Proposal of Effective Designs and Evaluation
  • 13. Design goals of Our Proposal  Minimum performance degradation from the normal tree  Minimum FPGA resource requirement  Increasing 𝐾 does not decrease the frequency seriously. 12
  • 14. Designs 1. Baseline design 2. Proposal 1: Critical-path optimized ►For not so large trees 3. Proposal 2: Record management with Block RAMs ►For so large trees 4. Combination of Proposal 1 and Proposal 2 13
  • 15. Request queue 2 Baseline design  Minimal design of SSC  BRAMs for RAM layers (as [12]).  A cell sends a request in the form of an ID of the selected FIFO.  To execute a request by the cell, proper read addresses and a write address have to be given to BRAMs.  Address calculation logic converts an ID into the addresses. 14 An ID of the selected FIFO Stage 2 Stage 1 Stage 0 Request queue 1 Read addresses An ID of the selected FIFO Cell1 < < Issue a request Cell0 BRAMs 0 1 Write address Execute a request 0 1 2 3 Address calculation logic 1
  • 16. Request queue2 Cell1 Address calculation logic  Focus on reading operation  It is a combinational circuit.  Each FIFO has a head pointer for reading.  Address calculation logic contains FIFO IDs and head pointers. ►Managed with Distributed RAMs 15 < < Issue a request An ID of the selected FIFO Cell0 Request queue 1 Stage 2 Stage 1 Stage 0 BRAMs Address calculation logic 1 Read addresses An ID of the selected FIFO 0 1 2 3 Head 0 Head 1 Head 2 Head 3 Distributed RAMs
  • 17. Request queue and BRAM cycle latency  A case where just giving the top of the request queue  A BRAM emits an entry 1 cycle after given an read address.  The sorter cell can operate once per 2 cycle. 16 Cell1 < 10 22 21 14 45 34 50 23 Request queue 1 0 Address calculation logic 1 1 Cell1 < 0 1 Address calculation logic 1 10 Stall Request queue 1 10 22 21 14 45 34 50 23 22 10 Cycle N Cycle N+1 Read addresses Read addresses
  • 18. Request queue 1 1Address calculation logic 1 0 Solution  When the sorter cell is operating, the 2nd request is given to the Address calculation logic.  Request queue is divided into 2 parts.  To operate cells every cycle, an input request is sometimes passed through. 17Cycle N Cell1 < Request queue 1 0 1 Address calculation logic 1 10 Cycle N+1 Cell1 < 21 14 14 Active at Cycle N Active at Cycle N+1 10 22 21 14 45 34 50 23 45 22 21 14 34 50 23 10 22 Active at Cycle N+1 Active at Cycle N+2 Read addresses Read addresses
  • 19. 14 Address calculation logic 1 Request full  The sorter cell has to stall if the output request queue is full. ►It occurs when the main inputs get empty.  When the cell is stalling, the top of the queue is given to the Address calculation logic to keep the active elements. 18 Cell1 < Address calculation logic 1 Cycle N+2 Cell1 < Request queue 1 1 0 Stall Request queue 2 is full Request queue 1 1 0 Cycle N+1 Active at Cycle N+1 Active at Cycle N+2 Active at Cycle N+1 Active at Cycle N+2 45 22 21 14 34 50 23 Keep active 21 14 45 22 21 14 56 34 50 23 14 21 14 Operate correctly Request queue 1 is full Read addresses Read addresses
  • 20. Designs 1. Baseline design 2. Proposal 1: Critical-path optimized ►For not so large trees 3. Proposal 2: Record management with BRAMs ►For so large trees 4. Combination of Proposal 1 and Proposal 2 19
  • 21. Cell1 Proposal 1: Critical-path optimized  The rear part of the request queue becomes a 2 entry FIFO. ►The wire to operate the cell every cycle is long, so divided.  A pipeline register is inserted after a sorter cell. 20 < < Issue a request Cell0 Stage 2 Stage 1 Stage 0 BRAMs Request queue 1 Address calculation logic 1 An ID of the selected FIFO An ID of the selected FIFORequest queue 2
  • 22. Designs 1. Baseline design 2. Proposal 1: Critical-path optimized ►For not so large trees 3. Proposal 2: Record management with BRAMs ►For so large trees ►BRAMs on Address calculation logic 4. Combination of Proposal 1 and Proposal 2 21
  • 23. Request queue 2 Cell1 Proposal 2: Record management with BRAMs  Where 𝐾 is so large, Distributed RAMs in Address calculation logic becomes too large ►Decrease performance and increase slice requirement  Proposal: Record management with BRAMs 22 < < Issue a request Cell0 BRAMs Request queue 1 Address calculation logic 1 Address calculation logic 1 Distributed RAMs BRAMs Stage 2 Stage 1 Stage 0
  • 24. Problem of Record management with BRAMs  In Proposal 1, Required latency of the logics: 1  A BRAM emits an entry 1 cycle after given a read address.  Required latency of the logics: 2  Doubles BRAM capacity (Please see our manuscript). 23 Request queue 2 Cell1 < < Issue a request Cell0 BRAMs Request queue 1 Address calculation logic 1 Address calculation logic 1 Stage 2 Stage 1 Stage 0
  • 25. Overall Design of Proposal 2  Exchange: Request queue and Address calculation logic ►Calculate the addresses just after the cell issues a request  Sometimes through Request queue to address ports  Required latency of the logics: 1 (as Proposal 1)  FIFO capacity becomes the same as Proposal 1. 24 Request queue 1 Address calculation logic 1 Exchanged Request queue 2 Cell1 < < Issue a request Cell0 BRAMs Stage 2 Stage 1 Stage 0
  • 26. Designs 1. Baseline design 2. Proposal 1: Critical-path optimized ►For not so large trees 3. Proposal 2: Record management with BRAMs ►For so large trees ►BRAMs on Address calculation logic 4. Combination of Proposal 1 and Proposal 2 25
  • 27. Design Combination  Proposal 2 is effective only for large trees.  Threshold: 𝐾 = 1,024 (determined by the evaluation).  We combine Proposal 1 and Proposal 2 26 Proposal 1 >>…>> Proposal 2 1,024 ways 2,048 ways
  • 28. Evaluated Designs  Normal merge sorter tree (Not SSC) ►A component of FACE [11]  SSC ►Baseline: Baseline design ►Proposal 1: Critical-path optimized ►Proposal 2: Record management with BRAMs ►Combination: Combination of Proposal 1 and Proposal 2 27 < < < << [11] R. Kobayashi et al, “FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems,” MCSoC’15
  • 29. Evaluation Setup  Data: 64-bit integer  16 ≤ 𝐾 ≤ 𝟒, 𝟎𝟗𝟔  Terms: Resource usage, clock frequency  Simulation Tool: Synopsys VCS  Design Tool: Xilinx Vivado 2014.4 ►Synthesis option: Flow_PerfOptimized_High ►Implementation option: Performance ExplorePostRoutePhysOpt  Target FPGA: Xilinx Virtex7 XC7VX485T-2 ►It is on a VC707 Evaluation Kit, which is an ordinary evaluation environment. 28[11] R. Kobayashi et al, “FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems,” MCSoC’15
  • 30. Slice Usage  52.4x better than the normal tree (𝐾 = 1024, Proposal 1)  Slice usage is roughly proportional to log 𝐾 in almost all SSCs.  Where 𝐾 ≥ 2,048, Proposal 1 consumes more slices. ►In the combined design, the usage is reduced to 𝑂(log 𝐾).  The 4,096-way tree (Combination) utilizes only 1.72% of slices. 29 0 0.5 1 1.5 2 2.5 16 32 64 128 256 512 1024 2048 4096 Sliceusage[%] Number of ways (K) Baseline Proposal 1 Proposal 2 Combination
  • 31. Operating Clock Frequency  Almost equal to merging throughput  Baseline is the lowest (about 150[MHz]).  While the degradation is 1.61x in Baseline compared to Normal, it is suppressed to 1.31x in Proposal 1 (𝐾 = 1,024).  149[Million records/s] where 𝐾 = 4,096 in Combination ► 1.23x better than Baseline 30 0 50 100 150 200 250 300 16 32 64 128 256 512 1024 2048 4096 Frequency[MHz] Number of ways (K) Normal Baseline Proposal 1 Proposal 2 Combination
  • 32. Conclusion  We propose effective designs of cost-effective and scalable merge sorter trees for FPGAs based on [12]. ►For trees with thousands of input leaves ►Some optimizations and record management with BRAMs  Our proposed optimizations lead to 1.23x performance improvement compared to Baseline (𝐾 = 4096, Combination)  Slice requirement is reduced to 𝑂(log 𝐾) even where 𝐾 is so large without serious performance degradation compared to the normal tree which consumes 𝑂(𝐾) slices. ► 1,024-way: 52.4x fewer slices with only 1.31x performance degradation ► 4,096-way: 149[Million records(64-bit)/s] ,1.72% slices 31 [12]Megumi Ito et al, “Logic-Saving FPGA-Based Merge Sort on Single Sort Cells” (in Japanese), IPSJ SIG Technical Report, vol. 2014-ARC-208.