FPGA on the Cloud

FPGAs on The Cloud
Ioannis Tsagatakis
Ioannis Stefanis
Msc in Informatics & Multimedia
Department of Informatics Engineering TEI of Crete
Embedded Systems

2
Accelerated Computing: GPUs and FPGA

3
Massive Parallelism
● GPU
– SIMD
– Instruction Set
– Fixed Word Sizes
– Simple control logic
● FPGA
– MIMD
– No instruction set
– Any data width
– Complex control logic (FSMs)

4
AWS F1 FPGA Instances
● Cloud based FPGA
– No need to buy hardware
● Cloud based IDE
– Ready to used AMI
– HDL: Verilog, VHDL
– SDAccel: C/C++, OpenCL
– AFI tools
● Marketplace
– A new market for Ips
– Secure encrypted AFIs
● f1.2xlarge
– 1 VU9P UltraScale+
● 2.5M logic elements
● 6,800 DSP
– 8 vCPU Cores
– 122GB RAM
– PCIe X16
– 1.6$ per hour
● f1.16xlarge
– 8 FPGA/64 CPUs
● Run simulation design on C4
to save money

5
The SDAccel Development Environment
● Cloud IDE
or
● Local Install
● Virtual JTAG
Intefcace

7
The AWS F1 Shell Amazon AFI
Image
Predefined interface
Secured, encrypted User can’t
Dynamically (re)loaded see the bits

9
Kernel Creation: The 2 workflows
● Custom IP must packaged as an
SDAccel Kernel
● Strict interface requirements
● Design for performance
● SDAccel provides a Kernel
Wizard
● Kernel container file (XO file)
- XML metadata, Vivado project
- RTL files
● Or generate kernel from OpenCL
● Advanced optimizations
- Memory partitioning,
- Loop unrolling
- DSP block inferencing

10
An OpenCL Kernel
● Language support
– Embedded profile (1.0)
– Pipes (2.0)
– Image Objects (2.0)
● N dim ranges
● SIMD vector types
● Math library functions

12
Creating the Amazon FPGA Image
● Created by an amazon
service
● Secured stored and
encrypted
● Developers have no
access to RTL IP
● The distributable
awsxclbin contains
only the AFI id

13
SDAccel Testing and Execution Modes

15
OpenCL vs Cuda
● Cuda
– SIMD
– Easier programming
model
– Restricted memory
access patterns
– Faster development
– Vendor lock
– Easy deployment
● F1 FPGA
– MIMD
– More complexity
– Harder programming
– Deep pipelining
– Slow development
– Vendor lock
– Cloud deployment

16
Smith–Waterman algorithm (sw_emu)
------FPGA Accelerator Summary --------
Number of SmithWaterman instances on FPGA:16
Total processing elements:512
Length of reference string:256
Length of read(query) string:128
Read-Ref pair block size(HOST to FPGA):1024
Verify Mode is:0
---------------------------------------
Generating read-ref samples
Processing 16384 Samples
HW Block Size: 16384
Total Number of blocks: 1
INFO: [smithwaterman.cpp:654] TIME: [Wed Feb 21 22:37:07 2018] nruns = 1
INFO: [smithwaterman.cpp:655] TIME: [Wed Feb 21 22:37:07 2018] total [ms] = 43326.373
INFO: [smithwaterman.cpp:656] TIME: [Wed Feb 21 22:37:07 2018] Host write [ms] = 0.768
INFO: [smithwaterman.cpp:657] TIME: [Wed Feb 21 22:37:07 2018] Krnl exec [ms] = 43317.977
INFO: [smithwaterman.cpp:658] TIME: [Wed Feb 21 22:37:07 2018] Host read [ms] = 1.029
GCups(based on kernel execution time):0.0115426
GCups(based on total execution time):0.0115403
INFO: [smithwaterman.cpp:679] TIME: [Wed Feb 21 22:37:07 2018] Host2Device rate [mbps] = 15616.602
INFO: [smithwaterman.cpp:691] TIME: [Wed Feb 21 22:37:07 2018] Device2Host rate [mbps] = 1457.154
INFO: [main.cpp:172] TIME: [Wed Feb 21 22:37:07 2018] finished
~/aws-fpga/SDAccel/examples/xilinx/acceleration/smithwaterman

17
Smith–Waterman algorithm (wh_emu)
~/aws-fpga/SDAccel/examples/xilinx/acceleration/smithwaterman
xsimk
Generating read-ref samples
Processing 16384 Samples
HW Block Size: 16384
Total Number of blocks: 1
INFO: [SDx-EM 22] [Wall clock time: 23:05, Emulation time: 0.275298 ms] Data transfer between kernel(s) and
global memory(s)
BANK0 RD = 64.316 KB WR = 7.875 KB
BANK1 RD = 0.000 KB WR = 0.000 KB
BANK2 RD = 0.000 KB WR = 0.000 KB
BANK3 RD = 0.000 KB WR = 0.000 KB
…. after many hours …
INFO: [SDx-EM 22] [Wall clock time: 00:27, Emulation time: 4.77014 ms] Data transfer between kernel(s) and
global memory(s)
BANK0 RD = 1110.004 KB WR = 138.562 KB
BANK1 RD = 0.000 KB WR = 0.000 KB
BANK2 RD = 0.000 KB WR = 0.000 KB
BANK3 RD = 0.000 KB WR = 0.000 KB
….

18
Building Times
For the helloworld example
INFO: [XOCC 60-629] Linking for hardware target
INFO: [XOCC 60-895] Target platform: /home/centos/src/project_data/aws-
fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0/xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xpfm
INFO: [XOCC 60-423] Target device: xilinx:aws-vu9p-f1:4ddr-xpr-2pr:4.0
INFO: [XOCC 60-251] Hardware accelerator integration...
Creating Vivado project and starting FPGA synthesis.
................................................................................................................................
Finished 1st of 5 tasks (FPGA synthesis). Elapsed time: 00h 34m 54s.
.....
Finished 2nd of 5 tasks (FPGA logic optimization). Elapsed time: 00h 05m 37s.
...............................
Finished 3rd of 5 tasks (FPGA logic placement). Elapsed time: 00h 43m 50s.
................................
Finished 4th of 5 tasks (FPGA routing). Elapsed time: 00h 56m 33s.
INFO: [XOCC 60-586] Created xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin
INFO: [XOCC 60-791] Total elapsed time: 2h 31m 50s
And then you have to build the AFI ...
Give up building the

19
Building Times
For the helloworld example
INFO: [XOCC 60-629] Linking for hardware target
INFO: [XOCC 60-895] Target platform: /home/centos/src/project_data/aws-
fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0/xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xpfm
INFO: [XOCC 60-423] Target device: xilinx:aws-vu9p-f1:4ddr-xpr-2pr:4.0
INFO: [XOCC 60-251] Hardware accelerator integration...
Creating Vivado project and starting FPGA synthesis.
................................................................................................................................
Finished 1st of 5 tasks (FPGA synthesis). Elapsed time: 00h 34m 54s.
.....
Finished 2nd of 5 tasks (FPGA logic optimization). Elapsed time: 00h 05m 37s.
...............................
Finished 3rd of 5 tasks (FPGA logic placement). Elapsed time: 00h 43m 50s.
................................
Finished 4th of 5 tasks (FPGA routing). Elapsed time: 00h 56m 33s.
INFO: [XOCC 60-586] Created xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin
INFO: [XOCC 60-791] Total elapsed time: 2h 31m 50s
And then you have to build the AFI ...
Good luck
building the Smith-Waterman
Example

20
Conclusions
● Moderate* costs
● Easy setup with minor issues
● Cloud based IDE (rdp), or ssh
● Slow development
● Harder to learn than CUDA
● Good documentation and examples
● Market place is still small but
promising
●
No 3rd
party examples
Moderate cost ;
$3,500 Xilinx Virtex-7 FPGA VC707 Evaluation
Kit
$13,000 Xilinx Virtex-7 FPGA VC7222 Char. Kit
$1.500 Intel Xeon Phi 7120P Coprocessor
$1.400 Nvidia GeForce Titan X Pascal

22
FPGA vs GPU Accelerating Compute-Intensive Applications with GPUs and
FPGAs
S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach,
2008 Symposium on Application Specific Processors
CUDA and the GeForce 8800 GTX GPU
VHDL and the Xilinx Virtex-II Pro FPGA

25
Is FPGA
and reconfigurable computing
the Future ?
Video on the cloud ? Deep Learning ?

FPGA on the Cloud

More Related Content

What's hot (20)

Similar to FPGA on the Cloud (20)

More from jtsagata (17)

Recently uploaded (20)

FPGA on the Cloud