SlideShare a Scribd company logo
Design of Programmable Accelerators for SoCs Gert Goossens CEO Target Compiler Technologies
Abstract For new wireless standards like 3GPP-LTE, general-purpose processors are getting out of steam. Wisdom is that accelerators must be added in the form of hardwired datapaths, to deliver the required performance. However, a hardwired datapath stands for zero flexibility, reducing the capability of supporting evolutionary or multiple standards.  We discuss how C-programmable application-specific processors (ASIPs) can replace fixed-function accelerators without sacrificing performance (throughput, power and gate count).  We review different approaches for ASIP design.  We illustrate our performance claims with examples from the data-plane of wireless baseband modems.
Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
What do you do when the performance  of your main processor is insufficient? Go multicore? Application mapping difficult,  resource utilisation unbalanced Add hardwired accelerators? Balanced but inflexible SoC SoC Design
What do you do when the performance  of your main processor is insufficient? ASIPs: application-specific processors Anything between general-purpose uP and hardwired datapath Flexibility through programmability and design-time reconfigurability High-throughput and low energy, through parallelism and specialisation Balanced and flexible SoC SoC Design
Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
How to Design ASIPs? IP Designer tool-suite
How to Design ASIPs? Design step Benefits Algorithm defined in C Raise abstraction level from RTL to ESL Connect hardware and algorithm design teams Datapath structure defined in nML Much faster than RTL design, enables rapid architectural exploration  Designer is in control; can use architectural knowledge C compiler maps algorithm onto datapath structure ISS simulates generated code Tools validate designer’s assumptions and performance reached Profiling tool guides architectural exploration Easily reprogrammable in case of bug or spec changes RTL generated automatically Error-free Quick feedback on gate count for every design iteration Low-power optimisations inserted automatically
How to Design ASIPs? Benefits Speed-up design Few weeks per ASIP Design exploration Wide architectural scope, based on processor description language Formal approach increases   40 production chips, 0 bugs correctness Automatic generation of RTL Competitive to hand-coded RTL Automatic generation of SDK C compiler “no-assembly-required”
Tool Comparison Programmable Architectural specialisation Resource sharing Business model Architectural style Example vendors Approach Yes High Yes EDA license Flexible, using processor description language Target (IP Designer),  CoWare (Processor Designer) Retargetable ASIP design tools Yes  Low (within template boundaries) Yes Royalties Configurable ASIP template + extension instructions Tensilica, ARC,  ASIP Solutions, SiliconHive Configurable ASIP templates No  High Depends on tool EDA license Hardwired datapath,  no programmability Mentor (CatapultC),  Forte, Synfora, Cadence (C2S) High-level synthesis from C —   (*) (*) No strong focus for CoWare?
Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
Programmable Datapath Examples  Examples shown  Served by IP Designer
What is a Programmable Datapath? Hardwired datapath Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow Hardwired datapath with resource sharing Superposition of multiple data-flow patterns Hardware saving benefit, if permitted by throughput spec Requires local modifications to datapath structure and addition of small amounts of control Modification of connectivity    multiplexers Modification of operator behaviour    programmable i.s.o. fixed operators Store intermediate data    local register files i.s.o. registers Controlled from FSM Programmable datapath Datapath with resource sharing, controlled from software Microcode in ROM (design-time programmable), or RAM/flash (post-silicon programmable) SEQ PM DEC s 0 s 1 s 2 d+=(a+b)*c; g+=(e-f)*f;
Prog. Datapath Example: WLAN Algorithm Design by Motorola Labs  [1] 802.11n, equalisation Characteristics Matrix calculations Specialised operators in  complex domain: cmpy, conjugate, sqmod Equalisation matrix:  multiple dataflow patterns  depending on MIMO scheme SDM Symmetric SDM + STBC SDM + STBC Matrix inversion Matrix inversion + Address computations Address computations Complex conjugate Square modulus [1]  Medea+ project “Uppermost”
Programmable datapath design Sample expressions: equalisation matrix Sample expression: matrix inversion 4 identical datapaths in SIMD unit Prog. Datapath Example: WLAN Dual Port Memory Common Program Control GMAC 0 Dual Port Memory GMAC 1 Dual Port Memory GMAC 2 Dual Port Memory GMAC 3 Channel Estimation ASIP GMAC
Prog. Datapath Example: WLAN nML code of gmac instruction reg  R[8] <vcmpl>  read(tR0, tR1,  tR2, tR3, tR4,  tR5); reg  ACC <vcmpl>; pipe  P0 <vcmpl>; pipe  P1 <vcmpl>; trn  tC0 <vcmpl>; trn  tC1 <vcmpl>; trn  tM0 <vcmpl>; trn  tM1 <vcmpl>; enum  gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...}; opn  gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action  { stage  E1: switch  (g) { case  mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]);  P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1 ); case  mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case  sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case  minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case  ... } stage  E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC);  } }  Resources Instruction-set   grammar
Prog. Datapath Example: WLAN C compiler uses advanced graph matching techniques to map dataflow patterns on programmable datapath  COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION
Prog. Datapath Example: FFT Algorithm Decimation in time Radix-2, radix-4, mixed radix Coefficients:  complex (16,16) Data: complex (24,24)
Prog. Datapath Example: FFT Programmable datapath design Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function Intrinsic’s behaviour is modelled in C, automatically converted to RTL Mdata Mcoef A[4] B[4] CMPY BFLY ld A/B Ld C stA/B * * * * - + + + - -
Prog. Datapath Example: FFT Instruction-level parallelism: ILP=5 Efficient register allocation, scheduling and SW pipelining needed E.g. inner-loop for radix-4 FFT Compiled code 4 cycles / iteration 100% resource utilisation /* 0 */   DO cnt,LE /* 1 */   /* delay slot */ /* 2 */   md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc |   b3,b2=bfly(a2,a3) /* 3 */   md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2) /* 4 */   md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3) /* 5 */   md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0) LDA LDC MPY LDA LDC MPY LDA LDC MPY LDA LDC MPY BFLY BFLY BFLY BFLY STB STB STB STB LDA STB LDC MPY BFLY
Prog. Datapath Example: FFT C compiler uses advanced graph search techniques to optimise register utilisation  schedule instructions  on programmable datapath  COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION
Prog. Datapath Example: FFT Results Performance Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly 4096-point FFT (radix-4): 24,671 cycles 2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles RTL metrics 26K gates, 123 MHz clock, 130 nm, DesignWare Basic 600 lines of nML code Custom data path, complex butterfly unit
Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
Conclusion ASIPs allow to make accelerators in SoCs programmable With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently “ Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators IP Designer as an alternative to high-level synthesis With ASIPs, multicore SoC architectures become even more prolific

More Related Content

PPTX
Matlab source codes section | Download MATLAB source code freerce-codes
PDF
A DSP technical challange for an FPGA Engineer
PDF
Demosaic RTL for ISP workflow
PPTX
Melp codec optimization using DSP kit
PPT
3D-DRESD AC
PDF
A study to Design and comparison of Full Adder using Various Techniques
PPT
Microprocessor system - summarize
PPTX
Matlab source codes section | Download MATLAB source code freerce-codes
A DSP technical challange for an FPGA Engineer
Demosaic RTL for ISP workflow
Melp codec optimization using DSP kit
3D-DRESD AC
A study to Design and comparison of Full Adder using Various Techniques
Microprocessor system - summarize

What's hot (14)

PDF
IBM XL Compilers Performance Tuning 2016-11-18
PDF
Review of high-speed phase accumulator for direct digital frequency synthesizer
PPT
Pla pal-and-pla-optimization
PPT
0507036
PPT
Assembly Language Lecture 1
PDF
8085 branching instruction
PDF
8051 instruction set
PDF
Effective replacement of dynamic polymorphism with std::variant
PPT
Addressing modes
PPT
Spectra IP Core ORB - high-performance, low-latency solution for FPGA-GPP com...
PDF
Arm instruction set
PPTX
PPT
The 8051 assembly language
PPTX
ARM inst set part 2
IBM XL Compilers Performance Tuning 2016-11-18
Review of high-speed phase accumulator for direct digital frequency synthesizer
Pla pal-and-pla-optimization
0507036
Assembly Language Lecture 1
8085 branching instruction
8051 instruction set
Effective replacement of dynamic polymorphism with std::variant
Addressing modes
Spectra IP Core ORB - high-performance, low-latency solution for FPGA-GPP com...
Arm instruction set
The 8051 assembly language
ARM inst set part 2
Ad

Viewers also liked (20)

PPT
Mips track a
PPT
C:\fakepath\apache track d updated
PPT
Magma trcak b
PPT
Vsync track c
PPT
Intel track a
PDF
Synopsys track c
PPT
Arm updated track h
PPT
Evatronix track h
PDF
Bary pangrle mentor track d
PPT
Stephan berg track f
PPT
Mullbery& veriest track g
PPT
Timing¬Driven Variation¬Aware NonuniformClock Mesh Synthesis
PPT
National instruments track e
PPT
Chip Ex2010 Gert Goossens
PPT
E silicon track b
PPT
Xilinx track g
PPT
C:\fakepath\micrologic track c
PPT
Apache track d updated
PPT
Altera trcak g
PDF
EC 2 lab manual with circulits
Mips track a
C:\fakepath\apache track d updated
Magma trcak b
Vsync track c
Intel track a
Synopsys track c
Arm updated track h
Evatronix track h
Bary pangrle mentor track d
Stephan berg track f
Mullbery& veriest track g
Timing¬Driven Variation¬Aware NonuniformClock Mesh Synthesis
National instruments track e
Chip Ex2010 Gert Goossens
E silicon track b
Xilinx track g
C:\fakepath\micrologic track c
Apache track d updated
Altera trcak g
EC 2 lab manual with circulits
Ad

Similar to Target updated track f (20)

PDF
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
PPTX
tau 2015 spyrou fpga timing
PPTX
Mirabilis_Design AMD Versal System-Level IP Library
PPT
Introduction to Blackfin BF532 DSP
PPTX
Introduction to computer architecture .pptx
PPT
UIC Thesis Candiloro
PDF
design-compiler.pdf
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
PPT
Embedded c programming22 for fdp
PDF
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
PPTX
07 140430-ipp-languages used in llvm during compilation
PPTX
Dpdk applications
PDF
20180920_DBTS_PGStrom_EN
PDF
International Journal of Engineering Research and Development
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
PPTX
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
PPT
Computer architecture 3
PPTX
676.v3
PDF
OPAL-RT and ANSYS - HIL simulation
PPT
20081114 Friday Food iLabt Bart Joris
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
tau 2015 spyrou fpga timing
Mirabilis_Design AMD Versal System-Level IP Library
Introduction to Blackfin BF532 DSP
Introduction to computer architecture .pptx
UIC Thesis Candiloro
design-compiler.pdf
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Embedded c programming22 for fdp
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
07 140430-ipp-languages used in llvm during compilation
Dpdk applications
20180920_DBTS_PGStrom_EN
International Journal of Engineering Research and Development
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
Computer architecture 3
676.v3
OPAL-RT and ANSYS - HIL simulation
20081114 Friday Food iLabt Bart Joris

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Institutional Correction lecture only . . .
PPH.pptx obstetrics and gynecology in nursing
STATICS OF THE RIGID BODIES Hibbelers.pdf
Sports Quiz easy sports quiz sports quiz
O5-L3 Freight Transport Ops (International) V1.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Pharma ospi slides which help in ospi learning
human mycosis Human fungal infections are called human mycosis..pptx
RMMM.pdf make it easy to upload and study
Anesthesia in Laparoscopic Surgery in India
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Basic Mud Logging Guide for educational purpose
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Renaissance Architecture: A Journey from Faith to Humanism
TR - Agricultural Crops Production NC III.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf

Target updated track f

  • 1. Design of Programmable Accelerators for SoCs Gert Goossens CEO Target Compiler Technologies
  • 2. Abstract For new wireless standards like 3GPP-LTE, general-purpose processors are getting out of steam. Wisdom is that accelerators must be added in the form of hardwired datapaths, to deliver the required performance. However, a hardwired datapath stands for zero flexibility, reducing the capability of supporting evolutionary or multiple standards. We discuss how C-programmable application-specific processors (ASIPs) can replace fixed-function accelerators without sacrificing performance (throughput, power and gate count). We review different approaches for ASIP design. We illustrate our performance claims with examples from the data-plane of wireless baseband modems.
  • 3. Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
  • 4. What do you do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced Add hardwired accelerators? Balanced but inflexible SoC SoC Design
  • 5. What do you do when the performance of your main processor is insufficient? ASIPs: application-specific processors Anything between general-purpose uP and hardwired datapath Flexibility through programmability and design-time reconfigurability High-throughput and low energy, through parallelism and specialisation Balanced and flexible SoC SoC Design
  • 6. Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
  • 7. How to Design ASIPs? IP Designer tool-suite
  • 8. How to Design ASIPs? Design step Benefits Algorithm defined in C Raise abstraction level from RTL to ESL Connect hardware and algorithm design teams Datapath structure defined in nML Much faster than RTL design, enables rapid architectural exploration Designer is in control; can use architectural knowledge C compiler maps algorithm onto datapath structure ISS simulates generated code Tools validate designer’s assumptions and performance reached Profiling tool guides architectural exploration Easily reprogrammable in case of bug or spec changes RTL generated automatically Error-free Quick feedback on gate count for every design iteration Low-power optimisations inserted automatically
  • 9. How to Design ASIPs? Benefits Speed-up design Few weeks per ASIP Design exploration Wide architectural scope, based on processor description language Formal approach increases  40 production chips, 0 bugs correctness Automatic generation of RTL Competitive to hand-coded RTL Automatic generation of SDK C compiler “no-assembly-required”
  • 10. Tool Comparison Programmable Architectural specialisation Resource sharing Business model Architectural style Example vendors Approach Yes High Yes EDA license Flexible, using processor description language Target (IP Designer), CoWare (Processor Designer) Retargetable ASIP design tools Yes Low (within template boundaries) Yes Royalties Configurable ASIP template + extension instructions Tensilica, ARC, ASIP Solutions, SiliconHive Configurable ASIP templates No High Depends on tool EDA license Hardwired datapath, no programmability Mentor (CatapultC), Forte, Synfora, Cadence (C2S) High-level synthesis from C — (*) (*) No strong focus for CoWare?
  • 11. Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
  • 12. Programmable Datapath Examples  Examples shown  Served by IP Designer
  • 13. What is a Programmable Datapath? Hardwired datapath Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow Hardwired datapath with resource sharing Superposition of multiple data-flow patterns Hardware saving benefit, if permitted by throughput spec Requires local modifications to datapath structure and addition of small amounts of control Modification of connectivity  multiplexers Modification of operator behaviour  programmable i.s.o. fixed operators Store intermediate data  local register files i.s.o. registers Controlled from FSM Programmable datapath Datapath with resource sharing, controlled from software Microcode in ROM (design-time programmable), or RAM/flash (post-silicon programmable) SEQ PM DEC s 0 s 1 s 2 d+=(a+b)*c; g+=(e-f)*f;
  • 14. Prog. Datapath Example: WLAN Algorithm Design by Motorola Labs [1] 802.11n, equalisation Characteristics Matrix calculations Specialised operators in complex domain: cmpy, conjugate, sqmod Equalisation matrix: multiple dataflow patterns depending on MIMO scheme SDM Symmetric SDM + STBC SDM + STBC Matrix inversion Matrix inversion + Address computations Address computations Complex conjugate Square modulus [1] Medea+ project “Uppermost”
  • 15. Programmable datapath design Sample expressions: equalisation matrix Sample expression: matrix inversion 4 identical datapaths in SIMD unit Prog. Datapath Example: WLAN Dual Port Memory Common Program Control GMAC 0 Dual Port Memory GMAC 1 Dual Port Memory GMAC 2 Dual Port Memory GMAC 3 Channel Estimation ASIP GMAC
  • 16. Prog. Datapath Example: WLAN nML code of gmac instruction reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5); reg ACC <vcmpl>; pipe P0 <vcmpl>; pipe P1 <vcmpl>; trn tC0 <vcmpl>; trn tC1 <vcmpl>; trn tM0 <vcmpl>; trn tM1 <vcmpl>; enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...}; opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1 ); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); } }  Resources Instruction-set  grammar
  • 17. Prog. Datapath Example: WLAN C compiler uses advanced graph matching techniques to map dataflow patterns on programmable datapath COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION
  • 18. Prog. Datapath Example: FFT Algorithm Decimation in time Radix-2, radix-4, mixed radix Coefficients: complex (16,16) Data: complex (24,24)
  • 19. Prog. Datapath Example: FFT Programmable datapath design Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function Intrinsic’s behaviour is modelled in C, automatically converted to RTL Mdata Mcoef A[4] B[4] CMPY BFLY ld A/B Ld C stA/B * * * * - + + + - -
  • 20. Prog. Datapath Example: FFT Instruction-level parallelism: ILP=5 Efficient register allocation, scheduling and SW pipelining needed E.g. inner-loop for radix-4 FFT Compiled code 4 cycles / iteration 100% resource utilisation /* 0 */ DO cnt,LE /* 1 */ /* delay slot */ /* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3) /* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2) /* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3) /* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0) LDA LDC MPY LDA LDC MPY LDA LDC MPY LDA LDC MPY BFLY BFLY BFLY BFLY STB STB STB STB LDA STB LDC MPY BFLY
  • 21. Prog. Datapath Example: FFT C compiler uses advanced graph search techniques to optimise register utilisation schedule instructions on programmable datapath COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION
  • 22. Prog. Datapath Example: FFT Results Performance Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly 4096-point FFT (radix-4): 24,671 cycles 2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles RTL metrics 26K gates, 123 MHz clock, 130 nm, DesignWare Basic 600 lines of nML code Custom data path, complex butterfly unit
  • 23. Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions
  • 24. Conclusion ASIPs allow to make accelerators in SoCs programmable With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently “ Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators IP Designer as an alternative to high-level synthesis With ASIPs, multicore SoC architectures become even more prolific