Target updated track f

Design of Programmable Accelerators for SoCs Gert Goossens CEO Target Compiler Technologies

Abstract For new wireless standards like 3GPP-LTE, general-purpose processors are getting out of steam. Wisdom is that accelerators must be added in the form of hardwired datapaths, to deliver the required performance. However, a hardwired datapath stands for zero flexibility, reducing the capability of supporting evolutionary or multiple standards. We discuss how C-programmable application-specific processors (ASIPs) can replace fixed-function accelerators without sacrificing performance (throughput, power and gate count). We review different approaches for ASIP design. We illustrate our performance claims with examples from the data-plane of wireless baseband modems.

Agenda ASIPs as accelerators in SoCs How to design ASIPs Programmable datapath examples WLAN FFT Conclusions

What do you do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced Add hardwired accelerators? Balanced but inflexible SoC SoC Design

What do you do when the performance of your main processor is insufficient? ASIPs: application-specific processors Anything between general-purpose uP and hardwired datapath Flexibility through programmability and design-time reconfigurability High-throughput and low energy, through parallelism and specialisation Balanced and flexible SoC SoC Design

How to Design ASIPs? IP Designer tool-suite

How to Design ASIPs? Design step Benefits Algorithm defined in C Raise abstraction level from RTL to ESL Connect hardware and algorithm design teams Datapath structure defined in nML Much faster than RTL design, enables rapid architectural exploration Designer is in control; can use architectural knowledge C compiler maps algorithm onto datapath structure ISS simulates generated code Tools validate designer’s assumptions and performance reached Profiling tool guides architectural exploration Easily reprogrammable in case of bug or spec changes RTL generated automatically Error-free Quick feedback on gate count for every design iteration Low-power optimisations inserted automatically

How to Design ASIPs? Benefits Speed-up design Few weeks per ASIP Design exploration Wide architectural scope, based on processor description language Formal approach increases  40 production chips, 0 bugs correctness Automatic generation of RTL Competitive to hand-coded RTL Automatic generation of SDK C compiler “no-assembly-required”

Tool Comparison Programmable Architectural specialisation Resource sharing Business model Architectural style Example vendors Approach Yes High Yes EDA license Flexible, using processor description language Target (IP Designer), CoWare (Processor Designer) Retargetable ASIP design tools Yes Low (within template boundaries) Yes Royalties Configurable ASIP template + extension instructions Tensilica, ARC, ASIP Solutions, SiliconHive Configurable ASIP templates No High Depends on tool EDA license Hardwired datapath, no programmability Mentor (CatapultC), Forte, Synfora, Cadence (C2S) High-level synthesis from C — (*) (*) No strong focus for CoWare?

Programmable Datapath Examples  Examples shown  Served by IP Designer

What is a Programmable Datapath? Hardwired datapath Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow Hardwired datapath with resource sharing Superposition of multiple data-flow patterns Hardware saving benefit, if permitted by throughput spec Requires local modifications to datapath structure and addition of small amounts of control Modification of connectivity  multiplexers Modification of operator behaviour  programmable i.s.o. fixed operators Store intermediate data  local register files i.s.o. registers Controlled from FSM Programmable datapath Datapath with resource sharing, controlled from software Microcode in ROM (design-time programmable), or RAM/flash (post-silicon programmable) SEQ PM DEC s 0 s 1 s 2 d+=(a+b)*c; g+=(e-f)*f;

Prog. Datapath Example: WLAN Algorithm Design by Motorola Labs [1] 802.11n, equalisation Characteristics Matrix calculations Specialised operators in complex domain: cmpy, conjugate, sqmod Equalisation matrix: multiple dataflow patterns depending on MIMO scheme SDM Symmetric SDM + STBC SDM + STBC Matrix inversion Matrix inversion + Address computations Address computations Complex conjugate Square modulus [1] Medea+ project “Uppermost”

Programmable datapath design Sample expressions: equalisation matrix Sample expression: matrix inversion 4 identical datapaths in SIMD unit Prog. Datapath Example: WLAN Dual Port Memory Common Program Control GMAC 0 Dual Port Memory GMAC 1 Dual Port Memory GMAC 2 Dual Port Memory GMAC 3 Channel Estimation ASIP GMAC

Prog. Datapath Example: WLAN nML code of gmac instruction reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5); reg ACC <vcmpl>; pipe P0 <vcmpl>; pipe P1 <vcmpl>; trn tC0 <vcmpl>; trn tC1 <vcmpl>; trn tM0 <vcmpl>; trn tM1 <vcmpl>; enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...}; opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1 ); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); } }  Resources Instruction-set  grammar

Prog. Datapath Example: WLAN C compiler uses advanced graph matching techniques to map dataflow patterns on programmable datapath COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION

Prog. Datapath Example: FFT Algorithm Decimation in time Radix-2, radix-4, mixed radix Coefficients: complex (16,16) Data: complex (24,24)

Prog. Datapath Example: FFT Programmable datapath design Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function Intrinsic’s behaviour is modelled in C, automatically converted to RTL Mdata Mcoef A[4] B[4] CMPY BFLY ld A/B Ld C stA/B * * * * - + + + - -

Prog. Datapath Example: FFT Instruction-level parallelism: ILP=5 Efficient register allocation, scheduling and SW pipelining needed E.g. inner-loop for radix-4 FFT Compiled code 4 cycles / iteration 100% resource utilisation /* 0 */ DO cnt,LE /* 1 */ /* delay slot */ /* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3) /* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2) /* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3) /* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0) LDA LDC MPY LDA LDC MPY LDA LDC MPY LDA LDC MPY BFLY BFLY BFLY BFLY STB STB STB STB LDA STB LDC MPY BFLY

Prog. Datapath Example: FFT C compiler uses advanced graph search techniques to optimise register utilisation schedule instructions on programmable datapath COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION

Prog. Datapath Example: FFT Results Performance Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly 4096-point FFT (radix-4): 24,671 cycles 2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles RTL metrics 26K gates, 123 MHz clock, 130 nm, DesignWare Basic 600 lines of nML code Custom data path, complex butterfly unit

Conclusion ASIPs allow to make accelerators in SoCs programmable With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently “ Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators IP Designer as an alternative to high-level synthesis With ASIPs, multicore SoC architectures become even more prolific

Target updated track f

More Related Content

What's hot (14)

Viewers also liked (20)

Similar to Target updated track f (20)

Recently uploaded (20)

Target updated track f