Maha an energy efficient malleable hardware accelerator for data intensive applications

MAHA : An Energy Efficient Malleable
Hardware Accelerator For Data
Intensive Applications
Grace Abraham
Roll No: 01
VLSI & ES

CONTENTS
Dept. of ECE 3
MAHA : Malleable Hardware Accelerator
29/07/2015
• INTRODUCTION
• BACKGROUND AND MOTIVATION
• MAHA - OVERALL APPROACH
• NAND FLASH – A CASE STUDY
• SOFTWARE ARHITECTURE
• RESULTS
• CONCLUSION

Dept. of ECE 4
29/07/2015
INTRODUCTION
• In the nanometer technology, power has emerged as primary
design constraint
• Ever increasing demand for low power and high performance
• Von-Neumann bottleneck (back & forth data transfer) barrier to
performance & energy scaling
• To improve efficiency use explicit parallelism
• Energy overhead due to data transfer from off-chip to on-chip
memory
 Low Bandwidth
 High latency
 High energy

Dept. of ECE 5
29/07/2015
• To overcome this, a Malleable Hardware Accelerator is
introduced
• MAHA :
 Implements a
reconfigurable
computing fabric
in last level
memory
 Enabling computing
within off chip
memory Fig 1 : Von-Neumann bottleneck and proposed MAHA
framework

• Choice of NAND flash technology for demonstration
• Previous investigations on Processing in memory (PIM)
• MAHA differs from PIM architecture
 Achieves on-demand computation by design modifications to the
the off-chip nonvolatile memory organization
 High energy efficiency through parallelism & dynamic customization
• MAHA for data intensive applications
• Area and energy overheads are accurately estimated
• An efficient software flow for mapping applications to MAHA is
presented
Dept. of ECE 6
29/07/2015

Dept. of ECE 7
29/07/2015
• Following sections includes
 Von-Neumann bottleneck barrier
 Introduces MAHA & its hardware architecture
 Realization with a CMOS compatible NAND flash memory
 Evaluation results for MAHA

Dept. of ECE 8
29/07/2015
BACKGROUND & MOTIVATION
• PERFORMANCE BARRIER DUE TO VON-NEUMANN BOTTLENECK
• ENERGY BARRIER FOR DATA-INTENSIVE APPLICATIONS
 Off chip BW scales poorly in comparison to on chip transistor density
 On chip density is likely to improve by 16X from 2011 to 2022
 Off chip BW expected to improve only by 40%
 BW available inside flash array is 4.2x105 GB/s in contrast , at 16 bit
flash interface is only 100MB/s
 Managing latency and energy for memory to achieve energy efficiency
 To identify major hurdles to energy scaling
o Performance of ten common kernels were simulated
o System-level performance metrics, such as cache hit/miss frequency were noted

Dept. of ECE 9
29/07/2015
 From table,
o 73% of total energy expended is contributed by access to on-chip instruction & data
cache
o 26% invested in useful computations, including fetch and decode operations
Table 1 : Energy breakdown for a conventional processor executing common computational kernels

Dept. of ECE 10
29/07/2015
• MITIGATING VON-NEUMANN BOTTLENECK THROUGH IN-
MEMORY COMPUTING
 75% of energy in a processor is dissipated in data transport
 Optimizing the compute model for data-intensive tasks can cause
large improvements in energy efficiency
 Two implications for compute model
o Relocate compute resources closer to last level of nonvolatile storage
o Minimizes overhead for data transfer to on-chip execution units
o Replace conventional software pipeline & caches with distributed memory
infrastructure
o Minimizes memory & interconnect memory power dissipation

Dept. of ECE 11
29/07/2015
MAHA-OVERALL APPROACH
 HARDWARE ARCHITECTURE
• MAHA is a hardware reconfigurable framework
• Consists of an array of processing elements (PEs)
• Communication using a hierarchical interconnect architecture
• Target application to be mapped is represented as Control &
data flow graph (CDFG)
• Software flow partitions CDFG into smaller multiple-input
multiple output tasks
• Tasks are mapped to individual PEs

Dept. of ECE 12
29/07/2015
1) COMPUTE LOGIC
2) INTERCONNECT FABRIC
 Each compute block or PE is referred to as memory logic block (MLB)
 A single MLB includes a dense 2D memory array which stores lookup
table, data
 A custom data path with arithmetic units
 A local register file for storing temporary outputs from memory
 Sequence of operations inside an MLB is controlled by μ-code
controller referred to as a schedule table
 Tasks mapped to different MLBs communicate via a programmable &
hierarchical interconnect
 Interconnect is time-multiplexed & shared among multiple MLBs

Dept. of ECE 13
29/07/2015
Fig 2 : (a) Application mapping flow for MAHA
(b) μ-arch details of a single computing block (MLB)
(c) Synchronization among multiple MLBs over shared interconnect

Dept. of ECE 14
29/07/2015
 Sig1 & Sig2 are outputs of MLB A & B at end of cycle 1
 Sig3 & Sig4 are outputs at end of cycle 2
 Signals at end of each cycle are transmitted over same local/global to
MLB C
 Significant gains in energy efficiency can be obtained by computing
inside the NVM
 MAHA is an attractive low-overhead & energy efficient candidate for
in-memory computing
 In NVM-based MAHA model,
o Multiple NVM arrays are grouped to form a single MLB
o Each MLB process its local data, communicates with other MLBs
o Distribution of data to multiple MLBs through flash translation layer for mapping
logical address to a physical location in NVM
o Static CMOS logic integrated with NVM to realize MLB

Dept. of ECE 15
29/07/2015
 COMPARISON WITH ALTERNATE ACCELERATORS
• Computing Model
• Granularity of computations
 Frameworks that do not inherent hardware support for spatio-
temporal computing - FPGA, Chimaera, Piperench & Rapid
 Frameworks that support spatio-temporal execution-MATRIX,
Morphosys
 MAHA is also a spatio-temporal computing framework
 Defined as width of smallest PE
 Based on granularity, frameworks are classified as
 MAHA is a mixed granular computing framework
o Fine- grained
o Coarse-grained
o Mixed granular

Dept. of ECE 16
29/07/2015
• Computing Fabric
• Target Application Domain
 Hardware accelerators proposed earlier used fine grained 1-D lookup
tables
 MAHA uses memory for storage & mapping 1 or more multiple input
multiple output LUTs
 Hardware accelerators proposed earlier target a wide application
space, bit-level computations, signal processing, image processing
 MAHA improve system energy for a variety of data-intensive
applications

Dept. of ECE 17
29/07/2015
NAND FLASH – A CASE STUDY
• Hardware architecture for an off chip MAHA framework based
on CMOS-compatible single level cell (SLC) NAND flash memory
array
• CMOS compatibility allows
• Due to availability of open-source area, power & delay models
SLC is considered
 Integration of MLB controllers, registers, datapath and PI
 Realization using CMOS logic

Dept. of ECE 18
29/07/2015
• OVERVIEW OF CURRENT FLASH ORGANISATION
 Organisation of nand flash memory with flash array & no. of logic
structures
 For Normal Flash read,
o 8-b or 16-b I/O bandwidth
o Organized in units of pages & blocks
o Page size – 2KB
o Each block have 64-128 pages
o Block decoder first selects one of the blocks
o Page decoder selects one of the pages
o Content of entire page is first read into page register
o Transferred to flash external interface
Table 2 : Flash Organization and
performance

Dept. of ECE 19
29/07/2015
Figure 3: Modifications to conventional flash memory to realize MAHA framework.
A small control engine outside the memory array is added to initiate & synchronize parallel operations
inside the memory array

Dept. of ECE 20
29/07/2015
• MODIFICATIONS TO FLASH ARRAY ORGANIZATION
 Modifications to achieve on-demand computation
 Without affecting normal read/write operation
1) Compute Logic Modifications
o Group of N flash blocks are clustered to form a single MLB
o In MLB, blocks are logically divided into LUT blocks & data blocks
o MLB control logic & custom datapath implemented using static CMOS logic
o A custom dual ported asynchronous read register file for storing intermediate
outputs
o A pass gate multiplexors & keep transistor are used for selecting operands
for LUT
o For Normal NAND flash read, entire page is read at once (2KB)

Dept. of ECE 21
29/07/2015
o For LUT operations, due to smaller operand sizes a wide read is avoided
o We propose a narrow- read scheme for LUT blocks in which a fraction of a
page size is read at a time
o Hardware overhead due to word line segmentation
o To minimize overhead, we read only 64-b words from each block at a time

Dept. of ECE 22
29/07/2015
o Advantage – It improves energy efficiency by lowering word line capacitance
o Combinational logic is used to switch between narrow read for MAHA
operation & full page read for normal flash operation
o They are used with narrow read decoder to control the AND gate for segmentation
o Segmentation for data blocks is coarse with data sizes of 4096 bits being read out
from each page and stored inside buffers
o A group of such LUT and data blocks constitute 1 MLB
o Two planes of the flash array are logically divided into 8 banks, each consists of
2 MLBs
o Each MLB contains
a. 256 blocks of flash memory
b. 1 LUT block
c. 255 data blocks

Dept. of ECE 23
29/07/2015
Figure 4: Modified flash memory array for on-demand reconfigurable computing.
The memory blocks are augmented with local control and compute logic to act as a
hardware reconfigurable unit

Dept. of ECE 24
29/07/2015
2) Routing logic modifications
o Each block communicates with the page register over a shared bus
o To minimize the inter MLB PI overhead, a set of hierarchical buses with a
at each level to select the source of incoming data
o 4 levels – banks, sub banks, subarrays
Figure 5 : Hierarchical interconnect architecture to connect a group of MLB’s

Dept. of ECE 25
29/07/2015
SOFTWARE ARCHITECTURE
• Figure shows application mapping for the proposed
acceleration platform.
• Mapper (application mapping tool ) was developed in C
• Key features of software flow are
1) Description of input application using an ISA
 Define an instruction set for the proposed MAHA framework that
common control as well as data flow operations
 Operation types that are supported by software architecture :
o bitswC
o bits
o mult
o shift and rotate
o sel
o complex
o load & store

Dept. of ECE 26
29/07/2015
Figure 6 : Application mapping flow for proposed MAHA framework

Dept. of ECE 27
29/07/2015
2) Application Mapping to a mixed-granular time-multiplexed
computing fabric
 The mapping process includes 2 key contributions
1) Decomposition of fine & coarse grained operations
o During decomposition of load/store operation, memory is allocated in 1
or more MLBs depending on the address size used for load/store & no.
of data blocks present inside each MLB
2) Fusing multiple LUT as well as custom datapath operations
o 3 fusion routines
1) Fusion of random LUT based operations
2) Fusion of bit-sliceable operations
3) Fusion of custom-datapath operations
o In all these, decomposed CDFG is first partitioned into 1 or more vertices

Dept. of ECE 28
29/07/2015
3) Placement & routing for hierarchical interconnect model :
 Software tool places the MLBs in hierarchical fashion such that
no. of inputs & outputs crossing each module is minimized
 In bi-partitioning approach, MLBs are first allocated to the first
level modules, then distributed among second-level modules
 This continues until each MLB has been mapped to the
lowermost memory module
 Routing of signals in the CDFG is performed in the following order
1) Routing of signals which cross each level of the memory hierarchy
2) Routing of primary outputs from each MLB for all levels of the cyclic
schedule
3) Routing of primary inputs to each MLB for all levels of the cyclic
schedule

Dept. of ECE 29
29/07/2015
4) Functional validation of the proposed framework :
 Bit file generation routine accepts the placed & routed netlist &
the control or select bits for the following
1) Configuration for programmable switches
2) Schedule table entries which control the sequence of
operations inside each MLB
3) LUT entries to be loaded into the function table
 Bit file generated by the tool can be directly loaded into the
function table

Dept. of ECE 30
29/07/2015
RESULTS
A. Design space exploration for MAHA
B. Energy , Performance, and Overhead estimation
 Estimate design overhead for entire MLB as well as for inter-MLB PI
 Map the benchmark applications to the MAHA framework
 Calculate the area overhead, performance, and energy
requirements for each configurations & select best configuration
 Cycle time of 20ns for MAHA operation – bitline precharge time (12ns)+
intra-MLB delay(3ns)+inter-MLB signal propagation time(5ns)
 Area of single block of flash array-5*F2 * (Npages)*(pagesize)
Since LUT block is separate from data blocks, area overhead is different

Dept. of ECE 31
29/07/2015
 The parameters noted are :
C. Selection of optimal MAHA configuration
o Area overhead
o Latency
o Number of MLBs required to map application
o Total energy dissipation in the MLBs
o Area & energy for inter-MLB PI
o Size of reconfiguration data
o Final configuration
Figure 7: (a) Relative contribution of different components to total area of modified
flash(b) Relative contribution of memory & logic components

Dept. of ECE 32
29/07/2015
D. Energy & performance for mapped applications
 Mapping results for a single CDFG instantiation for each of the selected
benchmarks mapped to final MAHA hardware configuration
 For MAHA, average PI energy is less compared with the average MLB
logic energy

Dept. of ECE 33
29/07/2015
E. Comparison with a conventional GPP
1) Reduction in On-chip & off-chip communication
2) Improvement in execution latency
3) Improvement in energy
4) Improvement in EDP

Dept. of ECE 34
29/07/2015
F. Comparison with FPGA & GPU
G. Hardware emulation based validation
 On an average MAHA improves the energy requirement by 74% & 84%
over FPGA & GPU frameworks
 MAHA eliminates the high energy overhead for transferring data from off-
chip memory to FPGA or GPU
 We developed an FPGA –based emulation framework, which validates
1) Functionality & synchronization of multiple MLBs for several
application kernels
2) Interfacing the MAHA framework with the host processor
 Emulation framework consists of 2 FPGA boards, one DE0, running a host
CPU, & a DE4, consisting of 3 main components

Dept. of ECE 35
29/07/2015
o MAHA framework
o Flash controller
o on board flash memory
 The last 2 boards communicate over 3-wire SPI in simple master/slave
configuration
 The slave queries the flash for all available kernels, & upon finding a match,
begins a transfer of the configuration bits & data for processing to MAHA
framework .
 If no match is found, the slave immediately responds with an error code
 Otherwise slave will only interrupt the host CPU

Dept. of ECE 36
29/07/2015
Figure 8 : (a) Overview for off-chip acceleration with MAHA framework
(b)System architecture for FPGA- based hardware emulation framework
(c) Improvement in latency & energy with MAHA –based off-chip acceleration

DISCUSSION
 Before mapping a kernel to an-in memory accelerator, key applications &
system primitives can be used to determine whether it will benefit from in-
memory acceleration. These are listed below :
1) g—fraction of total instructions with memory reference (loads and stores);
2) f —fraction of total instructions transferred to an compute engine;
3) c—fraction of instructions translated from the host’s ISA to the ISA for the off-chip
compute framework
4) o—fraction of original instructions, which result in an output. A fraction f × c × o
thus produces outputs, which need to be transferred to the host processor;
5) eoffchip—average energy per instruction in the off-chip compute engine;
6) etxfer—energy expended in the transfer of an output from the off-chip framework
to the host processor;
7) toffchip—ratio of cycle time of the off-chip compute framework to that of the host
processor;
8) n—fraction of speedup due to parallelism in the framework
9) ttxfer—time taken in terms of processor clock cycles to transfer an output from the
off-chip compute framework to the host processor.
Dept. of ECE 37
29/07/2015

 Tsys = Toffchip + Tproc + Ttxfer
 Esys = Eoffchip + Eproc + Etxfer
Figure 9 : Energy & performance for a hybrid system with a host processor &
off-chip memory based hardware accelerator
Dept. of ECE 38
29/07/2015

Dept. of ECE 39
29/07/2015
CONCLUSION
• MAHA , a hardware acceleration framework
• Greatly improve energy efficiency for data-intensive applications by
transferring computing kernal to last level of memory
• Design considerations to modify an SLC NAND flash memory for on-chip
reconfigurable computing are presented
• Improvement in energy efficiency
• Better efficiency compared to FPGA & GPU
• Future research efforts can be directed for optimizing the MLB
architecture, interconnect topology & mapper software

Dept. of ECE 40
29/07/2015
REFERENCES
 MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data-
Intensive Applications Somnath Paul, Member, IEEE, Aswin Krishna,
Student Member, IEEE, Wenchao Qian, Student Member, IEEE, Robert
Karam, Student Member, IEEE, and Swarup Bhunia, Senior Member, IEEE
 V. Govindaraju, C.-H. Ko, and K. Sankaralingam, “Dynamically specialized
datapaths for energy efficient computing,” in Proc. IEEE 17th Int.
Symp. High Perform. Comput. Archit. (HPCA), Feb. 2011, pp. 503–514
and more....

Dept. of ECE 41
29/07/2015
THANK YOU

QUERIES ????.....
Dept. of ECE 42
29/07/2015

Maha an energy efficient malleable hardware accelerator for data intensive applications

More Related Content

What's hot (13)

Viewers also liked (11)

Similar to Maha an energy efficient malleable hardware accelerator for data intensive applications (20)

More from Grace Abraham (7)

Recently uploaded (20)

Maha an energy efficient malleable hardware accelerator for data intensive applications