SlideShare a Scribd company logo
WELCOME
MAHA : An Energy Efficient Malleable
Hardware Accelerator For Data
Intensive Applications
Grace Abraham
Roll No: 01
VLSI & ES
CONTENTS
Dept. of ECE 3
MAHA : Malleable Hardware Accelerator
29/07/2015
• INTRODUCTION
• BACKGROUND AND MOTIVATION
• MAHA - OVERALL APPROACH
• NAND FLASH – A CASE STUDY
• SOFTWARE ARHITECTURE
• RESULTS
• CONCLUSION
Dept. of ECE 4
MAHA : Malleable Hardware Accelerator
29/07/2015
INTRODUCTION
• In the nanometer technology, power has emerged as primary
design constraint
• Ever increasing demand for low power and high performance
• Von-Neumann bottleneck (back & forth data transfer) barrier to
performance & energy scaling
• To improve efficiency use explicit parallelism
• Energy overhead due to data transfer from off-chip to on-chip
memory
 Low Bandwidth
 High latency
 High energy
Dept. of ECE 5
MAHA : Malleable Hardware Accelerator
29/07/2015
• To overcome this, a Malleable Hardware Accelerator is
introduced
• MAHA :
 Implements a
reconfigurable
computing fabric
in last level
memory
 Enabling computing
within off chip
memory Fig 1 : Von-Neumann bottleneck and proposed MAHA
framework
• Choice of NAND flash technology for demonstration
• Previous investigations on Processing in memory (PIM)
• MAHA differs from PIM architecture
 Achieves on-demand computation by design modifications to the
the off-chip nonvolatile memory organization
 High energy efficiency through parallelism & dynamic customization
• MAHA for data intensive applications
• Area and energy overheads are accurately estimated
• An efficient software flow for mapping applications to MAHA is
presented
Dept. of ECE 6
MAHA : Malleable Hardware Accelerator
29/07/2015
Dept. of ECE 7
MAHA : Malleable Hardware Accelerator
29/07/2015
• Following sections includes
 Von-Neumann bottleneck barrier
 Introduces MAHA & its hardware architecture
 Realization with a CMOS compatible NAND flash memory
 Evaluation results for MAHA
Dept. of ECE 8
MAHA : Malleable Hardware Accelerator
29/07/2015
BACKGROUND & MOTIVATION
• PERFORMANCE BARRIER DUE TO VON-NEUMANN BOTTLENECK
• ENERGY BARRIER FOR DATA-INTENSIVE APPLICATIONS
 Off chip BW scales poorly in comparison to on chip transistor density
 On chip density is likely to improve by 16X from 2011 to 2022
 Off chip BW expected to improve only by 40%
 BW available inside flash array is 4.2x105 GB/s in contrast , at 16 bit
flash interface is only 100MB/s
 Managing latency and energy for memory to achieve energy efficiency
 To identify major hurdles to energy scaling
o Performance of ten common kernels were simulated
o System-level performance metrics, such as cache hit/miss frequency were noted
Dept. of ECE 9
MAHA : Malleable Hardware Accelerator
29/07/2015
 From table,
o 73% of total energy expended is contributed by access to on-chip instruction & data
cache
o 26% invested in useful computations, including fetch and decode operations
Table 1 : Energy breakdown for a conventional processor executing common computational kernels
Dept. of ECE 10
MAHA : Malleable Hardware Accelerator
29/07/2015
• MITIGATING VON-NEUMANN BOTTLENECK THROUGH IN-
MEMORY COMPUTING
 75% of energy in a processor is dissipated in data transport
 Optimizing the compute model for data-intensive tasks can cause
large improvements in energy efficiency
 Two implications for compute model
o Relocate compute resources closer to last level of nonvolatile storage
o Minimizes overhead for data transfer to on-chip execution units
o Replace conventional software pipeline & caches with distributed memory
infrastructure
o Minimizes memory & interconnect memory power dissipation
Dept. of ECE 11
MAHA : Malleable Hardware Accelerator
29/07/2015
MAHA-OVERALL APPROACH
 HARDWARE ARCHITECTURE
• MAHA is a hardware reconfigurable framework
• Consists of an array of processing elements (PEs)
• Communication using a hierarchical interconnect architecture
• Target application to be mapped is represented as Control &
data flow graph (CDFG)
• Software flow partitions CDFG into smaller multiple-input
multiple output tasks
• Tasks are mapped to individual PEs
Dept. of ECE 12
MAHA : Malleable Hardware Accelerator
29/07/2015
1) COMPUTE LOGIC
2) INTERCONNECT FABRIC
 Each compute block or PE is referred to as memory logic block (MLB)
 A single MLB includes a dense 2D memory array which stores lookup
table, data
 A custom data path with arithmetic units
 A local register file for storing temporary outputs from memory
 Sequence of operations inside an MLB is controlled by μ-code
controller referred to as a schedule table
 Tasks mapped to different MLBs communicate via a programmable &
hierarchical interconnect
 Interconnect is time-multiplexed & shared among multiple MLBs
Dept. of ECE 13
MAHA : Malleable Hardware Accelerator
29/07/2015
Fig 2 : (a) Application mapping flow for MAHA
(b) μ-arch details of a single computing block (MLB)
(c) Synchronization among multiple MLBs over shared interconnect
Dept. of ECE 14
MAHA : Malleable Hardware Accelerator
29/07/2015
 Sig1 & Sig2 are outputs of MLB A & B at end of cycle 1
 Sig3 & Sig4 are outputs at end of cycle 2
 Signals at end of each cycle are transmitted over same local/global to
MLB C
 Significant gains in energy efficiency can be obtained by computing
inside the NVM
 MAHA is an attractive low-overhead & energy efficient candidate for
in-memory computing
 In NVM-based MAHA model,
o Multiple NVM arrays are grouped to form a single MLB
o Each MLB process its local data, communicates with other MLBs
o Distribution of data to multiple MLBs through flash translation layer for mapping
logical address to a physical location in NVM
o Static CMOS logic integrated with NVM to realize MLB
Dept. of ECE 15
MAHA : Malleable Hardware Accelerator
29/07/2015
 COMPARISON WITH ALTERNATE ACCELERATORS
• Computing Model
• Granularity of computations
 Frameworks that do not inherent hardware support for spatio-
temporal computing - FPGA, Chimaera, Piperench & Rapid
 Frameworks that support spatio-temporal execution-MATRIX,
Morphosys
 MAHA is also a spatio-temporal computing framework
 Defined as width of smallest PE
 Based on granularity, frameworks are classified as
 MAHA is a mixed granular computing framework
o Fine- grained
o Coarse-grained
o Mixed granular
Dept. of ECE 16
MAHA : Malleable Hardware Accelerator
29/07/2015
• Computing Fabric
• Target Application Domain
 Hardware accelerators proposed earlier used fine grained 1-D lookup
tables
 MAHA uses memory for storage & mapping 1 or more multiple input
multiple output LUTs
 Hardware accelerators proposed earlier target a wide application
space, bit-level computations, signal processing, image processing
 MAHA improve system energy for a variety of data-intensive
applications
Dept. of ECE 17
MAHA : Malleable Hardware Accelerator
29/07/2015
NAND FLASH – A CASE STUDY
• Hardware architecture for an off chip MAHA framework based
on CMOS-compatible single level cell (SLC) NAND flash memory
array
• CMOS compatibility allows
• Due to availability of open-source area, power & delay models
SLC is considered
 Integration of MLB controllers, registers, datapath and PI
 Realization using CMOS logic
Dept. of ECE 18
MAHA : Malleable Hardware Accelerator
29/07/2015
• OVERVIEW OF CURRENT FLASH ORGANISATION
 Organisation of nand flash memory with flash array & no. of logic
structures
 For Normal Flash read,
o 8-b or 16-b I/O bandwidth
o Organized in units of pages & blocks
o Page size – 2KB
o Each block have 64-128 pages
o Block decoder first selects one of the blocks
o Page decoder selects one of the pages
o Content of entire page is first read into page register
o Transferred to flash external interface
Table 2 : Flash Organization and
performance
Dept. of ECE 19
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 3: Modifications to conventional flash memory to realize MAHA framework.
A small control engine outside the memory array is added to initiate & synchronize parallel operations
inside the memory array
Dept. of ECE 20
MAHA : Malleable Hardware Accelerator
29/07/2015
• MODIFICATIONS TO FLASH ARRAY ORGANIZATION
 Modifications to achieve on-demand computation
 Without affecting normal read/write operation
1) Compute Logic Modifications
o Group of N flash blocks are clustered to form a single MLB
o In MLB, blocks are logically divided into LUT blocks & data blocks
o MLB control logic & custom datapath implemented using static CMOS logic
o A custom dual ported asynchronous read register file for storing intermediate
outputs
o A pass gate multiplexors & keep transistor are used for selecting operands
for LUT
o For Normal NAND flash read, entire page is read at once (2KB)
Dept. of ECE 21
MAHA : Malleable Hardware Accelerator
29/07/2015
o For LUT operations, due to smaller operand sizes a wide read is avoided
o We propose a narrow- read scheme for LUT blocks in which a fraction of a
page size is read at a time
o Hardware overhead due to word line segmentation
o To minimize overhead, we read only 64-b words from each block at a time
Dept. of ECE 22
MAHA : Malleable Hardware Accelerator
29/07/2015
o Advantage – It improves energy efficiency by lowering word line capacitance
o Combinational logic is used to switch between narrow read for MAHA
operation & full page read for normal flash operation
o They are used with narrow read decoder to control the AND gate for segmentation
o Segmentation for data blocks is coarse with data sizes of 4096 bits being read out
from each page and stored inside buffers
o A group of such LUT and data blocks constitute 1 MLB
o Two planes of the flash array are logically divided into 8 banks, each consists of
2 MLBs
o Each MLB contains
a. 256 blocks of flash memory
b. 1 LUT block
c. 255 data blocks
Dept. of ECE 23
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 4: Modified flash memory array for on-demand reconfigurable computing.
The memory blocks are augmented with local control and compute logic to act as a
hardware reconfigurable unit
Dept. of ECE 24
MAHA : Malleable Hardware Accelerator
29/07/2015
2) Routing logic modifications
o Each block communicates with the page register over a shared bus
o To minimize the inter MLB PI overhead, a set of hierarchical buses with a
at each level to select the source of incoming data
o 4 levels – banks, sub banks, subarrays
Figure 5 : Hierarchical interconnect architecture to connect a group of MLB’s
Dept. of ECE 25
MAHA : Malleable Hardware Accelerator
29/07/2015
SOFTWARE ARCHITECTURE
• Figure shows application mapping for the proposed
acceleration platform.
• Mapper (application mapping tool ) was developed in C
• Key features of software flow are
1) Description of input application using an ISA
 Define an instruction set for the proposed MAHA framework that
common control as well as data flow operations
 Operation types that are supported by software architecture :
o bitswC
o bits
o mult
o shift and rotate
o sel
o complex
o load & store
Dept. of ECE 26
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 6 : Application mapping flow for proposed MAHA framework
Dept. of ECE 27
MAHA : Malleable Hardware Accelerator
29/07/2015
2) Application Mapping to a mixed-granular time-multiplexed
computing fabric
 The mapping process includes 2 key contributions
1) Decomposition of fine & coarse grained operations
o During decomposition of load/store operation, memory is allocated in 1
or more MLBs depending on the address size used for load/store & no.
of data blocks present inside each MLB
2) Fusing multiple LUT as well as custom datapath operations
o 3 fusion routines
1) Fusion of random LUT based operations
2) Fusion of bit-sliceable operations
3) Fusion of custom-datapath operations
o In all these, decomposed CDFG is first partitioned into 1 or more vertices
Dept. of ECE 28
MAHA : Malleable Hardware Accelerator
29/07/2015
3) Placement & routing for hierarchical interconnect model :
 Software tool places the MLBs in hierarchical fashion such that
no. of inputs & outputs crossing each module is minimized
 In bi-partitioning approach, MLBs are first allocated to the first
level modules, then distributed among second-level modules
 This continues until each MLB has been mapped to the
lowermost memory module
 Routing of signals in the CDFG is performed in the following order
1) Routing of signals which cross each level of the memory hierarchy
2) Routing of primary outputs from each MLB for all levels of the cyclic
schedule
3) Routing of primary inputs to each MLB for all levels of the cyclic
schedule
Dept. of ECE 29
MAHA : Malleable Hardware Accelerator
29/07/2015
4) Functional validation of the proposed framework :
 Bit file generation routine accepts the placed & routed netlist &
the control or select bits for the following
1) Configuration for programmable switches
2) Schedule table entries which control the sequence of
operations inside each MLB
3) LUT entries to be loaded into the function table
 Bit file generated by the tool can be directly loaded into the
function table
Dept. of ECE 30
MAHA : Malleable Hardware Accelerator
29/07/2015
RESULTS
A. Design space exploration for MAHA
B. Energy , Performance, and Overhead estimation
 Estimate design overhead for entire MLB as well as for inter-MLB PI
 Map the benchmark applications to the MAHA framework
 Calculate the area overhead, performance, and energy
requirements for each configurations & select best configuration
 Cycle time of 20ns for MAHA operation – bitline precharge time (12ns)+
intra-MLB delay(3ns)+inter-MLB signal propagation time(5ns)
 Area of single block of flash array-5*F2 * (Npages)*(pagesize)
Since LUT block is separate from data blocks, area overhead is different
Dept. of ECE 31
MAHA : Malleable Hardware Accelerator
29/07/2015
 The parameters noted are :
C. Selection of optimal MAHA configuration
o Area overhead
o Latency
o Number of MLBs required to map application
o Total energy dissipation in the MLBs
o Area & energy for inter-MLB PI
o Size of reconfiguration data
o Final configuration
Figure 7: (a) Relative contribution of different components to total area of modified
flash(b) Relative contribution of memory & logic components
Dept. of ECE 32
MAHA : Malleable Hardware Accelerator
29/07/2015
D. Energy & performance for mapped applications
 Mapping results for a single CDFG instantiation for each of the selected
benchmarks mapped to final MAHA hardware configuration
 For MAHA, average PI energy is less compared with the average MLB
logic energy
Dept. of ECE 33
MAHA : Malleable Hardware Accelerator
29/07/2015
E. Comparison with a conventional GPP
1) Reduction in On-chip & off-chip communication
2) Improvement in execution latency
3) Improvement in energy
4) Improvement in EDP
Dept. of ECE 34
MAHA : Malleable Hardware Accelerator
29/07/2015
F. Comparison with FPGA & GPU
G. Hardware emulation based validation
 On an average MAHA improves the energy requirement by 74% & 84%
over FPGA & GPU frameworks
 MAHA eliminates the high energy overhead for transferring data from off-
chip memory to FPGA or GPU
 We developed an FPGA –based emulation framework, which validates
1) Functionality & synchronization of multiple MLBs for several
application kernels
2) Interfacing the MAHA framework with the host processor
 Emulation framework consists of 2 FPGA boards, one DE0, running a host
CPU, & a DE4, consisting of 3 main components
Dept. of ECE 35
MAHA : Malleable Hardware Accelerator
29/07/2015
o MAHA framework
o Flash controller
o on board flash memory
 The last 2 boards communicate over 3-wire SPI in simple master/slave
configuration
 The slave queries the flash for all available kernels, & upon finding a match,
begins a transfer of the configuration bits & data for processing to MAHA
framework .
 If no match is found, the slave immediately responds with an error code
 Otherwise slave will only interrupt the host CPU
Dept. of ECE 36
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 8 : (a) Overview for off-chip acceleration with MAHA framework
(b)System architecture for FPGA- based hardware emulation framework
(c) Improvement in latency & energy with MAHA –based off-chip acceleration
DISCUSSION
 Before mapping a kernel to an-in memory accelerator, key applications &
system primitives can be used to determine whether it will benefit from in-
memory acceleration. These are listed below :
1) g—fraction of total instructions with memory reference (loads and stores);
2) f —fraction of total instructions transferred to an compute engine;
3) c—fraction of instructions translated from the host’s ISA to the ISA for the off-chip
compute framework
4) o—fraction of original instructions, which result in an output. A fraction f × c × o
thus produces outputs, which need to be transferred to the host processor;
5) eoffchip—average energy per instruction in the off-chip compute engine;
6) etxfer—energy expended in the transfer of an output from the off-chip framework
to the host processor;
7) toffchip—ratio of cycle time of the off-chip compute framework to that of the host
processor;
8) n—fraction of speedup due to parallelism in the framework
9) ttxfer—time taken in terms of processor clock cycles to transfer an output from the
off-chip compute framework to the host processor.
Dept. of ECE 37
MAHA : Malleable Hardware Accelerator
29/07/2015
 Tsys = Toffchip + Tproc + Ttxfer
 Esys = Eoffchip + Eproc + Etxfer
Figure 9 : Energy & performance for a hybrid system with a host processor &
off-chip memory based hardware accelerator
Dept. of ECE 38
MAHA : Malleable Hardware Accelerator
29/07/2015
Dept. of ECE 39
MAHA : Malleable Hardware Accelerator
29/07/2015
CONCLUSION
• MAHA , a hardware acceleration framework
• Greatly improve energy efficiency for data-intensive applications by
transferring computing kernal to last level of memory
• Design considerations to modify an SLC NAND flash memory for on-chip
reconfigurable computing are presented
• Improvement in energy efficiency
• Better efficiency compared to FPGA & GPU
• Future research efforts can be directed for optimizing the MLB
architecture, interconnect topology & mapper software
Dept. of ECE 40
MAHA : Malleable Hardware Accelerator
29/07/2015
REFERENCES
 MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data-
Intensive Applications Somnath Paul, Member, IEEE, Aswin Krishna,
Student Member, IEEE, Wenchao Qian, Student Member, IEEE, Robert
Karam, Student Member, IEEE, and Swarup Bhunia, Senior Member, IEEE
 V. Govindaraju, C.-H. Ko, and K. Sankaralingam, “Dynamically specialized
datapaths for energy efficient computing,” in Proc. IEEE 17th Int.
Symp. High Perform. Comput. Archit. (HPCA), Feb. 2011, pp. 503–514
and more....
Dept. of ECE 41
MAHA : Malleable Hardware Accelerator
29/07/2015
THANK YOU
QUERIES ????.....
Dept. of ECE 42
MAHA : Malleable Hardware Accelerator
29/07/2015

More Related Content

PDF
Different Approaches in Energy Efficient Cache Memory
PDF
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
PPT
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
PPT
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
PDF
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HP
PDF
Storage Networking Solutions for High Performance Databases by QLogic
PDF
Guidelines for-early-power-analysis
PPTX
MapReduce
Different Approaches in Energy Efficient Cache Memory
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HP
Storage Networking Solutions for High Performance Databases by QLogic
Guidelines for-early-power-analysis
MapReduce

What's hot (13)

PDF
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 
PDF
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
PDF
Synergistic processing in cell's multicore architecture
PDF
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
PDF
SPE effiency on modern hardware paper presentation
PDF
Accelerix ISSCC 1998 Paper
PDF
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PDF
Greenplum: Driving the future of Data Warehousing and Analytics
PDF
Greenplum Database Overview
 
PDF
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
PDF
Architectures for parallel
PDF
Greenplum Database on HDFS
PPT
Oracle real application_cluster
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Synergistic processing in cell's multicore architecture
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
SPE effiency on modern hardware paper presentation
Accelerix ISSCC 1998 Paper
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
Greenplum: Driving the future of Data Warehousing and Analytics
Greenplum Database Overview
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architectures for parallel
Greenplum Database on HDFS
Oracle real application_cluster
Ad

Viewers also liked (11)

PPTX
Liquor detection through Automatic Motor locking system ppt
PPT
Automatic room light controller with bidirectional visitor counter
PPTX
Latest ECE Projects Ideas In Various Electronics Technologies
PDF
Project report on self compacting concrete
PDF
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
PPS
Schindler case study
PDF
wireless charging of mobile phones using microwave full seminar report
PPTX
OLED 2014 PPT
PPTX
Automatic irrigation 1st review(ieee project ece dept)
PPTX
Artificial eye
PPTX
Wireless charging of mobilephones
Liquor detection through Automatic Motor locking system ppt
Automatic room light controller with bidirectional visitor counter
Latest ECE Projects Ideas In Various Electronics Technologies
Project report on self compacting concrete
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
Schindler case study
wireless charging of mobile phones using microwave full seminar report
OLED 2014 PPT
Automatic irrigation 1st review(ieee project ece dept)
Artificial eye
Wireless charging of mobilephones
Ad

Similar to Maha an energy efficient malleable hardware accelerator for data intensive applications (20)

PDF
OpenHPI - Parallel Programming Concepts - Week 4
PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
PDF
OpenPOWER Summit 2020 - OpenCAPI Keynote
PDF
Processing-in-Memory
PDF
6 open capi_meetup_in_japan_final
PDF
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
PPTX
Seminario utovrm
PDF
I understand that physics and hardware emmaded on the use of finete .pdf
PDF
Challenges in Embedded Computing
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
PDF
VLSI- An Automotive Application Perspective
PDF
Harnessing the Killer Micros
PPTX
Abhaycavirtual memory and the pagehit.pptx
PPT
Power Point Presentation on Virtual Memory.ppt
PDF
Heterogeneous Computing : The Future of Systems
PPTX
Os Module 4_Virtual Memory Management.pptx
PDF
Architecture_L5 (3).pdf wwwwwwwwwwwwwwwwwwwwwwwwwww
PDF
Computer architecture abhmail
PPTX
Ram and types of ram.Cache
PDF
Memory-Driven Near-Data Acceleration and its application to DOME/SKA
OpenHPI - Parallel Programming Concepts - Week 4
Refactoring Applications for the XK7 and Future Hybrid Architectures
OpenPOWER Summit 2020 - OpenCAPI Keynote
Processing-in-Memory
6 open capi_meetup_in_japan_final
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Seminario utovrm
I understand that physics and hardware emmaded on the use of finete .pdf
Challenges in Embedded Computing
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
VLSI- An Automotive Application Perspective
Harnessing the Killer Micros
Abhaycavirtual memory and the pagehit.pptx
Power Point Presentation on Virtual Memory.ppt
Heterogeneous Computing : The Future of Systems
Os Module 4_Virtual Memory Management.pptx
Architecture_L5 (3).pdf wwwwwwwwwwwwwwwwwwwwwwwwwww
Computer architecture abhmail
Ram and types of ram.Cache
Memory-Driven Near-Data Acceleration and its application to DOME/SKA

More from Grace Abraham (7)

PPTX
Embedded system hardware architecture ii
PPTX
Design and implementation of cmos rail to-rail operational amplifiers
PPTX
Clock recovery in mesochronous systems and pleisochronous systems
PPTX
MEMS ACCELEROMETER BASED NONSPECIFIC – USER HAND GESTURE RECOGNITION
PPTX
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
PPTX
Rtl design optimizations and tradeoffs
PPTX
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...
Embedded system hardware architecture ii
Design and implementation of cmos rail to-rail operational amplifiers
Clock recovery in mesochronous systems and pleisochronous systems
MEMS ACCELEROMETER BASED NONSPECIFIC – USER HAND GESTURE RECOGNITION
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
Rtl design optimizations and tradeoffs
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...

Recently uploaded (20)

PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
DOCX
573137875-Attendance-Management-System-original
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
introduction to datamining and warehousing
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
composite construction of structures.pdf
PPTX
additive manufacturing of ss316l using mig welding
R24 SURVEYING LAB MANUAL for civil enggi
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
OOP with Java - Java Introduction (Basics)
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CYBER-CRIMES AND SECURITY A guide to understanding
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
573137875-Attendance-Management-System-original
Operating System & Kernel Study Guide-1 - converted.pdf
CH1 Production IntroductoryConcepts.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Model Code of Practice - Construction Work - 21102022 .pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
introduction to datamining and warehousing
Current and future trends in Computer Vision.pptx
Internet of Things (IOT) - A guide to understanding
composite construction of structures.pdf
additive manufacturing of ss316l using mig welding

Maha an energy efficient malleable hardware accelerator for data intensive applications

  • 2. MAHA : An Energy Efficient Malleable Hardware Accelerator For Data Intensive Applications Grace Abraham Roll No: 01 VLSI & ES
  • 3. CONTENTS Dept. of ECE 3 MAHA : Malleable Hardware Accelerator 29/07/2015 • INTRODUCTION • BACKGROUND AND MOTIVATION • MAHA - OVERALL APPROACH • NAND FLASH – A CASE STUDY • SOFTWARE ARHITECTURE • RESULTS • CONCLUSION
  • 4. Dept. of ECE 4 MAHA : Malleable Hardware Accelerator 29/07/2015 INTRODUCTION • In the nanometer technology, power has emerged as primary design constraint • Ever increasing demand for low power and high performance • Von-Neumann bottleneck (back & forth data transfer) barrier to performance & energy scaling • To improve efficiency use explicit parallelism • Energy overhead due to data transfer from off-chip to on-chip memory  Low Bandwidth  High latency  High energy
  • 5. Dept. of ECE 5 MAHA : Malleable Hardware Accelerator 29/07/2015 • To overcome this, a Malleable Hardware Accelerator is introduced • MAHA :  Implements a reconfigurable computing fabric in last level memory  Enabling computing within off chip memory Fig 1 : Von-Neumann bottleneck and proposed MAHA framework
  • 6. • Choice of NAND flash technology for demonstration • Previous investigations on Processing in memory (PIM) • MAHA differs from PIM architecture  Achieves on-demand computation by design modifications to the the off-chip nonvolatile memory organization  High energy efficiency through parallelism & dynamic customization • MAHA for data intensive applications • Area and energy overheads are accurately estimated • An efficient software flow for mapping applications to MAHA is presented Dept. of ECE 6 MAHA : Malleable Hardware Accelerator 29/07/2015
  • 7. Dept. of ECE 7 MAHA : Malleable Hardware Accelerator 29/07/2015 • Following sections includes  Von-Neumann bottleneck barrier  Introduces MAHA & its hardware architecture  Realization with a CMOS compatible NAND flash memory  Evaluation results for MAHA
  • 8. Dept. of ECE 8 MAHA : Malleable Hardware Accelerator 29/07/2015 BACKGROUND & MOTIVATION • PERFORMANCE BARRIER DUE TO VON-NEUMANN BOTTLENECK • ENERGY BARRIER FOR DATA-INTENSIVE APPLICATIONS  Off chip BW scales poorly in comparison to on chip transistor density  On chip density is likely to improve by 16X from 2011 to 2022  Off chip BW expected to improve only by 40%  BW available inside flash array is 4.2x105 GB/s in contrast , at 16 bit flash interface is only 100MB/s  Managing latency and energy for memory to achieve energy efficiency  To identify major hurdles to energy scaling o Performance of ten common kernels were simulated o System-level performance metrics, such as cache hit/miss frequency were noted
  • 9. Dept. of ECE 9 MAHA : Malleable Hardware Accelerator 29/07/2015  From table, o 73% of total energy expended is contributed by access to on-chip instruction & data cache o 26% invested in useful computations, including fetch and decode operations Table 1 : Energy breakdown for a conventional processor executing common computational kernels
  • 10. Dept. of ECE 10 MAHA : Malleable Hardware Accelerator 29/07/2015 • MITIGATING VON-NEUMANN BOTTLENECK THROUGH IN- MEMORY COMPUTING  75% of energy in a processor is dissipated in data transport  Optimizing the compute model for data-intensive tasks can cause large improvements in energy efficiency  Two implications for compute model o Relocate compute resources closer to last level of nonvolatile storage o Minimizes overhead for data transfer to on-chip execution units o Replace conventional software pipeline & caches with distributed memory infrastructure o Minimizes memory & interconnect memory power dissipation
  • 11. Dept. of ECE 11 MAHA : Malleable Hardware Accelerator 29/07/2015 MAHA-OVERALL APPROACH  HARDWARE ARCHITECTURE • MAHA is a hardware reconfigurable framework • Consists of an array of processing elements (PEs) • Communication using a hierarchical interconnect architecture • Target application to be mapped is represented as Control & data flow graph (CDFG) • Software flow partitions CDFG into smaller multiple-input multiple output tasks • Tasks are mapped to individual PEs
  • 12. Dept. of ECE 12 MAHA : Malleable Hardware Accelerator 29/07/2015 1) COMPUTE LOGIC 2) INTERCONNECT FABRIC  Each compute block or PE is referred to as memory logic block (MLB)  A single MLB includes a dense 2D memory array which stores lookup table, data  A custom data path with arithmetic units  A local register file for storing temporary outputs from memory  Sequence of operations inside an MLB is controlled by μ-code controller referred to as a schedule table  Tasks mapped to different MLBs communicate via a programmable & hierarchical interconnect  Interconnect is time-multiplexed & shared among multiple MLBs
  • 13. Dept. of ECE 13 MAHA : Malleable Hardware Accelerator 29/07/2015 Fig 2 : (a) Application mapping flow for MAHA (b) μ-arch details of a single computing block (MLB) (c) Synchronization among multiple MLBs over shared interconnect
  • 14. Dept. of ECE 14 MAHA : Malleable Hardware Accelerator 29/07/2015  Sig1 & Sig2 are outputs of MLB A & B at end of cycle 1  Sig3 & Sig4 are outputs at end of cycle 2  Signals at end of each cycle are transmitted over same local/global to MLB C  Significant gains in energy efficiency can be obtained by computing inside the NVM  MAHA is an attractive low-overhead & energy efficient candidate for in-memory computing  In NVM-based MAHA model, o Multiple NVM arrays are grouped to form a single MLB o Each MLB process its local data, communicates with other MLBs o Distribution of data to multiple MLBs through flash translation layer for mapping logical address to a physical location in NVM o Static CMOS logic integrated with NVM to realize MLB
  • 15. Dept. of ECE 15 MAHA : Malleable Hardware Accelerator 29/07/2015  COMPARISON WITH ALTERNATE ACCELERATORS • Computing Model • Granularity of computations  Frameworks that do not inherent hardware support for spatio- temporal computing - FPGA, Chimaera, Piperench & Rapid  Frameworks that support spatio-temporal execution-MATRIX, Morphosys  MAHA is also a spatio-temporal computing framework  Defined as width of smallest PE  Based on granularity, frameworks are classified as  MAHA is a mixed granular computing framework o Fine- grained o Coarse-grained o Mixed granular
  • 16. Dept. of ECE 16 MAHA : Malleable Hardware Accelerator 29/07/2015 • Computing Fabric • Target Application Domain  Hardware accelerators proposed earlier used fine grained 1-D lookup tables  MAHA uses memory for storage & mapping 1 or more multiple input multiple output LUTs  Hardware accelerators proposed earlier target a wide application space, bit-level computations, signal processing, image processing  MAHA improve system energy for a variety of data-intensive applications
  • 17. Dept. of ECE 17 MAHA : Malleable Hardware Accelerator 29/07/2015 NAND FLASH – A CASE STUDY • Hardware architecture for an off chip MAHA framework based on CMOS-compatible single level cell (SLC) NAND flash memory array • CMOS compatibility allows • Due to availability of open-source area, power & delay models SLC is considered  Integration of MLB controllers, registers, datapath and PI  Realization using CMOS logic
  • 18. Dept. of ECE 18 MAHA : Malleable Hardware Accelerator 29/07/2015 • OVERVIEW OF CURRENT FLASH ORGANISATION  Organisation of nand flash memory with flash array & no. of logic structures  For Normal Flash read, o 8-b or 16-b I/O bandwidth o Organized in units of pages & blocks o Page size – 2KB o Each block have 64-128 pages o Block decoder first selects one of the blocks o Page decoder selects one of the pages o Content of entire page is first read into page register o Transferred to flash external interface Table 2 : Flash Organization and performance
  • 19. Dept. of ECE 19 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 3: Modifications to conventional flash memory to realize MAHA framework. A small control engine outside the memory array is added to initiate & synchronize parallel operations inside the memory array
  • 20. Dept. of ECE 20 MAHA : Malleable Hardware Accelerator 29/07/2015 • MODIFICATIONS TO FLASH ARRAY ORGANIZATION  Modifications to achieve on-demand computation  Without affecting normal read/write operation 1) Compute Logic Modifications o Group of N flash blocks are clustered to form a single MLB o In MLB, blocks are logically divided into LUT blocks & data blocks o MLB control logic & custom datapath implemented using static CMOS logic o A custom dual ported asynchronous read register file for storing intermediate outputs o A pass gate multiplexors & keep transistor are used for selecting operands for LUT o For Normal NAND flash read, entire page is read at once (2KB)
  • 21. Dept. of ECE 21 MAHA : Malleable Hardware Accelerator 29/07/2015 o For LUT operations, due to smaller operand sizes a wide read is avoided o We propose a narrow- read scheme for LUT blocks in which a fraction of a page size is read at a time o Hardware overhead due to word line segmentation o To minimize overhead, we read only 64-b words from each block at a time
  • 22. Dept. of ECE 22 MAHA : Malleable Hardware Accelerator 29/07/2015 o Advantage – It improves energy efficiency by lowering word line capacitance o Combinational logic is used to switch between narrow read for MAHA operation & full page read for normal flash operation o They are used with narrow read decoder to control the AND gate for segmentation o Segmentation for data blocks is coarse with data sizes of 4096 bits being read out from each page and stored inside buffers o A group of such LUT and data blocks constitute 1 MLB o Two planes of the flash array are logically divided into 8 banks, each consists of 2 MLBs o Each MLB contains a. 256 blocks of flash memory b. 1 LUT block c. 255 data blocks
  • 23. Dept. of ECE 23 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 4: Modified flash memory array for on-demand reconfigurable computing. The memory blocks are augmented with local control and compute logic to act as a hardware reconfigurable unit
  • 24. Dept. of ECE 24 MAHA : Malleable Hardware Accelerator 29/07/2015 2) Routing logic modifications o Each block communicates with the page register over a shared bus o To minimize the inter MLB PI overhead, a set of hierarchical buses with a at each level to select the source of incoming data o 4 levels – banks, sub banks, subarrays Figure 5 : Hierarchical interconnect architecture to connect a group of MLB’s
  • 25. Dept. of ECE 25 MAHA : Malleable Hardware Accelerator 29/07/2015 SOFTWARE ARCHITECTURE • Figure shows application mapping for the proposed acceleration platform. • Mapper (application mapping tool ) was developed in C • Key features of software flow are 1) Description of input application using an ISA  Define an instruction set for the proposed MAHA framework that common control as well as data flow operations  Operation types that are supported by software architecture : o bitswC o bits o mult o shift and rotate o sel o complex o load & store
  • 26. Dept. of ECE 26 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 6 : Application mapping flow for proposed MAHA framework
  • 27. Dept. of ECE 27 MAHA : Malleable Hardware Accelerator 29/07/2015 2) Application Mapping to a mixed-granular time-multiplexed computing fabric  The mapping process includes 2 key contributions 1) Decomposition of fine & coarse grained operations o During decomposition of load/store operation, memory is allocated in 1 or more MLBs depending on the address size used for load/store & no. of data blocks present inside each MLB 2) Fusing multiple LUT as well as custom datapath operations o 3 fusion routines 1) Fusion of random LUT based operations 2) Fusion of bit-sliceable operations 3) Fusion of custom-datapath operations o In all these, decomposed CDFG is first partitioned into 1 or more vertices
  • 28. Dept. of ECE 28 MAHA : Malleable Hardware Accelerator 29/07/2015 3) Placement & routing for hierarchical interconnect model :  Software tool places the MLBs in hierarchical fashion such that no. of inputs & outputs crossing each module is minimized  In bi-partitioning approach, MLBs are first allocated to the first level modules, then distributed among second-level modules  This continues until each MLB has been mapped to the lowermost memory module  Routing of signals in the CDFG is performed in the following order 1) Routing of signals which cross each level of the memory hierarchy 2) Routing of primary outputs from each MLB for all levels of the cyclic schedule 3) Routing of primary inputs to each MLB for all levels of the cyclic schedule
  • 29. Dept. of ECE 29 MAHA : Malleable Hardware Accelerator 29/07/2015 4) Functional validation of the proposed framework :  Bit file generation routine accepts the placed & routed netlist & the control or select bits for the following 1) Configuration for programmable switches 2) Schedule table entries which control the sequence of operations inside each MLB 3) LUT entries to be loaded into the function table  Bit file generated by the tool can be directly loaded into the function table
  • 30. Dept. of ECE 30 MAHA : Malleable Hardware Accelerator 29/07/2015 RESULTS A. Design space exploration for MAHA B. Energy , Performance, and Overhead estimation  Estimate design overhead for entire MLB as well as for inter-MLB PI  Map the benchmark applications to the MAHA framework  Calculate the area overhead, performance, and energy requirements for each configurations & select best configuration  Cycle time of 20ns for MAHA operation – bitline precharge time (12ns)+ intra-MLB delay(3ns)+inter-MLB signal propagation time(5ns)  Area of single block of flash array-5*F2 * (Npages)*(pagesize) Since LUT block is separate from data blocks, area overhead is different
  • 31. Dept. of ECE 31 MAHA : Malleable Hardware Accelerator 29/07/2015  The parameters noted are : C. Selection of optimal MAHA configuration o Area overhead o Latency o Number of MLBs required to map application o Total energy dissipation in the MLBs o Area & energy for inter-MLB PI o Size of reconfiguration data o Final configuration Figure 7: (a) Relative contribution of different components to total area of modified flash(b) Relative contribution of memory & logic components
  • 32. Dept. of ECE 32 MAHA : Malleable Hardware Accelerator 29/07/2015 D. Energy & performance for mapped applications  Mapping results for a single CDFG instantiation for each of the selected benchmarks mapped to final MAHA hardware configuration  For MAHA, average PI energy is less compared with the average MLB logic energy
  • 33. Dept. of ECE 33 MAHA : Malleable Hardware Accelerator 29/07/2015 E. Comparison with a conventional GPP 1) Reduction in On-chip & off-chip communication 2) Improvement in execution latency 3) Improvement in energy 4) Improvement in EDP
  • 34. Dept. of ECE 34 MAHA : Malleable Hardware Accelerator 29/07/2015 F. Comparison with FPGA & GPU G. Hardware emulation based validation  On an average MAHA improves the energy requirement by 74% & 84% over FPGA & GPU frameworks  MAHA eliminates the high energy overhead for transferring data from off- chip memory to FPGA or GPU  We developed an FPGA –based emulation framework, which validates 1) Functionality & synchronization of multiple MLBs for several application kernels 2) Interfacing the MAHA framework with the host processor  Emulation framework consists of 2 FPGA boards, one DE0, running a host CPU, & a DE4, consisting of 3 main components
  • 35. Dept. of ECE 35 MAHA : Malleable Hardware Accelerator 29/07/2015 o MAHA framework o Flash controller o on board flash memory  The last 2 boards communicate over 3-wire SPI in simple master/slave configuration  The slave queries the flash for all available kernels, & upon finding a match, begins a transfer of the configuration bits & data for processing to MAHA framework .  If no match is found, the slave immediately responds with an error code  Otherwise slave will only interrupt the host CPU
  • 36. Dept. of ECE 36 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 8 : (a) Overview for off-chip acceleration with MAHA framework (b)System architecture for FPGA- based hardware emulation framework (c) Improvement in latency & energy with MAHA –based off-chip acceleration
  • 37. DISCUSSION  Before mapping a kernel to an-in memory accelerator, key applications & system primitives can be used to determine whether it will benefit from in- memory acceleration. These are listed below : 1) g—fraction of total instructions with memory reference (loads and stores); 2) f —fraction of total instructions transferred to an compute engine; 3) c—fraction of instructions translated from the host’s ISA to the ISA for the off-chip compute framework 4) o—fraction of original instructions, which result in an output. A fraction f × c × o thus produces outputs, which need to be transferred to the host processor; 5) eoffchip—average energy per instruction in the off-chip compute engine; 6) etxfer—energy expended in the transfer of an output from the off-chip framework to the host processor; 7) toffchip—ratio of cycle time of the off-chip compute framework to that of the host processor; 8) n—fraction of speedup due to parallelism in the framework 9) ttxfer—time taken in terms of processor clock cycles to transfer an output from the off-chip compute framework to the host processor. Dept. of ECE 37 MAHA : Malleable Hardware Accelerator 29/07/2015
  • 38.  Tsys = Toffchip + Tproc + Ttxfer  Esys = Eoffchip + Eproc + Etxfer Figure 9 : Energy & performance for a hybrid system with a host processor & off-chip memory based hardware accelerator Dept. of ECE 38 MAHA : Malleable Hardware Accelerator 29/07/2015
  • 39. Dept. of ECE 39 MAHA : Malleable Hardware Accelerator 29/07/2015 CONCLUSION • MAHA , a hardware acceleration framework • Greatly improve energy efficiency for data-intensive applications by transferring computing kernal to last level of memory • Design considerations to modify an SLC NAND flash memory for on-chip reconfigurable computing are presented • Improvement in energy efficiency • Better efficiency compared to FPGA & GPU • Future research efforts can be directed for optimizing the MLB architecture, interconnect topology & mapper software
  • 40. Dept. of ECE 40 MAHA : Malleable Hardware Accelerator 29/07/2015 REFERENCES  MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data- Intensive Applications Somnath Paul, Member, IEEE, Aswin Krishna, Student Member, IEEE, Wenchao Qian, Student Member, IEEE, Robert Karam, Student Member, IEEE, and Swarup Bhunia, Senior Member, IEEE  V. Govindaraju, C.-H. Ko, and K. Sankaralingam, “Dynamically specialized datapaths for energy efficient computing,” in Proc. IEEE 17th Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2011, pp. 503–514 and more....
  • 41. Dept. of ECE 41 MAHA : Malleable Hardware Accelerator 29/07/2015 THANK YOU
  • 42. QUERIES ????..... Dept. of ECE 42 MAHA : Malleable Hardware Accelerator 29/07/2015