SlideShare a Scribd company logo
Low Power Architecture for
     JPEG 2000

Dr. P. R. Panda                           Rahul Jain
Associate Professor                       2004JVL2433
IIT-Delhi                                 M.Tech (VDTT)
                                          IIT-Delhi
                  S. Krishnakumar
                  Cypress Semiconductor
                  Bangalore
Agenda
   JPEG2000 and 2-D DWT
   Memory Power Optimization
   Existing 2D-DWT Scan Based Architectures
   Proposed Architectures
       Low Power Z-Scan
       Low Power Block Scan
   Optimization and Pipelining Exploration for 2D-DWT
       Proposed DFG Optimization
       Pipeline Study
JPEG2000 Computation Blocks

   Pre-processing (Image Tiling)
   Discrete Wavelet Transform
   Quantization
   Tier-1 Coding (EBCOT)
   Tier-2 Coding (File Formatting and Packing)
Discrete Wavelet Transform
   2D wavelet transform:
       1st:1D wavelet transform to all rows
       2nd:1D wavelet transform to all columns
   Each Row/Column can be computed
    independently
                                            LL        HL




                          LL     HL              LL        HL
                                            LH        HH




        Image

                          LH    HH               LH        HH


                         1-Level DWT       2-Level DWT
Importance of Optimizing Memory System
Energy
   Many emerging media applications like
    JPEG2000 are data intensive
   For ASICs and embedded systems, memory
    system can contribute up to 90% energy
   Multiple memories exist in a SoC design
Optimization approaches
   Fixed memory access patterns
       Optimize memory architecture
   Fixed memory architecture
       Optimize memory access patterns
   Concurrently optimize Memory Architecture and
    Accesses
       Highest Potential
       Algorithm Level
           Reduce memory requirement
           Improve regularity of accesses
       Build optimized memory architecture
           Memory Partitioning
           Custom Circuits
       Option Explored in this Work
Memory Partitioning

   Partition the memory array into smaller banks
    so that only the addressed bank is activated
       improves speed and lowers power
       bit line capacitance reduced
       number of bit cells activated reduced
   At some point the delay and power overhead
    associated with the bank decoding circuit
    dominates (2 to 8 banks typical)
2D-DWT Architectures
   Direct
   Line Based
   Z-Scan
   Optimal Z-Scan (Ref:Optimal data transfer and buffering schemes
    for JPEG2000 encoder, Mu-Yu Chiu; Kun-Bin Lee; Chein-Wei Jen; Signal
    Processing Systems, 2003. SIPS 2003. IEEE Workshop on 27-29 Aug.
    2003 Page(s):177 – 182)
Direct DWT

   Straightforward Architecture
   First Read the Image Row wise computing
    Row-wise 1-D DWT
   Then Read the Image Column wise
    computing Column-wise 1-D DWT
   No On-Chip Buffer Required
   Reads + Writes to Off-Chip Memory =
    2MN+2MN (M =Image Tile ht, N = Image Tile wd)
Data Dependency in (9,7)DWT

   0   1   2   3   4   5   6   7   8     X(i)



       1       3       5       7       Y(2i+1)



   0       2       4       6       8     Y(2i)



       1       3       5       7         Z(2i+1)


   0       2       4       6       8     Z(2i)
Line-Based DWT
   Read pixels line by line
   Keep the min required number of lines in
    memory
   Row Operation gets full line data
   Column operation is activated as it gets
    Column data to reduce buffer
   On-Chip Buffer Required = 6*N
   Reads + Writes to Off-Chip Memory =
    MN+MN (M =Image Tile ht, N = Image Tile wd)
Z-Scan DWT
   Do a Z-Scan instead of Line by Line Scan
   Column Processing can start early
   On-Chip Buffer Required = 4*M
   Reads + Writes to Off-Chip Memory =
    MN+MN (M =Image Tile ht, N = Image Tile wd)
Optimal Z-Scan
    Considers the Code-Block size (CW*CH) required by
     Encoding Block in the next phase

• On-Chip Buffer Required
 = 4*M+4*2*CW
• Reads + Writes to
 Off-Chip Memory
= MN+MN
(M =Image Tile ht, N = Image Tile
wd)                               2* CH




                                          2* CW
Low Power Z-Scan
   Compute r elements in a row before starting
    with the next row
   For Z Scan r =1
   For Optimal Z-Scan r = 2*CW
                                               r   r
• On-Chip Buffer Required =
      4*M+4*2*CW
• Reads + Writes to Off-Chip
                                        2*CH
Memory = MN+MN
(M =Image Tile ht, N = Image Tile wd)
Low Power Z-Scan
   r will be a sub-integral multiple of 2*CW
       This considers the Code Block Size
   No of Wakeups to the Column Buffer Banks depend
    on r
       Large Value of r not desirable
   Between the resumption of a row computation and
    storing back of intermediate values after calculating
    r row elements the buffer can go into a Low Power
    state
       Large Value of r is desirable
   Access to the buffers
       Row Buffer = 2 per ‘r’ element computation
       Column Buffer = 1 per element computation
Low Power Block Scan
   Extend the concept of ‘r’ for column processing also
   Reduces the access to column buffer from 1 per
    element to 2/s per element
   To maintain the throughput introduce 2 Transpose
    Buffers (TB1 & TB2)                    r

   Transpose Buffer Accesses
                                   s     B1        B3
       Row Processor Writes
       Column Processor Reads
       i.e 2 access per element
   TB must be much smaller        s
                                         B2        B4
    than Column Buffer
Working: Low Power Block Scan
   2D-DWT computed in blocks of r*s
   Step 1: Row Processor (RP) computes 1D-DWT on B1
    and writes into TB1
   Step 2: Column Processor (CP) computes 1D-DWT on
    the data in TB1 (B1) and RP computes on B2 and
    writes into TB2
   Similarly RP and CP      RP:
                                  TB1
                                      CP:  RP:
                                                TB1
                                                    CP:
                             B1            B3       B2
    alternate between             TB2           TB2

    TB1 and TB2
                                      TB1                            TB1
                            RP:                 CP:        RP:                 CP:
                            B2                  B1         B4                  B3
                                      TB2                            TB2

                          B: Block, RP/CP: Row/Column Processor, TB: Transpose Buffer
Memory Power Analysis
   Memory can be in 3 modes
     Active (Read/Write being done) P (n)
                                      a
     Standby (No Access being done) P
                                        Standby(n)
     Sleep Mode (Data Retention Mode and Cannot Access) P (n)
                                                          Sleep
         To Access from this mode, first wakeup the memory
         Wakeup incurs energy penalty PWakeup(n)
         Let ‘T’ be the minimum clock cycles for the memory to be in sleep mode to
          get any power advantage
   To account for memory banking overhead, multiplexer power
    considered
     P (i,j) be the power for a i:1 multiplexer of bit width j
        Mux
   Assumption: on-chip memory access latency to fit into the clock
    period equal to 15ns
   Power values refer to average power dissipation per coefficient
    computation for the corresponding memory component
Row and Column Buffer Power
   With 4-Stage pipelined DWT,10 16-bit registers need to
    be stored/transferred incase of suspension/resumption
    of line computation
   Row Buffer
       Size = 160*M (M: Ht of Image Tile)
       ‘b’ banks, each having 160 column and M/b rows
       One b:1 Mux of 160 bits required
   Column Buffer
       Size = 160*2*CW (CW: EBCOT code block width, usually 128)
       ‘c’ banks, each having 160 column and 2*CW/c rows
       One c:1 Mux of 160 bits required
   Column Buffer Power analysis Similar to Row Buffer
    Power analysis
Row Buffer Power
   Accesses to Row Buffer
       2 per ‘r’ element ie 2/r per element computation
       Only one Bank active at a time, others in Sleep Mode
   Row Buffer Power is:
       Prow= [2*Pa(M/b)+Pmux(b,160)+(r-2)*Ps(M/b)]/r +
        Psleep(M/b)* (b-1)
       Ps = Psleep if (r-2) >= ‘T’ else Ps = Pstandby
   Due to sequential access to the Row Buffer each
    Bank is woken up Once
   Total Row Buffer Power
   PTotal_Row = Prow + [Pw(M/b) * b/(M*r) ]
Transpose Buffer Power
   2 buffers required of size r*s*16 bits partitioned into ‘d’ banks
    each
   Access and No of Wakeups
     RP: Sequential Order hence d wakeups for r*s elements

     CP: Sequential Order, but in jumps of r elements
          CP reads s elements from d banks
          Each bank has s/d elements
          If s-s/d > ‘T’, then put banks in Sleep mode and no of wakeups per
           element = d/s
   Power
     If (s-s/d >= T) P
                          Buffer = 2* Pa(r*s/d) + Mux Power + 2*(d-1) *
      Psleep(r*s/d)
      Else PBuffer =2* Pa(r*s/d) + Mux Power + (d-1) * Psleep(r*s/d)+ (d-1) *
      Pstandby(r*s/d)
     Mux Power = P
                        mux (d,16) ) + Pmux (2,16)
     Wakeup Power = P (r*s/d) * P
                                 w           Buffer_Wake
Memory Architecture
   Row and Column Buffers
       Used as Circular FIFOs
       Replace General Row Decoder with Custom Circuit for
        Addressing
       Similar observation for Transpose Buffer
   Custom Row Decoder                         Log (n) Bit
                                               Counter
                                                             Log (n)

                                                                       Row Decoder
                                                                                     n



       Counter and a Decoder
       Circular Shift Register (CSR)
           Flip Flop corresponding to the accessed row stores ‘1’
           A lot of power dissipated at FF clock pins
       Proposed Power Efficient CSR
           During shifting only 2 FF
            need to be enabled
           Use Clock Gating for others
Comparison of 3 Row Decoders
                 3000                                                               45000
                                                                                    40000
                 2500       Power Comparison                                        35000
                                                                                            Area Comparison
                 2000                                                               30000
     Power(uW)




                                                                      Area (um^2)
                                                                                    25000
                 1500
                                                                                    20000
                 1000                                                               15000
                                                                                    10000
                 500
                                                                                    5000
                   0                                                                   0
                        8    16    32    64     128    256      512                         8   16    32       64    128    256      512
                                        Bits                                                                  Bits

                             CSR    ClockGated CSR    Cntr+RD                                   CSR        ClockGated CSR         Cntr+RD




   Proposed Row Decoder is up to 90% and
    84% power efficient compared to CSR and
    Cntr+Decoder
   Area Penalty of about 15%
Memory Energy Modeling
    Active Energy modeled using eCACTI
        eCACTI models leakage current also
        Models Cache Power
        Modified to get SRAM power
    Standby Energy
        IStandby = 1.83 nA at Vdd = 1V [Qin05]
    Sleep Mode Energy
        ISleep = 0.55 nA at Vdd = 0.49V [Qin05]
    Wakeup Energy
        Ewakeup = 0.57 fJ * no of bits in SRAM
H. Qin, et.al, "Standy supply voltage minimization for deep sub-micron
SRAM", IEEE Microelectronics Journal, Aug 2005, vol. 36, pp. 789-800
Architecture Comparison




   8 Banks for row and column buffer in all the 3
    architectures
   Low Power Block Scan
       r =16 and s = 16
Low Power Architecture for JPEG2000
Low Power Architecture for JPEG2000
Optimization and Pipeline
Exploration
DFG Optimization
4 Stage Pipelining
                    Critical Path is Ta + Tm
                    Initiation Interval =1,
                     Resource Requirement
                        4 Multipliers
                        8 Adders
                        11 Registers
                            6 Pipelining Registers
                            4 for e1-e4
                            1 for Z4
                    Initiation Interval =2
                     Resource Requirement
                        2 Multipliers
                        4 Adders
                        9 Registers
Reducing Scaling Step Multipliers
   After Each1D DWT, multiply Low Pass Coeffs with k
    and High Pass with 1/k
   Delay the De-Interleaving of coefficients to save
    75% Multiplications
   With Throughput of 2,
    1 multiplication per cycle,
    hence 1 multiplier required
   Other Architectures require
    4 multipliers, 2 each for
    row and column processor
Low Power Architecture for JPEG2000
Low Power Architecture for JPEG2000
Pipeline Study
   Optimized DFG pipelined from 2-Stages to 8-
    Stages
   Study done to get the most power efficient
    strategy
   Impact of Pipelining on Clock Network Power
    also Accounted
Clock Tree Power Model

    H-Tree Network Assumed
    Buffer Energy also considered
    No of levels increase with
     increasing registers
          More Interconnect
          More Buffers




http://www.acsel-lab.com/Projects/detclocking/power_comparison.htm
Low Power Architecture for JPEG2000
Energy Components of Different Pipeline
Schemes
Conclusion
   “Low-Power Z-Scan” and “Low Power
    Block Scan” derived using different memory
    subsystem optimization techniques
   Optimizing the memory subsystem can result
    in up to 90% power savings
   1D-DWT DFG optimization proposed
   4-Stage pipelining on the optimized DFG is
    most energy efficient pipelined architecture
Thank You
   “A Power-Efficient Architecture for the 2-D
    Discrete Wavelet Transform”, Submitted to IEEE
    VLSI Design and Test Symposium, 2006
   “Memory Architecture Exploration for Power-
    Efficient 2D-Discrete Wavelet Transform”,
    Submitted to CODES+ISSS 2006
   “Optimization and Pipeline Exploration of 2D-
    Discrete Wavelet Transform”, Submitted to
    CASES 2006

More Related Content

DOCX
Frequency hopping signal of dds based on fpga hardware
PDF
International Journal of Engineering Research and Development
PDF
Switching and signalling ovt, Winter training .bsnl .swesome knowledge ,tele...
PDF
International Journal of Engineering Inventions (IJEI)
PDF
Aw25293296
PDF
A_law_and_Microlaw_companding
Frequency hopping signal of dds based on fpga hardware
International Journal of Engineering Research and Development
Switching and signalling ovt, Winter training .bsnl .swesome knowledge ,tele...
International Journal of Engineering Inventions (IJEI)
Aw25293296
A_law_and_Microlaw_companding

What's hot (20)

PDF
Lesson 18
PDF
Fast Fourier Transform
PDF
PPT
Unit 3-pipelining & vector processing
PDF
Monte Carlo G P U Jan2010
PPT
Pulse Code Modulation
PDF
Video Compression Basics
PPT
PPTX
Digital Signal Processing Course Help
PPT
DSP architecture
PPT
DOC
Chap 5
PDF
Lecture set 2
PDF
First order sigma delta modulator with low-power
PDF
Tele3113 wk9wed
PDF
Gn3311521155
PPTX
Nyquist criterion for distortion less baseband binary channel
PPTX
Introduction to Digital Signal processors
PPTX
Design of a high speed low power Brent Kung Adder in 45nM CMOS
PDF
디지털통신 9
Lesson 18
Fast Fourier Transform
Unit 3-pipelining & vector processing
Monte Carlo G P U Jan2010
Pulse Code Modulation
Video Compression Basics
Digital Signal Processing Course Help
DSP architecture
Chap 5
Lecture set 2
First order sigma delta modulator with low-power
Tele3113 wk9wed
Gn3311521155
Nyquist criterion for distortion less baseband binary channel
Introduction to Digital Signal processors
Design of a high speed low power Brent Kung Adder in 45nM CMOS
디지털통신 9
Ad

Viewers also liked (20)

PDF
A Power Efficient Architecture for 2-D Discrete Wavelet Transform
PDF
Low Energy Architecture: An Overview
PDF
Passive Low Energy Architecture Conference Paper 2009
PDF
Cadence Ppt
PDF
Design And Analysis Of Low Power High Performance Single Bit Full Adder
PDF
Vlsi cadence tutorial_ahmet_ilker_şin
PPTX
Low power
PPT
Low power & area efficient carry select adder
PPTX
Design & implementation of high speed carry select adder
DOCX
Project report on design & implementation of high speed carry select adder
PPT
Energy Efficient Architecture-Sustainable Habitat
PDF
My Report on adders
PDF
Energy Efficient Design Education Through Architectural Design Studio Projects
PPTX
Adder ppt
PDF
Advanced architecture theory and criticism lecture 01
PPTX
Climate Responsive Architecture
PPTX
Design half ,full Adder and Subtractor
PPTX
Explain Half Adder and Full Adder with Truth Table
PDF
Low power vlsi design ppt
A Power Efficient Architecture for 2-D Discrete Wavelet Transform
Low Energy Architecture: An Overview
Passive Low Energy Architecture Conference Paper 2009
Cadence Ppt
Design And Analysis Of Low Power High Performance Single Bit Full Adder
Vlsi cadence tutorial_ahmet_ilker_şin
Low power
Low power & area efficient carry select adder
Design & implementation of high speed carry select adder
Project report on design & implementation of high speed carry select adder
Energy Efficient Architecture-Sustainable Habitat
My Report on adders
Energy Efficient Design Education Through Architectural Design Studio Projects
Adder ppt
Advanced architecture theory and criticism lecture 01
Climate Responsive Architecture
Design half ,full Adder and Subtractor
Explain Half Adder and Full Adder with Truth Table
Low power vlsi design ppt
Ad

Similar to Low Power Architecture for JPEG2000 (20)

PPTX
Microcontroller architecture programming and interfacing
PPT
Line coding
PPT
error_correction.ppt
PDF
Lecture 3 - Serial Communicationkahfag.pdf
PDF
Baseline Wandering
PPTX
Mast content
PPSX
Practical spherical harmonics based PRT methods.ppsx
PDF
Practical Spherical Harmonics Based PRT Methods
PPT
PPTX
PPT
Memory Architecture Exploration for Power-Efficient 2D-Discrete Wavelet Trans...
PPT
amba (1).ppt
PPT
amba.ppt
PPT
amba.ppt
PDF
CS520 Computer Architecture Project 2 � Spring 2023 Due date 0326.pdf
PPT
amba DVE Materials for Verification------
PPT
ADC Conveter Performance and Limitations.ppt
PPT
Memory systems n
PPT
HEVC Definitions and high-level syntax
PDF
Ld2519361941
Microcontroller architecture programming and interfacing
Line coding
error_correction.ppt
Lecture 3 - Serial Communicationkahfag.pdf
Baseline Wandering
Mast content
Practical spherical harmonics based PRT methods.ppsx
Practical Spherical Harmonics Based PRT Methods
Memory Architecture Exploration for Power-Efficient 2D-Discrete Wavelet Trans...
amba (1).ppt
amba.ppt
amba.ppt
CS520 Computer Architecture Project 2 � Spring 2023 Due date 0326.pdf
amba DVE Materials for Verification------
ADC Conveter Performance and Limitations.ppt
Memory systems n
HEVC Definitions and high-level syntax
Ld2519361941

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Lesson notes of climatology university.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Pharma ospi slides which help in ospi learning
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Insiders guide to clinical Medicine.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
01-Introduction-to-Information-Management.pdf
Microbial diseases, their pathogenesis and prophylaxis
Lesson notes of climatology university.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Basic Mud Logging Guide for educational purpose
Pharma ospi slides which help in ospi learning
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Abdominal Access Techniques with Prof. Dr. R K Mishra
Final Presentation General Medicine 03-08-2024.pptx
VCE English Exam - Section C Student Revision Booklet
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
TR - Agricultural Crops Production NC III.pdf
Microbial disease of the cardiovascular and lymphatic systems
Insiders guide to clinical Medicine.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Anesthesia in Laparoscopic Surgery in India
Module 4: Burden of Disease Tutorial Slides S2 2025
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Cell Types and Its function , kingdom of life
01-Introduction-to-Information-Management.pdf

Low Power Architecture for JPEG2000

  • 1. Low Power Architecture for JPEG 2000 Dr. P. R. Panda Rahul Jain Associate Professor 2004JVL2433 IIT-Delhi M.Tech (VDTT) IIT-Delhi S. Krishnakumar Cypress Semiconductor Bangalore
  • 2. Agenda  JPEG2000 and 2-D DWT  Memory Power Optimization  Existing 2D-DWT Scan Based Architectures  Proposed Architectures  Low Power Z-Scan  Low Power Block Scan  Optimization and Pipelining Exploration for 2D-DWT  Proposed DFG Optimization  Pipeline Study
  • 3. JPEG2000 Computation Blocks  Pre-processing (Image Tiling)  Discrete Wavelet Transform  Quantization  Tier-1 Coding (EBCOT)  Tier-2 Coding (File Formatting and Packing)
  • 4. Discrete Wavelet Transform  2D wavelet transform:  1st:1D wavelet transform to all rows  2nd:1D wavelet transform to all columns  Each Row/Column can be computed independently LL HL LL HL LL HL LH HH Image LH HH LH HH 1-Level DWT 2-Level DWT
  • 5. Importance of Optimizing Memory System Energy  Many emerging media applications like JPEG2000 are data intensive  For ASICs and embedded systems, memory system can contribute up to 90% energy  Multiple memories exist in a SoC design
  • 6. Optimization approaches  Fixed memory access patterns  Optimize memory architecture  Fixed memory architecture  Optimize memory access patterns  Concurrently optimize Memory Architecture and Accesses  Highest Potential  Algorithm Level  Reduce memory requirement  Improve regularity of accesses  Build optimized memory architecture  Memory Partitioning  Custom Circuits  Option Explored in this Work
  • 7. Memory Partitioning  Partition the memory array into smaller banks so that only the addressed bank is activated  improves speed and lowers power  bit line capacitance reduced  number of bit cells activated reduced  At some point the delay and power overhead associated with the bank decoding circuit dominates (2 to 8 banks typical)
  • 8. 2D-DWT Architectures  Direct  Line Based  Z-Scan  Optimal Z-Scan (Ref:Optimal data transfer and buffering schemes for JPEG2000 encoder, Mu-Yu Chiu; Kun-Bin Lee; Chein-Wei Jen; Signal Processing Systems, 2003. SIPS 2003. IEEE Workshop on 27-29 Aug. 2003 Page(s):177 – 182)
  • 9. Direct DWT  Straightforward Architecture  First Read the Image Row wise computing Row-wise 1-D DWT  Then Read the Image Column wise computing Column-wise 1-D DWT  No On-Chip Buffer Required  Reads + Writes to Off-Chip Memory = 2MN+2MN (M =Image Tile ht, N = Image Tile wd)
  • 10. Data Dependency in (9,7)DWT 0 1 2 3 4 5 6 7 8 X(i) 1 3 5 7 Y(2i+1) 0 2 4 6 8 Y(2i) 1 3 5 7 Z(2i+1) 0 2 4 6 8 Z(2i)
  • 11. Line-Based DWT  Read pixels line by line  Keep the min required number of lines in memory  Row Operation gets full line data  Column operation is activated as it gets Column data to reduce buffer  On-Chip Buffer Required = 6*N  Reads + Writes to Off-Chip Memory = MN+MN (M =Image Tile ht, N = Image Tile wd)
  • 12. Z-Scan DWT  Do a Z-Scan instead of Line by Line Scan  Column Processing can start early  On-Chip Buffer Required = 4*M  Reads + Writes to Off-Chip Memory = MN+MN (M =Image Tile ht, N = Image Tile wd)
  • 13. Optimal Z-Scan  Considers the Code-Block size (CW*CH) required by Encoding Block in the next phase • On-Chip Buffer Required = 4*M+4*2*CW • Reads + Writes to Off-Chip Memory = MN+MN (M =Image Tile ht, N = Image Tile wd) 2* CH 2* CW
  • 14. Low Power Z-Scan  Compute r elements in a row before starting with the next row  For Z Scan r =1  For Optimal Z-Scan r = 2*CW r r • On-Chip Buffer Required = 4*M+4*2*CW • Reads + Writes to Off-Chip 2*CH Memory = MN+MN (M =Image Tile ht, N = Image Tile wd)
  • 15. Low Power Z-Scan  r will be a sub-integral multiple of 2*CW  This considers the Code Block Size  No of Wakeups to the Column Buffer Banks depend on r  Large Value of r not desirable  Between the resumption of a row computation and storing back of intermediate values after calculating r row elements the buffer can go into a Low Power state  Large Value of r is desirable  Access to the buffers  Row Buffer = 2 per ‘r’ element computation  Column Buffer = 1 per element computation
  • 16. Low Power Block Scan  Extend the concept of ‘r’ for column processing also  Reduces the access to column buffer from 1 per element to 2/s per element  To maintain the throughput introduce 2 Transpose Buffers (TB1 & TB2) r  Transpose Buffer Accesses s B1 B3  Row Processor Writes  Column Processor Reads  i.e 2 access per element  TB must be much smaller s B2 B4 than Column Buffer
  • 17. Working: Low Power Block Scan  2D-DWT computed in blocks of r*s  Step 1: Row Processor (RP) computes 1D-DWT on B1 and writes into TB1  Step 2: Column Processor (CP) computes 1D-DWT on the data in TB1 (B1) and RP computes on B2 and writes into TB2  Similarly RP and CP RP: TB1 CP: RP: TB1 CP: B1 B3 B2 alternate between TB2 TB2 TB1 and TB2 TB1 TB1 RP: CP: RP: CP: B2 B1 B4 B3 TB2 TB2 B: Block, RP/CP: Row/Column Processor, TB: Transpose Buffer
  • 18. Memory Power Analysis  Memory can be in 3 modes  Active (Read/Write being done) P (n) a  Standby (No Access being done) P Standby(n)  Sleep Mode (Data Retention Mode and Cannot Access) P (n) Sleep  To Access from this mode, first wakeup the memory  Wakeup incurs energy penalty PWakeup(n)  Let ‘T’ be the minimum clock cycles for the memory to be in sleep mode to get any power advantage  To account for memory banking overhead, multiplexer power considered  P (i,j) be the power for a i:1 multiplexer of bit width j Mux  Assumption: on-chip memory access latency to fit into the clock period equal to 15ns  Power values refer to average power dissipation per coefficient computation for the corresponding memory component
  • 19. Row and Column Buffer Power  With 4-Stage pipelined DWT,10 16-bit registers need to be stored/transferred incase of suspension/resumption of line computation  Row Buffer  Size = 160*M (M: Ht of Image Tile)  ‘b’ banks, each having 160 column and M/b rows  One b:1 Mux of 160 bits required  Column Buffer  Size = 160*2*CW (CW: EBCOT code block width, usually 128)  ‘c’ banks, each having 160 column and 2*CW/c rows  One c:1 Mux of 160 bits required  Column Buffer Power analysis Similar to Row Buffer Power analysis
  • 20. Row Buffer Power  Accesses to Row Buffer  2 per ‘r’ element ie 2/r per element computation  Only one Bank active at a time, others in Sleep Mode  Row Buffer Power is:  Prow= [2*Pa(M/b)+Pmux(b,160)+(r-2)*Ps(M/b)]/r + Psleep(M/b)* (b-1)  Ps = Psleep if (r-2) >= ‘T’ else Ps = Pstandby  Due to sequential access to the Row Buffer each Bank is woken up Once  Total Row Buffer Power  PTotal_Row = Prow + [Pw(M/b) * b/(M*r) ]
  • 21. Transpose Buffer Power  2 buffers required of size r*s*16 bits partitioned into ‘d’ banks each  Access and No of Wakeups  RP: Sequential Order hence d wakeups for r*s elements  CP: Sequential Order, but in jumps of r elements  CP reads s elements from d banks  Each bank has s/d elements  If s-s/d > ‘T’, then put banks in Sleep mode and no of wakeups per element = d/s  Power  If (s-s/d >= T) P Buffer = 2* Pa(r*s/d) + Mux Power + 2*(d-1) * Psleep(r*s/d) Else PBuffer =2* Pa(r*s/d) + Mux Power + (d-1) * Psleep(r*s/d)+ (d-1) * Pstandby(r*s/d)  Mux Power = P mux (d,16) ) + Pmux (2,16)  Wakeup Power = P (r*s/d) * P w Buffer_Wake
  • 22. Memory Architecture  Row and Column Buffers  Used as Circular FIFOs  Replace General Row Decoder with Custom Circuit for Addressing  Similar observation for Transpose Buffer  Custom Row Decoder Log (n) Bit Counter Log (n) Row Decoder n  Counter and a Decoder  Circular Shift Register (CSR)  Flip Flop corresponding to the accessed row stores ‘1’  A lot of power dissipated at FF clock pins  Proposed Power Efficient CSR  During shifting only 2 FF need to be enabled  Use Clock Gating for others
  • 23. Comparison of 3 Row Decoders 3000 45000 40000 2500 Power Comparison 35000 Area Comparison 2000 30000 Power(uW) Area (um^2) 25000 1500 20000 1000 15000 10000 500 5000 0 0 8 16 32 64 128 256 512 8 16 32 64 128 256 512 Bits Bits CSR ClockGated CSR Cntr+RD CSR ClockGated CSR Cntr+RD  Proposed Row Decoder is up to 90% and 84% power efficient compared to CSR and Cntr+Decoder  Area Penalty of about 15%
  • 24. Memory Energy Modeling  Active Energy modeled using eCACTI  eCACTI models leakage current also  Models Cache Power  Modified to get SRAM power  Standby Energy  IStandby = 1.83 nA at Vdd = 1V [Qin05]  Sleep Mode Energy  ISleep = 0.55 nA at Vdd = 0.49V [Qin05]  Wakeup Energy  Ewakeup = 0.57 fJ * no of bits in SRAM H. Qin, et.al, "Standy supply voltage minimization for deep sub-micron SRAM", IEEE Microelectronics Journal, Aug 2005, vol. 36, pp. 789-800
  • 25. Architecture Comparison  8 Banks for row and column buffer in all the 3 architectures  Low Power Block Scan  r =16 and s = 16
  • 30. 4 Stage Pipelining  Critical Path is Ta + Tm  Initiation Interval =1, Resource Requirement  4 Multipliers  8 Adders  11 Registers  6 Pipelining Registers  4 for e1-e4  1 for Z4  Initiation Interval =2 Resource Requirement  2 Multipliers  4 Adders  9 Registers
  • 31. Reducing Scaling Step Multipliers  After Each1D DWT, multiply Low Pass Coeffs with k and High Pass with 1/k  Delay the De-Interleaving of coefficients to save 75% Multiplications  With Throughput of 2, 1 multiplication per cycle, hence 1 multiplier required  Other Architectures require 4 multipliers, 2 each for row and column processor
  • 34. Pipeline Study  Optimized DFG pipelined from 2-Stages to 8- Stages  Study done to get the most power efficient strategy  Impact of Pipelining on Clock Network Power also Accounted
  • 35. Clock Tree Power Model  H-Tree Network Assumed  Buffer Energy also considered  No of levels increase with increasing registers  More Interconnect  More Buffers http://www.acsel-lab.com/Projects/detclocking/power_comparison.htm
  • 37. Energy Components of Different Pipeline Schemes
  • 38. Conclusion  “Low-Power Z-Scan” and “Low Power Block Scan” derived using different memory subsystem optimization techniques  Optimizing the memory subsystem can result in up to 90% power savings  1D-DWT DFG optimization proposed  4-Stage pipelining on the optimized DFG is most energy efficient pipelined architecture
  • 39. Thank You  “A Power-Efficient Architecture for the 2-D Discrete Wavelet Transform”, Submitted to IEEE VLSI Design and Test Symposium, 2006  “Memory Architecture Exploration for Power- Efficient 2D-Discrete Wavelet Transform”, Submitted to CODES+ISSS 2006  “Optimization and Pipeline Exploration of 2D- Discrete Wavelet Transform”, Submitted to CASES 2006